您的位置:首页 > 编程语言 > Python开发

机器学习基础 维基翻译 Johnson-Lindenstrauss降维 应用于广义因子模型 及简单的sklearn例子

2016-07-17 21:13 609 查看
The Johnson-Lindenstrauss lemma states that any high dimensional dataset

can be random projected into a lower dimensional Euclidean space

while controlling the distortion in the pairwise distances.

(变换的失真度由两个距离所控制)

The distortion introduced by a random projection p is asserted

(断言) by the fact that p is defining an eps-embedding(设置)

with good probability:

u, v are any rows taken from a dataset of shape [n_sample, n_featues]

and p is a projection by a random Gaussiian N(0, 1) matrix with

shape [n_components, n_features](or a sparse Achlioptas matrix)

WiKi interpretation:

The lemma states that a small set of points in a high-dimensional

space can be embedded into a space of much lower dimension in such

a way that distance between the points are nearly preserved.

The map used for the embedding is at least Lipschitz, and can even

be taken to be an orthogonal projection.

(高维数据的一个小集合可以得到一个强于Lipschitz条件的降维距离连续变换)

(变换可以是正交的)

The lemma has uses in compressed sensing, manifold learning(流型学习)

, dimensionality reduction, and graph embedding. Much of the data stored

and manipulated on computers, including text and images, can be represented

as points in a high-demensional space (see vector space model for the

case of text), However, the essential algorithms for working with

such data tend to become bogged down(陷入困境) very quickly

as dimension increases. It is therefore desirable to reduce

the dimensionality of the data in a way that preserves its

relevant structure. The Johnson-Lindenstrauss lemma is a classic

result in this vein.

Lemma

Given 0 < a < 1, a set X of m points in RN, and a number n >

8ln(m)/(a**2), there is a linear map f: RN to Rn

such that :

 (1 - a) ||u - v|| ** 2 <= ||f(u) - f(v)|| ** 2 <= (1 + a) ||u - v|| ** 2

注意这里是线性变换。也就是说是在整个样本的一个子集可以找到这样一个变换。

One proof of the lemma takes f to be a suitable mutiple(倍数)

of orthogonal projection onto a random subspace of dimension in RN

 and exploits the phenomenon of concentration of measure.

有一种证明方法是用某一倍数的随机生成的正交向量进行投影。

Obviously an orthogonal projection will, in general, reduce the average

 distance between points, but the lemma can be viewed as dealing with

relative distance, which do not change under scaling. In a nutshell,

you roll the dice (骰子) and obtain a random projection, which will

reduce the average diatamce, and then you scale up the distance so that

the average distance returns to its previous value. If you keep rolling

 the dice, you will , in polynomial random time, find a projection for

which the (scaled) distances satisfy the lemma.

这里对过程进行了形象描述。

正交投影变换可以看成其特殊情况。

回到sklearn:

The minimum number of components to guarantees the eps-embedding is

given by:

 n_components >= 4log(n_samples) / (a ** 2/ 2 - a ** 3/ 3)

这里的决定比例关系是可以理解的。

Empirical validation(经验证实)

We validate the obove bounds on the digits dataset or on the 20 newsgroups

text document(TF-IDF word frequencies) dataset:

for the digits dataset, some 8x8 gray level pixels(像素)

for 500 handwritten digits pictures are randomly projected

to spaces for various larger number of dimensions n_components.

......

sklearn.random_projection::johnson_lindenstrauss_min_dim(n_sample, eps):

 find a 'safe' number of components to randomly project to

plt.loglog

 对图进行对数变换,可以消除较大的量级显示差异。

plt.semilogy

 仅对y方向进行对数变换。

ndarray.ravel():

 返回拉直后的向量(横向拉直)

sklearn.random_project::SparseRandomProjection

 用稀疏矩阵进行随机投影的类

 参数n_components 指定降维后的维数

 调用fit_transform 可以对数据进行变换

下面的代码是sklearn中验证上述定理的例子,但这里是升维(64 维到

[300, 1000, 10000]维),是反问题验证的观点。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.random_projection import johnson_lindenstrauss_min_dim

eps_range = np.linspace(0.1, 0.99, 5)
colors = plt.cm.Blues(np.linspace(0.3, 1.0, len(eps_range)))

n_samples_range = np.logspace(1, 9, 9)

plt.figure()
for eps, color in zip(eps_range, colors):
 min_n_components = johnson_lindenstrauss_min_dim(n_samples_range, eps = eps)
 plt.loglog(n_samples_range, min_n_components, color = color)

plt.legend(["eps = %0.1f" % eps for eps in eps_range], loc = "lower right")
plt.xlabel("Number of observations to eps-embed")
plt.ylabel("Minimum number of dimensions")
plt.title("Johnson-Lindenstrauss bounds:\nn_samples vs n_components")

n_samples_range = np.logspace(2, 6, 5)
colors = plt.cm.Blues(np.linspace(0.3, 1.0, len(n_samples_range)))

plt.figure()
for n_samples, color in zip(n_samples_range, colors):
 min_n_components = johnson_lindenstrauss_min_dim(n_samples, eps = eps_range)
 plt.semilogy(eps_range, min_n_components, color = color)

plt.legend(["n_samples = %d" % n for n in n_samples_range], loc = "upper right")
plt.xlabel("Distortion eps")
plt.ylabel("Mininum number of dimensions")
plt.title("Johnson-Lindenstrauss bounds:\nn_components vs eps")

from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.datasets import load_digits

#data = fetch_20newsgroups_vectorized().data[:500]
data = load_digits().data[:500]
n_samples, n_features = data.shape

print
print "data shape :"
print data.shape
print

print "Embedding %d samples with dim %d using various random projections" % (n_samples, n_features)

from sklearn.metrics.pairwise import euclidean_distances

n_components_range = np.array([300, 1000, 10000])
dists = euclidean_distances(data, squared = True).ravel()

nonzero = dists != 0
dists = dists[nonzero]

from time import time
from sklearn.random_projection import SparseRandomProjection

for n_components in n_components_range:
 t0 = time()
 rp = SparseRandomProjection(n_components = n_components)
 projected_data = rp.fit_transform(data)

 print
 print "projected_data shape :"
 print projected_data.shape
 print

 print "Projected %d samples from %d to %d in %0.3fs" % (n_samples, n_features, n_components, time() - t0)

 if hasattr(rp, 'components_'):
  n_bytes = rp.components_.data.nbytes 
  n_bytes += rp.components_.indices.nbytes
  print "Random matrix with size: %0.3fMB" % (n_bytes / 1e6)

 projected_dists = euclidean_distances(projected_data, squared = True).ravel()[nonzero]
 plt.figure()
 plt.hexbin(dists, projected_dists, gridsize
4000
= 100, cmap = plt.cm.PuBu)
 plt.xlabel("Pairwise squared distance in original space")
 plt.ylabel("Pairwise squared distance in projected space")
 plt.title("Pairwise distance distribution for n_components = %d" % n_components)

 cb = plt.colorbar()
 cb.set_label("Sample pairs counts")

 rates = projected_dists / dists
 print "Mean distance rate: %0.2f (%0.2f)" % (np.mean(rates), np.std(rates))

 plt.figure()
 plt.hist(rates, bins = 50, normed = True, range = (0.,2.))
 plt.xlabel("Squared distance rate: projected / original")
 plt.ylabel("Distribution of samples pairs")
 plt.title("Histogram of pairwise distance rates for n_components = %d" % n_components)

#plt.show()


随机投影的方法是用来解决大维数据的降维问题,但不代表会降低共线性,

但当我们要判定数据是否有强因子的时候,可以使用这种方法将为后再处理

其利用

随机保留了共线性,可以加快大维矩阵的运算,可以如下加快大维广义因子模型

的相关计算:
X = generate_X(1000, 5, 1000, 10, 10, 5, 5)[0]

#we use random projection which may have a more fast conclusion
from sklearn.random_projection import SparseRandomProjection
from sklearn.random_projection import johnson_lindenstrauss_min_dim

min_n_components = johnson_lindenstrauss_min_dim(1000, eps = 0.5)
print "min_n_components :"
print min_n_components

n_components = min_n_components
rp = SparseRandomProjection(n_components = n_components)
projected_X = rp.fit_transform(X)
eigen_v = svd(projected_X)[1]

eigen_v1 = eigen_v / (1 + eigen_v)
print eigen_v1[:-1] / eigen_v1[1:]
print np.argmax((eigen_v1[:-1] / eigen_v1[1:])[: 20])
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息