您的位置：首页 > 其它

【论文翻译】Nonlinear Dimensionality Reduction by Locally Linear Embedding

2020-08-20 10:17 127 查看

论文题目：Nonlinear Dimensionality Reduction by Locally Linear Embedding
论文来源：Nonlinear Dimensionality Reduction by Locally Linear Embedding
翻译人：BDML@CQUT实验室

Nonlinear Dimensionality Reduction by Locally Linear Embedding

Sam T. Roweis and Lawrence K. Saul

通过局部线性嵌入减少非线性维数

Sam T. Roweis and Lawrence K. Saul

Abstract

Many areas of science depend on exploratory data analysis and visualization. The need to analyze large amounts of multivariate data raises the fundamental problem of dimensionality reduction: how to discover compact representations of high-dimensional data. Here, we introduce locally linear embedding (LLE), an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs. Unlike clustering methods for local dimensionality reduction, LLE maps its inputs into a single global coordinate system of lower dimensionality, and its optimizations do not involve local minima. By exploiting the local symmetries of linear reconstructions, LLE is able to learn the global structure of nonlinear manifolds, such as those generated by images of faces or documents of text.

许多科学领域的研究都需要对数据进行分析和可视化。因为数据一般都是大量且多变量的数据，直接处理十分麻烦，所以迫切需要一种能够对数据进行降维的方法，即发现高维数据的紧凑表示。在这里，我们介绍了局部线性嵌入（LLE），一种无监督的学习算法，用于计算高维输入的低维、邻域保护嵌入。与局部降维的聚类方法不同，LLE将其输入映射到一个低维的全局坐标系中，并且其优化不影响局部极小值。利用线性重构的局部对称性，LLE能够学习非线性流形的整体结构，如由人脸图像或文本文档生成的流形。

正文

How do we judge similarity? Our mental representations of the world are formed by processing large numbers of sensory in- puts-including, for example, the pixel in- tensities of images, the power spectra of sounds, and the joint angles of articulated bodies. While complex stimuli of this form can be represented by points in a high-dimensional vector space, they typically have a much more compact description. Coherent structure in the world leads to strong correlations between in- puts (such as between neighboring pixels in images), generating observations that lie on or close to a smooth low-dimensional manifold. To compare and classify such observations-in effect to reason about the world-depends crucially on modeling the nonlinear geometry of these low-dimensional manifolds.

我们如何判断相似性？ 我们对世界的心理表征是通过处理大量的感觉输入而形成的，例如，图像的像素强度、声音的功率谱以及关节的关节角度。 虽然这种形式的复杂刺激可以由高维向量空间中的点表示，但它们通常具有更为紧凑的描述。 世界上的连贯结构导致输入之间（例如图像中相邻像素之间）的强相关性，从而生成位于平滑低维流形上或附近的观测值。 对这些观察结果进行比较和分类以有效地推理世界，关键在于对这些低维流形的非线性几何建模。

Scientists interested in exploratory analysis or visualization of multivariate data (1) face a similar problem in dimensionality reduction. The problem, as illustrated in Fig. 1, involves mapping high-dimensional inputs into a low-dimensional “description” space with as many coordinates as observed modes of variability. Previous approaches to this problem, based on multidimensional scaling (MDS) (2), have computed embeddings that attempt to preserve pairwise distances [or generalized disparities (3)] between data points; these distances are measured along straight lines or, in more sophisticated usages of MDS such as Isomap (4), along shortest paths confined to the manifold of observed inputs. Here, we take a different approach, called locally linear embedding (LLE), that eliminates the need to estimate pairwise distances between widely separated data points. Unlike previous methods, LLE recovers global nonlinear structure from locally linear fits.

对多变量数据的探索性分析或可视化感兴趣的科学家(1)在降维方面也面临着类似的问题。如图1所示，该问题涉及将高维输入映射到一个低维“描述”空间，该空间具有与观察到的可变性模式相同的坐标。以前解决这个问题的方法是基于多维尺度(MDS)(2)计算嵌入，试图保持数据点之间的成对距离[或广义差(3)]；这些距离是沿着直线测量的，或者在更复杂的MDS应用中，如Isomap (4)，沿着限定于观察输入流形的最短路径测量。这里，我们采用了一种不同的方法，称为局部线性嵌入(LLE)，它不需要对相距很远的数据点之间的成对距离进行估计。与以往的方法不同，LLE从局部线性拟合中恢复了全局非线性结构。

Fig. 1. The problem of nonlinear dimensionality reduction, as illustrated (10) for three-dimensional data (B) sampled from a two-dimensional manifold (A). An unsupervised learning algorithm must discover the global internal coordinates of the manifold without signals that explicitly indicate how the data should be embedded in two dimensions. The color coding illustrates the neighborhoodpreserving mapping discovered by LLE; black outlines in (B) and (C) show the neighborhood of a single point. Unlike LLE, projections of the data by principal component analysis (PCA) (28) or classical MDS (2) map faraway data points to nearby points in the plane, failing to identify the underlying structure of the manifold. Note that mixture models for local dimensionality reduction (29), which cluster the data and perform PCA within each cluster, do not address the problem considered here: namely, how to map high-dimensional data into a single global coordinate system of lower dimensionality.

图1.非线性降维的问题，如图(10)所示，它是从二维流形(A)采样的三维数据(B)。一种无监督的学习算法必须发现流形的全局内部坐标，而没有明确表明如何将数据嵌入二维的信号。颜色编码说明了LLE发现的邻域保留映射。(B)和(C)中的黑色轮廓线显示了单个点的邻域。与LLE不同，通过主成分分析(PCA)(28)或经典MDS(2)进行的数据投影将遥远的数据点映射到平面中的附近点，从而无法识别歧管的基础结构。注意，用于局部降维的混合模型(29)将数据聚类并在每个聚类中执行PCA，但并未解决此处考虑的问题：即如何将高维数据映射到一个较低维的单一全局坐标系中。

The LLE algorithm, summarized in Fig. 2, is based on simple geometric intuitions. Suppose the data consist of NNN real-valued vectors Xi⃗\vec{X_i}Xi, each of dimensionality DDD, sampled from some underlying manifold. Provided there is sufficient data (such that the manifold is well-sampled), we expect each data point and its neighbors to lie on or close to a locally linear patch of the manifold. We characterize the local geometry of these patches by linear coefficients that reconstruct each data point from its neighbors. Reconstruction errors are measured by the cost function
ε(W)=∑i∣Xi⃗−∑jWijXj⃗∣2(1) \varepsilon(W)=\sum_i \left| \vec{X_i}-\sum_j W_{ij} \vec{X_j} \right|^2 \qquad (1) ε(W)=i∑∣∣∣∣∣Xi−j∑WijXj∣∣∣∣∣2(1)
which adds up the squared distances between all the data points and their reconstructions. The weights WijW_{ij}Wij summarize the contribution of the jjjth data point to the iiith reconstruction. To compute the weights WijW_{ij}Wij, we minimize the cost function subject to two constraints: first, that each data point Xi⃗\vec{X_i}Xi is reconstructed only from its neighbors (5), enforcing Wij=0W_{ij}=0Wij=0 if Xj⃗\vec{X_j}Xj does not belong to the set of neighbors of Xi⃗\vec{X_i}Xi; second, that the rows of the weight matrix sum to one: ∑jWij=1∑_jW_{ij}=1∑jWij=1. The optimal weights WijW_{ij}Wij subject to these constraints (6) are found by solving a least-squares problem (7).

LLE算法，如图2所示，基于简单的几何直觉。假设数据由NNN个实值向量Xi⃗\vec{X_i}Xi组成，每个向量的维数为DDD，从底层流形中采样。如果有足够的数据（例如流形是很好的采样），我们期望每个数据点及其邻域都位于流形的局部线性面片上或附近。我们用线性系数来描述这些斑块的局部几何特征，这些系数从相邻的数据点重建每个数据点。重建误差用代价函数来度量

ε(W)=∑i∣Xi⃗−∑jWijXj⃗∣2(1)
\varepsilon(W)=\sum_i \left| \vec{X_i}-\sum_j W_{ij} \vec{X_j} \right|^2  \qquad (1)
ε(W)=i∑∣∣∣∣∣Xi−j∑WijXj∣∣∣∣∣2(1)

它将所有数据点及其重建之间的平方距离相加。权值总结了第jjj个数据点对第iii个重建的贡献。为了计算权重WijW_{ij}Wij，我们在两个约束条件下最小化代价函数：第一，每个数据点Xi⃗\vec{X_i}Xi仅从其邻域(5)重构，如果Xj⃗\vec{X_j}Xj不属于Xi⃗\vec{X_i}Xi的邻域集合，则强制Wij=0W_{ij}=0Wij=0；第二，权重矩阵的行和为1:∑jWij=1∑_jW_{ij}=1∑jWij=1。通过求解一个最小二乘问题(7)，得到了这些约束条件下的最优权重(6)。

Fig. 2. Steps of locally linear embedding:
(1) Assign neighbors to each data point Xi⃗\vec{X_i}Xi (for example by using the KKK nearest neighbors).
(2) Compute the weights WijW_{ij}Wij that best linearly reconstruct Xi⃗\vec{X_i}Xi from its neighbors, solving the constrained least-squares problem in Eq. 1.
(3) Compute the low-dimensional embedding vectors Yi⃗\vec{Y_i}Yi best reconstructed by WijW_{ij}Wij, minimizing Eq. 2 by finding the smallest eigenmodes of the sparse symmetric matrix in Eq. 3. Although the weights WijW_{ij}Wij and vectors YiY_iYi are computed by methods in linear algebra, the constraint that points are only reconstructed from neighbors can result in highly nonlinear embeddings.

图2. 局部线性嵌入的步骤：
（1）为每个数据点Xi⃗\vec{X_i}Xi指定邻域（例如使用KKK个最近邻）。
（2）计算从其邻域中最佳线性重构Xi⃗\vec{X_i}Xi的权重WijW_{ij}Wij，求解等式1中的约束最小二乘问题。
（3）计算WijW_{ij}Wij最佳重构的低维嵌入向量Yi⃗\vec{Y_i}Yi，通过求出方程3中稀疏对称矩阵的最小本征模，使方程2最小化。虽然权重WijW_{ij}Wij和向量Yi⃗\vec{Y_i}Yi是用线性代数的方法来计算的，但是点只从邻域重构的约束会导致高度非线性的嵌入。

The constrained weights that minimize these reconstruction errors obey an important symmetry: for any particular data point, they are invariant to rotations, rescalings, and translations of that data point and its neighbors. By symmetry, it follows that the reconstruction weights characterize intrinsic geometric properties of each neighborhood, as opposed to properties that depend on a particular frame of reference (8). Note that the invariance to translations is specifically enforced by the sum-to-one constraint on the rows of the weight matrix.

使这些重构误差最小的约束权重遵循一个重要的对称性：对于任何特定的数据点，它们对于该数据点及其邻域的旋转、缩放和平移都是不变的。通过对称性可以得出，重建权重表征了每个邻域的固有几何特性，这与依赖于特定参考框架的特性相反（8）。请注意，转换的不变性是通过权重矩阵的行上的合一约束专门实施的。

Suppose the data lie on or near a smooth nonlinear manifold of lower dimensionality d<<Dd<<Dd<<D. To a good approximation then, there exists a linear mapping— consisting of a translation, rotation, and rescaling—that maps the high-dimensional coordinates of each neighborhood to global internal coordinates on the manifold. By design, the reconstruction weights WijW_{ij}Wij reflect intrinsic geometric properties of the data that are invariant to exactly such transformations. We therefore expect their characterization of local geometry in the original data space to be equally valid for local patches on the manifold. In particular, the same weights WijW_{ij}Wij that reconstruct the iiith data point in DDD dimensions should also reconstruct its embedded manifold coordinates in ddd dimensions.

假设数据位于较低维d<<Dd << Dd<<D的平滑非线性流形上或其附近。 然后，近似地存在一个线性映射（包括平移，旋转和缩放），该线性映射将每个邻域的高维坐标映射到流形上的全局内部坐标。 通过设计，重建权重WijW_{ij}Wij反映了数据的固有几何属性，这些属性对于精确的此类转换始终不变。因此，我们希望它们在原始数据空间中的局部几何图形的表征对于流形上的局部面片同样有效。特别地，在DDD维中重建第iii个数据点的相同权重WijW_{ij}Wij也应重建其在DDD维中嵌入式的流形坐标。

LLE constructs a neighborhood-preserving mapping based on the above idea. In the final step of the algorithm, each high-dimensional observation Xi⃗\vec{X_i}Xi is mapped to a low-dimensional vector Yi⃗\vec{Y_i}Yi representing global internal coordinates on the manifold. This is done by choosing ddd-dimensional coordinates Yi⃗\vec{Y_i}Yi to minimize the embedding cost function
Φ(Y)=∑i∣Yi⃗−∑jWijYj⃗∣2(2) \Phi(Y)=\sum_i \left| \vec{Y_i} - \sum_j W_{ij} \vec{Y_j} \right|^2 \qquad (2) Φ(Y)=i∑∣∣∣∣∣Yi−j∑WijYj∣∣∣∣∣2(2)
This cost function, like the previous one, is based on locally linear reconstruction errors, but here we fix the weights WijW_{ij}Wij while optimizing the coordinates Yi⃗\vec{Y_i}Yi. The embedding cost in Eg.2 defines a quadratic form in the vectors Yi⃗\vec{Y_i}Yi. Subject to constraints that make the problem well-posed, it can be minimized by solving a sparse N×NN \times NN×N eigenvalue problem (9), whose bottom ddd nonzero eigenvectors provide an ordered set of orthogonal coordinates centered on the origin.

基于上述思想，LLE构造了一个保持邻域的映射。在算法的最后一步，每个高维观测值Xi⃗\vec{X_i}Xi被映射到表示流形上全局内部坐标的低维向量Yi⃗\vec{Y_i}Yi。这是通过选择ddd维坐标Yi⃗\vec{Y_i}Yi来最小化嵌入代价函数来实现的

Φ(Y)=∑i∣Yi⃗−∑jWijYj⃗∣2(2)
\Phi(Y)=\sum_i \left|  \vec{Y_i} - \sum_j W_{ij} \vec{Y_j} \right|^2 \qquad (2)
Φ(Y)=i∑∣∣∣∣∣Yi−j∑WijYj∣∣∣∣∣2(2)

这个代价函数和前一个一样，是基于局部线性重建误差，但是这里我们在优化坐标Yi⃗\vec{Y_i}Yi的同时固定了权重WijW_{ij}Wij。Eg.2中的嵌入代价在向量Yi⃗\vec{Y_i}Yi中定义了一个二次型。受限于使问题适定的约束条件，可以通过求解稀疏N×NN \times NN×N特征值问题（9）来最小化问题，该问题的底部ddd非零特征向量提供了一组以原点为中心的有序正交坐标。

Implementation of the algorithm is straightforward. In our experiments, data points were reconstructed from their K nearest neighbors, as measured by Euclidean distance or normalized dot products. For such implementations of LLE, the algorithm has only one free parameter: the number of neighbors, KKK.Once neighbors are chosen, the optimal weights WijW_{ij}Wij and coordinates Yi⃗\vec{Y_i}Yi are computed by standard methods in linear algebra.The algorithm involves a single pass through the three steps in Fig. 2 and finds global minima of the reconstruction and embedding costs in Egs. 1 and 2.

算法的实现很简单。在我们的实验中，数据点从它们的KKK个最近的邻居重建，以欧几里得距离或标准化点积度量。对于这样的LLE实现，算法只有一个自由参数：邻域数KKK。一旦选择了邻域，用线性代数的标准方法计算出最优权值WijW_{ij}Wij和坐标Yi⃗\vec{Y_i}Yi。该算法只需简单地通过图2中的三个步骤，就可以找到例1和例2中重构和嵌入代价的全局最小值。

In addition to the example in Fig. 1, for which the true manifold structure was known (10), we also applied LLE to images of faces (11) and vectors of word-document counts (12). Two-dimensional embeddings of faces and words are shown in Figs. 3 and 4. Note how the coordinates of these embedding spaces are related to meaningful attributes, such as the pose and expression of human faces and the semantic associations of words.

除了图1示例中已知的真实流形结构(10)之外，我们还将LLE应用于人脸图像(11)和单词文档计数向量(12)。平面和文字的二维嵌入如图3和图4所示。请注意这些嵌入空间的坐标是如何与有意义的属性相关联的，例如人脸的姿态和表情以及单词的语义关联。

Fig. 3. Images of faces (11) mapped into the embedding space described by the first two coordinates of LLE.
Representative faces are shown next to circled points in different parts of the space. The bottom images correspond to points along the top-right path (linked by solid line), illustrating one particular mode of variability in pose and expression.

图3. 映射到由LLE的前两个坐标描述的嵌入空间中的脸部（11）的图像。
代表性的脸显示在空间不同部分的圆圈点旁边。底部图像对应于右上角路径上的点（用实线连接），说明了姿势和表情的一种特殊变化模式。

Fig. 4. Arranging words in a continuous semantic space.
Each word was initially represented by a high-dimensional vector that counted the number of times it appeared in different encyclopedia articles. LLE was applied to these word-document count vectors (12), resulting in an embedding location for each word. Shown are words from two different bounded regions (A) and (B) of the embedding space discovered by LLE. Each panel shows a twodimensional projection onto the third and fourth coordinates of LLE; in these two dimensions, the regions (A) and (B) are highly overlapped. The inset in (A) shows a three-dimensional projection onto the third, fourth, and Þfth coordinates, revealing an extra dimension along which regions (A) and (B) are more separated. Words that lie in the intersection of both regions are capitalized. Note how LLE colocates words with similar contexts in this continuous semantic space.

图4. 在一个连续的语义空间中排列单词。
每个单词最初由一个高维向量表示，该向量计算它在不同百科全书文章中出现的次数。将LLE应用于这些单词文档计数向量（12），从而为每个单词生成一个嵌入位置。所示为LLE发现的嵌入空间的两个不同有界区域（A）和（B）中的单词。每个面板显示一个二维投影到LLE的第三和第四个坐标上；在这两个维度中，区域（A）和（B）高度重叠。（A）中的插图显示了一个三维投影到第三、第四和第五坐标，显示了一个额外的维度，沿着该维度，区域（A）和（B）更加分离。位于两个区域相交处的单词将大写。注意LLE是如何在这个连续的语义空间中把具有相似上下文的单词组合起来的。

Many popular learning algorithms for nonlinear dimensionality reduction do not share the favorable properties of LLE. Iterative hill-climbing methods for autoencoder neural networks (13, 14), self-organizing maps (15), and latent variable models (16) do not have the same guarantees of global optimality or convergence;they also tend to involve many more free parameters, such as learning rates, convergence criteria, and architectural specifications. Finally, whereas other nonlinear methods rely on deterministic annealing schemes (17) to avoid local minima, the optimizations of LLE are especially tractable.

许多流行的非线性降维学习算法都不具有线性降维算法的优点。自编码器神经网络(13,14)、自组织映射(15)和潜变量模型(16)的迭代爬山方法不能保证全局最优性或收敛性；它们还倾向于包含更多的自由参数，如学习率、收敛准则和架构规范。最后，虽然其他非线性方法依赖于确定性退火方案(17)来避免局部极小值，但LLE的优化特别容易处理。

LLE scales well with the intrinsic manifold dimensionality, ddd, and does not require a discretized gridding of the embedding space. As more dimensions are added to the embedding space, the existing ones do not change, so that LLE does not have to be rerun to compute higher dimensional embeddings. Unlike methods such as principal curves and surfaces (18) or additive component models (19), LLE is not limited in practice to manifolds of extremely low dimensionality or codimensionality. Also, the intrinsic value of ddd can itself be estimated by analyzing a reciprocal cost function, in which reconstruction weights derived from the embedding vectors Yi⃗\vec{Y_i}Yi are applied to the data points Xi⃗\vec{X_i}Xi.

LLE能很好地利用其固有的流形维数ddd，且不需要对嵌入空间进行离散网格化。随着嵌入空间的维数增加，现有的维数不会改变，因此LLE不必重新运行来计算更高维数的嵌入。与主曲线和曲面(18)或可加成分模型(19)等方法不同，LLE在实践中并不局限于极低维数或多维度的流形。同样，ddd的内在值本身可以通过分析一个倒易代价函数来估计，其中从嵌入向量Yi⃗\vec{Y_i}Yi推导出的重构权值被应用到数据点Xi⃗\vec{X_i}Xi。

LLE illustrates a general principle of manifold learning, elucidated by Martinetz and Schulten (20) and Tenenbaum (4), that overlapping local neighborhoods—collectively analyzed—can provide information about global geometry.Many virtues of LLE are shared by Tenenbaum’s algorithm, Isomap, which has been successfully applied to similar problems in nonlinear dimensionality reduction. Isomap’s embeddings, however, are optimized to preserve geodesic distances between general pairs of data points, which can only be estimated by computing shortest paths through large sublattices of data. LLE takes a different approach, analyzing local symmetries, linear coefficients, and reconstruction errors instead of global constraints, pairwise distances, and stress functions. It thus avoids the need to solve large dynamic programming problems, and it also tends to accumulate very sparse matrices, whose structure can be exploited for savings in time and space.

LLE说明了流形学习的一般原理，主要由Martinetz和Schulten（20）和Tenenbaum（4）阐明，即集体分析的重叠局部邻域可以提供关于全局的信息几何学。Tenenbaum算法和Isomap具有LLE的许多优点，已成功地应用于非线性降维中的类似问题。然而，Isomap的嵌入被优化以保持一般数据点对之间的测地距离，这只能通过计算穿过大数据子格的最短路径来估计。LLE采用不同的方法，分析局部对称性、线性系数和重建误差，而不是全局约束、成对距离和应力函数。因此，它避免了求解大型动态规划问题的需要，而且它也倾向于积累非常稀疏的矩阵，利用这些矩阵的结构可以节省时间和空间。

LLE is likely to be even more useful in combination with other methods in data analysis and statistical learning. For example, a parametric mapping between the observation and embedding spaces could be learned by supervised neural networks (21) whose target values are generated by LLE. LLE can also be generalized to harder settings, such as the case of disjoint data manifolds (22), and specialized to simpler ones, such as the case of time-ordered observations (23).

在数据分析和统计学习中，LLE与其他方法相结合可能会更加有用。例如，观察空间和嵌入空间之间的参数映射可以由监督神经网络（21）学习，其目标值由LLE生成。LLE还可以推广到更困难的设置，例如不相交的数据流形（22），也可专门用于更简单的设置，例如时序观测的情况（23）。

Perhaps the greatest potential lies in applying LLE to diverse problems beyond those considered here. Given the broad appeal of traditional methods, such as PCA and MDS, the algorithm should find widespread use in many areas of science.

也许最大的潜力在于将LLE应用于本文所讨论的问题之外的各种问题。鉴于PCA和MDS等传统方法的广泛应用，该算法应该在许多科学领域得到广泛的应用。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航