Nonlinear Dimensionality Reduction by Locally Linear Embedding

See allHide authors and affiliations

Science  22 Dec 2000:
Vol. 290, Issue 5500, pp. 2323-2326
DOI: 10.1126/science.290.5500.2323


Many areas of science depend on exploratory data analysis and visualization. The need to analyze large amounts of multivariate data raises the fundamental problem of dimensionality reduction: how to discover compact representations of high-dimensional data. Here, we introduce locally linear embedding (LLE), an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs. Unlike clustering methods for local dimensionality reduction, LLE maps its inputs into a single global coordinate system of lower dimensionality, and its optimizations do not involve local minima. By exploiting the local symmetries of linear reconstructions, LLE is able to learn the global structure of nonlinear manifolds, such as those generated by images of faces or documents of text.

How do we judge similarity? Our mental representations of the world are formed by processing large numbers of sensory inputs—including, for example, the pixel intensities of images, the power spectra of sounds, and the joint angles of articulated bodies. While complex stimuli of this form can be represented by points in a high-dimensional vector space, they typically have a much more compact description. Coherent structure in the world leads to strong correlations between inputs (such as between neighboring pixels in images), generating observations that lie on or close to a smooth low-dimensional manifold. To compare and classify such observations—in effect, to reason about the world—depends crucially on modeling the nonlinear geometry of these low-dimensional manifolds.

Scientists interested in exploratory analysis or visualization of multivariate data (1) face a similar problem in dimensionality reduction. The problem, as illustrated in Fig. 1, involves mapping high-dimensional inputs into a low-dimensional “description” space with as many coordinates as observed modes of variability. Previous approaches to this problem, based on multidimensional scaling (MDS) (2), have computed embeddings that attempt to preserve pairwise distances [or generalized disparities (3)] between data points; these distances are measured along straight lines or, in more sophisticated usages of MDS such as Isomap (4), along shortest paths confined to the manifold of observed inputs. Here, we take a different approach, called locally linear embedding (LLE), that eliminates the need to estimate pairwise distances between widely separated data points. Unlike previous methods, LLE recovers global nonlinear structure from locally linear fits.

Figure 1

The problem of nonlinear dimensionality reduction, as illustrated (10) for three-dimensional data (B) sampled from a two-dimensional manifold (A). An unsupervised learning algorithm must discover the global internal coordinates of the manifold without signals that explicitly indicate how the data should be embedded in two dimensions. The color coding illustrates the neighborhood-preserving mapping discovered by LLE; black outlines in (B) and (C) show the neighborhood of a single point. Unlike LLE, projections of the data by principal component analysis (PCA) (28) or classical MDS (2) map faraway data points to nearby points in the plane, failing to identify the underlying structure of the manifold. Note that mixture models for local dimensionality reduction (29), which cluster the data and perform PCA within each cluster, do not address the problem considered here: namely, how to map high-dimensional data into a single global coordinate system of lower dimensionality.

The LLE algorithm, summarized in Fig. 2, is based on simple geometric intuitions. Suppose the data consist of N real-valued vectors X⃗ i, each of dimensionality D, sampled from some underlying manifold. Provided there is sufficient data (such that the manifold is well-sampled), we expect each data point and its neighbors to lie on or close to a locally linear patch of the manifold. We characterize the local geometry of these patches by linear coefficients that reconstruct each data point from its neighbors. Reconstruction errors are measured by the cost functionEmbedded Image(1)which adds up the squared distances between all the data points and their reconstructions. The weightsW ij summarize the contribution of thejth data point to the ith reconstruction. To compute the weights W ij, we minimize the cost function subject to two constraints: first, that each data pointX⃗ i is reconstructed only from its neighbors (5), enforcing W ij= 0 if X⃗ j does not belong to the set of neighbors of X⃗ i; second, that the rows of the weight matrix sum to one: Σj W ij = 1. The optimal weights W ij subject to these constraints (6) are found by solving a least-squares problem (7).

Figure 2

Steps of locally linear embedding: (1) Assign neighbors to each data pointX⃗ i (for example by using theK nearest neighbors). (2) Compute the weightsW ij that best linearly reconstructX⃗ i from its neighbors, solving the constrained least-squares problem in Eq. 1. (3) Compute the low-dimensional embedding vectorsY⃗ i best reconstructed byW ij, minimizing Eq. 2 by finding the smallest eigenmodes of the sparse symmetric matrix in Eq. 3. Although the weights W ij and vectorsY i are computed by methods in linear algebra, the constraint that points are only reconstructed from neighbors can result in highly nonlinear embeddings.

The constrained weights that minimize these reconstruction errors obey an important symmetry: for any particular data point, they are invariant to rotations, rescalings, and translations of that data point and its neighbors. By symmetry, it follows that the reconstruction weights characterize intrinsic geometric properties of each neighborhood, as opposed to properties that depend on a particular frame of reference (8). Note that the invariance to translations is specifically enforced by the sum-to-one constraint on the rows of the weight matrix.

Suppose the data lie on or near a smooth nonlinear manifold of lower dimensionality d << D. To a good approximation then, there exists a linear mapping—consisting of a translation, rotation, and rescaling—that maps the high-dimensional coordinates of each neighborhood to global internal coordinates on the manifold. By design, the reconstruction weightsW ij reflect intrinsic geometric properties of the data that are invariant to exactly such transformations. We therefore expect their characterization of local geometry in the original data space to be equally valid for local patches on the manifold. In particular, the same weights W ijthat reconstruct the ith data point in Ddimensions should also reconstruct its embedded manifold coordinates ind dimensions.

LLE constructs a neighborhood-preserving mapping based on the above idea. In the final step of the algorithm, each high-dimensional observation X⃗ i is mapped to a low-dimensional vector Y⃗ irepresenting global internal coordinates on the manifold. This is done by choosing d-dimensional coordinatesY⃗ i to minimize the embedding cost functionEmbedded Image(2)This cost function, like the previous one, is based on locally linear reconstruction errors, but here we fix the weights W ij while optimizing the coordinatesY⃗ i. The embedding cost in Eq. 2 defines a quadratic form in the vectorsY⃗ i. Subject to constraints that make the problem well-posed, it can be minimized by solving a sparse N × N eigenvalue problem (9), whose bottom d nonzero eigenvectors provide an ordered set of orthogonal coordinates centered on the origin.

Implementation of the algorithm is straightforward. In our experiments, data points were reconstructed from their K nearest neighbors, as measured by Euclidean distance or normalized dot products. For such implementations of LLE, the algorithm has only one free parameter: the number of neighbors, K. Once neighbors are chosen, the optimal weights W ij and coordinates Y⃗ i are computed by standard methods in linear algebra. The algorithm involves a single pass through the three steps in Fig. 2 and finds global minima of the reconstruction and embedding costs in Eqs. 1 and 2.

In addition to the example in Fig. 1, for which the true manifold structure was known (10), we also applied LLE to images of faces (11) and vectors of word-document counts (12). Two-dimensional embeddings of faces and words are shown in Figs. 3 and4. Note how the coordinates of these embedding spaces are related to meaningful attributes, such as the pose and expression of human faces and the semantic associations of words.

Figure 3

Images of faces (11) mapped into the embedding space described by the first two coordinates of LLE. Representative faces are shown next to circled points in different parts of the space. The bottom images correspond to points along the top-right path (linked by solid line), illustrating one particular mode of variability in pose and expression.

Figure 4

Arranging words in a continuous semantic space. Each word was initially represented by a high-dimensional vector that counted the number of times it appeared in different encyclopedia articles. LLE was applied to these word-document count vectors (12), resulting in an embedding location for each word. Shown are words from two different bounded regions (A) and (B) of the embedding space discovered by LLE. Each panel shows a two-dimensional projection onto the third and fourth coordinates of LLE; in these two dimensions, the regions (A) and (B) are highly overlapped. The inset in (A) shows a three-dimensional projection onto the third, fourth, and fifth coordinates, revealing an extra dimension along which regions (A) and (B) are more separated. Words that lie in the intersection of both regions are capitalized. Note how LLE colocates words with similar contexts in this continuous semantic space.

Many popular learning algorithms for nonlinear dimensionality reduction do not share the favorable properties of LLE. Iterative hill-climbing methods for autoencoder neural networks (13, 14), self-organizing maps (15), and latent variable models (16) do not have the same guarantees of global optimality or convergence; they also tend to involve many more free parameters, such as learning rates, convergence criteria, and architectural specifications. Finally, whereas other nonlinear methods rely on deterministic annealing schemes (17) to avoid local minima, the optimizations of LLE are especially tractable.

LLE scales well with the intrinsic manifold dimensionality,d, and does not require a discretized gridding of the embedding space. As more dimensions are added to the embedding space, the existing ones do not change, so that LLE does not have to be rerun to compute higher dimensional embeddings. Unlike methods such as principal curves and surfaces (18) or additive component models (19), LLE is not limited in practice to manifolds of extremely low dimensionality or codimensionality. Also, the intrinsic value of d can itself be estimated by analyzing a reciprocal cost function, in which reconstruction weights derived from the embedding vectors Y⃗ i are applied to the data points X⃗ i.

LLE illustrates a general principle of manifold learning, elucidated by Martinetz and Schulten (20) and Tenenbaum (4), that overlapping local neighborhoods—collectively analyzed—can provide information about global geometry. Many virtues of LLE are shared by Tenenbaum's algorithm, Isomap, which has been successfully applied to similar problems in nonlinear dimensionality reduction. Isomap's embeddings, however, are optimized to preserve geodesic distances between general pairs of data points, which can only be estimated by computing shortest paths through large sublattices of data. LLE takes a different approach, analyzing local symmetries, linear coefficients, and reconstruction errors instead of global constraints, pairwise distances, and stress functions. It thus avoids the need to solve large dynamic programming problems, and it also tends to accumulate very sparse matrices, whose structure can be exploited for savings in time and space.

LLE is likely to be even more useful in combination with other methods in data analysis and statistical learning. For example, a parametric mapping between the observation and embedding spaces could be learned by supervised neural networks (21) whose target values are generated by LLE. LLE can also be generalized to harder settings, such as the case of disjoint data manifolds (22), and specialized to simpler ones, such as the case of time-ordered observations (23).

Perhaps the greatest potential lies in applying LLE to diverse problems beyond those considered here. Given the broad appeal of traditional methods, such as PCA and MDS, the algorithm should find widespread use in many areas of science.


View Abstract

Navigate This Article