MDS#

Download#

Multidimensional scaling#

  • A popular visualization technique

  • Observation: a distance / dissimilarity matrix $\( D_{ij} \overset{?}{=} d(X_i, X_j) \)$

  • Note: need not be a distance but we usually assume

\[ D_{ij} \geq 0, D_{ii}=0 \]

Goals of MDS#

  • Learn something about structure of the sample.

  • In MDS learn something means find a \(k\)-dimensional Euclidean representation \(\hat{X} \in \mathbb{R}^{n \times k}\)

  • We’ll call this a visualization – MKB calls it a configuration.

PCA and MDS#

  • Recall what we hope to find with rank \(k\)- PCA

\[ X \overset{D}{=} \mu + \sum_{j=1}^k \langle V_j, (X - \mu) \rangle \cdot V_j + \text{small error} \]
  • By taking only the first \(k\) principal components we get an optimal \(k\)-dimensional Euclidean representation.

  • Let \(\hat{X}_k\) be the rank \(k\) SVD of (centered) \(X\), then

\[ \hat{X}_k = U_k \Lambda_k V_k^T = \text{argmin}_{Y:\text{rank}(Y)=k} \frac{1}{2} \|X-Y\|^2_F \]

  • Consider the \(n \times n\) similarity matrix \(S=RXX^TR\), with \(R\) the centering matrix \(I-\frac{1}{n}11^T\).

  • Its rank \(k\)-approximation is

\[ U_k \Lambda_k^2 U_k^T \]
  • This is the similarity matrix of the best rank \(k\) projection of the centered data.

  • Rank \(k\) decomposition of similarity \(\to\) visualizations that (approximately) preserve distances.

  • If distances are Euclidean (or close?) then rank \(k\) decomposition of similarity yields optimal \(\hat{X}\).

Euclidean distance matrices#

  • We say that an distance matrix \(D_{n \times n}\) is Euclidean of dimension \(p\) if there exists a collection of points \(Y \in \mathbb{R}^{n \times p}\) such that

\[ D_{ij} = \|Y[i] - Y[j]\|_{2}\]
  • From a Gram matrix \(K\) we can form a distance matrix

\[ D_{ij} = (K_{ii} - 2 K_{ij} + K_{jj})^{1/2} \]

Gram matrices are Euclidean#

Fact: any Gram distance matrix is Euclidean with \(p=\text{rank}(K)\).#

Write \(K=U\Lambda^2U^T\) and take \(\hat{X}=U\Lambda\) from which we see

\[ K_{ij} = \hat{X}[i]^T\hat{X}[j] \]

and

\[ D_{ij} = \|\hat{X}[i]-\hat{X}[j]\|_2 \]
  • For Euclidean distance matrices, taking \(k=p\) and setting \(\hat{X} \in \mathbb{R}^{p \times n}\) we perfectly reconstruct distances \((D_{ij})_{1 \leq i,j \leq n}\).

  • Projecting \(\hat{X} \in \mathbb{R}^p\) to lower-dimensional space might kill some noise (and make it easier to see structure).

Classical MDS solution#

  • Not all distance matrices are Euclidean… but we can try.

  • Given a distance matrix \(D\), define

\[\begin{split} \begin{aligned} A_{ij} &= -\frac{1}{2} D_{ij}^2 \\ \end{aligned} \end{split}\]

and

\[ B = RAR \]
  • Note that if \(D\) were formed from a centered data matrix

\[ D_{ij}^2 = \|RX[i] - RX[j]\|^2_2 = (RXX^TR)_{ii} - (2RXX^TR)_{ij} + (RXX^TR)_{jj} \]

so

\[ A = -s1^T-1s^T + RXX^TR \]

with \(s = \text{diag}(RXX^TR)/2\) and

\[ B = RXX^TR \]

General case#

  • \(B=B(D)\) is symmetric so it has eigendecomposition \(U \Gamma U^T\).

  • If the first \(k\) eigenvalues of \(B\) are non-negative, then we take

    \[\hat{X}_k(D) = U_k \Gamma_k^{1/2} \in \mathbb{R}^{n \times k}\]

    as the rank \(k\) classical MDS solution.

  • What if we asked for \(k=5\) but only 3 eigenvalues are non-negative? Not completely clear what right thing to do is…

  • Taking all 5 retains some optimality (c.f. Theorem 14.4.2 of MKB) even though eigendecomposition would not “allow” a 5-dimensional configuration.

  • Optimality measured by

\[ \mathbb{R}^{n \times n} \ni B \mapsto \text{Tr}(B(D)-B)^2 \]

Classical MDS solution and PCA#

  • In the PCA case, i.e. when \(D\) is formed from a centered data matrix, this is exactly the same as first \(k\) sample principal component scores.

  • Let’s start with distances from a covariance kernel \(K\):

\[\begin{split} \begin{aligned} A_{ij} &= -\frac{1}{2} D_{ij}^2 \\ &= K(X_i, X_j) - \frac{1}{2} K(X_i, X_i) - \frac{1}{2} K(X_j, X_j) \end{aligned} \end{split}\]

– We see that

\[ RAR= RKR \]
  • Note: this is the same matrix we saw in HW problem on kernel PCA.

  • In summary: taking leading scores from kernel PCA is classical MDS with distance based on kernel \(K\).


NCI data#

nci.data = read.table(url("https://hastie.su.domains/ElemStatLearn/datasets/nci.data.csv"),
                      sep=",",
                      row.names=1,
                      header=TRUE)
nci.labels = scan(url("https://hastie.su.domains/ElemStatLearn/datasets/nci.label.txt"), what="")
nci.samples = t(as.matrix(nci.data))
#| echo: true
D_nci = dist(nci.samples, method='euclidean')
X_mds = cmdscale(D_nci, k=2)
plot(X_mds,col=as.factor(nci.labels), xlab='1st MDS score', ylab='2nd MDS score')
../_images/dc04dd3328e51144b35660fdf9c4eb2e32c1395c1c3f63ebfb798755b65e18c0.png

Hmm… R and sklearn don’t agree#

Also doesn’t agree with PCA#


What does sklearn.MDS do?#

  • Let \(y \in \mathbb{R}^{n \times k}\) denote a visualization with distance matrix

\[ D(y)_{ij} = \|y[i] - y[j]\|_2. \]
  • Directly attempts to solve

\[ \text{minimize}_{y_1, \dots, y_n} \sum_{i,j} (D_{ij} - D(y)_{ij})^2 \overset{def}{=} \text{Stress}(y_1, \dots, y_n) \]
  • Implements SMACOF algorithm wikipedia, which also allows weights \(w_{ij}\).

How is classical MDS not just PCA?#

  • MDS methods (usually) work with \(D_{n \times n}\) rather than original features.

  • As a result, we don’t get loadings / functions on feature space.

  • Mostly used as visualization technique – not as useful for feature generation / downstream supervised procedures.


sklearn.MDS on digits#

Compare to classical solution#

Non-metric solution#

  • Similar inputs to classical case: a distance matrix \(D\) and a dimension \(k\).

  • Constraint: seeks to have order of pairwise distances preserved in the \(k\)-dimensional approximation / visualization

  • The non-metric solution tries to impose same ordering on values \(D(Y)\) as \(D\).


  • Along with this ordering (isotonic) constraint on \(D(Y)\), it has a cost

\[ \text{Stress}_1(y_1,\dots,y_n) = \sqrt{\frac{ \inf_{D^* \sim D} \sum_{i,j} (D(y)_{ij} - D^*_{ij})^2}{\sum_{i,j} D(y)_{ij}^2}}. \]
  • Goal: directly minimize over \((y_1, \dots, y_n)\)

Kruskal algorithm#

  1. At \(k\)-th iteration let’s call configuration \(Y^k\) with distance:

\[ D(y^k)_{ij} = \|y^k[i] - y^k[j]\|_2 \]
  1. Use isotonic regression to find \(\widetilde{D}^k\), the distance matrix satisfying the required ordering that is closest to \(D(y^k)\).

  2. Apply classical solution to \(\widetilde{D}^k\), yielding \(y^{k+1}\).

  3. Repeat 1-3.

  • Potential problem: \(\widetilde{D}^k\) may be non-Euclidean?

  • Will terminate when \(Y\) is a configuration respecting the ordering that is \(k\)-Euclidean.


Non-metric for NCI60 data#


Non-metric on digits data#

Variants of MDS#

  • Kruskal’s algorithm directly optimizes over the configuration.

  • Classical method crosses the fingers and hopes a Euclidean configuration of rank \(k\) exists.

Variants#

Graph-based distances#

  • (Essentially) uses classical MDS on distances derived from local properties of the dataset.

  • E.g. diffusion eigenmaps, ISOMAP

Cost function#

  • Some methods directly optimize over configuration as well with different loss functions.

  • Local Linear Embedding, t-SNE, local MDS

Manifold learning#

  • MDS looks to find low-dimensional approximation to \(D\), i.e. points in a low dimension \(k\) whose Euclidean distance is close to the distances implied by \(k\).

  • Essentially a linearization of \(D\)

  • Techniques that seek to model more general low-dimensional structure fall under manifold learning umbrella, though many (most?) involve some variant of MDS.

  • See Wikipedia on nonlinear dimension reduction


Some examples#

  • ISOMAP

  • Diffusion / Laplacian maps

  • Local linear embedding


Graph-based distances#

ISOMAP#

  • Classical MDS applied to a distance matrix formed by geodesic / graph distance on graph at scale \(r\).


Graph distance#

  • Let $\( D(i,j) = \text{length of shortest path between vertices \)i\( and \)j\( in \)G_k\(} \)$

  • We can add weights, too… let \(W=W_{ij}\) be a weight matrix, with $\( W_{ij} = \infty, \ \text{if \)i\( not connected to \)j\(} \)$

  • Set $\( D_W(i,j) = \min_{p: i \mapsto j}\sum_{e \in p} W_{e} \)\( where \)p\( are all paths to connect \)i\( to \)j\( and a path is described by the edges \)e$ it crosses.


Source: Balasubramanian et al.

  • LHS: points in curved 3d space.

  • RHS: \(D_W\) passed through classical MDS.


  • ISOMAP can flatten data if there is a low-dimensional Euclidean configuration where the Euclidean distances well approximate the graph distances \(D_W\).

  • Note: this means that the data has to be flattish in dimenion \(k\) if ISOMAP is to “recover” the true structure.


Diffusion based graph distances#

  • For symmetric \(W\), let

\[ A_{ij} = e^{-W_{ij}} \]
  • Common choice

\[ A_{ij} = e^{-\kappa \|X[i]-X[j]\|^2_2/2}. \]

  • Let’s normalize the rows

\[ T_{ij} = \frac{A_{ij}}{\sum_{k:k \sim i} A_{ik}} = \frac{A_{ij}}{\pi_{i}} = (\Pi^{-1}A)_{ij} \]
  • This is a transition matrix for a Markov chain.

\[ P(X_{t+1}=j|X_{t}=i) = T_{ij}. \]
  • If a chain takes a long time to get to \(j\) starting from \(i\), then \(j\) is far from \(i\).


  • Associated to \(T\) is its generator

\[ I-T \]
  • In unnormalized setting this corresponds to

\[ L = \Pi - A. \]
  • We see \((I-T)1=0\) (and \(L1=0\)):

\[ (I - T)f = f(i) - \text{Avg}_{j: j \sim i} f(j) \]
  • Eigenvalues of \(I-T\) are those of generalized eigenvalue problem

\[ Lf = \lambda \Pi f, \qquad \Pi^{-1/2}L\Pi^{-1/2}h = \lambda h. \]
  • Eigenfunctions corresponding to small eigenvalues are smoother functions.


Kernels from graph#

  • The matrix \(L\) is non-negative definite… as is its (formal) inverse the Green’s function. Alternatively, \(\Pi^{-1/2}L\Pi^{-1/2}\) is also non-negative definite as is its (formal) inverse.

  • Gives another distance based on a graph, different from geodesic distance…


Source: Belkin and Niyogi



---

## Changing the cost function

### Local MDS (c.f. [Chen and Buja (2009)](https://www-tandfonline-com.stanford.idm.oclc.org/doi/abs/10.1198/jasa.2009.0111), 14.9 of ESL)

- Local MDS defines a proximity graph and seeks
$$
\text{argmin}_{y_1, \dots, y_n} \sum_{(i,j): i \sim j} (D_{ij} - D(y)_{ij}\|_2)^2 - \tau \sum_{(i,j): i \not \sim j} D(y)_{ij}
$$

- Encourages points not considered *close* in the graph to be faraway in the
Euclidean visualization / configuration.


## Locally linear embedding

- Another method that directly optimizes over $y \in \mathbb{R}^{n \times k}$.

1. Given graph $G$ (usually some number of nearest neighbors), solve for weights ($W_{ij} \neq 0 \iff i \sim j$) 

$$
\min_{W[i]:W[i]^T1=1} \|X[i] - \sum_{j:j \sim i} W_{ij} X[j]\|^2_2
$$

(The weights are an *averaging* operator, computing norm is similar to Laplacian...)

2. With $W$ fixed, find $Y \in \mathbb{R}^{n \times k}$ 
that minimizes

$$
\sum_i\|Y[i] - \sum_{j:j \sim i}W_{ij}Y[j]\|^2_2
$$

subject to $1^TY=0, Y^TY/n=I$. (Last constraint is non-trivial: otherwise $Y$=0 is the solution -- uninteresting.)

---

- First constraint is trivial by step 1. -- matrix $I-W$ satisfies

$$
(I-W)1=0
$$

- Take next $k$ eigenvectors (ordering eigenvalues from *smallest to largest*).

---

## Using `n_neighbors=5`

```{python}
lle = LLE(n_neighbors=5, n_components=2)
digits_lle = lle.fit(digits_X)

Using n_neighbors=10#

Using n_neighbors=15#


Source: ESL