Dimension reduction: t-SNE in Stan

Peter Ralph

29 February 2018 – Advanced Biological Statistics

t-SNE

http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

t-SNE is a recently published method for dimension reduction (visualization of struture in high-dimensional data).

The generic idea is to find a low dimensional representation (i.e., a picture) in which proximity of points reflects similarity of the corresponding (high dimensional) observations.

Suppose we have \(n\) data points, \(x_1, \ldots, x_n\) and each one has \(p\) variables: \(x_1 = (x_{11}, \ldots, x_{1p})\).

For each, we want to find a low-dimensional point \(y\): \[\begin{aligned} x_i \mapsto y_i \end{aligned}\] similarity of \(x_i\) and \(x_j\) is (somehow) reflected by proximity of \(y_i\) and \(y_j\)

We need to define:

“low dimension” (usually, \(k=2\) or \(3\))
“similarity” of \(x_i\) and \(x_j\)
“reflected by proximity” of \(y_i\) and \(y_j\)

Similarity

To measure similarity, we just use (Euclidean) distance: \[\begin{aligned} d(x_i, x_j) = \sqrt{\sum_{\ell=1}^n (x_{i\ell} - x_{j\ell})^2} , \end{aligned}\] after first normalizing variables to vary on comparable scales.

Proximity

A natural choice is to require distances in the new space (\(d(y_i, y_j)\)) to be as close as possible to distances in the original space (\(d(x_i, x_j)\)).

That’s a pairwise condition. t-SNE instead tries to measure how faithfully relative distances are represented in each point’s neighborhood.

Neighborhoods

For each data point \(x_i\), define \[\begin{aligned} p_{ij} = \frac{ e^{-d(x_i, x_j)^2 / (2\sigma^2)} }{ \sum_{\ell \neq i} e^{-d(x_i, x_\ell)^2 / (2 \sigma^2)} } , \end{aligned}\] and \(p_{ii} = 0\).

This is the probability that point \(i\) would pick point \(j\) as a neighbor if these are chosen according to the Gaussian density centered at \(x_i\).

Similarly, in the output space, define \[\begin{aligned} q_{ij} = \frac{ (1 + d(y_i, y_j)^2)^{-1} }{ \sum_{\ell \neq i} (1 + d(y_i, y_\ell)^2)^{-1} } \end{aligned}\] and \(q_{ii} = 0\).

This is the probability that point \(i\) would pick point \(j\) as a neighbor if these are chosen according to the Cauchy centered at \(x_i\).

Similiarity of neighborhoods

We want neighborhoods to look similar, i.e., choose \(y\) so that \(q\) looks like \(p\).

To do this, we minimize \[\begin{aligned} \text{KL}(p \given q) &= \sum_{i} \sum_{j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right) . \end{aligned}\]

What’s KL()?

Kullback-Leibler divergence

\[\begin{aligned} \text{KL}(p \given q) &= \sum_{i} \sum_{j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right) . \end{aligned}\]

This is the average log-likelihood ratio between \(p\) and \(q\) for a single observation drawn from \(p\).

It is a measure of how “surprised” you would be by a sequence of samples from \(p\) that you think should be coming from \(q\).

Facts:

\(\text{KL}(p \given q) \ge 0\)
\(\text{KL}(p \given q) = 0\) only if \(p_i = q_i\) for all \(i\).

Simulate data

A high-dimensional donut

Let’s first make some data. This will be some points distributed around a ellipse in \(n\) dimensions.

n <- 20
npts <- 1e3
xy <- matrix(rnorm(n*npts), ncol=n)
theta <- runif(npts) * 2 * pi
ab <- matrix(rnorm(2*n), nrow=2)
ab[,2] <- ab[,2] - ab[,1] * sum(ab[,1] * ab[,2]) / sqrt(sum(ab[,1]^2))
ab <- sweep(ab, 1, sqrt(rowSums(ab^2)), "/")
for (k in 1:npts) {
    dxy <- 4 * c(cos(theta[k]), sin(theta[k])) %*% ab
    xy[k,] <- xy[k,] + dxy
}

Here’s what the data look like: plot of chunk plot_data

But there is hidden, two-dimensional structure: plot of chunk plot_ring

Now let’s build the distance matrix.

dmat <- as.matrix(dist(xy))

Implementation

A Stan block

tsne_block <- '
data {
    int N; // number of points
    int n; // input dimension
    int k; // output dimension
    matrix[N,N] dsq;  // distance matrix, squared
}
parameters {
    real<lower=0> sigma_sq; // in kernel for p 
    matrix[N-2,k] y1;
    real<lower=0> y21;
}
transformed parameters {
    matrix[N,k] y;
    y[1,] = rep_row_vector(0.0, k);
    y[3:N,] = y1;
    y[2,] = rep_row_vector(0.0, k);
    y[2,1] = y21;
}
model {
    matrix[N,N] q;
    real dt;
    matrix[N,N] p;
    q[N,N] = 0.0;
    for (i in 1:(N-1)) {
        q[i,i] = 0.0;
        for (j in (i+1):N) {
            q[i,j] = 1 / (1 + squared_distance(y[i], y[j]));
            q[j,i] = q[i,j];
        }
    }
    for (i in 1:N) {
        q[i] = q[i] / sum(q[i]);
    }
    // create p matrix
    p = exp(-dsq/(2*sigma_sq));
    for (i in 1:N) {
        p[i,i] = 0.0;
        p[i] = p[i] / sum(p[i]);
    }
    // compute the target
    for (i in 1:(N-1)) {
        for (j in (i+1):N) {
            target += (-1) * (p[i,j] .* log(p[i,j] ./ q[i,j]));
            target += (-1) * (p[j,i] .* log(p[j,i] ./ q[j,i]));
        }
    }
    sigma_sq ~ normal(0, 10);
}'


tk <- 2
tsne_model <- stan_model(model_code=tsne_block)

Run the model with `optimizing`

runtime <- system.time(tsne_optim <- optimizing(tsne_model,
                                 data=list(N=nrow(xy),
                                           n=ncol(xy),
                                           k=tk,
                                           dsq=(dmat/max(dmat))^2)))

## Initial log joint probability = -273.23
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance

runtime

##    user  system elapsed 
## 132.816   0.036 132.881

It works!

plot of chunk plot_tsne

Another case

High-dimensional random walk

n <- 40
npts <- 1e2
# rw <- colCumsums(matrix(rnorm(n*npts), ncol=n))
rw <- matrix(rnorm(n*npts), ncol=n)
for (i in 2:npts) {
    rw[i,] <- sqrt(.9) * rw[i-1,] + sqrt(.1) * rw[i,]
}

Here’s what the data look like: plot of chunk plot_rw

plot of chunk plot_rw_sub

Now let’s build the distance matrix.

rw_dmat <- as.matrix(dist(rw))

Run again

rw_runtime <- system.time(
     rw_tsne_optim <- optimizing(tsne_model,
                                 data=list(N=nrow(rw),
                                           n=ncol(rw),
                                           k=tk,
                                           dsq=(rw_dmat/max(rw_dmat))^2),
                                 tol_rel_grad=1e5))

## Initial log joint probability = -25.3982
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance

rw_runtime

##    user  system elapsed 
##   0.878   0.000   0.879

It works!

plot of chunk plot_rw_tsne

Dimension reduction: t-SNE in Stan

t-SNE

t-SNE

Similarity

Proximity

Neighborhoods

Similiarity of neighborhoods

Kullback-Leibler divergence

Simulate data

A high-dimensional donut

Implementation

A Stan block

Run the model with optimizing

It works!

Another case

High-dimensional random walk

Run again

It works!

Run the model with `optimizing`