Peter Ralph
29 February 2018 – Advanced Biological Statistics
t-SNE is a recently published method for dimension reduction (visualization of struture in high-dimensional data).
The generic idea is to find a low dimensional representation (i.e., a picture) in which proximity of points reflects similarity of the corresponding (high dimensional) observations.
Suppose we have \(n\) data points, \(x_1, \ldots, x_n\) and each one has \(p\) variables: \(x_1 = (x_{11}, \ldots, x_{1p})\).
For each, we want to find a low-dimensional point \(y\): \[\begin{aligned} x_i \mapsto y_i \end{aligned}\] similarity of \(x_i\) and \(x_j\) is (somehow) reflected by proximity of \(y_i\) and \(y_j\)
We need to define:
To measure similarity, we just use (Euclidean) distance: \[\begin{aligned} d(x_i, x_j) = \sqrt{\sum_{\ell=1}^n (x_{i\ell} - x_{j\ell})^2} , \end{aligned}\] after first normalizing variables to vary on comparable scales.
A natural choice is to require distances in the new space (\(d(y_i, y_j)\)) to be as close as possible to distances in the original space (\(d(x_i, x_j)\)).
That’s a pairwise condition. t-SNE instead tries to measure how faithfully relative distances are represented in each point’s neighborhood.
For each data point \(x_i\), define \[\begin{aligned} p_{ij} = \frac{ e^{-d(x_i, x_j)^2 / (2\sigma^2)} }{ \sum_{\ell \neq i} e^{-d(x_i, x_\ell)^2 / (2 \sigma^2)} } , \end{aligned}\] and \(p_{ii} = 0\).
This is the probability that point \(i\) would pick point \(j\) as a neighbor if these are chosen according to the Gaussian density centered at \(x_i\).
Similarly, in the output space, define \[\begin{aligned} q_{ij} = \frac{ (1 + d(y_i, y_j)^2)^{-1} }{ \sum_{\ell \neq i} (1 + d(y_i, y_\ell)^2)^{-1} } \end{aligned}\] and \(q_{ii} = 0\).
This is the probability that point \(i\) would pick point \(j\) as a neighbor if these are chosen according to the Cauchy centered at \(x_i\).
We want neighborhoods to look similar, i.e., choose \(y\) so that \(q\) looks like \(p\).
To do this, we minimize \[\begin{aligned} \text{KL}(p \given q) &= \sum_{i} \sum_{j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right) . \end{aligned}\]
What’s KL()?
\[\begin{aligned} \text{KL}(p \given q) &= \sum_{i} \sum_{j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right) . \end{aligned}\]
This is the average log-likelihood ratio between \(p\) and \(q\) for a single observation drawn from \(p\).
It is a measure of how “surprised” you would be by a sequence of samples from \(p\) that you think should be coming from \(q\).
Facts:
\(\text{KL}(p \given q) \ge 0\)
\(\text{KL}(p \given q) = 0\) only if \(p_i = q_i\) for all \(i\).
Let’s first make some data. This will be some points distributed around a ellipse in \(n\) dimensions.
n <- 20
npts <- 1e3
xy <- matrix(rnorm(n*npts), ncol=n)
theta <- runif(npts) * 2 * pi
ab <- matrix(rnorm(2*n), nrow=2)
ab[,2] <- ab[,2] - ab[,1] * sum(ab[,1] * ab[,2]) / sqrt(sum(ab[,1]^2))
ab <- sweep(ab, 1, sqrt(rowSums(ab^2)), "/")
for (k in 1:npts) {
dxy <- 4 * c(cos(theta[k]), sin(theta[k])) %*% ab
xy[k,] <- xy[k,] + dxy
}
Here’s what the data look like:
But there is hidden, two-dimensional structure:
Now let’s build the distance matrix.
tsne_block <- '
data {
int N; // number of points
int n; // input dimension
int k; // output dimension
matrix[N,N] dsq; // distance matrix, squared
}
parameters {
real<lower=0> sigma_sq; // in kernel for p
matrix[N-2,k] y1;
real<lower=0> y21;
}
transformed parameters {
matrix[N,k] y;
y[1,] = rep_row_vector(0.0, k);
y[3:N,] = y1;
y[2,] = rep_row_vector(0.0, k);
y[2,1] = y21;
}
model {
matrix[N,N] q;
real dt;
matrix[N,N] p;
q[N,N] = 0.0;
for (i in 1:(N-1)) {
q[i,i] = 0.0;
for (j in (i+1):N) {
q[i,j] = 1 / (1 + squared_distance(y[i], y[j]));
q[j,i] = q[i,j];
}
}
for (i in 1:N) {
q[i] = q[i] / sum(q[i]);
}
// create p matrix
p = exp(-dsq/(2*sigma_sq));
for (i in 1:N) {
p[i,i] = 0.0;
p[i] = p[i] / sum(p[i]);
}
// compute the target
for (i in 1:(N-1)) {
for (j in (i+1):N) {
target += (-1) * (p[i,j] .* log(p[i,j] ./ q[i,j]));
target += (-1) * (p[j,i] .* log(p[j,i] ./ q[j,i]));
}
}
sigma_sq ~ normal(0, 10);
}'
tk <- 2
tsne_model <- stan_model(model_code=tsne_block)
optimizing
runtime <- system.time(tsne_optim <- optimizing(tsne_model,
data=list(N=nrow(xy),
n=ncol(xy),
k=tk,
dsq=(dmat/max(dmat))^2)))
## Initial log joint probability = -273.23
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## user system elapsed
## 132.816 0.036 132.881
Here’s what the data look like:
Now let’s build the distance matrix.
rw_runtime <- system.time(
rw_tsne_optim <- optimizing(tsne_model,
data=list(N=nrow(rw),
n=ncol(rw),
k=tk,
dsq=(rw_dmat/max(rw_dmat))^2),
tol_rel_grad=1e5))
## Initial log joint probability = -25.3982
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## user system elapsed
## 0.878 0.000 0.879