\[ %% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \DeclareMathOperator{\logit}{logit} \DeclareMathOperator{\logistic}{logistic} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\var}{var} \DeclareMathOperator{\cov}{cov} \DeclareMathOperator{\Normal}{Normal} \DeclareMathOperator{\Poisson}{Poisson} \DeclareMathOperator{\Beta}{Beta} \DeclareMathOperator{\Binom}{Binomial} \DeclareMathOperator{\Gam}{Gamma} \DeclareMathOperator{\Exp}{Exponential} \DeclareMathOperator{\Cauchy}{Cauchy} \DeclareMathOperator{\Unif}{Unif} \DeclareMathOperator{\Dirichlet}{Dirichlet} \newcommand{\given}{\;\vert\;} \]

Dimension reduction: t-SNE in Stan

Peter Ralph

29 February 2018 – Advanced Biological Statistics

t-SNE

t-SNE

http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

t-SNE is a recently published method for dimension reduction (visualization of struture in high-dimensional data).

It makes very nice visualizations.

The generic idea is to find a low dimensional representation (i.e., a picture) in which proximity of points reflects similarity of the corresponding (high dimensional) observations.

Suppose we have \(n\) data points, \(x_1, \ldots, x_n\) and each one has \(p\) variables: \(x_1 = (x_{11}, \ldots, x_{1p})\).

For each, we want to find a low-dimensional point \(y\): \[\begin{aligned} x_i \mapsto y_i \end{aligned}\] similarity of \(x_i\) and \(x_j\) is (somehow) reflected by proximity of \(y_i\) and \(y_j\)

We need to define:

  1. “low dimension” (usually, \(k=2\) or \(3\))
  2. “similarity” of \(x_i\) and \(x_j\)
  3. “reflected by proximity” of \(y_i\) and \(y_j\)

Similarity

To measure similarity, we just use (Euclidean) distance: \[\begin{aligned} d(x_i, x_j) = \sqrt{\sum_{\ell=1}^n (x_{i\ell} - x_{j\ell})^2} , \end{aligned}\] after first normalizing variables to vary on comparable scales.

Proximity

A natural choice is to require distances in the new space (\(d(y_i, y_j)\)) to be as close as possible to distances in the original space (\(d(x_i, x_j)\)).

That’s a pairwise condition. t-SNE instead tries to measure how faithfully relative distances are represented in each point’s neighborhood.

Neighborhoods

For each data point \(x_i\), define \[\begin{aligned} p_{ij} = \frac{ e^{-d(x_i, x_j)^2 / (2\sigma^2)} }{ \sum_{\ell \neq i} e^{-d(x_i, x_\ell)^2 / (2 \sigma^2)} } , \end{aligned}\] and \(p_{ii} = 0\).

This is the probability that point \(i\) would pick point \(j\) as a neighbor if these are chosen according to the Gaussian density centered at \(x_i\).

Similarly, in the output space, define \[\begin{aligned} q_{ij} = \frac{ (1 + d(y_i, y_j)^2)^{-1} }{ \sum_{\ell \neq i} (1 + d(y_i, y_\ell)^2)^{-1} } \end{aligned}\] and \(q_{ii} = 0\).

This is the probability that point \(i\) would pick point \(j\) as a neighbor if these are chosen according to the Cauchy centered at \(x_i\).

Similiarity of neighborhoods

We want neighborhoods to look similar, i.e., choose \(y\) so that \(q\) looks like \(p\).

To do this, we minimize \[\begin{aligned} \text{KL}(p \given q) &= \sum_{i} \sum_{j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right) . \end{aligned}\]

What’s KL()?

Kullback-Leibler divergence

\[\begin{aligned} \text{KL}(p \given q) &= \sum_{i} \sum_{j} p_{ij} \log\left(\frac{p_{ij}}{q_{ij}}\right) . \end{aligned}\]

This is the average log-likelihood ratio between \(p\) and \(q\) for a single observation drawn from \(p\).

It is a measure of how “surprised” you would be by a sequence of samples from \(p\) that you think should be coming from \(q\).

Facts:

  1. \(\text{KL}(p \given q) \ge 0\)

  2. \(\text{KL}(p \given q) = 0\) only if \(p_i = q_i\) for all \(i\).

Simulate data

A high-dimensional donut

Let’s first make some data. This will be some points distributed around a ellipse in \(n\) dimensions.

Here’s what the data look like: plot of chunk plot_data

But there is hidden, two-dimensional structure: plot of chunk plot_ring

Now let’s build the distance matrix.

Implementation

A Stan block

Run the model with optimizing

## Initial log joint probability = -273.23
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
##    user  system elapsed 
## 132.816   0.036 132.881

It works!

plot of chunk plot_tsne

Another case

High-dimensional random walk

Here’s what the data look like: plot of chunk plot_rw

plot of chunk plot_rw_sub

Now let’s build the distance matrix.

Run again

## Initial log joint probability = -25.3982
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance
##    user  system elapsed 
##   0.878   0.000   0.879

It works!

plot of chunk plot_rw_tsne