Clustering and categorization

Peter Ralph

26 February 2018 – Advanced Biological Statistics

Clusters in expression space

A conceptual model

Let’s build a conceptual model for descriptive analysis of “mixture” expression data.

Data: expression data from tissue samples that consist of various mixtures of different cell types.

Goal: identify shared coexpression patterns corresponding to cell type.

Similar situations: identify different developmental stages from whole-organism expression; common community structures from metagenomic data.

Each cell type has a typical set of mean expression levels.
Each sample is composed of a mixture of cell types, defined by the proportions that come from each type.
Mean expression levels differ between cell types for only some of the genes.
Some samples are noisier than others.

Assume the same amount of sequencing of each sample.

Mean expression by cell type.
Cell type proportions by sample.

\(x_{kj}\) : Mean expression of gene \(j\) in cell type \(k\).
\(w_{ik}\) : Proportion of sample \(i\) of cell type \(k\).

\(Z_{ij}\) : expression level in sample \(i\) of gene \(j\).

\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]

Mean expression by cell type.
Cell type proportions by sample.
Mean expression levels differ between cell types for only some of the genes.
Some samples are noisier than others.

\(Z_{ij}\) : expression level in sample \(i\) of gene \(j\).

\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]

\(y_j\), \(\eta_j\) : mean and SD of expression of gene \(j\) across all cell types; shrink \(x_{kj}\) towards \(y_j\).
(omit this)

data {
  int N; // # samples
  int L; // # genes
  int K; // # cell types
  int Z[N,L];
}

parameters {
  matrix<lower=0>[L,K] x;
  vector[L] y;
  simplex[K] w[N];
  vector<lower=0>[L] eta;
  vector<lower=0>[K] alpha;
  real<lower=0> d_alpha;
}

model {
  for (i in 1:N) {
      Z[i] ~ poisson(eta .* (x * w[i]));
      w[i] ~ dirichlet(d_alpha * alpha);
  }
  for (j in 1:K) 
      { x[,j] ~ normal(y ./ eta, 1); }
  y ~ normal(0, 20);
  alpha ~ normal(0, 1);
  d_alpha ~ exponential(0.2);
  eta ~ cauchy(0, 10);
}

\(x_{kj}\) : Mean expression of gene \(j\) in cell type \(k\).
\(w_{ik}\) : Proportion of sample \(i\) of cell type \(k\).

\[\begin{aligned} Z_{ij} \approx \sum_k w_{ik} x_{kj} . \end{aligned}\]

\(y_j\), \(\eta_j\) : mean and SD of expression of gene \(j\) across all cell types; shrink \(x_{kj}\) towards \(y_j\).

Testing: compiles?

nmf_model <- stan_model(model_code=nmf_block)

Testing: runs?

sampling(nmf_model,
         data=list(N=10,
                   L=5,
                   K=2,
                   Z=matrix(rpois(50, 100), ncol=5)),
         chains=1, iter=100)

## 
## SAMPLING FOR MODEL '027c9a0b92082dc735bfb6350861f7ee' NOW (CHAIN 1).
## 
## Gradient evaluation took 5.4e-05 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0.54 seconds.
## Adjust your expectations accordingly!
## 
## 
## WARNING: There aren't enough warmup iterations to fit the
##          three stages of adaptation as currently configured.
##          Reducing each adaptation stage to 15%/75%/10% of
##          the given number of warmup iterations:
##            init_buffer = 7
##            adapt_window = 38
##            term_buffer = 5
## 
## Iteration:  1 / 100 [  1%]  (Warmup)
## Iteration: 10 / 100 [ 10%]  (Warmup)
## Iteration: 20 / 100 [ 20%]  (Warmup)
## Iteration: 30 / 100 [ 30%]  (Warmup)
## Iteration: 40 / 100 [ 40%]  (Warmup)
## Iteration: 50 / 100 [ 50%]  (Warmup)
## Iteration: 51 / 100 [ 51%]  (Sampling)
## Iteration: 60 / 100 [ 60%]  (Sampling)
## Iteration: 70 / 100 [ 70%]  (Sampling)
## Iteration: 80 / 100 [ 80%]  (Sampling)
## Iteration: 90 / 100 [ 90%]  (Sampling)
## Iteration: 100 / 100 [100%]  (Sampling)
## 
##  Elapsed Time: 0.832321 seconds (Warm-up)
##                1.24025 seconds (Sampling)
##                2.07257 seconds (Total)

## Warning: There were 39 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

## Warning: Examine the pairs() plot to diagnose sampling problems

## Inference for Stan model: 027c9a0b92082dc735bfb6350861f7ee.
## 1 chains, each with iter=100; warmup=50; thin=1; 
## post-warmup draws per chain=50, total post-warmup draws=50.
## 
##              mean se_mean    sd     2.5%      25%      50%      75%
## x[1,1]       1.24    0.11  0.75     0.16     0.74     1.11     1.56
## x[1,2]       1.55    0.12  0.87     0.37     0.98     1.47     2.03
## x[2,1]       0.94    0.16  0.69     0.09     0.40     0.82     1.26
## x[2,2]       1.19    0.08  0.60     0.33     0.79     1.10     1.47
## x[3,1]       1.00    0.10  0.68     0.10     0.46     0.90     1.42
## x[3,2]       1.22    0.10  0.71     0.28     0.72     1.15     1.52
## x[4,1]       1.01    0.13  0.63     0.04     0.61     0.90     1.38
## x[4,2]       1.30    0.09  0.61     0.43     0.95     1.20     1.44
## x[5,1]       1.14    0.08  0.58     0.20     0.76     1.02     1.56
## x[5,2]       1.32    0.09  0.62     0.28     0.86     1.25     1.71
## y[1]        17.72    5.98 16.01    -4.92     4.17    14.21    29.40
## y[2]        -0.24    6.73 15.63   -29.54    -9.72     3.55     7.78
## y[3]         1.98   12.59 20.71   -28.37   -19.01     7.52    19.56
## y[4]         7.92    5.13 13.46   -15.87    -3.00     7.16    19.72
## y[5]         4.64    1.49  7.82   -12.16     1.29     4.61     7.32
## w[1,1]       0.30    0.05  0.25     0.03     0.11     0.22     0.40
## w[1,2]       0.70    0.05  0.25     0.13     0.60     0.78     0.89
## w[2,1]       0.29    0.05  0.23     0.04     0.13     0.22     0.32
## w[2,2]       0.71    0.05  0.23     0.19     0.68     0.78     0.87
## w[3,1]       0.31    0.05  0.24     0.02     0.12     0.23     0.44
## w[3,2]       0.69    0.05  0.24     0.17     0.56     0.77     0.88
## w[4,1]       0.32    0.05  0.24     0.04     0.14     0.27     0.44
## w[4,2]       0.68    0.05  0.24     0.16     0.56     0.73     0.86
## w[5,1]       0.30    0.06  0.27     0.02     0.08     0.19     0.45
## w[5,2]       0.70    0.06  0.27     0.11     0.55     0.81     0.92
## w[6,1]       0.31    0.05  0.24     0.03     0.11     0.24     0.43
## w[6,2]       0.69    0.05  0.24     0.12     0.57     0.76     0.89
## w[7,1]       0.30    0.05  0.23     0.02     0.11     0.24     0.43
## w[7,2]       0.70    0.05  0.23     0.21     0.57     0.76     0.89
## w[8,1]       0.32    0.04  0.23     0.05     0.14     0.25     0.43
## w[8,2]       0.68    0.04  0.23     0.17     0.57     0.75     0.86
## w[9,1]       0.29    0.06  0.28     0.01     0.09     0.17     0.47
## w[9,2]       0.71    0.06  0.28     0.10     0.53     0.83     0.91
## w[10,1]      0.30    0.05  0.24     0.04     0.10     0.20     0.43
## w[10,2]      0.70    0.05  0.24     0.17     0.57     0.80     0.90
## eta[1]      88.81    8.69 55.65    29.38    54.69    74.71   102.98
## eta[2]     120.10   11.94 70.23    40.18    72.84   103.04   139.81
## eta[3]      96.83    8.15 56.14    38.45    60.81    85.42   114.59
## eta[4]      91.64    5.35 37.85    41.25    73.75    85.95   103.66
## eta[5]      85.41    5.20 33.04    43.50    58.50    81.54    99.51
## alpha[1]     0.66    0.14  0.63     0.10     0.21     0.39     1.05
## alpha[2]     1.39    0.09  0.67     0.39     0.94     1.30     1.77
## d_alpha     12.67    1.03  6.94     4.33     7.46    10.52    15.90
## lp__     18252.25    1.75  5.76 18243.44 18246.75 18252.26 18258.05
##             97.5% n_eff Rhat
## x[1,1]       3.07    50 0.98
## x[1,2]       3.70    50 1.02
## x[2,1]       2.46    18 0.99
## x[2,2]       2.55    50 1.05
## x[3,1]       2.50    44 1.00
## x[3,2]       2.73    50 1.10
## x[4,1]       2.30    22 0.99
## x[4,2]       2.81    50 0.98
## x[5,1]       2.32    50 0.99
## x[5,2]       2.43    43 0.98
## y[1]        49.04     7 1.18
## y[2]        28.31     5 1.15
## y[3]        32.39     3 2.77
## y[4]        29.25     7 1.10
## y[5]        19.76    28 1.02
## w[1,1]       0.87    26 0.98
## w[1,2]       0.97    26 0.98
## w[2,1]       0.81    23 0.98
## w[2,2]       0.96    23 0.98
## w[3,1]       0.83    24 0.99
## w[3,2]       0.98    24 0.99
## w[4,1]       0.84    23 0.98
## w[4,2]       0.96    23 0.98
## w[5,1]       0.89    18 0.98
## w[5,2]       0.98    18 0.98
## w[6,1]       0.88    23 0.98
## w[6,2]       0.97    23 0.98
## w[7,1]       0.79    24 0.98
## w[7,2]       0.98    24 0.98
## w[8,1]       0.83    27 0.98
## w[8,2]       0.95    27 0.98
## w[9,1]       0.90    22 0.98
## w[9,2]       0.99    22 0.98
## w[10,1]      0.83    23 0.98
## w[10,2]      0.96    23 0.98
## eta[1]     254.81    41 1.01
## eta[2]     301.68    35 1.00
## eta[3]     243.57    47 1.07
## eta[4]     186.39    50 0.99
## eta[5]     160.69    40 0.98
## alpha[1]     2.16    21 0.98
## alpha[2]     2.62    50 0.98
## d_alpha     27.86    45 1.03
## lp__     18260.60    11 0.98
## 
## Samples were drawn using NUTS(diag_e) at Wed Mar  7 10:44:56 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Nonnegative matrix factorization

… aka “NMF”

We are decomposing \(Z\) into the product of two lower-dimensional, nonnegative factors:

\[\begin{aligned} Z_{ij} &\approx \sum_k w_{ik} x_{kj} \\ w_{ik} &\ge 0 \\ x_{kj} &\ge 0 . \end{aligned}\]

This is like PCA, but sometimes more interpretable.

Stochastic minute

the Dirichlet distribution

A random set of \(k\) proportions \(0 \le P_i \le 1\) has a \(\Dirichlet(\alpha_1, \ldots, \alpha_k)\) if it has probability density \[\begin{aligned} \frac{1}{B(\alpha)} \prod_{i=1}^k p_i^{\alpha_i} \end{aligned}\] over the set of possible values \[\begin{aligned} P_1 + \cdots + P_k = 1 . \end{aligned}\]

This is useful as a prior on proportions.
This generalized the Beta: if \(X \sim \Beta(a, b)\) then \((X, 1-X) \sim \Dirichlet(a, b)\).
Marginal distributions are Beta distributed: \(P_i \sim \Beta(\alpha_i, \sum_{j=1}^k \alpha_j - \alpha_i)\).

If \(X_i \sim \Exp(\alpha_i)\), and \[\begin{aligned} P_i = X_i / \sum_{j=1}^k X_j \end{aligned}\] then \(P \sim \Dirichlet(\alpha)\).

“Simplex” parameters

“The \(k\)-simplex” is the set of proportions, i.e., nonnegative numbers \(p\) satisfying \[\begin{aligned} p_1 + \cdots p_k = 1 . \end{aligned}\]

parameters {
    simplex[K] w[N];
}
model {
    w ~ dirichlet(d_alpha * alpha);
}

Simulate data

Outline

How many cell types? five
How many genes? one thousand
How many samples? one hundred
How much noise in expression? Mean expression per gene are \(\mu \sim \Exp(1/100)\); expression per sample is \(\Poisson(m)\), where \(m\) is truncated \(\Normal(\mu, \sigma \mu)\), and \(\sigma\) is around 1.
How many genes distinguish cell types, and by how much relative to expression? Ten percent of genes, and a gene with mean expression \(\mu\) differs between cell types by about \(\mu/2\): means by cell type are \(\mu \times \log\Normal(0, 1)\).
How many “noisy” genes? How many “noisy” samples? Two percent of each.

An easy case

Parameters:

ntypes <- 3
ngenes <- 1000
nsamples <- 1000
ninf <- 0.6 * ngenes # number of informative genes

Determine expression profiles per cell type

mu <- rexp(ngenes, .01)
# copy mu into five columns
type_mu <- do.call(cbind, list(mu)[rep(1,ntypes)])

# simulate how differnt mu is across cell types
diff_sigma <- matrix(rlnorm(ntypes * ninf, log(0.5), 1), ncol=ntypes)
type_mu[1:ninf,] <- type_mu[1:ninf,] * diff_sigma

m_list <- lapply(1:ntypes, function (ct) {
                    matrix(pmax(0, rnorm(ngenes*nsamples, 
                                         type_mu[,ct], type_mu[,ct])),
                           ncol=ngenes, byrow=TRUE) })

Simulate proportions per sample, and construct expected expression levels

# simulate proportions per sample
w <- matrix( rexp(nsamples*ntypes), ncol=ntypes)
w <- sweep(w, 1, rowSums(w), "/")
stopifnot(all(abs(rowSums(w) - 1) < 1e-15))

# construct expected expression levels:
m <- matrix(0, nrow=nsamples, ncol=ngenes)
for (ct in 1:ntypes) {
    m <- m + (w[,ct] * m_list[[ct]])
}

# simulate expression
Z <- matrix(rpois(length(m), m), ncol=ncol(m))

# add noisy indivs, genes??

Point estimates with `optimizing`

system.time(nmf_optim <- optimizing(nmf_model,
                                    data=list(N=nsamples,
                                              L=ngenes,
                                              K=ntypes,
                                              Z=Z)))

## Initial log joint probability = 2.58746e+07
## Error evaluating model log probability: Non-finite gradient.
## Error evaluating model log probability: Non-finite gradient.
## 
## Optimization terminated normally: 
##   Convergence detected: relative gradient magnitude is below tolerance

##    user  system elapsed 
## 162.735   0.020 162.780

Can we infer mixture proportions?

results <- list(x=nmf_optim$par[grepl("^x", names(nmf_optim$par))],
                y=nmf_optim$par[grepl("^y", names(nmf_optim$par))],
                w=nmf_optim$par[grepl("^w", names(nmf_optim$par))],
                eta=nmf_optim$par[grepl("^eta", names(nmf_optim$par))],
                alpha=nmf_optim$par[grepl("^alpha", names(nmf_optim$par))],
                d_alpha=nmf_optim$par[grepl("^d_alpha", names(nmf_optim$par))])
dim(results$x) <- c(ngenes, ntypes)
dim(results$w) <- c(nsamples, ntypes)

cor(results$w, w)

##            [,1]       [,2]       [,3]
## [1,] -0.5808138  0.9782592 -0.4166005
## [2,] -0.4704923 -0.4772288  0.9730529
## [3,]  0.9826326 -0.5158866 -0.4709883

plot of chunk show_optim

The mean absolute error in inference of mixture proportions (\(w\)) is 0.0423191.

Can we infer expression profiles?

cor(results$x, type_mu)

##             [,1]        [,2]       [,3]
## [1,] -0.09646527  0.24097595 -0.1065245
## [2,] -0.09615290 -0.10846957  0.2138270
## [3,]  0.28938213 -0.08564512 -0.1106754

xij <- apply(cor(results$x, type_mu), 2, which.max)
x_ord <- apply(type_mu, 2, order)
layout(t(1:ntypes))
for (i in 1:ntypes) {
    plot(type_mu[x_ord[,i],i], pch=20,
         main=paste("expression profile", i))
    points((results$eta * results$x[,xij[i]])[x_ord[,i]], pch=20, col='red')
}
legend("topleft", pch=20, col=c("black", "red"),
       legend=c("truth", "inferred"))

plot of chunk x_plot_res

A harder case

Parameters:

ntypes <- 5
ngenes <- 1000
nsamples <- 100
ninf <- 0.1 * ngenes # number of informative genes

Determine expression profiles per cell type

mu <- rexp(ngenes, .01)
# copy mu into five columns
type_mu <- do.call(cbind, list(mu)[rep(1,ntypes)])

# simulate how differnt mu is across cell types
diff_sigma <- matrix(rlnorm(ntypes * ninf, log(0.5), 1), ncol=ntypes)
type_mu[1:ninf,] <- type_mu[1:ninf,] * diff_sigma

m_list <- lapply(1:ntypes, function (ct) {
                    matrix(pmax(0, rnorm(ngenes*nsamples, 
                                         type_mu[,ct], type_mu[,ct])),
                           ncol=ngenes, byrow=TRUE) })

Simulate proportions per sample, and construct expected expression levels

# simulate proportions per sample
w <- matrix( rexp(nsamples*ntypes), ncol=ntypes)
w <- sweep(w, 1, rowSums(w), "/")
stopifnot(all(abs(rowSums(w) - 1) < 1e-15))

# construct expected expression levels:
m <- matrix(0, nrow=nsamples, ncol=ngenes)
for (ct in 1:ntypes) {
    m <- m + (w[,ct] * m_list[[ct]])
}

# simulate expression
Z <- matrix(rpois(length(m), m), ncol=ncol(m))

# add noisy indivs, genes??

Point estimates with `optimizing`

system.time(nmf_optim <- optimizing(nmf_model,
                                    data=list(N=nsamples,
                                              L=ngenes,
                                              K=ntypes,
                                              Z=Z)))

## Initial log joint probability = 3.22418e+06
## Optimization terminated normally: 
##   Maximum number of iterations hit, may not be at an optima

##    user  system elapsed 
##  22.106   0.011  22.127

Can we infer mixture proportions?

results <- list(x=nmf_optim$par[grepl("^x", names(nmf_optim$par))],
                y=nmf_optim$par[grepl("^y", names(nmf_optim$par))],
                w=nmf_optim$par[grepl("^w", names(nmf_optim$par))],
                eta=nmf_optim$par[grepl("^eta", names(nmf_optim$par))],
                alpha=nmf_optim$par[grepl("^alpha", names(nmf_optim$par))],
                d_alpha=nmf_optim$par[grepl("^d_alpha", names(nmf_optim$par))])
dim(results$x) <- c(ngenes, ntypes)
dim(results$w) <- c(nsamples, ntypes)

cor(results$w, w)

##             [,1]        [,2]        [,3]        [,4]       [,5]
## [1,]  0.70346443 -0.35350986  0.01131823  0.19682506 -0.4302360
## [2,] -0.52692824 -0.03719675  0.09822368  0.11374785  0.3348279
## [3,] -0.02538077  0.52348238 -0.11243521 -0.25411249 -0.1954045
## [4,] -0.16689208  0.40134261 -0.13149467  0.04997887 -0.1598633
## [5,] -0.25402853 -0.34108062  0.10875600 -0.22951899  0.6125462

plot of chunk hshow_optim

The mean absolute error in inference of mixture proportions (\(w\)) is 0.1560075.

Can we infer expression profiles?

cor(results$x, type_mu)

##              [,1]         [,2]        [,3]        [,4]        [,5]
## [1,]  0.003493925 -0.100144726 -0.06501465 -0.00741866 -0.10650807
## [2,] -0.102344361 -0.076685361 -0.04067808 -0.04602893 -0.04144020
## [3,] -0.002324807  0.039969510 -0.02602720 -0.02708119 -0.03212653
## [4,] -0.036106205 -0.008097775 -0.02712279 -0.01015633 -0.03046783
## [5,] -0.020074330 -0.019037972  0.02745353 -0.01491372  0.05433943

xij <- apply(cor(results$x, type_mu), 2, which.max)
x_ord <- apply(type_mu, 2, order)
layout(t(1:ntypes))
for (i in 1:ntypes) {
    plot(type_mu[x_ord[,i],i], pch=20,
         main=paste("expression profile", i))
    points((results$eta * results$x[,xij[i]])[x_ord[,i]], pch=20, col='red')
}
legend("topleft", pch=20, col=c("black", "red"),
       legend=c("truth", "inferred"))

plot of chunk hx_plot_res

Clustering and categorization

Clusters in expression space

A conceptual model

Testing: compiles?

Testing: runs?

Nonnegative matrix factorization

… aka “NMF”

Stochastic minute

the Dirichlet distribution

“Simplex” parameters

Simulate data

Outline

An easy case

Parameters:

Point estimates with optimizing

Can we infer mixture proportions?

Can we infer expression profiles?

A harder case

Parameters:

Point estimates with optimizing

Can we infer mixture proportions?

Can we infer expression profiles?

Point estimates with `optimizing`

Point estimates with `optimizing`