Peter Ralph
26 February 2018 – Advanced Biological Statistics
Let’s build a conceptual model for descriptive analysis of “mixture” expression data.
Data: expression data from tissue samples that consist of various mixtures of different cell types.
Goal: identify shared coexpression patterns corresponding to cell type.
Similar situations: identify different developmental stages from whole-organism expression; common community structures from metagenomic data.
Each cell type has a typical set of mean expression levels.
Each sample is composed of a mixture of cell types, defined by the proportions that come from each type.
Mean expression levels differ between cell types for only some of the genes.
Some samples are noisier than others.
Assume the same amount of sequencing of each sample.
Mean expression by cell type.
Cell type proportions by sample.
\(x_{kj}\) : Mean expression of gene \(j\) in cell type \(k\).
\(w_{ik}\) : Proportion of sample \(i\) of cell type \(k\).
\(Z_{ij}\) : expression level in sample \(i\) of gene \(j\).
\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]
Mean expression by cell type.
Cell type proportions by sample.
Mean expression levels differ between cell types for only some of the genes.
Some samples are noisier than others.
\(Z_{ij}\) : expression level in sample \(i\) of gene \(j\).
\[\begin{aligned} Z_{ij} \approx \sum_{k=1}^K w_{ik} x_{kj} . \end{aligned}\]
\(y_j\), \(\eta_j\) : mean and SD of expression of gene \(j\) across all cell types; shrink \(x_{kj}\) towards \(y_j\).
(omit this)
data {
int N; // # samples
int L; // # genes
int K; // # cell types
int Z[N,L];
}
parameters {
matrix<lower=0>[L,K] x;
vector[L] y;
simplex[K] w[N];
vector<lower=0>[L] eta;
vector<lower=0>[K] alpha;
real<lower=0> d_alpha;
}
model {
for (i in 1:N) {
Z[i] ~ poisson(eta .* (x * w[i]));
w[i] ~ dirichlet(d_alpha * alpha);
}
for (j in 1:K)
{ x[,j] ~ normal(y ./ eta, 1); }
y ~ normal(0, 20);
alpha ~ normal(0, 1);
d_alpha ~ exponential(0.2);
eta ~ cauchy(0, 10);
}
\(x_{kj}\) : Mean expression of gene \(j\) in cell type \(k\).
\(w_{ik}\) : Proportion of sample \(i\) of cell type \(k\).
\[\begin{aligned} Z_{ij} \approx \sum_k w_{ik} x_{kj} . \end{aligned}\]
sampling(nmf_model,
data=list(N=10,
L=5,
K=2,
Z=matrix(rpois(50, 100), ncol=5)),
chains=1, iter=100)
##
## SAMPLING FOR MODEL '027c9a0b92082dc735bfb6350861f7ee' NOW (CHAIN 1).
##
## Gradient evaluation took 5.4e-05 seconds
## 1000 transitions using 10 leapfrog steps per transition would take 0.54 seconds.
## Adjust your expectations accordingly!
##
##
## WARNING: There aren't enough warmup iterations to fit the
## three stages of adaptation as currently configured.
## Reducing each adaptation stage to 15%/75%/10% of
## the given number of warmup iterations:
## init_buffer = 7
## adapt_window = 38
## term_buffer = 5
##
## Iteration: 1 / 100 [ 1%] (Warmup)
## Iteration: 10 / 100 [ 10%] (Warmup)
## Iteration: 20 / 100 [ 20%] (Warmup)
## Iteration: 30 / 100 [ 30%] (Warmup)
## Iteration: 40 / 100 [ 40%] (Warmup)
## Iteration: 50 / 100 [ 50%] (Warmup)
## Iteration: 51 / 100 [ 51%] (Sampling)
## Iteration: 60 / 100 [ 60%] (Sampling)
## Iteration: 70 / 100 [ 70%] (Sampling)
## Iteration: 80 / 100 [ 80%] (Sampling)
## Iteration: 90 / 100 [ 90%] (Sampling)
## Iteration: 100 / 100 [100%] (Sampling)
##
## Elapsed Time: 0.832321 seconds (Warm-up)
## 1.24025 seconds (Sampling)
## 2.07257 seconds (Total)
## Warning: There were 39 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems
## Inference for Stan model: 027c9a0b92082dc735bfb6350861f7ee.
## 1 chains, each with iter=100; warmup=50; thin=1;
## post-warmup draws per chain=50, total post-warmup draws=50.
##
## mean se_mean sd 2.5% 25% 50% 75%
## x[1,1] 1.24 0.11 0.75 0.16 0.74 1.11 1.56
## x[1,2] 1.55 0.12 0.87 0.37 0.98 1.47 2.03
## x[2,1] 0.94 0.16 0.69 0.09 0.40 0.82 1.26
## x[2,2] 1.19 0.08 0.60 0.33 0.79 1.10 1.47
## x[3,1] 1.00 0.10 0.68 0.10 0.46 0.90 1.42
## x[3,2] 1.22 0.10 0.71 0.28 0.72 1.15 1.52
## x[4,1] 1.01 0.13 0.63 0.04 0.61 0.90 1.38
## x[4,2] 1.30 0.09 0.61 0.43 0.95 1.20 1.44
## x[5,1] 1.14 0.08 0.58 0.20 0.76 1.02 1.56
## x[5,2] 1.32 0.09 0.62 0.28 0.86 1.25 1.71
## y[1] 17.72 5.98 16.01 -4.92 4.17 14.21 29.40
## y[2] -0.24 6.73 15.63 -29.54 -9.72 3.55 7.78
## y[3] 1.98 12.59 20.71 -28.37 -19.01 7.52 19.56
## y[4] 7.92 5.13 13.46 -15.87 -3.00 7.16 19.72
## y[5] 4.64 1.49 7.82 -12.16 1.29 4.61 7.32
## w[1,1] 0.30 0.05 0.25 0.03 0.11 0.22 0.40
## w[1,2] 0.70 0.05 0.25 0.13 0.60 0.78 0.89
## w[2,1] 0.29 0.05 0.23 0.04 0.13 0.22 0.32
## w[2,2] 0.71 0.05 0.23 0.19 0.68 0.78 0.87
## w[3,1] 0.31 0.05 0.24 0.02 0.12 0.23 0.44
## w[3,2] 0.69 0.05 0.24 0.17 0.56 0.77 0.88
## w[4,1] 0.32 0.05 0.24 0.04 0.14 0.27 0.44
## w[4,2] 0.68 0.05 0.24 0.16 0.56 0.73 0.86
## w[5,1] 0.30 0.06 0.27 0.02 0.08 0.19 0.45
## w[5,2] 0.70 0.06 0.27 0.11 0.55 0.81 0.92
## w[6,1] 0.31 0.05 0.24 0.03 0.11 0.24 0.43
## w[6,2] 0.69 0.05 0.24 0.12 0.57 0.76 0.89
## w[7,1] 0.30 0.05 0.23 0.02 0.11 0.24 0.43
## w[7,2] 0.70 0.05 0.23 0.21 0.57 0.76 0.89
## w[8,1] 0.32 0.04 0.23 0.05 0.14 0.25 0.43
## w[8,2] 0.68 0.04 0.23 0.17 0.57 0.75 0.86
## w[9,1] 0.29 0.06 0.28 0.01 0.09 0.17 0.47
## w[9,2] 0.71 0.06 0.28 0.10 0.53 0.83 0.91
## w[10,1] 0.30 0.05 0.24 0.04 0.10 0.20 0.43
## w[10,2] 0.70 0.05 0.24 0.17 0.57 0.80 0.90
## eta[1] 88.81 8.69 55.65 29.38 54.69 74.71 102.98
## eta[2] 120.10 11.94 70.23 40.18 72.84 103.04 139.81
## eta[3] 96.83 8.15 56.14 38.45 60.81 85.42 114.59
## eta[4] 91.64 5.35 37.85 41.25 73.75 85.95 103.66
## eta[5] 85.41 5.20 33.04 43.50 58.50 81.54 99.51
## alpha[1] 0.66 0.14 0.63 0.10 0.21 0.39 1.05
## alpha[2] 1.39 0.09 0.67 0.39 0.94 1.30 1.77
## d_alpha 12.67 1.03 6.94 4.33 7.46 10.52 15.90
## lp__ 18252.25 1.75 5.76 18243.44 18246.75 18252.26 18258.05
## 97.5% n_eff Rhat
## x[1,1] 3.07 50 0.98
## x[1,2] 3.70 50 1.02
## x[2,1] 2.46 18 0.99
## x[2,2] 2.55 50 1.05
## x[3,1] 2.50 44 1.00
## x[3,2] 2.73 50 1.10
## x[4,1] 2.30 22 0.99
## x[4,2] 2.81 50 0.98
## x[5,1] 2.32 50 0.99
## x[5,2] 2.43 43 0.98
## y[1] 49.04 7 1.18
## y[2] 28.31 5 1.15
## y[3] 32.39 3 2.77
## y[4] 29.25 7 1.10
## y[5] 19.76 28 1.02
## w[1,1] 0.87 26 0.98
## w[1,2] 0.97 26 0.98
## w[2,1] 0.81 23 0.98
## w[2,2] 0.96 23 0.98
## w[3,1] 0.83 24 0.99
## w[3,2] 0.98 24 0.99
## w[4,1] 0.84 23 0.98
## w[4,2] 0.96 23 0.98
## w[5,1] 0.89 18 0.98
## w[5,2] 0.98 18 0.98
## w[6,1] 0.88 23 0.98
## w[6,2] 0.97 23 0.98
## w[7,1] 0.79 24 0.98
## w[7,2] 0.98 24 0.98
## w[8,1] 0.83 27 0.98
## w[8,2] 0.95 27 0.98
## w[9,1] 0.90 22 0.98
## w[9,2] 0.99 22 0.98
## w[10,1] 0.83 23 0.98
## w[10,2] 0.96 23 0.98
## eta[1] 254.81 41 1.01
## eta[2] 301.68 35 1.00
## eta[3] 243.57 47 1.07
## eta[4] 186.39 50 0.99
## eta[5] 160.69 40 0.98
## alpha[1] 2.16 21 0.98
## alpha[2] 2.62 50 0.98
## d_alpha 27.86 45 1.03
## lp__ 18260.60 11 0.98
##
## Samples were drawn using NUTS(diag_e) at Wed Mar 7 10:44:56 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
We are decomposing \(Z\) into the product of two lower-dimensional, nonnegative factors:
\[\begin{aligned} Z_{ij} &\approx \sum_k w_{ik} x_{kj} \\ w_{ik} &\ge 0 \\ x_{kj} &\ge 0 . \end{aligned}\]
This is like PCA, but sometimes more interpretable.
A random set of \(k\) proportions \(0 \le P_i \le 1\) has a \(\Dirichlet(\alpha_1, \ldots, \alpha_k)\) if it has probability density \[\begin{aligned} \frac{1}{B(\alpha)} \prod_{i=1}^k p_i^{\alpha_i} \end{aligned}\] over the set of possible values \[\begin{aligned} P_1 + \cdots + P_k = 1 . \end{aligned}\]
This is useful as a prior on proportions.
This generalized the Beta: if \(X \sim \Beta(a, b)\) then \((X, 1-X) \sim \Dirichlet(a, b)\).
Marginal distributions are Beta distributed: \(P_i \sim \Beta(\alpha_i, \sum_{j=1}^k \alpha_j - \alpha_i)\).
“The \(k\)-simplex” is the set of proportions, i.e., nonnegative numbers \(p\) satisfying \[\begin{aligned} p_1 + \cdots p_k = 1 . \end{aligned}\]
parameters {
simplex[K] w[N];
}
model {
w ~ dirichlet(d_alpha * alpha);
}
How many cell types? five
How many genes? one thousand
How many samples? one hundred
How much noise in expression? Mean expression per gene are \(\mu \sim \Exp(1/100)\); expression per sample is \(\Poisson(m)\), where \(m\) is truncated \(\Normal(\mu, \sigma \mu)\), and \(\sigma\) is around 1.
How many genes distinguish cell types, and by how much relative to expression? Ten percent of genes, and a gene with mean expression \(\mu\) differs between cell types by about \(\mu/2\): means by cell type are \(\mu \times \log\Normal(0, 1)\).
How many “noisy” genes? How many “noisy” samples? Two percent of each.
Determine expression profiles per cell type
mu <- rexp(ngenes, .01)
# copy mu into five columns
type_mu <- do.call(cbind, list(mu)[rep(1,ntypes)])
# simulate how differnt mu is across cell types
diff_sigma <- matrix(rlnorm(ntypes * ninf, log(0.5), 1), ncol=ntypes)
type_mu[1:ninf,] <- type_mu[1:ninf,] * diff_sigma
m_list <- lapply(1:ntypes, function (ct) {
matrix(pmax(0, rnorm(ngenes*nsamples,
type_mu[,ct], type_mu[,ct])),
ncol=ngenes, byrow=TRUE) })
Simulate proportions per sample, and construct expected expression levels
# simulate proportions per sample
w <- matrix( rexp(nsamples*ntypes), ncol=ntypes)
w <- sweep(w, 1, rowSums(w), "/")
stopifnot(all(abs(rowSums(w) - 1) < 1e-15))
# construct expected expression levels:
m <- matrix(0, nrow=nsamples, ncol=ngenes)
for (ct in 1:ntypes) {
m <- m + (w[,ct] * m_list[[ct]])
}
optimizing
## Initial log joint probability = 2.58746e+07
## Error evaluating model log probability: Non-finite gradient.
## Error evaluating model log probability: Non-finite gradient.
##
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
## user system elapsed
## 162.735 0.020 162.780
results <- list(x=nmf_optim$par[grepl("^x", names(nmf_optim$par))],
y=nmf_optim$par[grepl("^y", names(nmf_optim$par))],
w=nmf_optim$par[grepl("^w", names(nmf_optim$par))],
eta=nmf_optim$par[grepl("^eta", names(nmf_optim$par))],
alpha=nmf_optim$par[grepl("^alpha", names(nmf_optim$par))],
d_alpha=nmf_optim$par[grepl("^d_alpha", names(nmf_optim$par))])
dim(results$x) <- c(ngenes, ntypes)
dim(results$w) <- c(nsamples, ntypes)
cor(results$w, w)
## [,1] [,2] [,3]
## [1,] -0.5808138 0.9782592 -0.4166005
## [2,] -0.4704923 -0.4772288 0.9730529
## [3,] 0.9826326 -0.5158866 -0.4709883
The mean absolute error in inference of mixture proportions (\(w\)) is 0.0423191.
## [,1] [,2] [,3]
## [1,] -0.09646527 0.24097595 -0.1065245
## [2,] -0.09615290 -0.10846957 0.2138270
## [3,] 0.28938213 -0.08564512 -0.1106754
xij <- apply(cor(results$x, type_mu), 2, which.max)
x_ord <- apply(type_mu, 2, order)
layout(t(1:ntypes))
for (i in 1:ntypes) {
plot(type_mu[x_ord[,i],i], pch=20,
main=paste("expression profile", i))
points((results$eta * results$x[,xij[i]])[x_ord[,i]], pch=20, col='red')
}
legend("topleft", pch=20, col=c("black", "red"),
legend=c("truth", "inferred"))
Determine expression profiles per cell type
mu <- rexp(ngenes, .01)
# copy mu into five columns
type_mu <- do.call(cbind, list(mu)[rep(1,ntypes)])
# simulate how differnt mu is across cell types
diff_sigma <- matrix(rlnorm(ntypes * ninf, log(0.5), 1), ncol=ntypes)
type_mu[1:ninf,] <- type_mu[1:ninf,] * diff_sigma
m_list <- lapply(1:ntypes, function (ct) {
matrix(pmax(0, rnorm(ngenes*nsamples,
type_mu[,ct], type_mu[,ct])),
ncol=ngenes, byrow=TRUE) })
Simulate proportions per sample, and construct expected expression levels
# simulate proportions per sample
w <- matrix( rexp(nsamples*ntypes), ncol=ntypes)
w <- sweep(w, 1, rowSums(w), "/")
stopifnot(all(abs(rowSums(w) - 1) < 1e-15))
# construct expected expression levels:
m <- matrix(0, nrow=nsamples, ncol=ngenes)
for (ct in 1:ntypes) {
m <- m + (w[,ct] * m_list[[ct]])
}
optimizing
## Initial log joint probability = 3.22418e+06
## Optimization terminated normally:
## Maximum number of iterations hit, may not be at an optima
## user system elapsed
## 22.106 0.011 22.127
results <- list(x=nmf_optim$par[grepl("^x", names(nmf_optim$par))],
y=nmf_optim$par[grepl("^y", names(nmf_optim$par))],
w=nmf_optim$par[grepl("^w", names(nmf_optim$par))],
eta=nmf_optim$par[grepl("^eta", names(nmf_optim$par))],
alpha=nmf_optim$par[grepl("^alpha", names(nmf_optim$par))],
d_alpha=nmf_optim$par[grepl("^d_alpha", names(nmf_optim$par))])
dim(results$x) <- c(ngenes, ntypes)
dim(results$w) <- c(nsamples, ntypes)
cor(results$w, w)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.70346443 -0.35350986 0.01131823 0.19682506 -0.4302360
## [2,] -0.52692824 -0.03719675 0.09822368 0.11374785 0.3348279
## [3,] -0.02538077 0.52348238 -0.11243521 -0.25411249 -0.1954045
## [4,] -0.16689208 0.40134261 -0.13149467 0.04997887 -0.1598633
## [5,] -0.25402853 -0.34108062 0.10875600 -0.22951899 0.6125462
The mean absolute error in inference of mixture proportions (\(w\)) is 0.1560075.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.003493925 -0.100144726 -0.06501465 -0.00741866 -0.10650807
## [2,] -0.102344361 -0.076685361 -0.04067808 -0.04602893 -0.04144020
## [3,] -0.002324807 0.039969510 -0.02602720 -0.02708119 -0.03212653
## [4,] -0.036106205 -0.008097775 -0.02712279 -0.01015633 -0.03046783
## [5,] -0.020074330 -0.019037972 0.02745353 -0.01491372 0.05433943
xij <- apply(cor(results$x, type_mu), 2, which.max)
x_ord <- apply(type_mu, 2, order)
layout(t(1:ntypes))
for (i in 1:ntypes) {
plot(type_mu[x_ord[,i],i], pch=20,
main=paste("expression profile", i))
points((results$eta * results$x[,xij[i]])[x_ord[,i]], pch=20, col='red')
}
legend("topleft", pch=20, col=c("black", "red"),
legend=c("truth", "inferred"))