Peter Ralph
8 March 2018 – Advanced Biological Statistics
A time series is a sequence of observations \[\begin{aligned} (y_1, y_2, \ldots, y_N) , \end{aligned}\] that were taken at some set of times \[\begin{aligned} t_1 < t_2 < \cdots < t_N . \end{aligned}\]
In general, the goal is to understand how what happens next depends on the previous history and maybe some predictor variables \[\begin{aligned} (x_1, x_2, \ldots, x_N) , \end{aligned}\] taken at the same set of times.
The simplest time series model is purely phenomenological: \[\begin{aligned} y_{k+1} &= \alpha + \beta y_k + \epsilon_k \\ \epsilon_k &\sim N(0, \sigma^2) . \end{aligned}\]
This is “autoregressive, of order 1”.
Rewriting this as \[\begin{aligned} \left(y_{k+1} - \frac{\alpha}{1-\beta}\right) &= \beta \left(y_k - \frac{\alpha}{1-\beta}\right) + \epsilon_k \\ \epsilon_k &\sim N(0, \sigma^2) , \end{aligned}\] we see that if \(|\beta| < 1\) then this oscillates stably about \(\alpha / (1-\beta)\).
First, let’s simulate some data.
## Inference for Stan model: 6ff5c9ed423ae7b3b71dcd556630e896.
## 3 chains, each with iter=3000; warmup=1500; thin=1;
## post-warmup draws per chain=1500, total post-warmup draws=4500.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## alpha 4.47 0.02 0.61 3.29 4.04 4.45 4.87 5.69 1591 1
## beta 0.29 0.00 0.10 0.10 0.23 0.29 0.36 0.48 1598 1
## sigma 0.48 0.00 0.03 0.42 0.45 0.48 0.50 0.56 2087 1
## lp__ 23.47 0.03 1.22 20.29 22.94 23.77 24.34 24.86 1227 1
##
## Samples were drawn using NUTS(diag_e) at Wed Mar 7 10:46:45 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Just like before: \[\begin{aligned} y_{k+1} &= A + B y_k + \epsilon_k \\ \epsilon_k &\sim N(0, \Sigma) , \end{aligned}\]
except now
If a random vector \((\epsilon_1, \ldots, \epsilon_n)\) is multivariate Gaussian with mean vector \(\mu\) and covariance matrix \(\Sigma\) then \[\begin{aligned} \epsilon_k &\sim N(\mu_k, \Sigma_{kk}) , \end{aligned}\] and \[\begin{aligned} \cov[\epsilon_i, \epsilon_j] &= \Sigma_{ij} . \end{aligned}\]
Also (remarkably), for any vector \(z\), \[\begin{aligned} \sum_i z_i \epsilon_i &\sim N(\sum_i z_i \mu_i, \sum_{ij} z_i \Sigma_{ij} z_j) ,\\ \text{i.e.,}\quad z^T \epsilon &\sim N(z^T \mu, z^T \Sigma z) . \end{aligned}\]
In other words, a multivariate Gaussian vector is a distribution that looks Gaussian in any direction.
The matrix \(\Sigma\) must be symmetric: \[\begin{aligned} \Sigma_{ij} = \cov[\epsilon_i, \epsilon_j] = \cov[\epsilon_j, \epsilon_i] = \Sigma_{ji} . \end{aligned}\]
Since variances are positive, \[\begin{aligned} \var[z^T \epsilon] &= z^T \Sigma z > 0 , \qquad \text{for any $z$} . \end{aligned}\]
These put constraints on \(\Sigma\): it must be symmetric, positive definite.
Note: actually, nonnegative definite would suffice, but Stan only deals with “full rank” multivariate Gaussians.
How to declare a covariance matrix:
cov_matrix[K] Sigma;
A useful prior:
Sigma ~ wishart(nu, S);
If \(X\) is a \(N \times K\) matrix, each row containing an independent sample from \(\Normal(0, S)\), then \[\begin{aligned} \cov[X] &\sim \Wishart(N, S) . \end{aligned}\]
Challenge: sample from the Wishart, with N=10
and \[\begin{aligned}
S = \begin{bmatrix} 1 & 1/2 & 0 \\
1/2 & 1 & 1/4 \\
0 & 1/4 & 2
\end{bmatrix} .
\end{aligned}\]
library(mvtnorm)
S <- matrix( c(1, 1/2, 0, 1/2, 1, 1/4, 0, 1/4, 2), nrow=3)
X <- rmvnorm(10, mean=rep(0,3), sigma=S)
Sigma <- cov(X)
X
## [,1] [,2] [,3]
## [1,] 0.633593508 1.91531130 1.0339688
## [2,] 0.202938277 -0.06591836 1.7092590
## [3,] -1.558708971 -0.77679237 1.1441345
## [4,] 1.616767409 0.84333610 -1.6951287
## [5,] 0.016669834 -0.66704282 -0.8543952
## [6,] 0.338531008 -0.48013294 1.0009107
## [7,] -1.276957390 -1.47534697 0.6566716
## [8,] 0.001619262 -0.27211542 0.1809046
## [9,] -1.232654633 -0.26403322 1.6616964
## [10,] -0.618324299 -0.10891272 -0.5674274
Just as positive numbers have well-defined square roots, so symmetric, positive definite matrices do as well, called the Choleky decomposition:
## [,1] [,2] [,3] [,4]
## [1,] 1.11291048 -0.04853800 0.05299169 -0.05088764
## [2,] -0.04853800 1.03000432 0.02917037 -0.02774929
## [3,] 0.05299169 0.02917037 1.33573885 0.03820457
## [4,] -0.05088764 -0.02774929 0.03820457 0.83080347
## [,1] [,2] [,3] [,4]
## [1,] 1.054946 -0.04600995 0.05023167 -0.04823721
## [2,] 0.000000 1.01384782 0.03105153 -0.02955935
## [3,] 0.000000 0.00000000 1.15423197 0.03599404
## [4,] 0.000000 0.00000000 0.00000000 0.90901448
## [,1] [,2] [,3] [,4]
## [1,] 1.11291048 -0.04853800 0.05299169 -0.05088764
## [2,] -0.04853800 1.03000432 0.02917037 -0.02774929
## [3,] 0.05299169 0.02917037 1.33573885 0.03820457
## [4,] -0.05088764 -0.02774929 0.03820457 0.83080347
Fact: to simulate from
\[\begin{aligned} x \sim \Normal(0, \Sigma) \\ \end{aligned}\]
you can multiply independent Normals by the Cholesky decomposition of \(\Sigma\):
## [,1] [,2] [,3] [,4]
## [1,] 1.11580841 -0.04972242 0.05386599 -0.05080609
## [2,] -0.04972242 1.03417832 0.02980374 -0.02751660
## [3,] 0.05386599 0.02980374 1.33233655 0.03673243
## [4,] -0.05080609 -0.02751660 0.03673243 0.82966394
Just like before: \[\begin{aligned} y_{k+1} &= A + B y_k + \epsilon_k \\ \epsilon_k &\sim N(0, \Sigma) , \end{aligned}\]
except now
First, let’s simulate some data, where \[\begin{aligned} y_1(t+1) &= 1 + 0.8 y_1(t) + 0.2 (y_2(t) - y_3(t)) + \epsilon_1(t) \\ y_2(t+1) &= 2 + 0.9 y_2(t) + \epsilon_2(t) \\ y_3(t+1) &= 3 + 0.9 y_3(t) + \epsilon_3(t) \end{aligned}\] and the noise is correlated: \[\begin{aligned} \Sigma &= \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0.5 \\ 0 & 0.5 & 1 \end{bmatrix} \end{aligned}\] i.e., \(y_3\) and \(y_2\) tend to go up and down together.
library(mvtnorm) # for rmvnorm
n <- 3
mv_truth <- list(A=1:3,
B=matrix(c(0.8, 0.2, -0.2,
0.0, 0.9, 0.0,
0.0, 0.0, 0.9), ncol=3, byrow=TRUE),
Sigma=matrix(c(1, 0, 0,
0, 1, 0.5,
0, 0.5, 1), ncol=3, byrow=TRUE))
N <- 100
mv_y <- matrix(0, nrow=N, ncol=n)
mv_y[1,] <- rnorm(n, mean=c(5, 25, 25), sd=5)
for (k in 1:(N-1)) {
mv_y[k+1,] <- (mv_truth$A + mv_y[k,] %*% mv_truth$B
+ rmvnorm(1, mean=rep(0,n), sigma=mv_truth$Sigma))
}
Write a Stan block for the multivariate AR(1) model.
mv_ar1_block <- "
data {
int N; // number of observations
int n; // the dimension
matrix[N,n] y;
cov_matrix[n] S;
}
parameters {
row_vector[n] A;
matrix[n,n] B;
cov_matrix[n] Sigma;
}
model {
for (k in 1:(N-1)) {
y[k+1,] ~ multi_normal(A + y[k,] * B, Sigma);
}
A ~ normal(0, 20);
for (k in 1:n) {
B[k,] ~ normal(0, 3);
}
// uninformative prior on Sigma
Sigma ~ wishart(3, S);
}
"
mv_ar1_model <- stan_model(model_code=mv_ar1_block)
Hm, the result looks noisy.
## Error in summary(mv_ar1_fit)$summary: $ operator is invalid for atomic vectors
Suppose we have regular, noisy observations from a discrete, noisy oscillator.
The system itself does \[\begin{aligned} x_{t+1} - x_t &= \alpha y_t + \Normal(0, \sigma_{xy}) \\ y_{t+1} - y_t &= - \beta x_t + \Normal(0, \sigma_{xy}) \end{aligned}\] but we only get to observe \[\begin{aligned} X_t &= x_t + \Normal(0, \sigma_\epsilon) \\ Y_t &= y_t + \Normal(0, \sigma_\epsilon) . \end{aligned}\]
Here’s what this looks like.
true_osc <- list(alpha=.1,
beta=.05,
sigma_xy=.01,
sigma_eps=.1)
N <- 500
xy <- matrix(nrow=N, ncol=2)
xy[1,] <- c(3,0)
for (k in 1:(N-1)) {
xy[k+1,] <- (xy[k,]
+ c(true_osc$alpha * xy[k,2],
(-1) * true_osc$beta * xy[k,1])
+ rnorm(2, 0, true_osc$sigma_xy))
}
XY <- xy + rnorm(N*2, 0, true_osc$sigma_eps)
osc_block <- "
data {
int N;
vector[N] X;
vector[N] Y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma_xy;
real<lower=0> sigma_eps;
vector[N] x;
vector[N] y;
}
model {
x[2:N] ~ normal(x[1:(N-1)] + alpha * y[1:(N-1)], sigma_xy);
y[2:N] ~ normal(y[1:(N-1)] - beta * x[1:(N-1)], sigma_xy);
X ~ normal(x, sigma_eps);
Y ~ normal(y, sigma_eps);
alpha ~ normal(0, 1);
beta ~ normal(0, 1);
sigma_xy ~ normal(0, 1);
sigma_eps ~ normal(0, 1);
}
"
osc_model <- stan_model(model_code=osc_block)
osc_fit <- sampling(osc_model,
data=list(N,
X=XY[,1],
Y=XY[,2]),
iter=1000, chains=3,
control=list(max_treedepth=12))
## Warning: There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
cbind(rstan::summary(osc_fit, pars=c("alpha", "beta", "sigma_xy", "sigma_eps"))$summary,
truth=c(true_osc$alpha, true_osc$beta, true_osc$sigma_xy, true_osc$sigma_eps))
## mean se_mean sd 2.5% 25%
## alpha 0.10027484 7.548355e-06 0.0002620254 0.099767459 0.100087329
## beta 0.04959407 3.981173e-06 0.0001453410 0.049302785 0.049497526
## sigma_xy 0.01016856 3.163218e-04 0.0015125117 0.007275602 0.009151821
## sigma_eps 0.10415614 6.739671e-05 0.0024588853 0.099336078 0.102485456
## 50% 75% 97.5% n_eff Rhat truth
## alpha 0.10028131 0.10045248 0.10076468 1204.98657 1.0014733 0.10
## beta 0.04958854 0.04969267 0.04987458 1332.76788 0.9984255 0.05
## sigma_xy 0.01012091 0.01112764 0.01342035 22.86332 1.0703627 0.01
## sigma_eps 0.10410202 0.10583318 0.10921730 1331.06464 1.0038145 0.10
Here is a density plot of 100 estimated trajectories (of x
and y
) from the Stan fit.
Let’s try that again, with more noise.
Here’s what this looks like.
true_osc2 <- list(alpha=.1,
beta=.05,
sigma_xy=.05,
sigma_eps=.5)
xy2 <- matrix(nrow=N, ncol=2)
xy2[1,] <- c(3,0)
for (k in 1:(N-1)) {
xy2[k+1,] <- (xy2[k,]
+ c(true_osc2$alpha * xy2[k,2],
(-1) * true_osc2$beta * xy2[k,1])
+ rnorm(2, 0, true_osc2$sigma_xy))
}
XY2 <- xy2 + rnorm(N*2, 0, true_osc2$sigma_eps)
osc_fit2 <- sampling(osc_model,
data=list(N,
X=XY2[,1],
Y=XY2[,2]),
iter=1000, chains=3,
control=list(max_treedepth=12))
## Warning: There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
cbind(rstan::summary(osc_fit2, pars=c("alpha", "beta", "sigma_xy", "sigma_eps"))$summary,
truth=c(true_osc2$alpha, true_osc2$beta, true_osc2$sigma_xy, true_osc2$sigma_eps))
## mean se_mean sd 2.5% 25%
## alpha 0.09977365 3.868091e-05 0.0014981051 0.09693679 0.09871924
## beta 0.05058485 2.181817e-05 0.0008450139 0.04888203 0.05003310
## sigma_xy 0.04888524 1.058616e-03 0.0061682198 0.03836894 0.04434006
## sigma_eps 0.49500456 2.933624e-04 0.0113618758 0.47337393 0.48763187
## 50% 75% 97.5% n_eff Rhat truth
## alpha 0.09976657 0.10080819 0.10273807 1500.00000 0.9992035 0.10
## beta 0.05061366 0.05112784 0.05225583 1500.00000 0.9995298 0.05
## sigma_xy 0.04836393 0.05306193 0.06140414 33.95026 1.0846841 0.05
## sigma_eps 0.49472709 0.50237105 0.51755222 1500.00000 1.0026377 0.50
Here is a density plot of 100 estimated trajectories (of x
and y
) from the Stan fit.
Now what if we actually don’t observe most of the \(Y\) values?
Here’s what this looks like.
true_osc3 <- list(alpha=.1,
beta=.05,
sigma_xy=.05,
sigma_eps=.5)
xy3 <- matrix(nrow=N, ncol=2)
xy3[1,] <- c(3,0)
for (k in 1:(N-1)) {
xy3[k+1,] <- (xy3[k,]
+ c(true_osc3$alpha * xy3[k,2],
(-1) * true_osc3$beta * xy3[k,1])
+ rnorm(2, 0, true_osc3$sigma_xy))
}
XY3 <- xy3 + rnorm(N*2, 0, true_osc3$sigma_eps)
obs_y <- floor(seq(1, N, length.out=5))
XY3[setdiff(1:N, obs_y), 2] <- NA
osc_block_missing <- "
data {
int N;
vector[N] X;
int k; // number of observed Y
int obs_y[k]; // which Y values are observed
vector[k] Y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma_xy;
real<lower=0> sigma_eps;
vector[N] x;
vector[N] y;
}
model {
x[2:N] ~ normal(x[1:(N-1)] + alpha * y[1:(N-1)], sigma_xy);
y[2:N] ~ normal(y[1:(N-1)] - beta * x[1:(N-1)], sigma_xy);
X ~ normal(x, sigma_eps);
Y ~ normal(y[obs_y], sigma_eps);
alpha ~ normal(0, 1);
beta ~ normal(0, 1);
sigma_xy ~ normal(0, 1);
sigma_eps ~ normal(0, 1);
}
"
osc_model_missing <- stan_model(model_code=osc_block_missing)
osc_fit3 <- sampling(osc_model_missing,
data=list(N,
X=XY3[,1],
k=length(obs_y),
obs_y=obs_y,
Y=XY3[obs_y,2]),
iter=1000, chains=3,
control=list(max_treedepth=12))
## Warning: There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
cbind(rstan::summary(osc_fit3, pars=c("alpha", "beta", "sigma_xy", "sigma_eps"))$summary,
truth=c(true_osc3$alpha, true_osc3$beta, true_osc3$sigma_xy, true_osc3$sigma_eps))
## mean se_mean sd 2.5% 25%
## alpha 0.10393336 0.0009268078 0.007705552 0.09040166 0.09863952
## beta 0.04721744 0.0004147620 0.003438507 0.04049462 0.04493189
## sigma_xy 0.04301800 0.0052723682 0.008877628 0.03001169 0.03576129
## sigma_eps 0.47543913 0.0005778553 0.016542496 0.44599898 0.46314292
## 50% 75% 97.5% n_eff Rhat truth
## alpha 0.10326524 0.10870398 0.12052463 69.123893 1.043283 0.10
## beta 0.04721705 0.04953988 0.05385456 68.729305 1.040078 0.05
## sigma_xy 0.04213721 0.04798025 0.06384010 2.835192 1.686882 0.05
## sigma_eps 0.47495666 0.48555496 0.50975366 819.528259 1.014492 0.50
Here is a density plot of 100 estimated trajectories (of x
and y
) from the Stan fit.
Autoregressive models: like linear regression.
Multivariate: covariance matrix prior
Oscillator: like an AR(1) but explicitly modeling the underlying, unobserved process.
Missing data: the inference worked well even without observing most of one entire dimension