Note: This is a partial solution, focused on technical aspects. I generally expect your reports to have a bit more words, describing what you’re doing and why.
Here’s some useful functions we’ll have use for later.
max_col <- function (x) {
# return vector of indices corresp to which column has the largest value
# faster than apply(, 1, which.max)
out <- rep(1, NROW(x))
maxs <- x[,1]
for (k in seq_len(NCOL(x))[-1]) {
yes <- (x[,k] > maxs)
out[yes] <- k
maxs[yes] <- x[yes,k]
}
return(out)
}
stopifnot( { x <- matrix(rnorm(120), ncol=15); all(max_col(x) == apply(x, 1, which.max)) } )
segplot <- function (x, labels=colnames(x), ...) {
splot(far_left=colMins(x),
left=colQuantiles(x, probs=0.025),
mid=colMeans(x),
right=colMaxs(x, probs=0.975),
far_right=colMaxs(x),
labels=labels, ...)
}
splot <- function (far_left, left, mid, right, far_right, labels,
yvals=length(mid):1,
xlim=range(far_left, left, mid, right, far_right, finite=TRUE),
add=FALSE, col='red', pt.col='black',
...) {
# make those nice horizontal line plots
par(mar=c(par("mar"), 9)[c(1,5,3,4)])
if (!add) {
plot(0, ylab='', yaxt='n', type='n', ylim=range(yvals), xlim=xlim, ...)
axis(2, at=yvals, labels=labels, las=2)
}
segments(x0=far_left, x1=far_right, y0=yvals)
segments(x0=left, x1=right,
y0=yvals, col=col, lwd=2)
points(mid, yvals, pch=20, col=pt.col)
}
We first read in the data, and for later use relabel the mitochondria as region 0
.
altai <- read.table("data/altai.counts.gz", header=TRUE, stringsAsFactors=FALSE)
denis <- read.table("data/denisova.counts.gz", header=TRUE, stringsAsFactors=FALSE)
geno <- rbind(altai, denis)
geno$id <- factor(c(rep("altai", nrow(altai)), rep("denis", nrow(denis))))
rm(altai, denis)
bases <- c("A","T","C","G")
geno$region[geno$region == "mt"] <- 0
geno$region <- as.numeric(geno$region)
geno$coverage <- rowSums(geno[,bases])
geno$major <- factor(bases[max_col(geno[,bases])], levels=bases)
geno$major_coverage <- geno[,bases][cbind(1:nrow(geno), as.numeric(geno$major))]
First, let’s look at the distribution of total coverages:
hist(subset(geno,id=="altai" & region>0)$coverage, breaks=30,
main="Altai coverage, nuclear", xlab='coverage')
hist(subset(geno,id=="altai" & region==0)$coverage, breaks=30,
main="Altai coverage, mitochondria", xlab='coverage')
hist(subset(geno,id=="denis" & region>0)$coverage, breaks=30,
main="Densiovan coverage, nuclear", xlab='coverage')
hist(subset(geno,id=="denis" & region==0)$coverage, breaks=30,
main="Densiovan coverage, mitochondria", xlab='coverage')
Here’s another look at range of coverage, this time separated by region. Here the points give mean coverages, while the black and red lines give min/max and the middle 90% quantile, respectively. We see that coverage varies substantially by region.
coverage_stats <- geno %>%
group_by(region,id) %>%
summarize(mean=mean(coverage),
sd=sd(coverage),
min=min(coverage),
max=max(coverage),
q5=quantile(coverage,.05),
q95=quantile(coverage,.95))
with(coverage_stats, {
y <- region + 110*as.numeric(id);
plot(mean, y, xlim=range(max,min), pch=20, xlab='coverage',
ylab='region', yaxt='n');
abline(h=105, lty=3);
segments(x0=min, x1=max, y0=y);
segments(x0=q5, x1=q95, y0=y, lwd=2, col='red')
points(mean, y,pch=20) } )
It looks like the nuclear genomes have some issues: such large coverages may well be a sign of problems. If we restrict to smaller coverages, we see that maximum coverage tails off gradually without a single natural cutoff. We will (somewhat arbitrarily) discard anything on the nuclear genome with coverage greater or equal to 200.
hist(subset(geno,id == "altai" & region>0 & coverage < 250)$coverage, breaks=30,
main="Altai coverage, nuclear", xlab='coverage')
hist(subset(geno,id == "denis" & region>0 & coverage < 250)$coverage, breaks=30,
main="Densiovan coverage, nuclear", xlab='coverage')
geno <- subset(geno, region == 0 | coverage < 200)
Here is a table, and also a plot, of mean base frequencies, separated by (mitochondria/not), sample, and major allele:
errors <- geno %>% mutate(mt=(region==0)) %>%
group_by(mt, id, major) %>%
summarise(A=mean(A/coverage),
T=mean(T/coverage),
C=mean(C/coverage),
G=mean(G/coverage))
errors$overall <- rowSums(errors[,bases]) - apply(errors[,bases], 1, max)
errors
## # A tibble: 16 x 8
## # Groups: mt, id [?]
## mt id major A T C G overall
## <lgl> <fctr> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F altai A 0.997 0.000848 0.000708 0.00146 0.00301
## 2 F altai T 0.000893 0.997 0.00133 0.000762 0.00299
## 3 F altai C 0.00218 0.00468 0.992 0.00140 0.00826
## 4 F altai G 0.00488 0.00223 0.00144 0.991 0.00856
## 5 F denis A 0.997 0.000734 0.00114 0.00144 0.00331
## 6 F denis T 0.000751 0.997 0.00136 0.00119 0.00330
## 7 F denis C 0.00141 0.00209 0.996 0.000835 0.00433
## 8 F denis G 0.00219 0.00139 0.000832 0.996 0.00441
## 9 T altai A 0.999 0.000209 0.000292 0.000500 0.00100
## 10 T altai T 0.000241 0.999 0.000608 0.000218 0.00107
## 11 T altai C 0.00107 0.00299 0.995 0.000438 0.00450
## 12 T altai G 0.00303 0.00127 0.000667 0.995 0.00497
## 13 T denis A 0.997 0.000453 0.00210 0.000601 0.00316
## 14 T denis T 0.000519 0.998 0.00134 0.000614 0.00247
## 15 T denis C 0.000801 0.000531 0.998 0.000246 0.00158
## 16 T denis G 0.000832 0.000703 0.000681 0.998 0.00222
plot(as.vector(4 * (match(errors$major, bases)-1) + col(errors[,bases]) - 1),
unlist(errors[,bases]),
col=c("black","red")[as.numeric(errors$id)],
pch=c(1,5)[1+errors$mt], xlim=c(0, 20),
log='y', xlab='', xaxt='n', ylab='frequency')
abline(v=4*(1:4)-0.5, lty=3)
axis(1, at=4*(0:3)+1.5, labels=bases, line=2, lwd=0, tick=FALSE)
axis(1, at=0:15, labels=rep(bases,4), line=0.5, lwd=0, tick=FALSE)
legend("topright", pch=c(1,1,5), col=c('black','red','black'),
legend=c(levels(errors$id), "mito"))
It looks like error rates range from 0.0010016, for Densiova mitochondria having a C
, to 0.0085579, for Altai mitochondria having a G
. There also looks to be substantial heterogeneity by base, reliable differences between the samples, and between nuclear/mitochondria. The overall error rate for the Denisovan was 0.0034253, and for the Altai Neanderthal it was 0.0040799. On average, the error rate in the mitochondria was 2.0606979 times lower.
Now we’ll simulate data. Motivated by the table above, we’ll pick error rates to be around 0.0035, with some differences by sample, major allele, and different between the mitochondria and nuclear regions (but the same otherwise).
sim_rates <- list(
id=c("altai"=0.004, "denis"=0.0035),
region_fac=c(0.5, rep(1, length(unique(geno$region))-1)),
major_fac=1+rnorm(4,0,0.05)
)
sim_geno <- geno[,c("region","id","coverage","major")]
sim_geno$error_rate <- with(sim_geno,
sim_rates$id[id] *
sim_rates$region_fac[region+1L] *
sim_rates$major_fac[as.numeric(major)])
sim_geno$major_coverage <- rbinom(nrow(sim_geno), size=sim_geno$coverage,
prob=1-sim_geno$error_rate)
The fact mentioned says that if \[\begin{aligned} X &\sim \Binom(10, 1/3) \\ Y &\sim \Binom(12, 1/3) \\ Z &= X + Y \end{aligned}\] then \[\begin{aligned} Z \sim \Binom(22, 1/3) \end{aligned}\] because if you flip a coin 10 times and count up the number of heads (\(X\)), then flip it 12 more times and count up the number of heads (\(Y\)), then (a) the total number of heads (\(X+Y\)) is the same as if you’d just flipped it 22 times, and (b) that’s all the information you need to know about it.
This means it suffices to collapse the data into a single table of how often we saw the right and the wrong base, by category, where “category” is a combination of sample ID, region, and true base.
Here we compute that table. The “minor” column refers to the sum of all less-frequently-seen bases, (i.e., not the “major” allele), which we are assuming are in error.
counts <- geno %>%
group_by(id, major, region) %>%
summarise(A=sum(A),
T=sum(T),
C=sum(C),
G=sum(G),
coverage=sum(coverage),
minor=sum(coverage)-sum(major_coverage))
We’ll also want the same thing for the simulated data.
sim_counts <- sim_geno %>%
group_by(id, major, region) %>%
summarise(coverage=sum(coverage),
minor=sum(coverage)-sum(major_coverage))
First let’s formulate and fit a simple model where every site has the same error rate. We’ll put a uniform prior on that error rate. If \(Z\) is the total number of erroneous bases, and \(N\) is the total coverage, then \[\begin{aligned} Z &\sim \Binom(N, \theta) \\ \theta &\sim \Beta(1,1) . \end{aligned}\] Since this is a Beta-Binomial model, we don’t need to use Stan, and can find the posterior distribution on \(\theta\) explicitly as Beta(\(1+Z\), \(1+N-Z\)).
We’ll do it anyways, just to practice.
simple_block <- "
data {
int coverage;
int minor;
}
parameters {
real<lower=0, upper=1> theta;
}
model {
minor ~ binomial(coverage, theta);
// theta ~ uniform(0,1); // uniform is the same as not present
}
"
First we’ll fit to the simulated data:
simple_sim_data <- list(coverage=sum(sim_counts$coverage),
minor=sum(sim_counts$minor))
simple_sim_fit <- stan(model_code=simple_block,
data=simple_sim_data,
iter=1e3, chains=3)
simple_sim_samples <- extract(simple_sim_fit)
print(simple_sim_fit)
## Inference for Stan model: c69ed19821015102bec2e473988a69c3.
## 3 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=1500.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5%
## theta 0 0.00 0.00 0 0 0 0 0
## lp__ -6736978 0.03 0.68 -6736980 -6736979 -6736978 -6736978 -6736978
## n_eff Rhat
## theta 596 1
## lp__ 691 1
##
## Samples were drawn using NUTS(diag_e) at Tue Jan 30 15:38:59 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
The average error rate, weighted by coverage, that we simulated under was 0.0031785: we should get close to that here. The posterior mean is 0.003185 and a 95% credible interval is from 0.003179 to 0.0031914. This agrees with the analytical answer, which has the posterior mean of a Beta(9.98421 × 105, 3.1245365 × 108) as 0.0031852, with a 95% credible interval of 0.003179 to 0.0031915.
Now, the real data:
simple_data <- list(coverage=sum(counts$coverage),
minor=sum(counts$minor))
simple_fit <- stan(model_code=simple_block,
data=simple_data,
iter=1e3, chains=3)
simple_samples <- extract(simple_fit)
print(simple_fit)
## Inference for Stan model: c69ed19821015102bec2e473988a69c3.
## 3 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=1500.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5%
## theta 0 0.00 0.00 0 0 0 0 0
## lp__ -7870551 0.03 0.69 -7870553 -7870551 -7870550 -7870550 -7870550
## n_eff Rhat
## theta 582 1
## lp__ 550 1
##
## Samples were drawn using NUTS(diag_e) at Tue Jan 30 15:39:03 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
The posterior mean is 0.0038251 and a 95% credible interval is from 0.0038187 to 0.0038318. This agrees with the analytical answer, which has the posterior mean of a Beta(1.199004 × 106, 3.1225307 × 108) as 0.0038252, with a 95% credible interval of 0.0038183 to 0.003832.
Now we’ll fit a model where each sample has a different error rate. We could do this separately, using the Beta-Binomial again, but just for practice, we’ll do them jointly. Also, since we don’t even know what scale the concentration parameter for the prior should be on, we’ll put a hyperprior on it. Now, \(Z_s\) and \(N_s\) are the total number of erroneous bases and coverage for sample \(s\), respectively (where \(s\) is Altai'' or
Densiovan’’). \[\begin{aligned}
Z_s &\sim \Binom(N_s, \theta_s) \\
\theta_s &\sim \Beta(\mu \kappa, (1-\mu) \kappa) \\
\mu &\sim \Beta(1,1) \\
\kappa &\sim \Normal^+(0, \sigma) \\
\sigma &\sim \Exp(1/10000)
\end{aligned}\]
Here’s the block
sample_block <- "
data {
int coverage[2];
int minor[2];
}
parameters {
real<lower=0, upper=1> theta[2];
real<lower=0, upper=1> mu;
real<lower=0> kappa;
real<lower=0> sigma;
}
model {
minor ~ binomial(coverage, theta);
theta ~ beta(mu * kappa, (1-mu) * kappa);
// mu ~ beta(1,1); // omit
kappa ~ normal(0, sigma);
sigma ~ exponential(.0001);
}
"
Again, first we’ll fit to the simulated data:
sample_sim_data <- as.list(sim_counts %>%
group_by(id) %>%
summarize(coverage=sum(coverage), minor=sum(minor)))
sample_sim_fit <- stan(model_code=sample_block,
data=sample_sim_data,
iter=1e3, chains=3)
sample_sim_samples <- extract(sample_sim_fit)
print(sample_sim_fit)
## Inference for Stan model: ccf90a5386d09552cf1de8d09458b372.
## 3 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=1500.
##
## mean se_mean sd 2.5% 25% 50%
## theta[1] 0.00 0.00 0.00 0.00 0.00 0.00
## theta[2] 0.00 0.00 0.00 0.00 0.00 0.00
## mu 0.00 0.00 0.00 0.00 0.00 0.00
## kappa 15931.31 1416.43 19769.64 168.86 3546.16 9322.70
## sigma 14761.48 861.56 12627.64 676.82 5464.85 11453.45
## lp__ -6736960.55 0.23 2.22 -6736966.42 -6736961.72 -6736960.11
## 75% 97.5% n_eff Rhat
## theta[1] 0.00 0.00 1500 1.00
## theta[2] 0.00 0.00 1500 1.00
## mu 0.00 0.01 178 1.02
## kappa 20594.37 72618.07 195 1.01
## sigma 20601.12 47185.28 215 1.01
## lp__ -6736958.90 -6736957.77 97 1.04
##
## Samples were drawn using NUTS(diag_e) at Tue Jan 30 15:47:53 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Looks like we might want to run for more iterations; also, the value of \(\kappa\) here is of order 10,000, so it might be pushing up against the prior.
The average error rates, weighted by coverage, that we simulated under were 0.0031689, 0.0031935 for the Altai Neanderthal and the Denisovan, respectively. we should get close to that here. The posterior means here are 0.0031733, 0.0032037 and a 95% credible interval is from 0.0031652, 0.0031937 to 0.0031815, 0.0032139 (respectively).
Now, for the real data.
sample_data <- as.list(counts %>%
group_by(id) %>%
summarize(coverage=sum(coverage), minor=sum(minor)))
sample_fit <- stan(model_code=sample_block,
data=sample_data,
iter=1e3, chains=3)
sample_samples <- extract(sample_fit)
print(sample_fit)
## Inference for Stan model: ccf90a5386d09552cf1de8d09458b372.
## 3 chains, each with iter=1000; warmup=500; thin=1;
## post-warmup draws per chain=500, total post-warmup draws=1500.
##
## mean se_mean sd 2.5% 25% 50%
## theta[1] 0.00 0.00 0.00 0.00 0.00 0.00
## theta[2] 0.00 0.00 0.00 0.00 0.00 0.00
## mu 0.00 0.00 0.00 0.00 0.00 0.00
## kappa 11049.52 927.92 12024.42 281.01 2705.85 7134.29
## sigma 12322.01 1297.10 10593.41 766.92 4606.33 9332.06
## lp__ -7866296.20 0.21 1.80 -7866300.30 -7866297.32 -7866295.87
## 75% 97.5% n_eff Rhat
## theta[1] 0.00 0.00 1500 1.00
## theta[2] 0.00 0.00 1500 1.00
## mu 0.00 0.01 99 1.03
## kappa 15207.24 43293.25 168 1.02
## sigma 16808.77 40548.28 67 1.04
## lp__ -7866294.80 -7866293.72 74 1.03
##
## Samples were drawn using NUTS(diag_e) at Tue Jan 30 15:47:57 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
The observed error rates were 0.0040799, 0.0034253 for the Altai Neanderthal and Denisovan respectively; we should get close to that here. The posterior means here are 0.0040799, 0.0034253 and a 95% credible interval is from 0.0040713, 0.0034159 to 0.0040887, 0.0034353 (respectively).
Now the question is: we know that error rates differ between the samples; but given this, do they differ by base? We will do this by allowing the nucleotide-specific error rates to be arbitrary, and different, between the two samples; we could build in something so that the relative error rates of A
, C
, G
, and T
are similar between the samples, but we won’t.
Here \(\mu\) is the per-sample mean error rate, and \(\theta\) represents the deviations of each base about this mean. \[\begin{aligned} Z_{s,b} &\sim \Binom(N_{s,b}, \theta_{s,b}) \\ \theta_{s,b} &\sim \Beta(\mu_s \kappa_s, (1-\mu_s) \kappa_s) \\ \mu_s &\sim \Beta(1, 1) \\ \kappa_s &\sim \Normal(0, \sigma) \\ \sigma &\sim \Exp(10^{-5}) \end{aligned}\]
Here’s a block for this:
sb_block <- "
data {
int coverage[8];
int minor[8]; // number of errors
int id[8]; // which sample
int major[8]; // which base is the true one
}
parameters {
real<lower=0, upper=1> theta[8];
vector<lower=0, upper=1>[2] mu;
vector<lower=0>[2] kappa;
real<lower=0> sigma;
}
model {
minor ~ binomial(coverage, theta);
theta ~ beta(mu[id] .* kappa[id], (1-mu[id]) .* kappa[id]);
// mu ~ beta(1,1); // omit
kappa ~ normal(0, sigma);
sigma ~ exponential(.00001);
}
"
Again, first we’ll fit to the simulated data:
sb_sim_data <- sim_counts %>%
group_by(id, major) %>%
summarize(coverage=sum(coverage), minor=sum(minor))
sb_sim_fit <- stan(model_code=sb_block,
data=list(
coverage=sb_sim_data$coverage,
minor=sb_sim_data$minor,
id=as.numeric(sb_sim_data$id),
major=as.numeric(sb_sim_data$major)),
iter=1e3, chains=3)
sb_sim_samples <- extract(sb_sim_fit)
rstan::summary(sb_sim_fit)$summary
## mean se_mean sd 2.5%
## theta[1] 3.048872e-03 1.774589e-07 6.872954e-06 3.035090e-03
## theta[2] 3.389424e-03 2.117187e-07 8.199831e-06 3.373187e-03
## theta[3] 3.031780e-03 2.190216e-07 8.482671e-06 3.014641e-03
## theta[4] 3.241158e-03 2.604123e-07 1.008573e-05 3.221802e-03
## theta[5] 3.071047e-03 2.198717e-07 8.515594e-06 3.054511e-03
## theta[6] 3.390961e-03 2.434702e-07 9.429560e-06 3.373123e-03
## theta[7] 3.167206e-03 2.986842e-07 1.156799e-05 3.144522e-03
## theta[8] 3.178614e-03 2.967080e-07 1.149145e-05 3.155779e-03
## mu[1] 3.188661e-03 5.506554e-06 1.103581e-04 2.966018e-03
## mu[2] 3.202322e-03 4.896433e-06 9.801477e-05 3.012432e-03
## kappa[1] 1.095351e+05 3.379973e+03 7.226984e+04 1.717901e+04
## kappa[2] 1.358618e+05 4.238694e+03 9.562847e+04 2.292444e+04
## sigma 1.453313e+05 4.345127e+03 8.731068e+04 3.240150e+04
## lp__ -6.735936e+06 1.246072e-01 2.719970e+00 -6.735943e+06
## 25% 50% 75% 97.5% n_eff
## theta[1] 3.044288e-03 3.048955e-03 3.053746e-03 3.061761e-03 1500.0000
## theta[2] 3.383868e-03 3.389633e-03 3.394976e-03 3.406069e-03 1500.0000
## theta[3] 3.025953e-03 3.032006e-03 3.037459e-03 3.048672e-03 1500.0000
## theta[4] 3.234325e-03 3.240982e-03 3.247799e-03 3.261464e-03 1500.0000
## theta[5] 3.065257e-03 3.071115e-03 3.076934e-03 3.087499e-03 1500.0000
## theta[6] 3.384702e-03 3.390676e-03 3.397253e-03 3.409990e-03 1500.0000
## theta[7] 3.159565e-03 3.166870e-03 3.175252e-03 3.189995e-03 1500.0000
## theta[8] 3.170953e-03 3.178430e-03 3.186234e-03 3.201454e-03 1500.0000
## mu[1] 3.130030e-03 3.182512e-03 3.245656e-03 3.426296e-03 401.6510
## mu[2] 3.144539e-03 3.201438e-03 3.259524e-03 3.390128e-03 400.7038
## kappa[1] 5.572054e+04 9.269921e+04 1.488857e+05 2.956066e+05 457.1805
## kappa[2] 6.630695e+04 1.146025e+05 1.783088e+05 3.868454e+05 508.9913
## sigma 8.327676e+04 1.273635e+05 1.892683e+05 3.666484e+05 403.7661
## lp__ -6.735938e+06 -6.735936e+06 -6.735935e+06 -6.735932e+06 476.4767
## Rhat
## theta[1] 0.9986473
## theta[2] 0.9992916
## theta[3] 0.9989723
## theta[4] 0.9989596
## theta[5] 0.9986153
## theta[6] 0.9988985
## theta[7] 0.9984938
## theta[8] 0.9987867
## mu[1] 1.0000346
## mu[2] 1.0020745
## kappa[1] 1.0010955
## kappa[2] 1.0032288
## sigma 1.0002993
## lp__ 0.9995747
There’s starting to be more parameters to compare to, so we’ll be a bit more organized. Here is a table of true values, along with posterior means and boundaries of 95% credible intervals:
sb_tab <- sim_geno %>% group_by(id, major) %>%
summarize(error_rate=sum(error_rate*coverage)/sum(coverage))
sb_tab$posterior_mean <- colMeans(sb_sim_samples$theta)
sb_tab$post_q.025 <- colQuantiles(sb_sim_samples$theta, prob=.025)
sb_tab$post_q.975 <- colQuantiles(sb_sim_samples$theta, prob=.975)
sb_tab
## # A tibble: 8 x 6
## # Groups: id [?]
## id major error_rate posterior_mean post_q.025 post_q.975
## <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
## 1 altai A 0.00304 0.00305 0.00304 0.00306
## 2 altai T 0.00339 0.00339 0.00337 0.00341
## 3 altai C 0.00303 0.00303 0.00301 0.00305
## 4 altai G 0.00323 0.00324 0.00322 0.00326
## 5 denis A 0.00307 0.00307 0.00305 0.00309
## 6 denis T 0.00337 0.00339 0.00337 0.00341
## 7 denis C 0.00315 0.00317 0.00314 0.00319
## 8 denis G 0.00317 0.00318 0.00316 0.00320
We can see that the true values are within the credible intervals in every case.
Here are the results for higher-level parameters:
theta_names <- paste(sb_sim_data$id, sb_sim_data$major, sep=":")
mu_names <- paste("mean,", levels(sb_sim_data$id))
kappa_names <- paste("concentration,", levels(sb_sim_data$id))
segplot(cbind(sb_sim_samples$theta, sb_sim_samples$mu), c(theta_names, mu_names))
segplot(cbind(sb_sim_samples$kappa, sb_sim_samples$sigma), c(kappa_names, "sigma"))
Now, for the real data:
sb_data <- counts %>%
group_by(id, major) %>%
summarize(coverage=sum(coverage), minor=sum(minor))
sb_fit <- stan(model_code=sb_block,
data=list(
coverage=sb_data$coverage,
minor=sb_data$minor,
id=as.numeric(sb_data$id),
major=as.numeric(sb_data$major)),
iter=1e3, chains=3)
sb_samples <- extract(sb_fit)
rstan::summary(sb_fit)$summary
## mean se_mean sd 2.5%
## theta[1] 2.172189e-03 1.589084e-07 6.154496e-06 2.159926e-03
## theta[2] 2.314288e-03 1.675222e-07 6.488109e-06 2.301658e-03
## theta[3] 6.220719e-03 2.937712e-07 1.137771e-05 6.199269e-03
## theta[4] 7.634249e-03 3.915556e-07 1.516488e-05 7.604839e-03
## theta[5] 3.156039e-03 2.253786e-07 8.728876e-06 3.139309e-03
## theta[6] 3.066519e-03 2.413862e-07 9.348846e-06 3.048579e-03
## theta[7] 3.629858e-03 3.069091e-07 1.188654e-05 3.606856e-03
## theta[8] 4.203468e-03 3.484523e-07 1.349550e-05 4.177913e-03
## mu[1] 5.227407e-03 1.014684e-04 2.264665e-03 2.870230e-03
## mu[2] 3.623621e-03 4.195787e-05 6.135662e-04 2.925861e-03
## kappa[1] 8.944200e+02 2.245731e+01 5.947890e+02 1.126520e+02
## kappa[2] 1.243148e+04 4.061996e+02 9.805067e+03 1.105108e+03
## sigma 2.157941e+04 1.010582e+03 2.567935e+04 1.482786e+03
## lp__ -7.744182e+06 1.579456e-01 2.699572e+00 -7.744189e+06
## 25% 50% 75% 97.5% n_eff
## theta[1] 2.168117e-03 2.172266e-03 2.176305e-03 2.184874e-03 1500.0000
## theta[2] 2.309743e-03 2.314242e-03 2.318744e-03 2.326655e-03 1500.0000
## theta[3] 6.212664e-03 6.220630e-03 6.228557e-03 6.242949e-03 1500.0000
## theta[4] 7.624295e-03 7.634264e-03 7.644449e-03 7.664773e-03 1500.0000
## theta[5] 3.150416e-03 3.155884e-03 3.161655e-03 3.173211e-03 1500.0000
## theta[6] 3.060028e-03 3.066335e-03 3.072874e-03 3.084331e-03 1500.0000
## theta[7] 3.622062e-03 3.629971e-03 3.637550e-03 3.653016e-03 1500.0000
## theta[8] 4.194138e-03 4.203591e-03 4.213496e-03 4.229218e-03 1500.0000
## mu[1] 4.028572e-03 4.801026e-03 5.857157e-03 1.002260e-02 498.1345
## mu[2] 3.367411e-03 3.558818e-03 3.765425e-03 4.676371e-03 213.8435
## kappa[1] 4.543426e+02 7.913557e+02 1.221692e+03 2.375364e+03 701.4719
## kappa[2] 5.107320e+03 9.983179e+03 1.723234e+04 3.790230e+04 582.6692
## sigma 6.438230e+03 1.301162e+04 2.711873e+04 9.129779e+04 645.6916
## lp__ -7.744184e+06 -7.744182e+06 -7.744180e+06 -7.744178e+06 292.1290
## Rhat
## theta[1] 0.9992033
## theta[2] 0.9983825
## theta[3] 0.9989792
## theta[4] 0.9983739
## theta[5] 0.9987056
## theta[6] 0.9985890
## theta[7] 0.9982808
## theta[8] 0.9983814
## mu[1] 0.9991045
## mu[2] 1.0043019
## kappa[1] 0.9999994
## kappa[2] 1.0022672
## sigma 1.0027291
## lp__ 1.0050122
Here is a table of estimated error rates, along with posterior means and boundaries of 95% credible intervals:
sb_tab <- geno %>% group_by(id, major) %>%
summarize(observed_error=1-sum(major_coverage)/sum(coverage))
sb_tab$posterior_mean <- colMeans(sb_samples$theta)
sb_tab$post_q.025 <- colQuantiles(sb_samples$theta, prob=.025)
sb_tab$post_q.975 <- colQuantiles(sb_samples$theta, prob=.975)
sb_tab
## # A tibble: 8 x 6
## # Groups: id [?]
## id major observed_error posterior_mean post_q.025 post_q.975
## <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
## 1 altai A 0.00217 0.00217 0.00216 0.00218
## 2 altai T 0.00231 0.00231 0.00230 0.00233
## 3 altai C 0.00622 0.00622 0.00620 0.00624
## 4 altai G 0.00763 0.00763 0.00760 0.00766
## 5 denis A 0.00316 0.00316 0.00314 0.00317
## 6 denis T 0.00307 0.00307 0.00305 0.00308
## 7 denis C 0.00363 0.00363 0.00361 0.00365
## 8 denis G 0.00420 0.00420 0.00418 0.00423
Here we see that Stan is still inferring the error rates to be nearly exactly what we’d guess just from the empirical proportion (as makes sense, since we have a lot of data).
Here are the results for higher-level parameters:
segplot(cbind(sb_samples$theta, sb_samples$mu), c(theta_names, mu_names))
segplot(cbind(sb_samples$kappa, sb_samples$sigma), c(kappa_names, "sigma"))
It looks like the per-base error rates are somewhat similar between the two samples - it is higher for C
and G
for both - but the pattern seen in both is different. This is to be expected, because a major source of DNA damage is deamination of cytosines to uracil, that gets read as thiamene.
We’d now like to continue the above analysis, but allowing error rates to differ by region. We’ll do this by just adding another level of the hierarchy corresponding to region, so that what we had previously as \(\theta_{s,b}\) will now be \(\nu_{s,b}\), the mean error rate in sample \(s\) with true base \(b\), but the actual error rate in a particular region will be a random deviate from that.
\[\begin{aligned} Z_{s,b,r} &\sim \Binom(N_{s,b,r}, \theta_{s,b,r}) \\ \theta_{s,b,r} &\sim \Beta(\nu_{s,b} \gamma{s,b}, (1-\nu_{s,b}) \gamma{s,b}) \\ \nu_{s,b} &\sim \Beta(\mu_s \kappa_s, (1-\mu_s) \kappa_s) \\ \gamma_{s,b} &\sim \Normal(0, \sigma_\gamma) \\ \mu_s &\sim \Beta(1, 1) \\ \kappa_s &\sim \Normal(0, \sigma_\mu) \\ \sigma_\gamma &\sim \Exp(10^{-5}) \sigma_\mu &\sim \Exp(10^{-5}) \end{aligned}\]
Here’s a block for this:
big_block <- "
data {
int N; // this will be 808
int coverage[N];
int minor[N]; // number of errors
int id[N]; // which sample
int major[N]; // which base is the true one
int region[N];
}
parameters {
real<lower=0, upper=1> theta[N];
matrix<lower=0, upper=1>[2,4] nu;
matrix<lower=0>[2,4] gamma;
vector<lower=0, upper=1>[2] mu;
vector<lower=0>[2] kappa;
real<lower=0> sigma_gamma;
real<lower=0> sigma_kappa;
}
model {
real alpha;
real beta;
minor ~ binomial(coverage, theta);
for (k in 1:N) {
theta[k] ~ beta(nu[id[k], major[k]] * gamma[id[k], major[k]],
(1 - nu[id[k], major[k]]) * gamma[id[k], major[k]]);
}
for (s in 1:2) {
nu[s,] ~ beta(mu[s] .* kappa[s], (1-mu[s]) .* kappa[s]);
gamma[s,] ~ normal(0, sigma_gamma);
}
// mu ~ beta(1,1); :: omit
kappa ~ normal(0, sigma_kappa);
sigma_gamma ~ exponential(.00001);
sigma_kappa ~ exponential(.00001);
}
"
Once again, first we’ll fit to the simulated data:
big_sim_fit <- stan(model_code=big_block,
data=list(
N=nrow(sim_counts),
coverage=sim_counts$coverage,
minor=sim_counts$minor,
id=as.numeric(sim_counts$id),
major=as.numeric(sim_counts$major),
region=1L + as.integer(sim_counts$region)),
iter=1e3, chains=3)
big_sim_samples <- extract(big_sim_fit)
We now compute the table of true values, along with posterior means and boundaries of 95% credible intervals, but rather than view it numerically, we’ll plot it. There is one plot for each sample, and purple points depict the true values.
all_theta_names <- paste(sim_counts$id, sim_counts$major, sim_counts$region, sep=":")
all_nu_names <- paste("mean,", outer(levels(sim_counts$id), levels(sim_counts$major), paste))
all_gamma_names <- paste("concentration,", outer(levels(sim_counts$id), levels(sim_counts$major), paste))
all_mu_names <- paste("mean,", levels(sim_counts$id))
all_kappa_names <- paste("concentration,", levels(sim_counts$id))
big_tab <- sim_geno %>% group_by(id, major, region) %>%
summarize(error_rate=sum(error_rate*coverage)/sum(coverage))
big_tab$posterior_mean <- colMeans(big_sim_samples$theta)
big_tab$post_q.025 <- colQuantiles(big_sim_samples$theta, prob=.025)
big_tab$post_q.25 <- colQuantiles(big_sim_samples$theta, prob=.25)
big_tab$post_q.75 <- colQuantiles(big_sim_samples$theta, prob=.75)
big_tab$post_q.975 <- colQuantiles(big_sim_samples$theta, prob=.975)
layout(t(1:2))
with(subset(big_tab, id=="altai"), {
splot(post_q.025, post_q.25, posterior_mean, post_q.75, post_q.975,
labels=sub("[^:]*:","",all_theta_names[big_tab$id=="altai"]),
main='error rates, Altai neanderthal')
points(error_rate, length(error_rate):1, col='purple')
})
with(subset(big_tab, id=="denis"), {
splot(post_q.025, post_q.25, posterior_mean, post_q.75, post_q.975,
labels=sub("[^:]*:","",all_theta_names[big_tab$id=="denis"]),
main='error rates, Denisovan')
points(error_rate, length(error_rate):1, col='purple')
})
We can see that the true values are within the credible intervals in every case.
Here are the results for higher-level parameters:
segplot(cbind(big_sim_samples$mu, matrix(big_sim_samples$nu, nrow=nrow(big_sim_samples$mu))),
c(all_mu_names,all_nu_names))
segplot(cbind(big_sim_samples$kappa,
matrix(big_sim_samples$gamma, nrow=nrow(big_sim_samples$kappa)),
big_sim_samples$sigma_gamma, big_sim_samples$sigma_kappa),
c(all_kappa_names, all_gamma_names, "sigma_gamma", "sigma_kappa"))
Now, for the real data, with the big model!
big_fit <- stan(model_code=big_block,
data=list(
N=nrow(counts),
coverage=counts$coverage,
minor=counts$minor,
id=as.numeric(counts$id),
major=as.numeric(counts$major),
region=1L + as.integer(counts$region)),
iter=1e3, chains=3)
big_samples <- extract(big_fit)
I’ve looked at the summary, and convergence looks good; the table is too big to reproduce here.
Here are the estimated error rates, as depicted above. There is one plot for each sample; different bases are grouped together, and then ordered by region.
big_tab <- geno %>% group_by(id, major, region) %>%
summarize(error_rate=sum(error_rate*coverage)/sum(coverage))
## Error in summarise_impl(.data, dots): Evaluation error: object 'error_rate' not found.
big_tab$posterior_mean <- colMeans(big_samples$theta)
big_tab$post_q.025 <- colQuantiles(big_samples$theta, prob=.025)
big_tab$post_q.25 <- colQuantiles(big_samples$theta, prob=.25)
big_tab$post_q.75 <- colQuantiles(big_samples$theta, prob=.75)
big_tab$post_q.975 <- colQuantiles(big_samples$theta, prob=.975)
layout(t(1:2))
with(subset(big_tab, id=="altai"), {
splot(post_q.025, post_q.25, posterior_mean, post_q.75, post_q.975,
labels=sub("[^:]*:","",all_theta_names[big_tab$id=="altai"]),
main='error rates, Altai neanderthal')
})
with(subset(big_tab, id=="denis"), {
splot(post_q.025, post_q.25, posterior_mean, post_q.75, post_q.975,
labels=sub("[^:]*:","",all_theta_names[big_tab$id=="denis"]),
main='error rates, Denisovan')
})
Here are the results for higher-level parameters:
segplot(cbind(big_samples$mu, matrix(big_samples$nu, nrow=nrow(big_samples$mu))),
c(all_mu_names,all_nu_names))
segplot(cbind(big_samples$kappa,
matrix(big_samples$gamma, nrow=nrow(big_samples$kappa)),
big_samples$sigma_gamma, big_samples$sigma_kappa),
c(all_kappa_names, all_gamma_names, "sigma_gamma", "sigma_kappa"))
There is substantial heterogeneity between regions. Let’s look at the results in another way: with the four bases grouped together for each region.
layout(t(1:2))
for (sample_id in levels(big_tab$id)) {
for (b in bases) {
with(subset(big_tab, id==sample_id & major == b), {
splot(post_q.025, post_q.25, posterior_mean, post_q.75, post_q.975,
xlim=range(big_tab$post_q.025, big_tab$post_q.975),
labels=gsub("[^:]*:","",all_theta_names[big_tab$id==sample_id & big_tab$major==b]),
main=paste('error rates,', sample_id),
pt.col=match(b, bases), col=match(b,bases),
add=(b != bases[1]))
})
}
legend("topright", pch=20, col=1:4, legend=bases)
}
In all cases, our methods worked well on the simulated data, correctly inferring the true error rates with an appropriate margin of error.
The two samples had substantially different overall error probabilities, of about 0.0041 for the Altai Neanderthal and 0.0034 for the Densiovan. However, error rates differed substantially according to what the true nucleotide was, a pattern that didn’t become clear until we split out error rates by region of the genome. Then, we found that both overall error rates and the pattern of error rates (i.e., relative error rates of A
, T
, C
, and G
) differed substantially between regions, as well as between samples. In both samples, the error rates of C-G
bases was higher, although this was much more pronounced in the Neanderthal. This is to be expected, because of deamination-induced DNA damage affecting C-G
bases; this signature is good evidence that we are in fact getting ancient DNA. Furthermore, the way that overall error rates varied between regions was strongly correlated between samples, despite these having lived about 25,000 years apart. (In fact, the “regions” of the two samples don’t exactly line up, so this correlation is probably stronger than it appears in the final figure above.) This implies that some regions of the genome are more error-prone than others; one reason for this is that different parts of the genome are more or less covered by heterochromatin (which protects DNA), and these patterns of heterochromatin are conserved over evolutionary time. Puzzlingly, in some regions we see that the C
and G
error rates are substantially different, despite the prediction that these should be the same, as every C
is reverse-complement paired to a G
and vice-versa. Finally, the error rates in the mitochondria look like many other regions in both samples: perhaps on the low side, but not surprisingly different. (Perhaps you did a statistical test of this: if so, good.)