Homework #2 :: solutions

Note: This is a partial solution, focused on technical aspects. I generally expect your reports to have a bit more words, describing what you’re doing and why.

Set-up

Here’s some useful functions we’ll have use for later.

max_col <- function (x) {
    # return vector of indices corresp to which column has the largest value
    #   faster than apply(, 1, which.max)
    out <- rep(1, NROW(x))
    maxs <- x[,1]
    for (k in seq_len(NCOL(x))[-1]) {
        yes <- (x[,k] > maxs)
        out[yes] <- k
        maxs[yes] <- x[yes,k]
    }
    return(out)
}
stopifnot( { x <- matrix(rnorm(120), ncol=15); all(max_col(x) == apply(x, 1, which.max)) } )

segplot <- function (x, labels=colnames(x), ...) {
    splot(far_left=colMins(x), 
          left=colQuantiles(x, probs=0.025), 
          mid=colMeans(x), 
          right=colMaxs(x, probs=0.975), 
          far_right=colMaxs(x),
          labels=labels, ...)
}
splot <- function (far_left, left, mid, right, far_right, labels, 
                   yvals=length(mid):1, 
                   xlim=range(far_left, left, mid, right, far_right, finite=TRUE), 
                   add=FALSE, col='red', pt.col='black',
                   ...) {
    # make those nice horizontal line plots
    par(mar=c(par("mar"), 9)[c(1,5,3,4)])
    if (!add) { 
        plot(0, ylab='', yaxt='n', type='n', ylim=range(yvals), xlim=xlim, ...) 
        axis(2, at=yvals, labels=labels, las=2)
    }
    segments(x0=far_left, x1=far_right, y0=yvals)
    segments(x0=left, x1=right,
             y0=yvals, col=col, lwd=2)
    points(mid, yvals, pch=20, col=pt.col)
}

The data

We first read in the data, and for later use relabel the mitochondria as region 0.

altai <- read.table("data/altai.counts.gz", header=TRUE, stringsAsFactors=FALSE)
denis <- read.table("data/denisova.counts.gz", header=TRUE, stringsAsFactors=FALSE)

geno <- rbind(altai, denis)
geno$id <- factor(c(rep("altai", nrow(altai)), rep("denis", nrow(denis))))
rm(altai, denis)

bases <- c("A","T","C","G")
geno$region[geno$region == "mt"] <- 0
geno$region <- as.numeric(geno$region)
geno$coverage <- rowSums(geno[,bases])
geno$major <- factor(bases[max_col(geno[,bases])], levels=bases)
geno$major_coverage <- geno[,bases][cbind(1:nrow(geno), as.numeric(geno$major))]

First, let’s look at the distribution of total coverages:

hist(subset(geno,id=="altai" & region>0)$coverage, breaks=30, 
     main="Altai coverage, nuclear", xlab='coverage')

plot of chunk total_coverage

hist(subset(geno,id=="altai" & region==0)$coverage, breaks=30, 
     main="Altai coverage, mitochondria", xlab='coverage')

plot of chunk total_coverage

hist(subset(geno,id=="denis" & region>0)$coverage, breaks=30, 
     main="Densiovan coverage, nuclear", xlab='coverage')

plot of chunk total_coverage

hist(subset(geno,id=="denis" & region==0)$coverage, breaks=30, 
     main="Densiovan coverage, mitochondria", xlab='coverage')

plot of chunk total_coverage

Here’s another look at range of coverage, this time separated by region. Here the points give mean coverages, while the black and red lines give min/max and the middle 90% quantile, respectively. We see that coverage varies substantially by region.

coverage_stats <- geno %>% 
    group_by(region,id) %>% 
    summarize(mean=mean(coverage), 
              sd=sd(coverage), 
              min=min(coverage), 
              max=max(coverage), 
              q5=quantile(coverage,.05), 
              q95=quantile(coverage,.95))

with(coverage_stats, {
       y <- region + 110*as.numeric(id);
       plot(mean, y, xlim=range(max,min), pch=20, xlab='coverage', 
            ylab='region', yaxt='n');
       abline(h=105, lty=3);
       segments(x0=min, x1=max, y0=y);
       segments(x0=q5, x1=q95, y0=y, lwd=2, col='red')
       points(mean, y,pch=20) } )

plot of chunk cov_by_region

It looks like the nuclear genomes have some issues: such large coverages may well be a sign of problems. If we restrict to smaller coverages, we see that maximum coverage tails off gradually without a single natural cutoff. We will (somewhat arbitrarily) discard anything on the nuclear genome with coverage greater or equal to 200.

hist(subset(geno,id == "altai" & region>0 & coverage < 250)$coverage, breaks=30, 
     main="Altai coverage, nuclear", xlab='coverage')

plot of chunk altai_coverage

hist(subset(geno,id == "denis" & region>0 & coverage < 250)$coverage, breaks=30, 
     main="Densiovan coverage, nuclear", xlab='coverage')

plot of chunk altai_coverage

geno <- subset(geno, region == 0 | coverage < 200)

Here is a table, and also a plot, of mean base frequencies, separated by (mitochondria/not), sample, and major allele:

errors <- geno %>% mutate(mt=(region==0)) %>%
            group_by(mt, id, major) %>% 
            summarise(A=mean(A/coverage),
                      T=mean(T/coverage),
                      C=mean(C/coverage),
                      G=mean(G/coverage))
errors$overall <- rowSums(errors[,bases]) - apply(errors[,bases], 1, max)
errors

## # A tibble: 16 x 8
## # Groups:   mt, id [?]
##    mt    id     major         A        T        C        G overall
##    <lgl> <fctr> <fctr>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
##  1 F     altai  A      0.997    0.000848 0.000708 0.00146  0.00301
##  2 F     altai  T      0.000893 0.997    0.00133  0.000762 0.00299
##  3 F     altai  C      0.00218  0.00468  0.992    0.00140  0.00826
##  4 F     altai  G      0.00488  0.00223  0.00144  0.991    0.00856
##  5 F     denis  A      0.997    0.000734 0.00114  0.00144  0.00331
##  6 F     denis  T      0.000751 0.997    0.00136  0.00119  0.00330
##  7 F     denis  C      0.00141  0.00209  0.996    0.000835 0.00433
##  8 F     denis  G      0.00219  0.00139  0.000832 0.996    0.00441
##  9 T     altai  A      0.999    0.000209 0.000292 0.000500 0.00100
## 10 T     altai  T      0.000241 0.999    0.000608 0.000218 0.00107
## 11 T     altai  C      0.00107  0.00299  0.995    0.000438 0.00450
## 12 T     altai  G      0.00303  0.00127  0.000667 0.995    0.00497
## 13 T     denis  A      0.997    0.000453 0.00210  0.000601 0.00316
## 14 T     denis  T      0.000519 0.998    0.00134  0.000614 0.00247
## 15 T     denis  C      0.000801 0.000531 0.998    0.000246 0.00158
## 16 T     denis  G      0.000832 0.000703 0.000681 0.998    0.00222

plot(as.vector(4 * (match(errors$major, bases)-1) + col(errors[,bases]) - 1), 
     unlist(errors[,bases]), 
     col=c("black","red")[as.numeric(errors$id)],
     pch=c(1,5)[1+errors$mt], xlim=c(0, 20),
     log='y', xlab='', xaxt='n', ylab='frequency')
abline(v=4*(1:4)-0.5, lty=3)
axis(1, at=4*(0:3)+1.5, labels=bases, line=2, lwd=0, tick=FALSE)
axis(1, at=0:15, labels=rep(bases,4), line=0.5, lwd=0, tick=FALSE)
legend("topright", pch=c(1,1,5), col=c('black','red','black'),
       legend=c(levels(errors$id), "mito"))

plot of chunk overall_error

It looks like error rates range from 0.0010016, for Densiova mitochondria having a C, to 0.0085579, for Altai mitochondria having a G. There also looks to be substantial heterogeneity by base, reliable differences between the samples, and between nuclear/mitochondria. The overall error rate for the Denisovan was 0.0034253, and for the Altai Neanderthal it was 0.0040799. On average, the error rate in the mitochondria was 2.0606979 times lower.

Simulation

Now we’ll simulate data. Motivated by the table above, we’ll pick error rates to be around 0.0035, with some differences by sample, major allele, and different between the mitochondria and nuclear regions (but the same otherwise).

sim_rates <- list(
                  id=c("altai"=0.004, "denis"=0.0035),
                  region_fac=c(0.5, rep(1, length(unique(geno$region))-1)),
                  major_fac=1+rnorm(4,0,0.05)
              )
sim_geno <- geno[,c("region","id","coverage","major")]
sim_geno$error_rate <- with(sim_geno, 
                            sim_rates$id[id] * 
                                sim_rates$region_fac[region+1L] * 
                                sim_rates$major_fac[as.numeric(major)])
sim_geno$major_coverage <- rbinom(nrow(sim_geno), size=sim_geno$coverage,
                                  prob=1-sim_geno$error_rate)

Summation

The fact mentioned says that if \[\begin{aligned} X &\sim \Binom(10, 1/3) \\ Y &\sim \Binom(12, 1/3) \\ Z &= X + Y \end{aligned}\] then \[\begin{aligned} Z \sim \Binom(22, 1/3) \end{aligned}\] because if you flip a coin 10 times and count up the number of heads ($X$), then flip it 12 more times and count up the number of heads ($Y$), then (a) the total number of heads ($X+Y$) is the same as if you’d just flipped it 22 times, and (b) that’s all the information you need to know about it.

This means it suffices to collapse the data into a single table of how often we saw the right and the wrong base, by category, where “category” is a combination of sample ID, region, and true base.

Here we compute that table. The “minor” column refers to the sum of all less-frequently-seen bases, (i.e., not the “major” allele), which we are assuming are in error.

counts <- geno %>% 
            group_by(id, major, region) %>% 
            summarise(A=sum(A),
                      T=sum(T),
                      C=sum(C),
                      G=sum(G),
                      coverage=sum(coverage),
                      minor=sum(coverage)-sum(major_coverage))

We’ll also want the same thing for the simulated data.

sim_counts <- sim_geno %>%
                group_by(id, major, region) %>% 
                summarise(coverage=sum(coverage),
                          minor=sum(coverage)-sum(major_coverage))

The model(s)

One error rate

First let’s formulate and fit a simple model where every site has the same error rate. We’ll put a uniform prior on that error rate. If $Z$ is the total number of erroneous bases, and $N$ is the total coverage, then \[\begin{aligned} Z &\sim \Binom(N, \theta) \\ \theta &\sim \Beta(1,1) . \end{aligned}\] Since this is a Beta-Binomial model, we don’t need to use Stan, and can find the posterior distribution on $\theta$ explicitly as Beta($1+Z$, $1+N-Z$).

We’ll do it anyways, just to practice.

simple_block <- "
data {
    int coverage;
    int minor;
}
parameters {
    real<lower=0, upper=1> theta;
}
model {
    minor ~ binomial(coverage, theta);
    // theta ~ uniform(0,1);  // uniform is the same as not present
}
"

Simulated data

First we’ll fit to the simulated data:

simple_sim_data <- list(coverage=sum(sim_counts$coverage),
                        minor=sum(sim_counts$minor))
simple_sim_fit <- stan(model_code=simple_block,
                       data=simple_sim_data,
                       iter=1e3, chains=3)
simple_sim_samples <- extract(simple_sim_fit)
print(simple_sim_fit)

## Inference for Stan model: c69ed19821015102bec2e473988a69c3.
## 3 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1500.
## 
##           mean se_mean   sd     2.5%      25%      50%      75%    97.5%
## theta        0    0.00 0.00        0        0        0        0        0
## lp__  -6736978    0.03 0.68 -6736980 -6736979 -6736978 -6736978 -6736978
##       n_eff Rhat
## theta   596    1
## lp__    691    1
## 
## Samples were drawn using NUTS(diag_e) at Tue Jan 30 15:38:59 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

The average error rate, weighted by coverage, that we simulated under was 0.0031785: we should get close to that here. The posterior mean is 0.003185 and a 95% credible interval is from 0.003179 to 0.0031914. This agrees with the analytical answer, which has the posterior mean of a Beta(9.98421 × 10⁵, 3.1245365 × 10⁸) as 0.0031852, with a 95% credible interval of 0.003179 to 0.0031915.

Real data

Now, the real data:

simple_data <- list(coverage=sum(counts$coverage),
                        minor=sum(counts$minor))
simple_fit <- stan(model_code=simple_block,
                       data=simple_data,
                       iter=1e3, chains=3)
simple_samples <- extract(simple_fit)
print(simple_fit)

## Inference for Stan model: c69ed19821015102bec2e473988a69c3.
## 3 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1500.
## 
##           mean se_mean   sd     2.5%      25%      50%      75%    97.5%
## theta        0    0.00 0.00        0        0        0        0        0
## lp__  -7870551    0.03 0.69 -7870553 -7870551 -7870550 -7870550 -7870550
##       n_eff Rhat
## theta   582    1
## lp__    550    1
## 
## Samples were drawn using NUTS(diag_e) at Tue Jan 30 15:39:03 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

The posterior mean is 0.0038251 and a 95% credible interval is from 0.0038187 to 0.0038318. This agrees with the analytical answer, which has the posterior mean of a Beta(1.199004 × 10⁶, 3.1225307 × 10⁸) as 0.0038252, with a 95% credible interval of 0.0038183 to 0.003832.

Error rates by sample

Now we’ll fit a model where each sample has a different error rate. We could do this separately, using the Beta-Binomial again, but just for practice, we’ll do them jointly. Also, since we don’t even know what scale the concentration parameter for the prior should be on, we’ll put a hyperprior on it. Now, $Z_s$ and $N_s$ are the total number of erroneous bases and coverage for sample $s$, respectively (where $s$ is Altai'' orDensiovan’’). \[\begin{aligned} Z_s &\sim \Binom(N_s, \theta_s) \\ \theta_s &\sim \Beta(\mu \kappa, (1-\mu) \kappa) \\ \mu &\sim \Beta(1,1) \\ \kappa &\sim \Normal^+(0, \sigma) \\ \sigma &\sim \Exp(1/10000) \end{aligned}\]

Here’s the block

sample_block <- "
data {
    int coverage[2];
    int minor[2];
}
parameters {
    real<lower=0, upper=1> theta[2];
    real<lower=0, upper=1> mu;
    real<lower=0> kappa;
    real<lower=0> sigma;
}
model {
    minor ~ binomial(coverage, theta);
    theta ~ beta(mu * kappa, (1-mu) * kappa);
    // mu ~ beta(1,1); // omit
    kappa ~ normal(0, sigma);
    sigma ~ exponential(.0001);
}
"

Simulated data

Again, first we’ll fit to the simulated data:

sample_sim_data <- as.list(sim_counts %>% 
                           group_by(id) %>% 
                           summarize(coverage=sum(coverage), minor=sum(minor)))
sample_sim_fit <- stan(model_code=sample_block,
                       data=sample_sim_data,
                       iter=1e3, chains=3)
sample_sim_samples <- extract(sample_sim_fit)
print(sample_sim_fit)

## Inference for Stan model: ccf90a5386d09552cf1de8d09458b372.
## 3 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1500.
## 
##                 mean se_mean       sd        2.5%         25%         50%
## theta[1]        0.00    0.00     0.00        0.00        0.00        0.00
## theta[2]        0.00    0.00     0.00        0.00        0.00        0.00
## mu              0.00    0.00     0.00        0.00        0.00        0.00
## kappa       15931.31 1416.43 19769.64      168.86     3546.16     9322.70
## sigma       14761.48  861.56 12627.64      676.82     5464.85    11453.45
## lp__     -6736960.55    0.23     2.22 -6736966.42 -6736961.72 -6736960.11
##                  75%       97.5% n_eff Rhat
## theta[1]        0.00        0.00  1500 1.00
## theta[2]        0.00        0.00  1500 1.00
## mu              0.00        0.01   178 1.02
## kappa       20594.37    72618.07   195 1.01
## sigma       20601.12    47185.28   215 1.01
## lp__     -6736958.90 -6736957.77    97 1.04
## 
## Samples were drawn using NUTS(diag_e) at Tue Jan 30 15:47:53 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Looks like we might want to run for more iterations; also, the value of $\kappa$ here is of order 10,000, so it might be pushing up against the prior.

The average error rates, weighted by coverage, that we simulated under were 0.0031689, 0.0031935 for the Altai Neanderthal and the Denisovan, respectively. we should get close to that here. The posterior means here are 0.0031733, 0.0032037 and a 95% credible interval is from 0.0031652, 0.0031937 to 0.0031815, 0.0032139 (respectively).

Real data

Now, for the real data.

sample_data <- as.list(counts %>% 
                           group_by(id) %>% 
                           summarize(coverage=sum(coverage), minor=sum(minor)))
sample_fit <- stan(model_code=sample_block,
                       data=sample_data,
                       iter=1e3, chains=3)
sample_samples <- extract(sample_fit)
print(sample_fit)

## Inference for Stan model: ccf90a5386d09552cf1de8d09458b372.
## 3 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1500.
## 
##                 mean se_mean       sd        2.5%         25%         50%
## theta[1]        0.00    0.00     0.00        0.00        0.00        0.00
## theta[2]        0.00    0.00     0.00        0.00        0.00        0.00
## mu              0.00    0.00     0.00        0.00        0.00        0.00
## kappa       11049.52  927.92 12024.42      281.01     2705.85     7134.29
## sigma       12322.01 1297.10 10593.41      766.92     4606.33     9332.06
## lp__     -7866296.20    0.21     1.80 -7866300.30 -7866297.32 -7866295.87
##                  75%       97.5% n_eff Rhat
## theta[1]        0.00        0.00  1500 1.00
## theta[2]        0.00        0.00  1500 1.00
## mu              0.00        0.01    99 1.03
## kappa       15207.24    43293.25   168 1.02
## sigma       16808.77    40548.28    67 1.04
## lp__     -7866294.80 -7866293.72    74 1.03
## 
## Samples were drawn using NUTS(diag_e) at Tue Jan 30 15:47:57 2018.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

The observed error rates were 0.0040799, 0.0034253 for the Altai Neanderthal and Denisovan respectively; we should get close to that here. The posterior means here are 0.0040799, 0.0034253 and a 95% credible interval is from 0.0040713, 0.0034159 to 0.0040887, 0.0034353 (respectively).

Different error rates by base

Now the question is: we know that error rates differ between the samples; but given this, do they differ by base? We will do this by allowing the nucleotide-specific error rates to be arbitrary, and different, between the two samples; we could build in something so that the relative error rates of A, C, G, and T are similar between the samples, but we won’t.

Here $\mu$ is the per-sample mean error rate, and $\theta$ represents the deviations of each base about this mean. \[\begin{aligned} Z_{s,b} &\sim \Binom(N_{s,b}, \theta_{s,b}) \\ \theta_{s,b} &\sim \Beta(\mu_s \kappa_s, (1-\mu_s) \kappa_s) \\ \mu_s &\sim \Beta(1, 1) \\ \kappa_s &\sim \Normal(0, \sigma) \\ \sigma &\sim \Exp(10^{-5}) \end{aligned}\]

Here’s a block for this:

sb_block <- "
data {
    int coverage[8];
    int minor[8];  // number of errors
    int id[8];     // which sample
    int major[8];  // which base is the true one
}
parameters {
    real<lower=0, upper=1> theta[8];
    vector<lower=0, upper=1>[2] mu;
    vector<lower=0>[2] kappa;
    real<lower=0> sigma;
}
model {
    minor ~ binomial(coverage, theta);
    theta ~ beta(mu[id] .* kappa[id], (1-mu[id]) .* kappa[id]);
    // mu ~ beta(1,1); // omit
    kappa ~ normal(0, sigma);
    sigma ~ exponential(.00001);
}
"

Simulated data

Again, first we’ll fit to the simulated data:

sb_sim_data <- sim_counts %>% 
                       group_by(id, major) %>% 
                       summarize(coverage=sum(coverage), minor=sum(minor))
sb_sim_fit <- stan(model_code=sb_block,
                       data=list(
                                 coverage=sb_sim_data$coverage,
                                 minor=sb_sim_data$minor,
                                 id=as.numeric(sb_sim_data$id),
                                 major=as.numeric(sb_sim_data$major)),
                       iter=1e3, chains=3)

sb_sim_samples <- extract(sb_sim_fit)
rstan::summary(sb_sim_fit)$summary

##                   mean      se_mean           sd          2.5%
## theta[1]  3.048872e-03 1.774589e-07 6.872954e-06  3.035090e-03
## theta[2]  3.389424e-03 2.117187e-07 8.199831e-06  3.373187e-03
## theta[3]  3.031780e-03 2.190216e-07 8.482671e-06  3.014641e-03
## theta[4]  3.241158e-03 2.604123e-07 1.008573e-05  3.221802e-03
## theta[5]  3.071047e-03 2.198717e-07 8.515594e-06  3.054511e-03
## theta[6]  3.390961e-03 2.434702e-07 9.429560e-06  3.373123e-03
## theta[7]  3.167206e-03 2.986842e-07 1.156799e-05  3.144522e-03
## theta[8]  3.178614e-03 2.967080e-07 1.149145e-05  3.155779e-03
## mu[1]     3.188661e-03 5.506554e-06 1.103581e-04  2.966018e-03
## mu[2]     3.202322e-03 4.896433e-06 9.801477e-05  3.012432e-03
## kappa[1]  1.095351e+05 3.379973e+03 7.226984e+04  1.717901e+04
## kappa[2]  1.358618e+05 4.238694e+03 9.562847e+04  2.292444e+04
## sigma     1.453313e+05 4.345127e+03 8.731068e+04  3.240150e+04
## lp__     -6.735936e+06 1.246072e-01 2.719970e+00 -6.735943e+06
##                    25%           50%           75%         97.5%     n_eff
## theta[1]  3.044288e-03  3.048955e-03  3.053746e-03  3.061761e-03 1500.0000
## theta[2]  3.383868e-03  3.389633e-03  3.394976e-03  3.406069e-03 1500.0000
## theta[3]  3.025953e-03  3.032006e-03  3.037459e-03  3.048672e-03 1500.0000
## theta[4]  3.234325e-03  3.240982e-03  3.247799e-03  3.261464e-03 1500.0000
## theta[5]  3.065257e-03  3.071115e-03  3.076934e-03  3.087499e-03 1500.0000
## theta[6]  3.384702e-03  3.390676e-03  3.397253e-03  3.409990e-03 1500.0000
## theta[7]  3.159565e-03  3.166870e-03  3.175252e-03  3.189995e-03 1500.0000
## theta[8]  3.170953e-03  3.178430e-03  3.186234e-03  3.201454e-03 1500.0000
## mu[1]     3.130030e-03  3.182512e-03  3.245656e-03  3.426296e-03  401.6510
## mu[2]     3.144539e-03  3.201438e-03  3.259524e-03  3.390128e-03  400.7038
## kappa[1]  5.572054e+04  9.269921e+04  1.488857e+05  2.956066e+05  457.1805
## kappa[2]  6.630695e+04  1.146025e+05  1.783088e+05  3.868454e+05  508.9913
## sigma     8.327676e+04  1.273635e+05  1.892683e+05  3.666484e+05  403.7661
## lp__     -6.735938e+06 -6.735936e+06 -6.735935e+06 -6.735932e+06  476.4767
##               Rhat
## theta[1] 0.9986473
## theta[2] 0.9992916
## theta[3] 0.9989723
## theta[4] 0.9989596
## theta[5] 0.9986153
## theta[6] 0.9988985
## theta[7] 0.9984938
## theta[8] 0.9987867
## mu[1]    1.0000346
## mu[2]    1.0020745
## kappa[1] 1.0010955
## kappa[2] 1.0032288
## sigma    1.0002993
## lp__     0.9995747

There’s starting to be more parameters to compare to, so we’ll be a bit more organized. Here is a table of true values, along with posterior means and boundaries of 95% credible intervals:

sb_tab <- sim_geno %>% group_by(id, major) %>% 
    summarize(error_rate=sum(error_rate*coverage)/sum(coverage))
sb_tab$posterior_mean <- colMeans(sb_sim_samples$theta)
sb_tab$post_q.025 <- colQuantiles(sb_sim_samples$theta, prob=.025)
sb_tab$post_q.975 <- colQuantiles(sb_sim_samples$theta, prob=.975)
sb_tab

## # A tibble: 8 x 6
## # Groups:   id [?]
##   id     major  error_rate posterior_mean post_q.025 post_q.975
##   <fctr> <fctr>      <dbl>          <dbl>      <dbl>      <dbl>
## 1 altai  A         0.00304        0.00305    0.00304    0.00306
## 2 altai  T         0.00339        0.00339    0.00337    0.00341
## 3 altai  C         0.00303        0.00303    0.00301    0.00305
## 4 altai  G         0.00323        0.00324    0.00322    0.00326
## 5 denis  A         0.00307        0.00307    0.00305    0.00309
## 6 denis  T         0.00337        0.00339    0.00337    0.00341
## 7 denis  C         0.00315        0.00317    0.00314    0.00319
## 8 denis  G         0.00317        0.00318    0.00316    0.00320

We can see that the true values are within the credible intervals in every case.

Here are the results for higher-level parameters:

theta_names <- paste(sb_sim_data$id, sb_sim_data$major, sep=":")
mu_names <- paste("mean,", levels(sb_sim_data$id))
kappa_names <- paste("concentration,", levels(sb_sim_data$id))

segplot(cbind(sb_sim_samples$theta, sb_sim_samples$mu), c(theta_names, mu_names))

plot of chunk sim_sb_hyper

segplot(cbind(sb_sim_samples$kappa, sb_sim_samples$sigma), c(kappa_names, "sigma"))

plot of chunk sim_sb_hyper

Real data

Now, for the real data:

sb_data <- counts %>% 
               group_by(id, major) %>% 
               summarize(coverage=sum(coverage), minor=sum(minor))
sb_fit <- stan(model_code=sb_block,
                   data=list(
                             coverage=sb_data$coverage,
                             minor=sb_data$minor,
                             id=as.numeric(sb_data$id),
                             major=as.numeric(sb_data$major)),
                   iter=1e3, chains=3)

sb_samples <- extract(sb_fit)
rstan::summary(sb_fit)$summary

##                   mean      se_mean           sd          2.5%
## theta[1]  2.172189e-03 1.589084e-07 6.154496e-06  2.159926e-03
## theta[2]  2.314288e-03 1.675222e-07 6.488109e-06  2.301658e-03
## theta[3]  6.220719e-03 2.937712e-07 1.137771e-05  6.199269e-03
## theta[4]  7.634249e-03 3.915556e-07 1.516488e-05  7.604839e-03
## theta[5]  3.156039e-03 2.253786e-07 8.728876e-06  3.139309e-03
## theta[6]  3.066519e-03 2.413862e-07 9.348846e-06  3.048579e-03
## theta[7]  3.629858e-03 3.069091e-07 1.188654e-05  3.606856e-03
## theta[8]  4.203468e-03 3.484523e-07 1.349550e-05  4.177913e-03
## mu[1]     5.227407e-03 1.014684e-04 2.264665e-03  2.870230e-03
## mu[2]     3.623621e-03 4.195787e-05 6.135662e-04  2.925861e-03
## kappa[1]  8.944200e+02 2.245731e+01 5.947890e+02  1.126520e+02
## kappa[2]  1.243148e+04 4.061996e+02 9.805067e+03  1.105108e+03
## sigma     2.157941e+04 1.010582e+03 2.567935e+04  1.482786e+03
## lp__     -7.744182e+06 1.579456e-01 2.699572e+00 -7.744189e+06
##                    25%           50%           75%         97.5%     n_eff
## theta[1]  2.168117e-03  2.172266e-03  2.176305e-03  2.184874e-03 1500.0000
## theta[2]  2.309743e-03  2.314242e-03  2.318744e-03  2.326655e-03 1500.0000
## theta[3]  6.212664e-03  6.220630e-03  6.228557e-03  6.242949e-03 1500.0000
## theta[4]  7.624295e-03  7.634264e-03  7.644449e-03  7.664773e-03 1500.0000
## theta[5]  3.150416e-03  3.155884e-03  3.161655e-03  3.173211e-03 1500.0000
## theta[6]  3.060028e-03  3.066335e-03  3.072874e-03  3.084331e-03 1500.0000
## theta[7]  3.622062e-03  3.629971e-03  3.637550e-03  3.653016e-03 1500.0000
## theta[8]  4.194138e-03  4.203591e-03  4.213496e-03  4.229218e-03 1500.0000
## mu[1]     4.028572e-03  4.801026e-03  5.857157e-03  1.002260e-02  498.1345
## mu[2]     3.367411e-03  3.558818e-03  3.765425e-03  4.676371e-03  213.8435
## kappa[1]  4.543426e+02  7.913557e+02  1.221692e+03  2.375364e+03  701.4719
## kappa[2]  5.107320e+03  9.983179e+03  1.723234e+04  3.790230e+04  582.6692
## sigma     6.438230e+03  1.301162e+04  2.711873e+04  9.129779e+04  645.6916
## lp__     -7.744184e+06 -7.744182e+06 -7.744180e+06 -7.744178e+06  292.1290
##               Rhat
## theta[1] 0.9992033
## theta[2] 0.9983825
## theta[3] 0.9989792
## theta[4] 0.9983739
## theta[5] 0.9987056
## theta[6] 0.9985890
## theta[7] 0.9982808
## theta[8] 0.9983814
## mu[1]    0.9991045
## mu[2]    1.0043019
## kappa[1] 0.9999994
## kappa[2] 1.0022672
## sigma    1.0027291
## lp__     1.0050122

Here is a table of estimated error rates, along with posterior means and boundaries of 95% credible intervals:

sb_tab <- geno %>% group_by(id, major) %>% 
    summarize(observed_error=1-sum(major_coverage)/sum(coverage))
sb_tab$posterior_mean <- colMeans(sb_samples$theta)
sb_tab$post_q.025 <- colQuantiles(sb_samples$theta, prob=.025)
sb_tab$post_q.975 <- colQuantiles(sb_samples$theta, prob=.975)
sb_tab

## # A tibble: 8 x 6
## # Groups:   id [?]
##   id     major  observed_error posterior_mean post_q.025 post_q.975
##   <fctr> <fctr>          <dbl>          <dbl>      <dbl>      <dbl>
## 1 altai  A             0.00217        0.00217    0.00216    0.00218
## 2 altai  T             0.00231        0.00231    0.00230    0.00233
## 3 altai  C             0.00622        0.00622    0.00620    0.00624
## 4 altai  G             0.00763        0.00763    0.00760    0.00766
## 5 denis  A             0.00316        0.00316    0.00314    0.00317
## 6 denis  T             0.00307        0.00307    0.00305    0.00308
## 7 denis  C             0.00363        0.00363    0.00361    0.00365
## 8 denis  G             0.00420        0.00420    0.00418    0.00423

Here we see that Stan is still inferring the error rates to be nearly exactly what we’d guess just from the empirical proportion (as makes sense, since we have a lot of data).

Here are the results for higher-level parameters:

segplot(cbind(sb_samples$theta, sb_samples$mu), c(theta_names, mu_names))

plot of chunk sb_hyper

segplot(cbind(sb_samples$kappa, sb_samples$sigma), c(kappa_names, "sigma"))

plot of chunk sb_hyper

It looks like the per-base error rates are somewhat similar between the two samples - it is higher for C and G for both - but the pattern seen in both is different. This is to be expected, because a major source of DNA damage is deamination of cytosines to uracil, that gets read as thiamene.

Differences by region

We’d now like to continue the above analysis, but allowing error rates to differ by region. We’ll do this by just adding another level of the hierarchy corresponding to region, so that what we had previously as $\theta_{s,b}$ will now be $\nu_{s,b}$, the mean error rate in sample $s$ with true base $b$, but the actual error rate in a particular region will be a random deviate from that.

\[\begin{aligned} Z_{s,b,r} &\sim \Binom(N_{s,b,r}, \theta_{s,b,r}) \\ \theta_{s,b,r} &\sim \Beta(\nu_{s,b} \gamma{s,b}, (1-\nu_{s,b}) \gamma{s,b}) \\ \nu_{s,b} &\sim \Beta(\mu_s \kappa_s, (1-\mu_s) \kappa_s) \\ \gamma_{s,b} &\sim \Normal(0, \sigma_\gamma) \\ \mu_s &\sim \Beta(1, 1) \\ \kappa_s &\sim \Normal(0, \sigma_\mu) \\ \sigma_\gamma &\sim \Exp(10^{-5}) \sigma_\mu &\sim \Exp(10^{-5}) \end{aligned}\]

Here’s a block for this:

big_block <- "
data {
    int N;   // this will be 808
    int coverage[N];
    int minor[N];  // number of errors
    int id[N];     // which sample
    int major[N];  // which base is the true one
    int region[N];
}
parameters {
    real<lower=0, upper=1> theta[N];
    matrix<lower=0, upper=1>[2,4] nu;
    matrix<lower=0>[2,4] gamma;
    vector<lower=0, upper=1>[2] mu;
    vector<lower=0>[2] kappa;
    real<lower=0> sigma_gamma;
    real<lower=0> sigma_kappa;
}
model {
    real alpha;
    real beta;
    minor ~ binomial(coverage, theta);
    for (k in 1:N) {
        theta[k] ~ beta(nu[id[k], major[k]] * gamma[id[k], major[k]],
                        (1 - nu[id[k], major[k]]) * gamma[id[k], major[k]]);
    }
    for (s in 1:2) {
        nu[s,] ~ beta(mu[s] .* kappa[s], (1-mu[s]) .* kappa[s]);
        gamma[s,] ~ normal(0, sigma_gamma);
    }
    // mu ~ beta(1,1); :: omit
    kappa ~ normal(0, sigma_kappa);
    sigma_gamma ~ exponential(.00001);
    sigma_kappa ~ exponential(.00001);
}
"

Simulated data

Once again, first we’ll fit to the simulated data:

big_sim_fit <- stan(model_code=big_block,
                       data=list(
                                 N=nrow(sim_counts),
                                 coverage=sim_counts$coverage,
                                 minor=sim_counts$minor,
                                 id=as.numeric(sim_counts$id),
                                 major=as.numeric(sim_counts$major),
                                 region=1L + as.integer(sim_counts$region)),
                       iter=1e3, chains=3)

big_sim_samples <- extract(big_sim_fit)

We now compute the table of true values, along with posterior means and boundaries of 95% credible intervals, but rather than view it numerically, we’ll plot it. There is one plot for each sample, and purple points depict the true values.

all_theta_names <- paste(sim_counts$id, sim_counts$major, sim_counts$region, sep=":")
all_nu_names <- paste("mean,", outer(levels(sim_counts$id), levels(sim_counts$major), paste))
all_gamma_names <- paste("concentration,", outer(levels(sim_counts$id), levels(sim_counts$major), paste))
all_mu_names <- paste("mean,", levels(sim_counts$id))
all_kappa_names <- paste("concentration,", levels(sim_counts$id))

big_tab <- sim_geno %>% group_by(id, major, region) %>% 
    summarize(error_rate=sum(error_rate*coverage)/sum(coverage))
big_tab$posterior_mean <- colMeans(big_sim_samples$theta)
big_tab$post_q.025 <- colQuantiles(big_sim_samples$theta, prob=.025)
big_tab$post_q.25 <- colQuantiles(big_sim_samples$theta, prob=.25)
big_tab$post_q.75 <- colQuantiles(big_sim_samples$theta, prob=.75)
big_tab$post_q.975 <- colQuantiles(big_sim_samples$theta, prob=.975)

layout(t(1:2))
with(subset(big_tab, id=="altai"), {
     splot(post_q.025, post_q.25, posterior_mean, post_q.75, post_q.975, 
           labels=sub("[^:]*:","",all_theta_names[big_tab$id=="altai"]),
           main='error rates, Altai neanderthal') 
     points(error_rate, length(error_rate):1, col='purple')
   })
with(subset(big_tab, id=="denis"), {
     splot(post_q.025, post_q.25, posterior_mean, post_q.75, post_q.975, 
           labels=sub("[^:]*:","",all_theta_names[big_tab$id=="denis"]),
           main='error rates, Denisovan')
     points(error_rate, length(error_rate):1, col='purple')
   })

$plot of chunk big_tab_sim_plot$ We can see that the true values are within the credible intervals in every case.

Here are the results for higher-level parameters:

segplot(cbind(big_sim_samples$mu, matrix(big_sim_samples$nu, nrow=nrow(big_sim_samples$mu))), 
        c(all_mu_names,all_nu_names))

plot of chunk sim_big_hyper

segplot(cbind(big_sim_samples$kappa, 
              matrix(big_sim_samples$gamma, nrow=nrow(big_sim_samples$kappa)), 
              big_sim_samples$sigma_gamma, big_sim_samples$sigma_kappa), 
        c(all_kappa_names, all_gamma_names, "sigma_gamma", "sigma_kappa"))

plot of chunk sim_big_hyper

Real data, finally

Now, for the real data, with the big model!

big_fit <- stan(model_code=big_block,
                       data=list(
                                 N=nrow(counts),
                                 coverage=counts$coverage,
                                 minor=counts$minor,
                                 id=as.numeric(counts$id),
                                 major=as.numeric(counts$major),
                                 region=1L + as.integer(counts$region)),
                       iter=1e3, chains=3)

big_samples <- extract(big_fit)

I’ve looked at the summary, and convergence looks good; the table is too big to reproduce here.

Here are the estimated error rates, as depicted above. There is one plot for each sample; different bases are grouped together, and then ordered by region.

big_tab <- geno %>% group_by(id, major, region) %>% 
    summarize(error_rate=sum(error_rate*coverage)/sum(coverage))

## Error in summarise_impl(.data, dots): Evaluation error: object 'error_rate' not found.

big_tab$posterior_mean <- colMeans(big_samples$theta)
big_tab$post_q.025 <- colQuantiles(big_samples$theta, prob=.025)
big_tab$post_q.25 <- colQuantiles(big_samples$theta, prob=.25)
big_tab$post_q.75 <- colQuantiles(big_samples$theta, prob=.75)
big_tab$post_q.975 <- colQuantiles(big_samples$theta, prob=.975)

layout(t(1:2))
with(subset(big_tab, id=="altai"), {
     splot(post_q.025, post_q.25, posterior_mean, post_q.75, post_q.975, 
           labels=sub("[^:]*:","",all_theta_names[big_tab$id=="altai"]),
           main='error rates, Altai neanderthal') 
   })
with(subset(big_tab, id=="denis"), {
     splot(post_q.025, post_q.25, posterior_mean, post_q.75, post_q.975, 
           labels=sub("[^:]*:","",all_theta_names[big_tab$id=="denis"]),
           main='error rates, Denisovan')
   })

plot of chunk big_tab_plot

Here are the results for higher-level parameters:

segplot(cbind(big_samples$mu, matrix(big_samples$nu, nrow=nrow(big_samples$mu))), 
        c(all_mu_names,all_nu_names))

plot of chunk big_hyper

segplot(cbind(big_samples$kappa, 
              matrix(big_samples$gamma, nrow=nrow(big_samples$kappa)), 
              big_samples$sigma_gamma, big_samples$sigma_kappa), 
        c(all_kappa_names, all_gamma_names, "sigma_gamma", "sigma_kappa"))

plot of chunk big_hyper

There is substantial heterogeneity between regions. Let’s look at the results in another way: with the four bases grouped together for each region.

layout(t(1:2))
for (sample_id in levels(big_tab$id)) {
    for (b in bases) {
        with(subset(big_tab, id==sample_id & major == b), {
             splot(post_q.025, post_q.25, posterior_mean, post_q.75, post_q.975, 
                   xlim=range(big_tab$post_q.025, big_tab$post_q.975),
                   labels=gsub("[^:]*:","",all_theta_names[big_tab$id==sample_id & big_tab$major==b]),
                   main=paste('error rates,', sample_id),
                   pt.col=match(b, bases), col=match(b,bases),
                   add=(b != bases[1]))
           })
    }
    legend("topright", pch=20, col=1:4, legend=bases)
}

plot of chunk big_tab_plot2

Conclusions

In all cases, our methods worked well on the simulated data, correctly inferring the true error rates with an appropriate margin of error.

The two samples had substantially different overall error probabilities, of about 0.0041 for the Altai Neanderthal and 0.0034 for the Densiovan. However, error rates differed substantially according to what the true nucleotide was, a pattern that didn’t become clear until we split out error rates by region of the genome. Then, we found that both overall error rates and the pattern of error rates (i.e., relative error rates of A, T, C, and G) differed substantially between regions, as well as between samples. In both samples, the error rates of C-G bases was higher, although this was much more pronounced in the Neanderthal. This is to be expected, because of deamination-induced DNA damage affecting C-G bases; this signature is good evidence that we are in fact getting ancient DNA. Furthermore, the way that overall error rates varied between regions was strongly correlated between samples, despite these having lived about 25,000 years apart. (In fact, the “regions” of the two samples don’t exactly line up, so this correlation is probably stronger than it appears in the final figure above.) This implies that some regions of the genome are more error-prone than others; one reason for this is that different parts of the genome are more or less covered by heterochromatin (which protects DNA), and these patterns of heterochromatin are conserved over evolutionary time. Puzzlingly, in some regions we see that the C and G error rates are substantially different, despite the prediction that these should be the same, as every C is reverse-complement paired to a G and vice-versa. Finally, the error rates in the mitochondria look like many other regions in both samples: perhaps on the low side, but not surprisingly different. (Perhaps you did a statistical test of this: if so, good.)