ecology, evolution, and simulation
Peter Ralph
University of Oregon // October 2019
Students/postdocs/researchers:
Funding:
Other collaborators:
Human sickle-cell allele (HbS): (Currat et al 2002)
Human G6PD variants: Howes et al 2013)
How generalizable is this?
Hudson 1994; Cutter & Payseur 2013; Corbett-Detig et al 2015
The indirect effects of selection on genomic locations that are linked to the sites under selection by a lack of recombination.
\[ \begin{aligned} \pi &= \text{ (within-pop genetic distance) } \\ d_{xy} &= \text{ (between-pop genetic distance) } \end{aligned} \]
How much genetic variation typically underlies traits?
How important is natural selection in determining genetic diversity?
How will populations respond to changes in the future?
To test theories and fit models, we need simulations with realistic
For a set of sampled chromosomes, at each position along the genome there is a genealogical tree that says how they are related.
The tree sequence is a way to describe this, er, sequence of trees.
Kelleher, Etheridge, and McVean introduced the tree sequence data structure for a fast coalescent simulator, msprime.
stores sequence and genealogical data very efficiently
tree-based sequence storage closely related to haplotype-matching compression
python/C tskit
tools
What do genotypes tell us about the genealogies?
Genotypes:
Example: genetic distance counts how many mutations differ between two sequences.
Genotypes:
Example: sequence divergence counts how many mutations differ between two sequences.
Trees:
Example: the mean time to most recent common ancestor between two sequences.
Any set of sample weights \(w\) and summary function \(f\) defines both
With genealogies fixed, and averaging only over mutations with rate \(\mu\), \[\begin{equation} \text{Branch}(f, w) = \frac{1}{\mu} \E\left[ \text{Site}(f, w) \right] . \end{equation}\]
Dealing directly with genealogies can remove the layer of noise due to mutation: \[\begin{equation} \frac{1}{\mu^2} \var\left[\text{Site}(f,w)\right] = \var\left[\text{Branch}(f,w)\right] + \frac{1}{n} \E\left[\text{Branch}(f^2,w)\right] \end{equation}\]
Duality, on 1000 Genomes data? Not quite…
Erik Lundgren: “Isolation By Coalescence”
Populus trichocarpa and P. balsamifera data from Moreno Geraldes et al 2014
If we record the tree sequence that relates everyone to everyone else,
after the simulation is over we can put neutral mutations down on the trees.
Since neutral mutations don’t affect demography,
this is equivalent to having kept track of them throughout.
This means recording the entire genetic history of everyone in the population, ever.
It is not clear this is a good idea.
But, with a few tricks…
For example:
Runtime: 8 hours
population split, with either:
We need better understanding and prediction of how
To test theories and fit models, we need simulations with realistic
Who inherits from who; only necessary for coalescent events.
Records: interval (left, right); parent node; child node.
The ancestors those happen in.
Records: time ago (of birth); ID (implicit).
When state changes along the tree.
Records: site it occured at; node it occurred in; derived state.
Where mutations fall on the genome.
Records: genomic position; ancestral (root) state; ID (implicit).