Peter Ralph, University of Oregon
June 2023
A data structure is…
succinct if it only stores each bit of information once.
descriptive if it reflects the underlying process.
So: let’s think about the process that generated the data!
You have two copies of each autosome, one from each parent.
When you make a gamete, the copies recombine,
and copying errors lead to mutations.
Each two copies of the genome were inherited, noisily, from the two parents,
and from the four grandparents,
and the eight great-grandparents
and the sixteen great-great-grandparents
… but, how much from each of them?
Tracing back the ancestry of some chromosomes:
Result: a labeled genealogy containing all the genealogical trees.
Tracing back the ancestry of some chromosomes:
Result: a labeled genealogy containing all the genealogical trees.
Tracing back the ancestry of some chromosomes:
Result: a labeled genealogy containing all the genealogical trees.
Tracing back the ancestry of some chromosomes:
Result: a labeled genealogy containing all the genealogical trees.
Tracing back the ancestry of some chromosomes:
Result: a labeled genealogy containing all the genealogical trees.
Tracing back the ancestry of some chromosomes:
Result: a labeled genealogy containing all the genealogical trees.
Tracing back the ancestry of some chromosomes:
Result: a labeled genealogy containing all the genealogical trees.
Tracing back the ancestry of some chromosomes:
Result: a labeled genealogy containing all the genealogical trees.
For a set of sampled chromosomes, at each position along the genome there is a genealogical tree that says how they are related.
The succinct tree sequence
is a way to succinctly describe this, er, sequence of trees
and the resulting genome sequences.
Who inherits from who.
Records: interval (left, right); parent node; child node.
The ancestors those happen in.
Records: time ago (of birth); ID (implicit).
When state changes along the tree.
Records: site it occured at; node it occurred in; derived state.
Where mutations fall on the genome.
Records: genomic position; ancestral (root) state; ID (implicit).
The result: an encoding of the genomes and all the genealogical trees.
Genotype matrix: \(O(NM)\)
\(N \times M\) things.
Tree sequence: \(O(N + T + M)\)
If we record the tree sequence that relates everyone to everyone else,
after the simulation is over we can put neutral mutations down on the trees.
Since neutral mutations don’t affect demography,
this is equivalent to having kept track of them throughout.
This means recording the entire genetic history of everyone in the population, ever.
It is not clear this is a good idea.
Every time an individual is born, we must:
This produces waaaaay too much data.
We won’t end up needing the entire history of everyone ever,
but we won’t know what we’ll need until later.
How do we get rid of the extra stuff?
Simplification.
Everything is efficient, open, and tested.
tskit
: tree sequence tools
stdpopsim
: a library of “standard” simulation
tools
msprime
: coalescent simulator,
SLiM
: forwards evolutionary simulator