Perhaps we are presented with data: genome sequences plus other information about a number of individuals that are usually random samples from some population(s), and we want to learn about their shared history: estimate levels of relatedness between the samples; infer their ancestral genome sequence(s); or identify genomic locations subject to selection. Or, perhaps we want to compute the levels of genetic diversity predicted under a certain population model. In either case, we must understand the relationship between process parameters (e.g. migration rates, selection coefficients) and the observed genomic patterns of dissimilarity.
This discussion starts out from the unconvential direction of treating the usual population genetic quantities as summary statistics of the (unobserved) pedigree–with–recombination, the “ancestral recombination graph”, or ARG. Often, things like “coalescent time” are defined only in the context of mathematical models, but can equally well be thought of as descriptive statistics, whose expected values we can compute under certain models. This seems to me at least conceptually useful when thinking about such quantities as estimated from real data, where assumptions such as random mating are rarely met.
Some other discussions that take this point of view to at least some degree are: , OTHERS. Other good references for coalescent theory are Hudson (3) (a nice article-length review), and the books Wakeley (4) (fairly gentle) and Ewens (1) (fairly mathematical, and covers more general population genetics).
First, an outline of how we model diploid reproduction, i.e. how the autosomal genome of an offspring gets assembled from those of the parents. There are exceptions (of course), but mostly, each organism has two copies of each autosomal chromosome, one from mom and one from dad. Offspring are the union of two gametes (an egg and a spermatid), and each gamete is produced by a single diploid cell:
duplicating each chromosome (by mitosis)
recombining the homlogous copies
segregating these copies among four daughter cells.
Sometimes all four daughter cells become gametes; sometimes (e.g. female meiosis in mammals) only one will.
is complicated, varies along the genome, and is affected by genomic factors and motifs. To make it tractable, we assume that recombination breakpoints occur as a Poisson point process with constant mean of 2 breakpoints per unit of length, and that each breakpoint is a crossing over between a randomly chosen maternal and a randomly chosen paternal chromosome. This produces an average of 1 crossover per chromosome per unit length, i.e. we are measuring length in genetic map length (Morgans).
In a major concession to mathematical convenience, we’ll simply model mutation as another Poisson point process – suppose that each gamete differs from its progenitor chromosome at the points of an Poisson point process. This is the “infinite sites model”, assuming that mutation cannot hit at the same location twice. Usually we assume the point process of mutations has mean rate per unit of map length, but in reality this rate varies along the genome.
Once four gametes are produced, it remains to be decided which of the four produce the offspring. We assume that one is chosen uniformly at random, independently of the result of recombination. (Again, there are exceptions.)
The pedigree of a set of individuals is a graph describing all parent-offspring relationships between these and some set of their ancestors. The population pedigree describes all such relationships between all individuals living and possibly dead. This records mate choice, but omits the important information of recombination and segregation, i.e. which parts of each chromosome derive from which of the parents’ two homologous copies. Adding this information to the pedigree obtains what is known as the ancestral recombination graph, or ARG. There are a number of ways to formalize this notion; see (2; 3) for discussions. One way simply notes that the relatedness structure at any particular locus is a treelike subset of the pedigree, and that the collection of these trees – one for each base in the genome, say – is sufficient to reconstruct the entire history of inheritance, recombination and segregation. Alternatively, one can annotate each link in the pedigree with a labeling of the genome by , denoting which segments of the parent’s maternal and paternal chromosomal copies were passed down along that link. See figure 1 for a partial depiction.
Perhaps the simplest thing we can obtain from the full ancestral recombination graph is the typical degree of relatedness of pairs of individuals. More concretely, we might ask for the empirical distribution of pairwise times back to the most recent common ancestor across all pairs of chromosomes and all loci: the distribution of , if is the length of a randomly chosen one of these paths.
A more precise way of formulating this is as follows: pick two random chromosomes and a random locus; follow the lineages of the two chromosomes at that locus up through the pedigree until their common ancestor; one-half the number of meioses encountered is . Since these two lineages will henceforth move together through if followed further back through the pedigree, is known as the “coalescence time” of the two lineages.
In this way, the phrase “coalescence time” is shorthand for “number of generations back to the most recent common ancestor”, taken as a random quantity across random samples of sets of chromosomes and/or loci. In this formulation, it is the empirical distribution of lengths of a certain set of paths to common ancestors through the pedigree.
The path back through the pedigree along which these two chromosomes have inherited that particular locus is a very simple tree, with two leaves (at the samples) and a root at their most recent common ancestor; the height of that tree is the coalescence time. More generally, the ancestral recombination graph encodes the marginal tree along which, at each locus, any set of sampled individuals have inherited at that locus. These are called “gene trees”. Each gene tree follows a path through the links of the pedigree, and if due to recombination during one of the meioses, loci and are inherited from different parental chromosomes, the marginal gene trees at and will differ.
Now, a formal definition for the so far obvious-yet-vague “ancestral recombination graph”. This records all relationships between both chromsomes of all individuals that ever lived in the population: so, let denote those chromosomes that were alive at time , and let denote all chromosomes in our universe, These are furthermore grouped together into individuals: is a partition of into pairs, and likewise is a partition of into pairs. The relationships occur if two chromosomes in a diploid cell undergo recombination and meiosis, producing one chromosome for an offspring. So, if is a pair of chromosomes in , and they recombine at locations to produce chromsome , then we say that the meiosis has occurred. As a matter of convention, say that has inherited from on odd intervals , , etcetera, and has inherited from from the remaining intervals, and that , where is the length of the chromosome. Let be the meioses occurring at , and . To reiterate, if , then must be in an individual birthed at from living parents; i.e. but for any ; and . Note that it is natural to identify each chromosome with the meiosis that produced it.
Since contains all the information about the population pedigree as well as how recombination has acted within the pedigree, we refer to as “the ancestral recombination graph” (or, the ARG).
The population pedigree is just the information about who was whose parents; we denote this by . The ARG carries the addition information of recombination locations, which is equivalent to knowing the population gene tree at each location on the chromosome.
For instance, let be the gene tree for a set of samples at position along the genome. This is the minimal acyclical graph (a tree) whose nodes are chromosomes (or, equivalently, meioses), that contains the sampled chromosomes, and if a chromosome is in , then so is the parent of at which has inherited at . These trees change whenever a breakpoint is encountered in any of the constituent meioses: Formally, define to be the set of recombination breakpoints of the meiosis that created ; then the next point to the right of at which the tree changes is:
(1) |
This notation, so far, describes the facts: actual relationships that have already occurred in a population; many of which may be unobservable. To make inferences, we will need to put models of certain of these processes: on mutation, recombination, mate choice, and/or offspring number.
When two chromosomes share a common ancestor, we like to say that that ancestor lived some number of “generations” in the past. For most organisms, the notion of a generation is statistical, rather than a fixed quantity. What we actually care about is the number of meioses separating the two chromosomes – so, we hereby define the length of a path through the pedigree in generations as one-half the number of meioses. In fact, in the presence of inbreeding, it is possible for two chromosomes to have inherited different genomic regions from the same ancestor along different paths through the pedigree, which may have different lengths!
We will mostly work in the infinite sites model of mutation; we do this basically so that we can keep track of how many mutations have occurred at each locus, which although unrealistic helps greatly in the analysis. We also usually assume that mutation rates are homogeneous along the genome. This is clearly not correct, but a very good approximation over the right scales.
Here we define to be the (“discrete”) mutation rate per generation per base – the probability that a given base differs from the homologous base in the parent it was inherited from. We will sometimes find it convenient to use , so that the probability of no mutation across meioses is .