First, we fix some notation. For a sample of individuals indexed by some set , genotyped at a set of genomic positions indexed by , the data are , i.e. is the allele that the individual inherited at the position from her mother, and is the corresponding allele inherited from her father.
Regardless of the process that has generated , it makes sense to think about the sampling distribution of , and associated statistics – i.e. the distribution of induced by some sort of random sampling of the individuals. Often, we can actually obtain from a good estimate of the entire sampling distribution. For instance, we can estimate the distribution of the the number of nucleotide differences between two individuals in a 100bp region across all such regions and all pairs of sampled individuals, as long as can be reasonably regarded as a random sample from some population. We can further estimate conditional sampling distributions, e.g. number of such differences as a function of geographical distance between them, or in protein coding regions.
Here we relate the sampling distributions of a number of statistics easily computable form to sampling distributions of properties of the pedigree with recombination.
We will study aspects of the distributions of these statistics under various “levels of sampling”:
Unconditional, including averaging over the population process.
Conditional on the population pedigree (), and averaging over recombinations, segregation, and mutations.
Conditional on the population ARG (), averaging over choices of individuals.
Conditional on the population ARG (), averaging over choices of locus.
Only the first needs a population model. The second is clearly fictitious, but can be useful (as we see below). The latter two can be interpreted as empirical distributions. Often, in practice, we have an empirical distribution obtained averaging across many loci (as in the last point), and compare it to a theoretical distribution for a single locus under a population model. This is in principle wrong, since it ignores correlations between loci introduced by the pedigree, but in practice seems to be pretty good (2).
As a first example, we can relate the mean and variance of the number of segregating sites in a sample to the distribution of total time in the sample tree, simliar to the calculation of number of mutations in (1). The number of segregating sites, , in a sample of chromosomes at loci, is the number of sites at which at least two of the sampled chromosomes differ. For this to happen, there must have been a mutation at that site somewhere on the tree that relates the samples.
Concretely, suppose that we have sampled chromosomes, and that (as above) the tree relating these samples at genomic position is . We measure “length” of the tree in meioses, and denote by the total number of meioses the tree, up until the most recent common ancestor. Also, let denote the total number of mutations that have occurred at site during any of the meioses anywhere in . Under the usual assumption on mutations, this number is Poisson distributed with mean .
Now let denote a randomly chosen locus, , and . It is straightforward that
(1) |
and using the formula for conditional partitioning of variance,
(2) | ||||
(3) |
Note that this remains true if depends on .
Now, by linearity of expectation,
(4) |
Similarly, by conditioning on the collection of trees , since and are conditionally independent given and ,
(5) | ||||
(6) | ||||
(7) | ||||
(8) | ||||
(9) | ||||
(10) |
This agrees with Hudson (1) if we are restricted to a single, nonrecombining locus.
The “observed heterozygosity” in a group of individuals in a genomic region is the probability that a randomly chosen individual is heterozygous at a randomly chosen nucleotide, or
(11) |
where denotes the total number of loci and denotes the total number of individuals.
In other words, is the proportion of homologous alleles that differ from each other, across and across . By calling them “homologous” we assume they share a common ancestor; so if they differ there must have occurred a mutation since that common ancestor. Take a single individual , suppose that the chance of a mutation occurring at site in a particular meiosis is , and that there have been generations since the common ancestor of the maternal and paternal copies. The probability that there has been no mutations since that time is , since there are meioses separating the two. The proportion of heterozygous sites is determined by the empirical distribution of times back to the most common ancestor of paired homologous sites, averaged across sites and across individuals. Let denote this distribution, i.e. . If, as assumed, there is no back mutation, then,
(12) | ||||
(13) | ||||
(14) |
Note that , if it was observable, would be a good summary statistic (albeit complicated) of the pedigree, and depends implicitly on the choice of individuals and the choice of genomic region . If is small, then , i.e. the proportion of sites that an individual is heterozygous is equal to the mutation rates multiplied by the average time back to the common ancestor of the maternal and paternal chomosomes.
Similarly,
(15) |
This implies that the observed heterozygosity is an estimtor of the statistic of the empirical distribution of pairwise coalescence times, and that can put some explicit bounds on how good this estimator is.
As stated, is a single number, the chance that a randomly chosen homologous pair of alleles differ. This averages over levels of relatedness of different individuals, as well as mutation rates and depths of relatedness that may differ systematically across loci. If we know local mutation rates, and partition sites according to this, then we can estimate as a function of , obtaining an estimate of the Laplace transform of .
Also known as “expected heterozygosity”, this is the chance that two randomly chosen alleles from at a random site in differ:
(16) |
, like , is computable from the distribution of the number of generations available for mutation where the relevant number of generations here is defined to be . Concretely, is the number of generations back to the common ancestor at a uniformly chosen locus between two uniformly chosen chromosomes in the population (possibly, but not necessarily, in the same individual). Again,
(17) |
Such measures of heterozygosity can measure not only within-group diversity but also between-group divergence, by computing e.g. the probability that two randomly chosen individuals in different subpopulations differ at a randomly chosen locus. Any such measurement can be thought of as the proportion of some subset of paths through the pedigree along which a mutation has occurred; (crucially) assuming that the mutation process is independent of inheritance, this probability of mutation only depends on the number of meioses along the path, and hence on the distribution of path lengths. Above these distributions of lengths across certain sets of paths through the pedigree appeared as and .
Mutations at a locus induce a partition of a set of chromosomes – those who are identical at that locus. Heterozygosities are pairwise statistics; when comparing two chromosomes there are only two possible results: identical or not. When looking at larger samples, any partition is possible; at loci with no more than two alleles, all dichotomous partitions are possible.
Suppose we are looking at the empirical distribution of allele frequencies in a sample of size at biallelic sites, including only the polymorphic sites, i.e. the sites where more than one allele is seen in the sample. This is called the “allele frequency spectrum”, or “site frequency spectrum”. Let denote the numbers of sampled chromosomes that have the ‘0’ and ‘1’ alleles, respectively, and define to be the number of sites with the allele ‘1’ is at frequency , or
(18) |
and the “unfolded” and “folded” allele frequency spectra
(20) | ||||
(21) |
If we have some way of polarizing mutations, so that e.g. allele ‘0’ is more likely to be the ancestral allele, then the unfolded spectrum is more useful; otherwise, if the choice of allele labeling is arbitrary, we expect and the folded spectrum is more natural.
Now, we’ll compute the mean and variance of conditional on the ARG , averaging across the mutation process. Let be the gene three at site , and let be the total length of branches in subtended by exactly tips, so that (see figure 1). Again assuming that the mutation is independent of inheritance, the probability that site has no segregating mutation is , and the probability that only a single segregating mutation has occurred is . Given that only a single mutation has occurred, the location of that mutation is uniform on the tree, and so the probability that a mutation occurs at frequency at site is
Therefore,
(22) | ||||
(23) |
and since given the marginal gene trees, the mutation processes at each site are assumed to be independent,
(24) |
Therefore, if the number of mutations in the region under consideration is large, then is well-approximated by times the sum of the appropriate edges in the trees:
(25) |
Therefore,
(26) |
The approximation holds if the number of sites at which two or more mutations have occurred is small, and if the total number of segregating sites is large. In other words, gives, to good approximation, the chance that a genetic ancestor, chosen uniformly among those genetic ancestors of some (but not all) of the sample, is a genetic ancestor to exactly of the sampled chromosomes.
The previous statistics were single-site statistics that took their information from the branching structure of the pedigree and the differentiating action of mutation along it. Consideration of the relationships multiple loci brings recombination into the picture. Perhaps the simplest summary of this is the measure of linkage disequilibrium. It is a two-site statistic, and is in some sense is a single-individual statistic.
Take two sites and , at recombination distance , so that mean number of crossovers that fall between them in a generation is . One statistic measuring association between alleles and at and is
(27) |
where is the empirical frequency of chromosomes that have the ‘1’ allele at both sites and , and is similar. To measure association between the loci we sum over alleles and square, defining
(28) | ||||
(29) |
Now assume that the loci are biallelic, coded as (in which case ) and let , , , and be the indices of individuals chosen uniformly at random with replacement. Now let be a randomly chosen allele at locus for (i.e. either or ), be the same for locus , on the same chromosome as , and similarly for , , and . Then
(30) |
(Note: this is an example of the more general idea of a “distance covariance”, here between and .)
These quantities are things that we can compute in terms of paths through the pedigree if we can assume that the appearance of mutations can be taken as independent of the pedigree. Let be the number of generations back to the common ancestor of the chosen chromosomes of and at locus , and similarly for at locus . Then under the infinite alleles model, with mutation rate ,
(31) |
Similar equations for the other terms leads to
(32) | ||||
(33) | ||||
(34) |
where the latter approximation holds if the expected number of mutations per site () is small.
What does have to say about the structure of the ancestral recombination graph? Intuitively, since it is the squared correlation between alleles at two loci on the same chromosome, it should be telling us about how much those loci tend to stick together. This is reflected in the formula above, which interprets in terms of covariances of times back to most recent common ancestors at the two sites.
We can do a little more to make these covariances interpretable, in terms of the recombination distance between and . For convenience, let , etcetera. Again assuming independence of mutation and the pedigree, given , the probability that there was no recombination between the loci along the path between and is ; in this case, . Suppose that in the complimentary case, when there was recombination, that is (conditionally) independent of – not true, but not too bad either. The correponding term in the formula for decays exponentially with :
(35) | ||||
(36) | ||||
(37) |
Now take the second term. The most obvious way that the genealogy induces correlations between and occurs if the most recent common ancestor of and is the same as that of and , in which case (see figure 2A), and there is no recombination along the whole genealogy back to this MRCA. Define to be the age of this MRCA. If we now assume that the case in which there was a recombination on the path from to contributes nothing to the covariance, since the probability that is in this position is ,
(38) | ||||
(39) | ||||
(40) |
It should be clear what to do for the third term now. If the situation in figure 2B occurs (which it does with probability ) then . As before,
(41) | ||||
(42) | ||||
(43) |
Combining these gets us an approximate expression for that is a tad unwieldy, but is in terms of ages of most recent common ancestors of two, three, and four samples: taking only terms first-order in , and letting ,
(44) |