First we read in the data, converting the strings of H/T to a number of Heads and a number of Tails.
datafile <- "coin_flips.txt"
flip_text <- strsplit(scan(datafile, what=''), "")
nflips <- sapply(flip_text, length)
nheads <- sapply(flip_text, function (x) sum(x=="H"))
ntails <- sapply(flip_text, function (x) sum(x=="T"))
stopifnot(all(ntails == nflips - nheads))
There are 100 coins in the dataset, that were flipped between 1 and 24 times, with a median of 8 flips. The number of flips per coin breaks down as follows:
table(nflips)
## nflips
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 22 24
## 2 4 6 6 3 14 10 15 11 6 3 7 4 1 3 1 1 1 2
The proportion of heads varies substantially: here is a plot of the proportion of heads against the number of flips, slightly jittered to avoid overplotting.
plot(jitter(nflips), jitter(nheads/nflips), xlab='number of flips',
ylab='proportion heads', pch=20)
The coins tend towards Heads, although they certainly don’t all have the same probability of coming up heads: observe that one coin flipped 12 times had no heads, while another, also flipped 12 times, had 12 heads.
From Wikipedia, the Beta-Binomial distribution ( for the number of heads among \(n\) flips of a coin whose probability of heads (which we denote \(\theta\)) has a Beta(\(\alpha\), \(\beta\)) distribution) is \[\begin{aligned}
p(k|n, \alpha, \beta) = \binom{n}{k} \frac{B(k + \alpha, n - k + \beta)}{B(\alpha, \beta)} ,
\end{aligned}\] where \(B()\) is the Beta function. That’s the probability that a particular coin that was flipped \(n\) times gets \(k\) heads; we are assuming that the sequence of flips coming from each coin is an independent result of this random process, with common \(\alpha\) and \(\beta\) parameters determining the distribution of \(\theta\) across the coins (i.e., how the probability of heads varies between coins). Using the R functions lchoose()
and lbeta()
for the logarithm of the “choose” function \(\binom{n}{k}\) and the Beta function, respectively, the log-likelihood of the data for a given set of params =
\((\alpha, \beta)\) is
loglik <- function (params) {
sum( lchoose(nflips, nheads) +
lbeta(nheads + params[1], ntails + params[2]) -
lbeta(params[1], params[2]) )
}
There are only two parameters, so to get a picture of the result we’ll do a grid search.
ngrid <- 51
maxgrid <- 10
alpha_vals <- beta_vals <- seq(0,maxgrid,length.out=ngrid)[-1]
ab <- expand.grid(alpha=alpha_vals, beta=beta_vals)
ll_mat <- apply(ab, 1, loglik)
dim(ll_mat) <- c(length(alpha_vals), length(beta_vals))
mle <- ab[which.max(ll_mat),]
The MLE obtained by this grid search is \(\alpha = 2.6\) and \(\beta = 1.6\), at which point the log-likelihood is -200.3388174.
The (log-)likelihood surface has a single peak, so after some adjusting of the parameters, we end up with the following picture:
image(alpha_vals, beta_vals, ll_mat)
contour(alpha_vals, beta_vals, ll_mat, add=TRUE)
contour(alpha_vals, beta_vals, ll_mat, add=TRUE,
levels=pretty(ll_mat[which.max(ll_mat)] - 0:10))
points(mle[1], mle[2], pch="*", cex=3)
We could use gradient descent (as in the R function optim()
to get an estimate of \(\alpha\) and \(\beta\) with more digits after the decimal, but it is clear from the likelihood surface that this would be false precision: there is an elliptical region over which \(\alpha\) ranges from about 2 to 4 and \(\beta\) ranges from about 1 to 2 on which the log-likelihood differs from the MLE by only two units; any reasonable confidence interval would roughly agree with this region.
In conclusion, our best guess at the distribution of true-probabilities-of-heads in my bag of coins is Beta(2.6, 1.6), which looks like:
xx <- seq(0, 1, length.out=400)
plot(xx, dbeta(xx, shape1=mle$alpha[1], shape2=mle$beta[1]), type='l',
xlab='probability of heads', ylab='density')
lines(xx, dbeta(xx, shape1=3, shape2=2), col='red', lty=3)
The true values used to simulate these data werre \(\alpha = 3\) and \(\beta = 2\), shown in red on the graph – not bad.
Are all the coins consistent with this distribution of probabilities? To get an idea of this we can compute the “\(z\)-scores” for each coin: the difference of the observed value from the mean, divided by the SD. Again from Wikipedia, the mean of a Beta-binomial of \(n\) flips is \(n \alpha / (\alpha + \beta)\) and the variance is \(n \alpha \beta (\alpha + \beta + n) / ((\alpha + \beta)^2 (\alpha + \beta + 1))\). Surprising outcomes should have large values of \(z\) (either positive or negative).
alpha <- mle$alpha[1]
beta <- mle$beta[1]
means <- nflips * alpha / (alpha + beta)
sds <- sqrt(nflips * alpha * beta * (alpha + beta + nflips) / ((alpha + beta)^2 * (alpha + beta + 1)))
z <- (nheads - means) / sds
Here is a stem-and-leaf plot of the \(z\)-scores:
biggest_z <- which.max(abs(z))
stem(z)
##
## The decimal point is at the |
##
## -2 | 5
## -2 |
## -1 | 99975
## -1 | 44432222210000
## -0 | 999999777765555
## -0 | 4444333222111
## 0 | 001223334444
## 0 | 556666667788888999
## 1 | 00001222223444444
## 1 | 55555
The largest-in-absolute-value \(z\)-score is the 76th coin, with \(z=-2.5018512\), a coin flipped 12 times getting 0 heads. Is this surprising? Using the Beta-Binomial probability distribution again, the probability of getting so few heads with 12 flips is 1.9104372 × 10-8. This certainly seems surprising – does this coin have two tails and no heads? More investigation is needed.