\[%% % Add your macros here; they'll be included in pdf and html output. %% \newcommand{\R}{\mathbb{R}} % reals \newcommand{\E}{\mathbb{E}} % expectation \renewcommand{\P}{\mathbb{P}} % probability \]
Title: "Generating sequences without repeats" |
Author: "Peter Ralph" |
Date: "Wed Aug 23 14:12:24 2017" |
Goal: To write a function that takes a single argument, \(n\), and returns all strings of length \(n\) composed of A, C, G, and T in which no two adjacent letters are the same.
To start, and for testing purposes, let's compile the list by hand for small \(n\). For instance, truth[[2]]
will have the right answer for n=2
:
truth <- list(
c("A", "C", "G", "T"),
c("AC", "AG", "AT",
"CA", "CG", "CT",
"GA", "GC", "GT",
"TA", "TC", "TG"),
c("ACA", "ACG", "ACT",
"AGA", "AGC", "AGT",
"ATA", "ATC", "ATG",
"CAC", "CAG", "CAT",
"CGA", "CGC", "CGT",
"CTA", "CTC", "CTG",
"GAC", "GAG", "GAT",
"GCA", "GCG", "GCT",
"GTA", "GTC", "GTG",
"TAC", "TAG", "TAT",
"TCA", "TCG", "TCT",
"TGA", "TGC", "TGT"))
As a check, the correct number of these should be 4 * 3^{(n-1)}
-- four for the first letter, and three for each subsequent one.
4 * (3^(0:2))
## [1] 4 12 36
sapply(truth, length)
## [1] 4 12 36
The procedure of making those lists suggests an elegant idea: construct the length \((n+1)\) strings by prepending a letter to the length \(n\) strings, but omitting the ones that would make a repeat.
sequences <- function (n) {
bases <- c("A", "C", "G", "T")
if (n==1) {
out <- bases
} else {
out <- c()
smaller <- sequences(n-1)
for (x in bases) {
omit <- (substr(smaller, 1, 1) == x)
out <- c(out, paste0(x, smaller[!omit]))
}
}
return(out)
}
Now let's check this:
all(sequences(1) == truth[[1]])
## [1] TRUE
all(sequences(2) == truth[[2]])
## [1] TRUE
all(sequences(3) == truth[[3]])
## [1] TRUE
results <- sapply(1:8, sequences)
plot(sapply(results, length), xlab='n', ylab='length of result')
lines(1:8, 4*3^(0:7), col='red')