In the previous post, we explored a simple but powerful Bayesian model of variant calling. This time, we’re climbing higher. Hold on tight — the path is clear, but the maths gets a little wilder. We’re going to reassess what we even mean by a variant, and reframe our model in a more generative way. Cling to your favourite tree branch — we’ll be swinging between two perspectives on variant calling: the binary hypothesis model and the base mixture model.
Previously we examined a simple but powerful Bayesian model of variant calling. It evaluated how likely it was that a variant was present versus a variant not being present, using Bayes’ theorem. This model is fundamentally binary — it considers two hypotheses:
However, we didn’t really define what we meant by “a variant”. I would imagine that most people assumed it meant a single, non-reference base at a specific position in the genome. But we only ever talked about “variant supporting reads”. We never said anything about what it meant for a read to support a variant, or not! In reality, a read “supporting a variant” presumes a base-specific model — but our model never defined what the variant was.
Let’s stick with the assumption that we were discussing single-nucleotide polymorphisms. Even in such a simple case we immediately run into a problem: in trying to define “variant supporting reads”, we assume a two-state model, variant or not variant. But wait, doesn’t DNA have (at least) four bases? What do we do with the other two bases?
We can call our original model the binary hypothesis model — it weighs one alternative base against the reference. This false dichotomy does not appear to directly model the reality of the data. What happens if we drop this binary setup and instead ask: what is the probability that each possible base (A, C, G, or T) is the true base at this position? In this post, we will explore such a four-base model, how it relates to the binary hypothesis model, and how both can still be useful in practice.
Before we look at an alternative model let’s take a step back and look at the binary hypothesis model again. Previously we derived the posterior probability of a variant existing (or not), given the observed data, a prior probability of a variant existing, and an error rate. We also warned against simply counting bases and claiming that the allele frequency was the fraction of variant-supporting reads. We can however use the model to estimate the true allele frequency $p$, given a substitution error rate $e$.
From the previous post, recall that if a variant exists at frequency $p$, the probability of observing $k$ variant-supporting reads out of $n$ total reads is:
$$ P(k \mid \text{Variant}) = \binom{n}{k} p_v^k (1 - p_v)^{n - k} $$
where:
$$ p_v = p(1 - e) + (1 - p)e $$
This accounts for both true variant reads (with chance of being misread) and erroneous reads (misread as the variant). So the likelihood is a function of $p$. Rather than presume we know the value of $p$, let’s instead ask the question, “what is the value of $p$ for which the likelihood is maximised?” This will give us the so-called “maximum likelihood estimate” (MLE) of the true allele frequency $p$.
To find the MLE we return to high-school calculus and differentiate the formula above for the likelihood. We’ll take logs before differentiating, to make it easier to work with:
$$ \ell(p) = \log \binom{n}{k} + k \log(p_v) + (n - k) \log(1 - p_v) $$ $$ \frac{d\ell}{dp} = \frac{k}{p_v} \cdot \frac{dp_v}{dp} - \frac{n - k}{1 - p_v} \cdot \frac{dp_v}{dp} $$
And substituting $\frac{dp_v}{dp} = 1 - 2e$ gives us:
$$ \frac{d\ell}{dp} = (1 - 2e) \left( \frac{k}{p_v} - \frac{n - k}{1 - p_v} \right) $$
To find the maximum likelihood estimate of $p$, we want this to be zero. After a little algebra we end up with:
$$ p_{\text{ MLE}} = \frac{\frac{k}{n} - e}{1 - 2e} $$
This a nice, compact formula that captures the relationship between observed variant reads, total reads, and error rates. The expression has some interesting, but quite unsurprising properties:
Allele frequency can’t be negative. What’s going on? The fact that $p_{\text{ MLE}}$ would fall below zero tells us that the observed number of variant reads is not sufficient to distinguish true variation from background error. This is what our intuition at the start of the original post caught: when $\frac{k}{n} \approx e$, it’s expected that some variation is just noise, even if there’s a real (small) signal underneath.
A negative MLE is just a formal way of saying: if there is a true variant, it’s too weak to rise above noise.
So rather than indicating the variant is absent, it reveals that you cannot meaningfully estimate a positive frequency given the observed data and error model. When the error rate is on the same order as the variant allele frequency, the model becomes “unidentifiable” — we can’t confidently separate real variants from noise. The negative MLE is a mathematical shadow of this problem. One should not however merely question the value of the MLE when it is negative. Remember that the MLE tells us the most likely value of the parameter given the data, provided that a variant is in present in the first place. To answer that final part we need to consider the posterior probability of a variant existing, as in the previous post.
One thing to note about the MLE is that it doesn’t take into account any prior information we might have about the allele frequency. A common problem with MLEs is that they can be very sensitive to the data, and can give very different estimates depending on the sample size and the observed data.
In many cases, we want to include prior information about the expected distribution of $p$. This is where the “maximum a posteriori” (MAP) estimate comes in. By including a prior distribution on $p$, we can obtain a more robust estimate of the allele frequency. Not only does including priors help to regularise noisy data, but it also allows us to incorporate biological knowledge into the model.
We calculate the MAP estimate through our old friend Bayes’ theorem:
$$ P(p \mid \text{data}) \propto P(\text{data} \mid p) \cdot P(p) $$
We need a prior belief, $P(p)$ for the allele frequency $p$, where $p ∈ [0, 1]$. When working with Bernoulli trials a standard choice for the prior is the Beta distribution. The reason for this is that it’s “conjugate” to the binomial distribution. This simply means that if we assume a Beta prior and work through a similar maximization process to above for the MLE, the function for the posterior will also take the form of a Beta distribution, albeit with modified parameters.
We won’t work through all the maths here because the algebra is a bit tedious, but the starting point would be:
$$ \log P(p \mid \text{data}) = \log P(\text{data} \mid p) + \log P(p) $$
with: $$ \begin{aligned} P(p) &= \text{Beta}(p \mid \alpha, \beta) \\ &= \frac{p^{\alpha - 1} (1 - p)^{\beta - 1}}{B(\alpha, \beta)} \end{aligned} $$
From here we would differentiate and set the derivative to zero to find the MAP estimate.
The binary hypothesis model is a useful way to think about the variant calling problem, but, as alluded to in the introduction, it begs a few questions. Rather than asking whether a variant exists, we can ask: what is the probability of each base being the true base, given the observed data? This is a different question, but it can be answered using the same data and similar principles. In some ways it’s a more natural question to ask, and a simpler model.
We wish to calculate:
$$ P(\text{base} \mid \text{reads}) \text{ for each base} \in \{ A, C, G, T \} $$
This is a marginalization problem: we want to compute the posterior probability of each base, given the observed data. We can use Bayes’ theorem to compute this:
$$ P(b \mid \text{reads}) = \frac{P(\text{reads} \mid b) \cdot P(b)}{\sum_{b’} P(\text{reads} \mid b’) \cdot P(b’)} $$
where $P(\text{reads} \mid \text{base})$ is the likelihood of observing the reads given a specific truth base. Note that the denominator is simply a sum over all possible bases or the numerator; it normalizes the posterior probabilities. We can also see that if the prior $P(b)$ is uniform, the posterior, $P(b \mid \text{reads})$, is simply proportional to the likelihood, $P(\text{reads} \mid b)$.
In some sense, this is a more general model than the binary hypothesis model, which only considers the reference and one alternative base. It allows us to compute the posterior probability of each base, rather than just the probability of a variant existing or not. There is however an important difference, which we will come to later.
Let’s first look at how to compute the likelihood of observing the reads given a specific base. We assume that each read is generated independently, and that the observed base is a noisy version of the true base. And so we have:
$$ P(\text{reads} \mid b) = \prod_{i=1}^{n} P(r_i \mid b) $$
For each read, $P(r_i \mid b)$ is an assumed probability model. It is the probability of observing the read base given the true base, for example:
ri=A | C | G | T | |
---|---|---|---|---|
b=A | 0.995 | 0.002 | 0.001 | 0.002 |
C | 0.000 | 0.995 | 0.003 | 0.002 |
G | 0.002 | 0.002 | 0.995 | 0.001 |
T | 0.004 | 0.000 | 0.001 | 0.995 |
Notice the diagonal elements of the matrix are largest, the off-diagnonal elements are small and correspond to substitution errors in the reads. The model allows to a different error rate for each truth base (and a different rate for each observed base). Given an observed pileup, e.g. 950 A’s, 0 C’s 30, 45 G’s, and 15 T’s, we can compute the likelihood of each base being true, given the observed data:
$$ P(\text{reads} \mid b) = P(A \mid b)^{950} \cdot P(C \mid b)^{0} \cdot P(G \mid b)^{45} \cdot P(T \mid b)^{15} $$
This is a product of the probabilities of observing each read base, raised to the power of the number of times that base was observed. It’s normal to perform this calculation in log space as:
$$ \begin{aligned} \log P(\text{reads} \mid b) &= \sum_{o \in {A,C,G,T}} n_o \log P(o \mid b) \\ &= 950 \log P(A \mid b) \\ &\phantom{= }+ 45 \log P(G \mid b) \\ &\phantom{= }+ 15 \log P(T \mid b) \end{aligned} $$
If we wish to get the vector of $P(\text{reads} \mid b)$ for all truth bases, this is simply a multiplication of the (log) substitution error matrix by the observed counts of each base.
$$ \begin{aligned} \log P(\text{reads} \mid b) &= \log \mathbf{S} \cdot \mathbf{n} \\ &= \log \begin{pmatrix} 0.995 & 0.002 & 0.001 & 0.002 \\ 0.000 & 0.995 & 0.003 & 0.002 \\ 0.002 & 0.002 & 0.995 & 0.001 \\ 0.004 & 0.000 & 0.001 & 0.995 \end{pmatrix} \cdot \begin{pmatrix} 950 \\ 0 \\ 45 \\ 15 \end{pmatrix} \end{aligned} $$
So the log-likelihood is a rather compact function of the substitution error matrix and the observed counts of each base. And for a uniform prior, the posterior is simply proportional to this (up to the normalization constant).
A nice feature of this model is that it allows us to incorporate empirical substitution error rates directly into the likelihood function. In practice, we can estimate the substitution error rates from the data itself, using a training set of known variants. This is a powerful way to express the variant calling problem, as it allows us to adapt the model to the specific error characteristics of the sequencing technology or the sample being analyzed.
The substituion matrix here acts to “project” (the equation above is a scalar product) the observed counts of each base into the space of possible true bases. There’s no reason this projection has to be constrained to four truth bases, and in fact we can extend this model to mixtures of bases, i.e. to heterozygous genotypes. For example, the observation of an A in a read strongly suggest the genotype A/A but it also gives weight to A/C, A/G, and A/T, and even to C/C if the substitution error rate for C>A is non-zero.
The model as described here is a single-base model. It is tempting to interpret the posterior probabilities as allele frequencies in the data, but this is not correct. The posterior probabilities we can calculate for each base $b$, tell us how likely it is that that single true base is present, given the observed data and the substitution error model. The model assumes that only one base is present, and that the observed mixture of bases is due to sequencing errors.
In the binary hypothesis model, we modeled variant detection as a decision between two hypotheses: all observed variation is due to sequencing error, or a true variant is present at frequency $p$. This model allowed us to reason about allele frequencies. The parameter $p$ controlled how often we expected to observe variant-supporting reads, and in this post we’ve seen how we can fit $p$ by maximizing the likelihood or computing a posterior (the MLE and MAP estimates of $p$). In this way, the binary model allows us to represent mixtures of reference and variant alleles, albeit in a limited form: only two alleles at a time (reference and a single alternative).
How do we extend the four-base model to contend with the idea that multiple bases may be present? For example, low-frequency somatic mutations or heterozygous sites with unexpected bias. The original four-base model assumed that only one true base was present, and all other bases were noise. This does not capture true biological mixtures.
To fix this, we can extend the four-base model to allow for base mixtures. Rather than assuming a single true base, we now assume that each read is drawn from a distribution over true bases, which reflects their biological allele frequencies.
Let us define a truth base mixture as a vector of frequencies for each base, $\mathbf{f}$, i.e.:
$$ \mathbf{f} = (f_A, f_C, f_G, f_T) \quad \text{with } \sum f_b = 1 $$
Each $f_b$ is the true frequency of base $b$ in the sample — due to zygosity, clonal structure, or other biological variation. An observed base in a single read is then produced as follows:
So the probability of observing a read with base $r$ is:
$$ P(r \mid \mathbf{f}) = \sum_{b} f_b \cdot P(r \mid b) $$
Assuming reads are independent, the likelihood of observing a pileup of $n$ reads is the product of $n$ such terms:
$$ P(\text{reads} \mid \mathbf{f}) = \prod_{i=1}^n \left( \sum_b f_b \cdot P(r_i \mid b) \right) $$
And if we summarize the data as base counts:
$$ P(\text{reads} \mid \mathbf{f}) = \prod_{o \in \{A,C,G,T\}} \left( \sum_b f_b \cdot P(o \mid b) \right)^{n_o} $$
(where $n_o$ is the number of observed bases of type $o$). In the previous section, we showed how converting the substitution error matrix and the observed counts into log space yields a compact form of the likelihood. We can do the same here — either for a single base mixture or for many candidate mixtures.
For a single mixture $\mathbf{f}$, the log-likelihood is:
$$ \log P(\text{reads} \mid \mathbf{f}) = \sum_{o \in \{A,C,G,T\}} n_o \log \left( \sum_b f_b \cdot P(o \mid b) \right) $$
Now suppose we want to compute the likelihood across a collection of mixtures. Let $\mathbf{F}$ be a matrix whose rows $\mathbf{f}$ are different allele frequency vectors. Then the log-likelihoods for all mixtures can be written compactly as:
$$ \begin{aligned} \log \mathbf{l} &= \log P(\text{reads} \mid \mathbf{F}) \\ &= \left[ \log(\mathbf{S} \cdot \mathbf{F}^T) \right]^T \cdot \mathbf{n} \end{aligned} $$
where:
This gives us a clean, vectorized way to compute the likelihood of the observed data across many hypothesized allele mixtures — enabling full genotype likelihoods or soft frequency estimation. Most usually we may wish to compute the likelihoods for all possible diploid genotypes, i.e. all combinations of two bases. But the methodology is general and can be applied to any set of mixtures.
Alternatively, we may simply want the most likely mixture, which we can find by maximizing the log-likelihood. This would typically be performed numerically, using a gradient descent or similar optimization algorithm. In practice, we would also want to include a prior on the mixture frequencies, e.g. a Dirichlet prior, to avoid overfitting to noise. (It’s also common to add “pseudo-counts” to the observed counts, which act in a manner similar to a prior, as a form of regularization).
In this section we’ve introduced an alternative model of variant calling, focussing on the posterior probability of each base being the true base, given the observed data. This is a different framing of the problem from our previous binary hypothesis model, and provides different information.
The binary hypothesis model and the base mixture model are two perspectives on the same fundamental challenge: determining whether observed variation is due to real biological signal or sequencing error.
Our original binary hypothesis model weighed evidence for the presence of a variant (however defined) against the no-variant hypothesis. It introduced an explicit allele frequency parameter, allowing us to estimate how common a variant was in the sample. It is naturally a two-state view of the world.
By contrast, the mixture model generalizes the problem: rather than testing a specific hypothesis, it models the entire process of base generation probabilistically. We assign frequencies to all four bases, then use the substitution error matrix to ask: how likely are the observed reads, assuming this particular mixture of true bases?
This makes it possible to model diploid genotypes (like A/G), somatic subclones, or low-frequency alleles — without committing in advance to which alternative to test.
Aspect | Binary Hypothesis Model | Base Mixture Model |
---|---|---|
Structure | Hypothesis test: ref vs alt | Full generative model over base frequencies |
Allele frequency | Explicit scalar $p$, estimated for one variant | Full vector $\mathbf{f}$ over A/C/G/T |
Number of alleles modeled | Exactly 2 (ref and one alt) | Up to 4, arbitrary mixtures |
Substitution errors | Folded into likelihood via error rate $e$ | Modeled explicitly via error matrix $P(r \mid b)$ |
Genotype resolution | One hypothesis at a time | All diploid (or somatic) mixtures simultaneously |
Output | Variant posterior | Basewise/genotype/mixture posteriors |
Priors | Requires explicit prior on variant presence | Optional prior over base frequencies (e.g. Dirichlet) |
Typical use case | Confident detection of known variants | Soft inference of likely genotypes or base states |
In summary, the binary model treats variant detection as a yes/no decision, based on the confidence that one alternative explains the data better than random error. The mixture model treats the data as arising from an underlying mixture of true bases and tries to infer that mixture, regardless of whether it matches a particular expected genotype. The binary model is great for thresholding: is there a variant here or not? The mixture model is great for interpreting ambiguity: how likely is A/G? Could it be low-frequency C contamination?
The binary model can be used for:
The mixture model performs softer inference and can:
Variant calling is fundamentally an inference problem, where signal must be teased from noise under conditions of uncertainty, sparsity, and error. The models we’ve explored offer two complementary ways to approach this challenge.
The binary hypothesis model casts the problem as a decision: is there a variant here or not? It formalizes confidence through Bayes’ theorem and allows precise quantification of variant existence and allele frequency — but only within a restricted hypothesis space. It is sharp, focused, and ideal when we already have a candidate variant in mind.
The base mixture model, by contrast, sidesteps hypothesis testing somewhat. Instead, it builds a generative view: given a proposed mixture of true bases, how well does it explain the data? This model is softer, richer, and more general. It accommodates full genotype inference, somatic variation, and complex mixtures, all while directly incorporating empirical error models.
Together, these models represent not competing answers, but complementary questions. One asks whether a variant exists; the other asks what base composition best explains the evidence. Both perspectives are essential — one to make discrete calls, the other to interpret uncertainty and nuance.
In the end, good variant calling is not just about calling variants. It’s about understanding data, assumptions, and a model’s limitations — and choosing the right lens for the question at hand.
Related Links