Hidden Markov Model for CpG islands

Does a Short DNA Stretch Come from a CpG Island?

Here we use likelihood scores as discriminatory statistic.

We need to model both the CpG Islands and the regions that are not CpG Islands. For this we need training sequences. We can collect databases of nucleotides from both types of sequences.

* Collect a database of 60,000 nucleotides * Extract 48 putative CpG Islands * For the putative CpG Regions compute the transition probabilities from nucleotide \(s\) to nucleotide \(t\) using a Markovian Model

\[a^+_{s,t} = \frac{c^+_{s,t} }{ \sum_k c^+_{s,k}}\]

\(c^+_{s,t}\) is the number of times that \(s\) is followed by \(t\) in the database of CpG Islands.

* Similarly for the regions without CpG Islands

\[a^-_{st}= \frac{c^-_{s,t}}{\sum_k c^-_{s,k}}\]

* Construct table of transition probabilities

Table of Transition Probabilities for CpG Islands

Model + A C G T
A .180 .274 .426 .120
C .171 .368 .274 .188
G .161 .339 .375 .125
T .079 .355 .384 .182
station 0.155 0.341 0.350 0.154

Table of Transition Probabilities for non CpG Islands

Model - A C G T
A .300 .205 .285 .210
C .322 .298 .078 .302
G .248 .246 .298 .208
T .177 .239 .292 .292
station 0.262 0.246 0.239 0.253

Calculate the Log-Odds ratio for a sequence \(x\):

\[S(x)=log_2 \frac{[P(x/model+]}{[P(x/model-]} = \sum_{i=1}^L log_2 \frac{a^+_{x(i-1),x(i)}}{ a^-_{x(i-1),x(i)}} = = \sum_{i=1}^L log_2 \beta_{x(i-1),x(i)} +log_2 \frac{[P(x_1\mid model+]}{[P(x_1\mid model-]}\]

Scores \(S(x)\) allow discrimination of a model (+) against another (-)

We can summarize the computations by writing once and for all the \(log_2 \beta\) matrix, and adding each relevant term from this matrix, for instance from the two transition matrices above we have

\[log_2 \beta_{AC}=log_2 \frac{a^+_{AC}}{ a^-_{AC}}=log_2 \frac{274}{205}=0.41855\]

thus we can calculate the table, rather than recompute each \(\beta\) every time:

\[\begin{split}\begin{array} {crrrr} \beta & A & C & G & T\\ A & -0.740 & 0.419 & .580 & -0.803\\ C & -0.913 & 0.302 & 1.812 & -0.685 \\ G & -0.623 & 0.461 & 0.331 & -0.730\\ T & -1.169 & 0.573 & 0.393 & -0.679 \\ \end{array}\end{split}\]
> statp=c(0.155,0.341,0.350,0.154)
> statm=c(0.262,0.246,0.239,0.253)
> statp/statm
 [1] 0.5916031 1.3861789 1.4644351 0.6086957
> log(statp/statm,2)
 [1] -0.7572986  0.4711134  0.5503443 -0.7162070

Score Computation

Thus the sequence \(x= CACTAAGCTA\) would have a score: \(-0.913+0.419+1.812-1.169-0.740+0.580+0.461-0.685-1.169-0.757=-1.404\) normalised by length: \(-2.161/9=-0.2401\) (bits) To conclude whether a sequence is from a CpG island or not, we just compute the score for the sequence, here it is negative thus indicating that this is not a CpG island, the histogram for known sequences shows a cutoff at around -0.05 and 0.1, with an uncertain region in between.

This is a first instance of what is known as discriminant analysis in statistics, given some training data, we build a model and a discriminating statistic (the score here) that allows us to guess which group (+ or -) an anonymous observation belongs to.

More about finding the stationary distribution

    .180    .274    .426    .120
    .171    .368    .274    .188
    .161    .339    .375    .125
    .079    .355    .384    .182
 > eigen(t(Aplus))
[1]  1.00034120+0.00000000i  0.11202252+0.00000000i -0.00368186+0.05741808i
[4] -0.00368186-0.05741808i
             [,1]          [,2]                  [,3]                  [,4]
[1,] 0.3868925+0i  0.5198328+0i  0.5541003-0.5988781i  0.5541003+0.5988781i
[2,] 0.8536965+0i -0.5304369+0i  0.6945813+0.2868043i  0.6945813-0.2868043i
[3,] 0.8749564+0i  0.6644252+0i -1.0666597+0.8310571i -1.0666597-0.8310571i
[4,] 0.3865033+0i -0.6532238+0i -0.1826954-0.5193075i -0.1826954+0.5193075i
             [,1]          [,2]                   [,3]                   [,4]
[1,] 0.1546303+0i  0.2077628+0i  0.22145866-0.2393551i  0.22145866+0.2393551i
[2,] 0.3411990+0i -0.2120010+0i  0.27760504+0.1146278i  0.27760504-0.1146278i
[3,] 0.3496960+0i  0.2655525+0i -0.42631454+0.3321506i -0.42631454-0.3321506i
[4,] 0.1544747+0i -0.2610756+0i -0.07301833-0.2075529i -0.07301833+0.2075529i
> A4%*%A4
          [,1]      [,2]      [,3]      [,4]
[1,] 0.1549889 0.3419902 0.3505069 0.1548329
[2,] 0.1551591 0.3423660 0.3508920 0.1550031
[3,] 0.1549995 0.3420136 0.3505309 0.1548435
[4,] 0.1550031 0.3420216 0.3505391 0.1548472
Aminus <- scan()
.300     .205  .285    .210
.322     .298   .078   .302
.248     .246   .298   .208
.177     .239   .292   .292
[1] 1.00000000+0.00000000i 0.11543346+0.00000000i 0.03628327+0.07470554i
[4] 0.03628327-0.07470554i
              [,1]          [,2]                   [,3]                   [,4]
[1,] -0.7229627+0i -0.7235334+0i  0.4323611+0.44531554i  0.4323611-0.44531554i
[2,] -0.6799330+0i  0.3554384+0i -0.1760636+0.31000760i -0.1760636-0.31000760i
[3,] -0.6594832+0i -0.6487878+0i -0.6370600-0.72207947i -0.6370600+0.72207947i
[4,] -0.6982125+0i  1.0168828+0i  0.3807625-0.03324366i  0.3807625+0.03324366i
> v2[,1]/sum(v2[,1])
[1] 0.2618869+0i 0.2462998+0i 0.2388920+0i 0.2529213+0i
> Aminus%*%Aminus%*%Aminus%*%Aminus%*%Aminus%*%Aminus%*%Aminus
          [,1]      [,2]      [,3]      [,4]
[1,] 0.2618870 0.2462997 0.2388921 0.2529212
[2,] 0.2618869 0.2462998 0.2388919 0.2529214
[3,] 0.2618869 0.2462998 0.2388920 0.2529213
[4,] 0.2618868 0.2462998 0.2388920 0.2529214
> round(log2(Aplus/Aminus),3)
       [,1]  [,2]  [,3]   [,4]
[1,] -0.737 0.419 0.580 -0.807
[2,] -0.913 0.304 1.813 -0.684
[3,] -0.623 0.463 0.332 -0.735
[4,] -1.164 0.571 0.395 -0.682
> piminus
[1] 0.262 0.246 0.239 0.253
> piplus_Re(round(v1[,1]/sum(v1[,1]),3))
> piplus
[1] 0.155 0.341 0.350 0.154
[1] -0.757  0.471  0.550 -0.716

First -Order Markov Chain for DNA Sequences

Consider a sequence of nucleotides in the following state: :math:`x = { A,C,G,G,C,C,A,G,T,A,C,C,G,G} ` Then,

\[P (x) = P (x_L , x_{L-1} ,... , x_1 ) = P(x_L \mid x_{L-1} , ... , x_1 )P (x_{L-1} , x_{L-2} ... , x_1 )\]

Assume now that the dependence is of first order : Markovian:

\[P(x_L \mid x_{L-1} , ... , x_1 )= P(x_L \mid x_{L-1} )\]
\[\mbox{Then } P (x) = P (x_L \mid x_{L-1} ) P (x_{L-1} \mid x_{L-2} )... P (x_l ) = P (x_l )\prod_i a_{x(i-1)x(i)}\]

this is by definition of a first order Markov Chain for DNA.

Note: For a Second -Order Markov Chain for DNA Sequences

\[P(x_L \mid x_{L-1} , ... , x_l )= P(x_L \mid x_{L-1} , x_{L-2} )\]

could be written

\[P(x_L \mid x_{L-1} , x_{L-2 }) = P(x_L , x_{L-1} , \mid x_{L-1} , x_{L-2} )\]

Then, instead of working with single-position states, ie \(x = \{ A,C,G,G,C,C,A,G,T,A,C,C,G,G\}\) we could work with 2-position states, ie \(x = \{(A,C),(C,G),(G,G),(G,C),(C,C),(C,A),(A,G),(G,T),(T,A),(A,C),(C,C),(C,G),(G,G)\}\)

The Second-Order Markov Chain over the 4 elements \(\{A,C,G,T\}\) is equivalent to a First-Order Markov Chain over the 16 two- position states \((AA),(AG),(AC),(AT),(GA),(GG),(GC),(GT)\), etc.

Second Question: Given a Long Stretch of DNA, find the CpG Islands in it

    1. First Approach

    Build the two First-Order Markov chains for the two regions, as


Take windows of the DNA segment, e.g. 100 nucleotides long

Compute the log-odds for a window and check against the two Markov

models. May need to change the length of the window

Determine the regions with CpG Islands
    1. Second Approach: Integrate the Two Markov Models into One

The Hidden Markov Model used as the model

Distinguish the sequence of states from the sequence of symbols, we enrich the original states A,C,G,T with an island state so there are now 8 possible states A+, C+, etc.

Path \(\pi\) : The state sequence (specified, + or -, state of every nucleotide). \(\pi_i\) is the ith state in the path.

\[\{... A_+ G_+ G_+ C_+ A_- T_- C_- C_- T_- C_- A_- A_- G_- T_- C_- T_+ G_+ A_+ C_+ G_+ C_+ G_+ A_- G_- G_- C_- T_- T_- A_- C_- ...\}\]
  • The states in the path follow a simple Markov Chain

Remark: Beware the island states (+/-) do not necessarily follow a Markov chains.

  • Transition Probabilities: \(a_{kl}\)
  • Emissions The sequence of symbols (nucleotides of unspecified state, + or - ):


  • States and Symbols are decoupled
  • Emission Probability: Probability of emitted symbol \(b\) -

\(e_k (b) = P (x_i = b \mid \pi_i = k )\) (= 0 or 1 for the CpG island problem)

Think of HMM as generative models that generate or emit sequences:

Example, Casino: Generate random sequences of rolls by * Simulating the successive choices of die (hidden Markov decision) * Rolling the chosen die (known probability) More generally:

  • Choose an initial state \(\pi_1\) according to probability \(a_{0\pi(1)}\)
  • Emit observation according to distribution \(e_k (b)=P(x_1=b\mid \pi_1 = k )\) for that state
  • Then a new state \(\pi_2\) is chosen according to the transition probability \(a_{\pi(1)i}\) (+ to +, + to -, - to +, - to -)

The above processes generate sequences of random observations in which an overt process \((a_{\pi(1)i} )\) is combined with a hidden one (+ or -) Path \(\pi\) :

\[\{... A_+ G_+ G_+ C_+ A_- T_- C_- C_- T_- C_- A_- A_- G_- T_- C_- T_+ G_+ A_+ C_+ G_+ C_+ G_+ A_- G_- G_- C_- T_- T_- A_- C_- ...\}\]
  • Transition Probabilities : \(a_{kl} = P (\pi_i = l \mid \pi_{i-1} = k )\)
  • Emissions : \(\{... AGGCATCCTA AGTCTGACGCGAGGCTTAC..\}\)

Emission Probability: Probability of emitted symbol, b \(e_k (b) = P (x_i = b \mid \pi_i = k )\)

  • Joint Probability of an observed sequence of symbols, x, and a state

sequence, \(\pi\):

\[P(x, \pi)= a_{0\pi(i)} \prod_i e_\pi(i) (x_i) a_{\pi(i)\pi(i+1)}\]

Example: Sequence of Emissions (Symbols): \(.... CGCG\) State Sequence (Path) \(......... C_+ G_- C_- G_+\) Joint Probability = \((a_{0,C_+} )*1*(a_{C_+,G_-})*1*(a_{G_-,C_-})*1*(a_{C_-,G_+})*1*(a_{C_+,0})\)

Example: The Occasionally Dishonest Casino. The Casino uses two dice, a well-balanced, fair, and an unbalanced, loaded, one, with the following probabilities:

Given: A sequence of nucleotides, e.g. \(CGCG\) The sequence of symbols \(\{CGCG \}\) can be generated from any of the following paths: \(\{C_+G_+C_+G_+\}\) \(\{C_-G_-C_-G_-\}\) \(\{C_+G_-C_+G_-\}\) \(\{C_-G_+C_-G_+\}\) \(\{C_-G_+C_+G_-\}\) \(\{C_+G_-C_-G_+\}\) with very different probabilities. Find : The sequence of the underlying states, i.e. The Path Solution : From the set of all possible state sequences, which can produce the sequence of the observed symbols, select the one which maximizes the joint probability of the given sequence of symbols, x, and associated sequence of states (Path), \(\pi\) , i.e.

\[\mbox{The Most Probable Path }= \pi^* = \mbox{argMax}_{\pi} P (x, \pi)\]

If we think of the number of possible paths, if \(L\) is the length of the sequence, \(8^L\) becomes exponentially large, we need a new idea.

The Viterbi Algorithm

An extension of the dynamic programming idea of backward induction, this is made possible by the Markovian property. Notation can be a little confusing, the number of possible states will be S, (2, for the unfair casino, 3 for the weather: Sunny, cloudy, rainy, 8 for the CpG islands). The transitions between states is an \(S\times S\) stochastic matrix, we then have observables, in the unfair casino, the dice roll, in the weather example the states of the seaweed: soggy, dry, damp, wet. There may the sounds that are heard. For these observables we suppose we have a confusion matrix which is the matrix of probabilities of seeing a certain observable, given a state, \(P(6 \mid F)=1/6\), \(P(wet\mid rainy)\), in the specific case of the nucleotides we know for sure that if we see an A, the underlying state can either be a A+ or A-, the other probabilities are 0. We saw that we denote the emission probabilities by indexing by the state: \(e_{s}(o)=P(\mbox{see o}\mid \mbox{state is } s)\).

We compute partial probabilities up to a point.

We call \(v_k(i)\) the most probable path ending in state k with observation \(i\) equal to \(x_i\), for all possible states k.

Calculation of \(v_Z(i)\), the most probable path to the state \(Z\) will have to path through A, B, or C at step \(i-1\). The most probable path to Z at step \(i\) will be either:

...sequence of states ending in A,Z ...sequence of states ending in B,Z ...sequence of states ending in C,Z

Viterbi Algorithm

We want the path ending in A,Z , B,Z or C,Z with an observation of X at step Z that has the highest (complete)probability. The path going through A has probability

\[P(\mbox{most probable path to A}) P(Z\mid A) P(\mbox{observation X}\mid Z)\]

The other two probabilities are the same with A replaced by B or C. We have to compute these three probabilities and take the maximum one:

\[P(\mbox{Z at step }i)=\max_{\ell=A,B,C} P(\ell \mbox{ at step }i-1)\times P(Z\mid \ell) P(\mbox{observation }X\mid Z)\]
\[v_Z(i)=e_Z(X_i)\max_k \{v_{k}(i-1)\times a_{kl}\}\]

Step-0: Initialize (i=0): \(v_0 (0) = 1, v_k(0) = 0\) for \(k>0\)

Steps- \(i=1..L\): \(v_l(i)=e_l(x_i ) max_k \{v_k(i-1) a_{kl} \}\)

* The Viterbi algorithm employs the strategy of Dynamic Programming * Probabilities should be expressed in a log space to avoid underflow errors *We need a backward pointer to the state \(k^*\) that gave the \(v_\ell(i)\) max, the backward pointer doesn’t need the emission probabilities, since for the same state \(\ell\) the emission probabilities are the same, so the pointer is

\[\phi_i(\ell)=arg max_j v_j(i-1)a_{jl}\]

Actual Example: Three states of the weather Sun, Cloudy, Rainy with transition matrix

\[\begin{split} A=\left( \begin{array}{rrrr} &Sunny & Cloudy & Rainy\\ S&0.5 & 0.25 & 0.25\\ C&0.375 & 0.125 & 0.375\\ R&0.125 & 0.675 & 0.375\\ \end{array} \right) \qquad \mbox{Stationary probabilities: } \begin{array}{ccc} Sunny & Cloudy & Rainy\\ 0.63 & 0.17 & 0.20\\ \end{array}\end{split}\]\[Three observables: sea weed is Soggy, Damp, Dryish, Brittle.\]
\[\begin{split}\mbox{Confusion /Emission matrix:}\qquad \begin{array}{rrrrr} & Brittle& Dryish &Damp &Soggy\\ Sunny&0.6 & 0.2 & 0.15 & 0.05\\ Cloudy&0.25 & 0.25 & 0.25 &0.25\\ Rainy& 0.05 & 0.10 & 0.35 & 0.50\\ \end{array}\end{split}\]

We saw Brittle, Damp, Soggy... What was the weather over those 3 days? Initializations with first observation Brittle:

\[\begin{split}\begin{aligned} P(Brittle\mid sunny)P(sunny)&=&0.6 \times 0.63=0.378\\ P(Brittle\mid cloudy)P(cloudy)&=&0.25 \times 0.17=0.0425\qquad \begin{array}{lr} sunny & 0.378\\ cloudy & 0.0425\\ rainy & 0.01 \end{array} \\ P(Brittle\mid rainy)P(rainy)&=&0.05 \times 0.20=0.01\end{aligned}\end{split}\]

Step one: State 1: sunny. \(P(sunny\mid sunny) \times P(Damp\mid sunny) \times 0.378=0.5 * 0.15 * 0.378= 0.02835\) \(P(sunny\mid cloudy) \times P(Damp\mid sunny) \times 0.0.0425=0.375 * 0.15 * 0.0425= 0.00239\) \(P(sunny\mid rainy) \times P(Damp\mid sunny) \times 0.0.001=0.125 * 0.15 *0.01= 0.0001875\)

\[\begin{split}\begin{aligned} v_2(sunny)&=& max(0.5 * 0.378,0.375 * 0.0425,0.125 * 0.01)*0.15=0.0284 \; \mbox{point from sunny(2) to sunny(1)}.\\ \mbox{State 2 }&:& {\mbox cloudy.}\\ v_2(cloudy)&=&max(0.378*0.25,0.0425*0.125,0.01*0.675)*0.25=0.024\; \mbox{cloudy(2) points to sunny(1)}\\ \mbox{State 3 }&:& {\mbox rainy.}\\ v_2(rainy)&=&max(0.378*0.25,0.0425*0.375,0.01*0.375)*0.35=0.0033\; \mbox{rainy(2) points to sunny(1)}\end{aligned}\end{split}\]

Step two:

\[\begin{split}\begin{aligned} \mbox{State 1 }&:& {\mbox sunny.}\\ v_3(sunny)&=&max(0.0284*0.5,0.024*0.25,0.033*0.125)*0.05=0.0142*0.05=0.00071 \\&& \mbox{ sunny(3) points to sunny(2)}\\ \mbox{State 2 }&:& {\mbox cloudy.}\\ v_3(cloudy)&=&max(0.0284*0.25,0.024*0.125,0.033*0.675)*0.25=0.022275*0.25=0.00556875\;\\&& \mbox{ cloudy(3) points to rainy(2)}\\ \mbox{State 3 }&:& {\mbox rainy.}\\ v_3(rainy)&=&max(0.0284*0.25,0.024*0.375,0.033*0.375)*0.50=0.012375*0.5=0.0061875 \\ && \mbox{rainy(3) points to rainy(1)}\end{aligned}\end{split}\]

Advantages of Viterbi: Computational complexity is reduced. Not deterred by noise in the middle, looks at the whole path before deciding which is the best.

Disadvantages: We don’t get to see any slightly suboptimal paths. We will remedy this, when we look at the Gibbs sampler.

Forward Algorithm

We have a sequence \(x\) for which we would like to compute the probability:

\[P(x)=\sum_\pi P(x,\pi)\]

In fact we can compute this with a similar algorithm where the maximum is replaced by sums:

\[f_k(i)=P(x_1,x_2,\ldots x_i, \pi_i=k)\qquad f_\ell(i+1)=e_\ell(x_{i+1})\sum_k f_k(i)a_{kl}\]

For a sequence of length L :\(P(x)=\sum_k f_k(L)\).

Backward Algorithm

Suppose we observe a sequence \(x\) and want to evaluate \(P(\pi_i=k\mid x)\).

\[P(x,\pi_i=k)=P(x_1,\ldots, x_i,\pi_i=k)P(x_{i+1}\ldots x_L\mid \pi_i=k)\]

The second term is called \(b_k(i)\) and the first is \(b_k(i)\), can be computed backwards from the ending sequence, in the same way the forward algorithm worked.

\[b_k(i)=\sum a_{kl}e_l(x_{i+1})b_l(i+1)\qquad P(\pi_i=k\mid x)= \frac{f_k(i)b_k(i)}{P(x)}\]

Estimation of the Hidden Markov Model


If a state sequence is known, we have an observed matrix of counts of the number of times j followed i in the matrix \(A_{ij}\) and the number of counts that state s emitted observable \(l\) as a matrix \(E_{sl}\), maximum likelihood theory gives as estimates for a and e:

\[a_{kl}=\frac{A_{kl}}{\sum_{l'}A_{kl'}} \qquad e_{k}(b)=\frac{E_{k}(b)}{\sum_{b'}E_{k}(b)}\]

We may have to use pseudocounts (equivalent to a Dirichlet prior) if the numbers are small or zero.

Estimation of the Hidden Markov Model


If a state sequence is known, we have an observed matrix of counts of the number of times j followed i in the matrix \(A_{ij}\) and the number of counts that state s emitted observable \(l\) as a matrix \(E_{sl}\), maximum likelihood theory gives as estimates for a and e:

\[a_{kl}=\frac{A_{kl}}{\sum_{l'}A_{kl'}} \qquad e_{k}(b)=\frac{E_{k}(b)}{\sum_{b'}E_{k}(b)}\]

We may have to use pseudocounts (equivalent to a Dirichlet prior) if the numbers are small or zero.

Baum-Welch , or EM specialized to HMM

\[E_{k}(b)=\sum_j \frac{1}{P(x^j)} \sum_{i\mid x_i^j=b} f_k^j(i)b^j_k(i)\]

We want to find the parameters (the model) that maximizes

\[log P(x\mid \theta)=log \sum_\pi P(x,\pi\mid \theta)\]

\(x\) means all the observations, including all the different sequences \(x^j, j=1...n\) We build a sequence of models \(\theta^t\), the missing data are the paths \(\pi\). The BW algorithm calculates \(A_{kl}\) and \(E_k(b)\) as the expected number of times each transition/emission occurs, given the training sequences. :math:`` P(aklused at position \(i\) in sequence \(x\)) =P(i=k,i+1=lx,) =fk(i)aklel(xi+1)bl(i+1)P(x)

thus the estimate by expectation(averaging) is:

\[A_{kl}=\sum_j \frac{1}{P(x^j)} \sum_i f_k^j(i)a_{kl}e_l(x^j_{i+1})b^j_l(i+1)\]

Initialize at arbitrary parameter values For each sequence \(j=1\ldots n\):

  • Calculate \(f_k(i)\) for sequence \(j\) using the forward algorithm.
  • Calculate \(b_k(i)\) for sequence \(j\) using the backward algorithm.
  • Add the contribution of sequence \(j\) to \(A\) and \(E\).

Calculate the new model parameters from the new A and E. Calculate the new log likelihood of the model. Stop when the incremental change is small, or that there are enough iterations.

Structure of HMMs

Prior knowledge should be used as optimally as possible.


* Length of extent of model: staying with the same nucleotide frequencies for instance:

  • Exponentially decaying length (geometric) \(P (\mbox{k residues}) = (1-p)p^{k-1}\)
  • Defined range of length, e.g., Model with distribution of lengths

between 2 and 8

  • Non-geometric length distribution, e.g., array of \(n\) states

* Silent (or, Null) states, e.g. Begin, End states Do not emit any symbols

Are used to simplify the models, reduce the number of transition probabilities to estimate from training data.

No loops consisting entirely of silent states are allowed. All HMM-related algorithms can be easily adjusted to incorporate silent states.