STANFORD CS 224S/LINGUIST 285   -     Spring 2014
Homework 2: Viterbi in ASR
Due: April 15 at 2:00pm, i.e. 15 minutes before the start of class.

WARNING: Please read this entire page before you start!!!!!

Today's homework is to decode a mystery sequence of digits that are originally from a speech file.

You are going to do this by implementing the Viterbi decoding algorithm, and applying it to a file we will give you that contains phone likelihoods from GMMs, together with a lexicon.

The output of your program should be the correct word sequence and the path probability of the most-likely path.

What you should turn in: the correct word sequence, the path probability of the most-likely path, and your code.

You may write your code in any language you want.

You will need the following three input files:

  1. The lexicon file
  2. The phone set for this exercise
  3. The phone likelihoods

The lexicon file just has the string of phones for the pronunciation of each word, separated by spaces, with a pound-sign (#) at the end of each line. Notice that there is a special phone for silence, called SIL:

silence         SIL #
oh              OW #
zero            Z IH R OW #
one             W AH N #
two             T UW #
three           TH R IY #
four            F AO R #
five            F AY V #
six             S IH K S #
seven           S EH V AX N #
eight           EY TD #
nine            N AY N #

The phone file gives the set of phones in order. The phone set is a version of the ARPAbet. Unlike a real system which would represent phonetic information as triphones or even up to quintaphones, this homework assumes a simple monophone system.

The likelihoods file has the output from the Gaussian models in the following format:

 FRAME  PHONE STATE   LOG P(O|Q,state)
    0   AA      0    -48.804
    0   AA      1    -50.494
    0   AA      2    -50.240
    0   AE      0    -47.295
    0   AE      1    -49.929
    0   AE      2    -46.059
    0   AH      0    -51.344
    0   AH      1    -50.410
    0   AH      2    -47.449
    0   AO      0    -54.502
...
    1   ZH      1    -46.419
    1   ZH      2    -48.539
    2   AA      0    -52.124
...
  557   ZH      2    -59.244

The first number is the frame number; there are 558 frames in this utterance. Since frames are every 10 milliseconds, this means that the utterances was 5.58 seconds long. The second symbol is the phonetic unit. The third number is the state (this is a three-state HMM). The last number is the log-probability (base 10) of observing the feature vector given the base phoneme and HMM state.

Your Viterbi implementation will need to read in the observations file, the phone file, and the lexicon, create a Viterbi_prob matrix, and then fill it out from left to right. You will need to keep an array of back-pointers so when the matrix is complete, you can start from the end and find the most likely sequence of words.

Your output should look something like this:

The best path probability is -500000.00
The best path contains the following words:  seven three four etc etc  (or whatever)

Hint: In order to get the correct answer, you will need to add a word transition penalty of -50 to each path every time it transitions from word to word.

Hint 2: The HMM structure you should assume for each word is technically a "left-to-right non-skip self-loop 3-state HMM". That means that each phone has exactly three substates, that you can't skip any of these states, and that they all have a self-loop. In addition, I assume all the state transition probabilities are 1.0. Technically this means that they aren't true probabilities (if they were true probabilities, they would have to be 0.5, since the two arcs leaving each state must sum to 1). But in practice, since all that matters is that they are all equal, it makes things simpler to just assign everything a probability of 1, since the log of 1 is 0; that means you can just ignore the probabilities altogether, except for the word transition penalty. Anyhow here's an example for the word "two":

Hint 3: You'll need to keep track, for each node of your lattice, which state (subphone) of which model (phone) of which word it corresponds to. This is automatically available from the row number in the viterbi lattice. There are exactly three transition cases you need to consider in your Viterbi algorithm as you move from one frame (column) to the next

  1. Staying in the same state
  2. (only if you are not in the last state of the current word) Transitioning to the next state in the same word
  3. (only if you are in the last state of the current word) Transitioning to the first state of every possible word.

Hint 4: Some frequently asked questions for this homework:

  • Q: should we create separate HMMs for each word, or one big HMM?
  • A: Mathematically it's one big HMM, but as mentioned above you need to represent the separate words in your state column in your lattice, since you need to know which state of which word you are in.

  • Q: Is the "f" of "four" the same state as the "f" of "five"?
  • A: No. You keep a distinct state for each phone of each word. Those two 'f's WILL share the same observation likelihood, i.e. the same "b" matrix. BUT they will have different viterbi scores because they correspond to different rows in the Viterbi lattice, and will be in different paths.

    Hint 5: If you would like to check that your code is correct, we have provided an additional output file, available here which you may run your code on. The correct answer for this file is:

    silence
    three
    one
    silence
    one
    five
    eight
    two
    six
    five
    six
    one
    three
    six
    seven
    five
    silence
    silence
    silence
    silence

    log probability: -19202.51304840347


    NOTE: Please note that in order to match this output, you must force your Viterbi implementation to end in an end state (instead of ending mid-word).

    How to submit the homework:
    Format the correct word sequence and the path probability as shown above, i.e. each word or silence goes into a separate line. Then put one newline character between the words and the log probability line. Name the file as result.txt. Here is an example result.txt. Put the result.txt and your code into a folder. Include a README file which tells us how to run your code. Submit the folder using the submit script from HW1. Remember to use pa2 when prompted.