Please read this entire page before beginning. And please do start early on this assignment.

For this programming assignment, you'll work in groups of 2-3 to build a speech recognizer with Kaldi.

Kaldi is a powerful ASR system developed in C++ that's used for speech recognition research here at Stanford to build state-of-the-art speech recognition systems, alongside many other techniques (which you'll learn more about in Andrew's lecture). Kaldi might also be a useful tool if you decide to build your final project on top of a well-written speech recognition system, but lack the time of writing it from scratch (as we always do). Detailed documentation/tutorial for Kaldi can be found here, which you might need to get familiar with if your final project involves Kaldi.

In this homework, you'll learn to

Next we'll walk you through the setup, then the task we would like you to explore, followed by the deliverables and finally submission guide.


  1. Log into Stanford FarmShare (corn) with your SUNet ID and password. (Students that are less familiar with SSH or FarmShare should feel free to drop by one of our office hours)

     ssh <Your_SUNet_ID>       
  2. We've provided you an script to setup the environment for this assignment, simply type

     /afs/ir/class/cs224s/hw_setup hw3

    in your command line window then press Enter. Note that this might take a while to finish since we will need to copy the compiled binaries of Kaldi.

  3. Navigate to the directory where the running scripts of our assignment is

     cd cs224s/hw3/kaldi-trunk/egs/tidigits/s5

    Side Note: You might have already noticed that Kaldi also has various other example directories under egs for different speech datasets. Those may turn out to be useful in later assignments or your final project.

  4. Use your favorite text editor to view the content of (which we'll be primarily dealing with throughout this assignment)


    or, simply execute this script


    Note: Running the script will typically take 2-3 minutes to finish. You will see many output from the script, most of them are Kaldi calling scripts checking the integrity of the data, or verifying the result of its data preparation scripts. By the time the script finishes, it will print two lines at the bottom, %WER 12.32 [ 3522 / 28583, 1096 ins, 377 del, 2049 sub ] exp/mono0a/decode/wer_19 and %WER exp/mono0a/decode/wer_19:%SER 24.39 [ 2122 / 8700 ], which indicate the word error rate and sentence error rate, respectively, of the baseline system that we've provided you.

    You might have noticed that our baseline recognizer did a very sloppy job recognizing digits -- it missed every one out of ten digits! That means for a typical 16-digit credit card number, your phone banking system misses at least one number in it and makes you repeat ~87% of the times.

Your Tasks

1. Improve the monophone acoustic model

In the example script we've given you, a simple monophone acoustic model was built. That is, for each phone to be recognized, it made no use of contextual information about the preceding or succeeding phones (which we will try to fix in a bit).

If you've carefully scanned the content of, you'd find that near the end of that script, we've highlighted a section of commands with a brief introduction on the training process of the acoustic model. There, you'll find the following lines

steps/  --nj 4 --cmd "$train_cmd" \
    data/$monodir data/lang exp/mono0a 

This is calling the script steps/, which can be found under the steps subdirectory under our s5 working directory. This is the one of the training scripts that Kaldi shares among scripts for different datasets (which we won't mention about in this assignment but might help with your final project).

In this file, we've highlighted the training configuration part that you will need to modify to improve your system performance with ## YOUR CODE HERE and ## END YOUR CODE. The meanings of different parameters you might tune to improve system performance are explained in the comments that follow, and you should adjust them with what you learned about acoustic models in class.

Report the adjustments you've made, and justify your modifications with both reasoning of ASR knowledge as well as the resulting system performance.

2. Improve feature extraction

In the above training of the monophone acoustic model, we used raw MFCC features extracted from the audio files. As you might made progress in improving the system performance with parameter tuning, the features that were used might not be ideal.

In practice, speech recognition systems deal with all sorts of environments, where environmental noise might contribute greatly to hurt the performance of such systems, such as microphone quality, other people talking in the background, or probably the least you'll expect, refrigerator noise.

To cope with such noise, we usually perform feature normalization after feature extraction is done. In, you'll find a section commented as # Now make MFCC features, and there you'll find this line of script

 steps/ --fake data/$x exp/make_mfcc/$x $mfccdir || exit 1;

In the starter script we added the --fake flag in the CMVN normalization step to essentially skip that step. Now remove that flag, report your observations and briefly explain in your report about what happens.

3. Try different training data

So far, you should have a good monophone acoustic model-based digit recognizer working. What's your word error rate (WER)? How practical is your system for bank cards (16 digits, assume)?

In case you're asking if there are other ways to improve your system, here's a quick and easy way. You might have noticed, while reading the script, that we've provided you with a way to switch among different training sets. By default, your system so far have been trained on a reduced version of the TIDIGITS training set, which consists of only 10 male speakers out of TIDIGITS' 112-speaker training set of men and women.

Specifically, four training sets were provided to you in this assignment

1 (Default)10 speakers, male only
210 speakers, female only
310 speakers, 5 male and 5 female
4112 speakers, 55 male and 57 female (Full TIDIGITS training set)

You can switch the training set easily by running the script with the training set number as the argument, for example

bash 4

will train your system on the full TIDIGITS training set.

Report your system performance (with/without tuning the monophone training script) x (with/without feature normalization), that is, all four combinations, on all four training sets. Explain the changes across datasets, as well as across different settings in your report. The objective of this task is to familiarize you with (a small part of) the difficulties we face in real speech recognition systems, as well as the various tools we could use to tackle (some of) them.

In case you've overwritten our starter code scripts/ for good, here are our original settings you might need for the comparisons

num_iters=10    # Number of iterations of training
max_iter_inc=8 # Last iter to increase #Gauss on.
totgauss=100 # Target #Gaussians.  
boost_silence=1.0 # Factor by which to boost silence likelihoods in alignment
realign_iters="1 4 7 10"; # Iterations on which the frames are realigned

4. Adding extra training steps (Last Updated: Apr 22, 1:00am)

You might have noticed that in, there's still one section marked with ## YOUR CODE HERE that we still haven't covered yet. In the main training process, we commented out some code for training a delta feature based decision-tree based triphone acoustic model based on the results we got from the simple monophone one. The commented code should look like the follows

steps/ --nj 4 --cmd "$train_cmd" \
    data/train data/lang exp/mono0a exp/mono0a_ali

steps/ --cmd "$train_cmd" \
    10 100 data/train data/lang exp/mono0a_ali exp/tri1

utils/ data/lang exp/tri1 exp/tri1/graph
steps/ --nj 10 --cmd "$decode_cmd" \
    exp/tri1/graph data/test exp/tri1/decode

Uncomment them, and look up steps/ to understand the meanings of the two numbers in the argument list. Tune those numbers in, and report your best system performance in your report as well as a separate text file (see Submission Guide for details). Also, justify your observations on the change of system performance.

If you are unfamiliar with the idea of decision-tree based triphone acoustic models: due to the large volume of all possible triphones we could have, it is a common practice to cluster them into a decision tree, treat the tree leaves as the "new set of phones" (called senones) and build an acoustic model for them with GMMs (this should sound similar to what you've finished with monophone models).

Note: For this part and this part only, you might want to use training set #4 to tune the parameters. For possible explanations see Q5 in FAQ.

Required for 3-person groups: Groups of three, you're also required to use the commented out line near the bottom in to compare the output with/without the delta features triphone acoustic model (code shown below), and reason about your observations based on your knowledge about delta features triphone acoustic models. (This may be used to count for 5% extra credit for 2-person groups)

# Example of looking at the output.
utils/ -f 2- data/lang/words.txt  exp/tri1/decode/scoring/19.tra | sed "s/ $//" | sort | diff - data/test/text

To show your transcript from the monophone model, you can use the code below (changing tri1 to mono0a from the line above)

utils/ -f 2- data/lang/words.txt  exp/mono0a/decode/scoring/19.tra | sed "s/ $//" | sort | diff - data/test/text

Extra credit

You may earn up to 10% extra credit in this assignment by doing one of the following:

At most 1 extra page is allowed for extra credit writeups.

Deliverables of this Assignment

Submission Guide


Q1. Do we tune our parameters only once in step 1 and then use those same values for steps 2, 3, and 4? Or do we retune our parameters in each step? Must we use the default training set to tune our parameters or can we use any of the 4 given?

For simplicity, you just need to tune the parameters on the first dataset, and report the results for all datasets with that setting. More experiments with other datasets are encouraged, but definitely not required (again, the competition grading shouldn't be any concern of this HW).

Clearly this is not the optimal approach, but from what you observed from the experiments, this shouldn't sound like an unreasonable idea.

Q2. Vagueness in Task 4 description?

To be precise, the commented out lines we want you to uncomment and try out are (these lines enables training with delta features)

  steps/ --nj 4 --cmd "$train_cmd" \
          data/train data/lang exp/mono0a exp/mono0a_ali

  steps/ --cmd "$train_cmd" \
      10 100 data/train data/lang exp/mono0a_ali exp/tri1

  utils/ data/lang exp/tri1 exp/tri1/graph
  steps/ --nj 10 --cmd "$decode_cmd" \
      exp/tri1/graph data/test exp/tri1/decode

Q3. I don't understand some of the parameters.

Please refer to these two piazza posts (most other parameters should be more intuitive and understandable, but feel free to ask as well, we'd be happy to clarify).

Q4. WER and SER?

At the end of the traning script, several lines should be printed, some containing WERs (Word Error Rates), and others SERs (Sentence Error Rates). The definition of WER can be found in the slides, and the definition of SER is the total number of sentences you got wrong (had errors in your transcript) over the total number of sentences. In our case each sentence is a speech file of spoken numbers in the test set.

Q5. Worse results after using delta features?

Introducing sparser features to a small training set might lead to overfitting, which can be ameliorated by increasing the number of training instances.