LINGUIST 138/238 -- SYMBSYS 138. Autumn 2004. Homework 5

LINGUIST 138/238 - SYMBSYS 138 - Autumn 2004
Homework 5

Due: Thursday November 4, (not Tuesday this time!!!!) at the start of class

Read this entire page before starting!! Note that homework is due Thursday not Tuesday!

5.1) Do problem 7.1 in the reading.

5.2) Find a speech recognizer and analyze its errors. For example OS X comes with speech input, if you have a Mac available to you try that. Or if you can't find one, call the "TellMe" speech recognizer again (1-800-555-TELL), or the United Airlines flight info (1.800.824.6200). You may have to help create some errors (introduce noise, try asking friends with foreign accents to talk), or you may get errors very naturally. In any case, write down any errors or misrecognitions that occur. Get at least 10 total errors. Now, as in problem 7.1 in the reading, analyze each error for whether you think it might be caused by feature extraction (i.e. noise), pronunciation modeling, lexicon size, language model, or pruning in the decoding search.

5.3) Transcribe the following two wavefiles at the word level (that is, write down the words that occur in the utterance). Make sure to listen to them carefully and more than one time. If you have trouble listening to them, let me know immediately.

5.4) Now open both files in Praat, the speech analysis program we used on class on Thursday October 28. Transcribe both the files into the ARPAbet, using Praat to help you play pieces of each wavfile, and to look at the wavefile and the spectrogram. Turn in both the ASCII ARPAbet sequence (just type it into your homework answers) and also a PDF of a Praat labeled file as we saw in class. This is very hard, so I don't expect you to be perfect, I just want to you try to listen carefully for what's happening in each file.

You may use an on-line ARPAbet dictionary to help you. Here is the CMU dictionary. But most words in the above sentences will not be the same as they are in the dictionary! So be careful not to just copy the pronunciation from the dictionary (besides the fact that CMU uses a slightly different version of the ARPAbet than the one I handed out in class).

Getting Praat: Praat itself is here, and is free and very simple to download, just grab the executable.

A quick Praat intro, written by Edward Flemming, is here.

A longer Praat tutorial is here.

5.5) This problem involves language modeling. Your assignment is to build two bigram grammars, one each from two corpora, and generate 10 random sentences each from your two bigram grammars. The first corpus is the 1 million word Wall Street Journal corpus /afs/ir/class/linguist238/WWW/restricted/wsj.txt, and the second corpus is the complete works of Shakespeare, also about a million words, at /afs/ir/class/linguist238/WWW/shaks12.txt.

HINTS: First, you should compute bigram word probabilities from each corpus. You will have to do some cleanup (for example remove comments and things, and put spaces around all the punctuation). You can either write a large single program to do all this, or you can do it quite quickly using various convenient UNIX utilities! For example, one useful thing you will need is to put every word on a separate line, which you can do with this command:

tr ' ' '\012'

And before that, you can use perl or java regular expressions to add a space around each punctuation mark, so they appear on their own line.

Once you have a cleaned-up file with one word on each line, you can make a second copy of the file, delete the first line in emacs, and paste the two files together using the unix command paste:

paste wordlistmissingfirstword wordlist > bigramlist

The resulting file will have one line for each bigram in the corpus. Now you can write a program to count how many times each bigram occurs, how many times each unigram occurs, and divide correctly to get the bigram probability.

Hint 2: to generate random sentences, chose a random probability between 0 and 1, and use that to pick a word with the appropriate probability. For example, you could store each word in your unigram list with its own probability but also with a running total of the probabilities of all earlier words in the list. Then you could just walk down the list until the word probabilities sum to your random number. Once you have a first word of your sentence, you switch to choosing bigrams with a similar method. Alternatively, you can create a special "first and last word", called SENTENCEBOUNDARY or something, and then you can do everything with bigrams instead of fooling around with unigrams. That will also make it easier to know when your sentence has ended.