STANFORD CS 224S/LINGUIST 285   -     Spring 2014
Homework 1: Speech Systems and Phonetics
Due: April 8 at 2:00pm, i.e. 15 minutes before the start of class.

Please read this entire page before beginning.

For these exercises, you should work in groups of 3. Non-native speakers, please team up with a native speaker. Do your write-up together, and put all your names on the writeup.

  1. For the first three problems, you are going to investigate the performance and errors in speech-enable personal assistant apps. You can use Siri, Google Now, or any kind of similar kind of speech-based personal assistant on a phone or other mobile device (or use multiple devices and compare!).

    First, write a couple of texts or emails. What is the rough speech recognition word error rate (the number of incorrect words; more technically this would be the edit distance, the number of substitutions + deletions + insertions between the transcribed sentence and the correct word string)? Can you characterize what's going on with the errors? Try to "barge in" (i.e. talk while the system is talking to you). Does the system allow barge-in?

  2. Make and cancel some calendar appointments. Again, analyze any errors: did they fail because of speech recognition or the natural language or dialog components? If the natural language or dialog, what went wrong?

  3. Try to find a business (a restaurant or etc.). Again, analyze any errors: did they fail because of speech recognition or NL/dialog? If NL/dialog, what went wrong?

  4. Now check out some TTS systems.

    For example, on an Android, you can download the Google Text to Speech App. Or on an iOS device, go to the Settings page, select General > Accessibility, and turn on Speak Selection. Now when you highlight any text it will give you a Speak option that you can click. On a Mac, in the terminal window you can use the "say" command on the unix command line.

    Test out the TTS by choosing 4 different sentences. Try to be creative, including questions, exclamations, or whatever. Write down at least 5 errors that you hear; note whether these errors are due to wrong phones, due to incorrect stress, or due to a problem with the intonation/prosody.

  5. Find and correct the mistakes in the ARPAbet transcriptions of the following words:

  6. Transcribe the following words into the ARPAbet.

  7. Transcribe the following two wavefiles at the word level (that is, write down the words that occur in the utterance). Make sure to listen to them carefully and more than one time. If you have trouble listening to them, let us know immediately.

    1. Utterance from Boston Radio News corpus
    2. Utterance from Switchboard corpus

  8. Now open both files in Praat, the speech analysis program we used on class on Thursday April 3. Transcribe both files into the ARPAbet, using Praat to help you play pieces of each wavfile, and to look at the wavefile and the spectrogram. (In fact, you can use Praat to play the files for the previous exercise as well).

    Turn in the ASCII ARPAbet sequences for the two files (just type it into your homework answers). For the Switchboard file, also label the start and end and identity of each phone using the "Annotate" -> "To Text Grid", with just one tier for "phone" (you don't need to use a word tier). Include a picture of this Praat labeled file using the "Draw" window (select the Sound and TextGrid, then click Draw, then save as EPS. Convert the file PDF before you attach it.) (If the file is too long to read the fonts clearly in the "draw" window, just break it into 2 or 3 parts and attach separate pictures). This is very hard, so I don't expect you to be perfect, I just want to you try to listen carefully for what's happening in each file.

  9. Get the minimum and maximum pitch for the two files. Record the pitch range (range = max - min).

  10. What are some differences between the Boston News file and the Switchboard file, in terms of transcription differences, pitch range, or other things you noticed. Switchboard is human-human speech; Boston News is broadcast speech, which resembles human-machine speech. Could this play a causal role in the differences you found? How?

    You may use an on-line ARPAbet dictionary to help you. Here is the CMU dictionary. But many or most words in the above sentences will not be the same as they are in the dictionary! So be careful not to just copy the pronunciation from the dictionary (CMU uses a slightly different version of the ARPAbet than the one in the slides from lecture 1, but you can use any ARPAbet version you want, including the version on Wikipedia).

    Getting Praat: Praat itself is here, and is free and very simple to download, just grab the executable. It runs on most popular platforms.

    A quick Praat intro, written by Edward Flemming, is here.

    A longer Praat tutorial is here.

    How to turn in the homework: