Please read this entire handout before beginning. We advise you to start early and to make use of the TAs by coming to office hours and asking questions! For collaboration and the late day policy, please refer to the home page.
About the Assignment
In this assignment you will become familiar with some easily available spoken language processing systems and perform some basic analysis and manipulation of speech audio. The goal of this assignment is to familiarize yourself with some of the basic tools/libraries available and get you thinking about challenges in building spoken language systems.
Submission Instructions
This assignment is due on 04/15/2024 by 11:59PM pacific (or at latest on 04/18/2024 with three late days) and has three parts. For parts 1-2, you should submit a PDF to Gradescope and mark in the PDF which page corresponds to which question. For part 3, you will submit your filled-in/executed Colab Notebook with all code/output.
You will submit your materials for parts 1-2 and part 3 to Gradescope. Please tag your question responses.
Part 1: Speech APIs and Personal Assistants
For the first part of your assignment, you will be investigating the performance of popular speech transcription and personal assistant services. Your task will be to interact with three different speech systems, document your results, and describe the types of failures or issues you discover in the writeup.
Speech Transcription (10 points)
First, compose some short (2-4 sentence) emails or text messages using the speech input button on your mobile keyboard (usually in the email or messaging app). Try your best to limit yourself to “everyday” sentences and “optimal” conditions (no obscure vocabulary, low background noise, etc) to gauge how well the system could work at its best. Try composing messages that include different domain-specific words (e.g. machine learning jargon) or proper nouns (e.g. restaurant or actor names) to challenge the system.
- Paste the results for one message in your writeup including any errors the system generated.
- What is the rough number of errors per word in your results? We can count an error as anything you would manually correct before sending the message/email.
- Describe how the system handles punctuation. Does it guess, insert no punctuation, or allow punctuation commands?
- Try composing a message where you correct yourself (e.g. “I’m leaving at five – delete that I meant 6”). Include the resulting text and comment on how the system handles attempts to edit the utterance and to quickly correct partial words.
- Try to break the system. For instance, speak in a different pitch, volume, or distance to the microphone. Try talking with background noises. If you know a different language, try speaking in that language. Show 2 example utterances and describe what types of errors the system makes, along with what you did to cause those errors. Can you consistently produce different types of errors using different approaches to break the system?
Personal Assistants (10 points)
Use Siri, Google Assistant, Amazon Alexa or any kind of similar speech-based personal assistant. In this section, you will try to perform a few goal-oriented interactions and describe how the system handles your requests. For each of the below, include a description or screenshot of the interaction. Depending on what system you are using, try to describe the interaction or include a screenshot if possible (not necessary to provide a verbatim description)
- Ask some factual questions about a favorite book, show/movie, sports team. Is the system accurate in its responses? How does the system handle follow-on questions? (e.g. “Who wrote The Great Gatsby? … When was that book published?”)
- Pretend you are searching for a restaurant for take-out food today. Try to explore possible restaurants, learn about their ratings/food, and start an order if possible. How many turns did you take in this interaction (a turn in dialog is each time you speak)? Were you able to explore new places and learn about them? Was the interaction completely speech-driven, or does your assistant prompt you to look at options visually?
- Create some calendar events that involve a meeting name and add details (location, attendees, or similar). If you offer a lengthy initial command, does the system add all the details you specify? If you start with a simple “make a calendar event” prompt, what questions does the system ask?
- Using any of the above themes, try an interaction where you “barge in” to edit or correct something (barging in is talking while the system is talking to you). Does the system allow for you to barge-in for corrections? Does it detect that you had something to add while it was speaking?
- Describe any types of error you found while completing the tasks above. When the system didn’t achieve the result you hoped, can you attribute issues to limited functionality (e.g. not allowing calendar events to have notes attached), issues with speech recognition, or knowledge of concepts in the world?
Part 2: Phonetic Transcription
In this section you will do some basic creation and editing of phonetic pronunciations.
ARPAbet Transcriptions (20 points)
- We often process speech data in phonemes instead of words. Find and correct the mistakes in the ARPAbet transcriptions of the following words:
- three [dh r i]
- sing [s ih n g]
- eyes [ay s]
- study [s t uh d i]
- though [th ow]
- planning [p pl aa n ih ng]
- slight [s l iy t]
- action [ae k t ah n]
- tangle [t ae ng g l]
- higher [hh ay g er]
- Transcribe the following words into ARPAbet. If you think there are multiple possible correct pronunciations,
you can write both and explain why you think both are valid.
- red
- blue
- black
- block
- humanity
- purple
- huge
- manatee
- verbatim
- water
Part 3: Audio analysis toolkits
Audio Analysis Notebook (70 points)
Complete the exercises described in the Colab notebook provided via Google Drive folder. Turn in a PDF of your fully executed Colab notebook, showing the plots you created. Remember to make a copy of the Colab notebook before you start working so changes will save!
CS224S