Please read this entire handout before beginning. We advise you to start early and to make use of the TAs by coming to office hours and asking questions! For collaboration and the late day policy, please refer to the home page.
About the Assignment
In this assignment you will become familiar with some easily available spoken language processing systems and perform some basic analysis and manipulation of speech audio. The goal of this assignment is to familiarize yourself with some of the basic tools/libraries available and get you thinking about challenges in building spoken language systems.
Submission Instructions
This assignment is due on 04/14/2025 by 11:59PM pacific (or at latest on 04/17/2025 with three late days) and has three parts. For parts 1-2, you should submit a PDF to Gradescope and mark in the PDF which page corresponds to which question. For part 3, you will submit your filled-in/executed Colab Notebook with all code/output.
You will submit your materials for parts 1-2 and part 3 to Gradescope. Please tag your question responses.
Part 1: Speech APIs and Personal Assistants
For the first part of your assignment, you will be investigating the performance of popular speech transcription and personal assistant services. Your task will be to interact with three different speech systems, document your results, and describe the types of failures or issues you discover in the writeup.
Speech Transcription (10 points)
First, compose some short (2-4 sentence) emails or text messages using the speech input button on your mobile keyboard (usually in the email or messaging app). Try your best to limit yourself to “everyday” sentences and “optimal” conditions (no obscure vocabulary, low background noise, etc) to gauge how well the system could work at its best. Try composing messages that include different domain-specific words (e.g. machine learning jargon) or proper nouns (e.g. restaurant or actor names) to challenge the system.
- Paste the results for one message in your writeup including any errors the system generated.
- What is the rough number of errors per word in your results? We can count an error as anything you would manually correct before sending the message/email.
- Describe how the system handles punctuation. Does it guess, insert no punctuation, or allow punctuation commands?
- Try composing a message where you correct yourself (e.g. “I’m leaving at five – delete that I meant 6”). Include the resulting text and comment on how the system handles attempts to edit the utterance and to quickly correct partial words.
- Try to break the system. For instance, speak in a different pitch, volume, or distance to the microphone. Try talking with background noises. If you know a different language, try speaking in that language. Show 2 example utterances and describe what types of errors the system makes, along with what you did to cause those errors. Can you consistently produce different types of errors using different approaches to break the system?
Personal Assistants (20 points)
Use Siri, Google Assistant, Amazon Alexa or any kind of similar speech-based personal assistant. In this section, you will try to perform a few goal-oriented interactions and describe how the system handles your requests. For each of the below, include a description or screenshot of the interaction. Depending on what system you are using, try to describe the interaction or include a screenshot if possible (not necessary to provide a verbatim description)
- Ask some factual questions about a favorite book, show/movie, sports team. Is the system accurate in its responses? How does the system handle follow-on questions? (e.g. “Who wrote The Great Gatsby? … When was that book published?”)
- Pretend you are searching for a restaurant for take-out food today. Try to explore possible restaurants, learn about their ratings/food, and start an order if possible. How many turns did you take in this interaction (a turn in dialog is each time you speak)? Were you able to explore new places and learn about them? Was the interaction completely speech-driven, or does your assistant prompt you to look at options visually?
- Create some calendar events that involve a meeting name and add details (location, attendees, or similar). If you offer a lengthy initial command, does the system add all the details you specify? If you start with a simple “make a calendar event” prompt, what questions does the system ask?
- Using any of the above themes, try an interaction where you “barge in” to edit or correct something (barging in is talking while the system is talking to you). Does the system allow for you to barge-in for corrections? Does it detect that you had something to add while it was speaking?
- Describe any types of error you found while completing the tasks above. When the system didn’t achieve the result you hoped, can you attribute issues to limited functionality (e.g. not allowing calendar events to have notes attached), issues with speech recognition, or knowledge of concepts in the world?
Now let’s try a very recent audio LLM style model which has different speech processing approach, but might not be built to complete real tasks like the digital assistant systems above. You can easily try a model like this using Sesame’s audio LLM demo, OpenAI’s ChatGPT voice input/output mode, Google’s Gemini Live mode, or a similar system.
- Ask some factual questions about a favorite book, show/movie, sports team. Is the system accurate in its responses? How does the system handle follow-on questions? (e.g. “Who wrote The Great Gatsby? … When was that book published?”) Compare the responses to the personal assistant responses from above in terms of both topic breadth and response quality.
- Try to request help completing a task. For example ask to book a round trip flight. How does the system respond? As you try multiple requests, does the system clearly respond with what it can or can not do?
- Separate from completing a real task, what is your overall impression of the “conversational tone” of this system compared with the task-oriented digital assistants you tried above? Is the system more or less verbose, does it focus on tasks and responding only to users requests, or try to move the conversation in its own direction?
- Try to interrupt the system as it speaks. How does it respond? How does it recover when you interrupt and change the topics? Describe two examples of you interrupting the system, and how it responded.
Part 2: Phonetic Transcription
In this section you will do some basic creation and editing of phonetic pronunciations. Whenever you are working with phonetic transcripts in this homework you may restrict yourself to this simplified ARPAbet.
ARPAbet Transcriptions (20 points)
- We often process speech data in phonemes instead of words. Find and correct the mistakes in the ARPAbet transcriptions of the following words:
- three [dh r i]
- sing [s ih n g]
- eyes [ay s]
- study [s t uh d i]
- though [th ow]
- planning [p pl aa n ih ng]
- slight [s l iy t]
- action [ae k t ah n]
- tangle [t ae ng g l]
- higher [hh ay g er]
- Transcribe the following words into ARPAbet. If you think there are multiple possible correct pronunciations, pick a pronunciation you think is reasonable and explain which phonemes you think could be different. We allow for multiple possibly correct transcriptions, and remember to use only this simplified ARPAbet.
- red
- blue
- black
- block
- humanity
- purple
- huge
- manatee
- verbatim
- water (Many possible choices, try to write the way you pronounce this word)
Part 3: Audio analysis toolkits
Audio Analysis Notebook (60 points)
Complete the exercises described in the Colab notebook provided via Google Drive folder. Turn in a PDF of your fully executed Colab notebook, showing the plots you created. Remember to make a copy of the Colab notebook before you start working so changes will save!
CS224S