Syllabus

Week 1: Introduction and Acoustic Phonetics

Deliverables

Assignment 1 released on Mon 4.1.24.

Lecture 1 (Mon 4.1.24)

Course introduction.

lecture slides

Lecture 2 (Wed 4.3.24)

Phonetics: Articulatory phonetics. Acoustics. ARPAbet transcription.

lecture slides

Readings:

J+M Draft Edition Appendix H: Phonetics, online pdf
Fun read (optional). The Art of Language Invention. David J Peterson. 2015.

Week 2: Speech Synthesis / Text to Speech (TTS)

Course Project Overview released on Mon 4.8.24.

Lecture 3 (Mon 4.8.24)

Some history of ASR, TTS, and dialog. TTS Overview. Text normalization. Letter-to-sound. Prosody.

lecture slides

Readings:

J+M Draft Edition Chapter 16.6: TTS online pdf

Lecture 4 (Wed 4.10.24)

lecture slides

Foundations of TTS: Data collection. Evaluation. Signal processing. Concatenative and parametric approaches.

Readings:

J+M Draft Edition Chapter 16.6: TTS (cont’d) online pdf

Week 3: Course project + TTS with deep learning

Deliverables

Assignment 1 due by Monday 4.15.24 11:59PM Pacific.
Assignment 2 released on Monday 4.15.24.

Lecture 5 (Mon 4.15.24)

lecture slides

Course project overview and Q&A. Social meaning extraction as supervised machine learning.

Readings:

Rajesh Ranganath, Dan Jurafsky, and Daniel A. McFarland. . Detecting friendly, flirtatious, awkward, and assertive speech in speed-dates. Computer Speech and Language. 2013.

Lecture 6 (Wed 4.17.24)

lecture slides

Deep learning for TTS.

Readings:

Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S. and Le, Q., Tacotron: Towards end-to-end speech synthesis. arXiv. 2017.
Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv. 2016.
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T. Y. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32. 2019.

Deep Learning Preliminaries. Review on your own as needed depending on your experience so far with deep learning models:

J+M Draft Edition Chapter 7: Neural Networks and Neural Language Models. pdf
J+M Draft Edition Chapter 9: RNNs and LSTMs. pdf
J+M Draft Edition Chapter 10: Transformers and Large Language Models. pdf
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time

Week 4: Speech to Text / Automatic Speech Recognition (ASR)

Deliverables

Course Project Proposal due by Wednesday 4.24.24 11:59PM Pacific.

Lecture 7 (Mon 4.22.24)

lecture slides

Speech recognition overview: Noisy channel model. Word error rate metrics. Hidden Markov models (HMMs).

Readings:

J+M Draft Edition Chapter 16.1, 16.2, 16.3, 16.5: Automatic Speech Recognition online pdf
J+M Draft Edition. Appendix A pdf
J+M Draft Edition. Appendix B pdf
Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J.R., Jurafsky, D. and Goel, S. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences. 2020.
(Optional). If you have never studied language modeling (i.e., have never taken CS124, CS224N, or similar) you should do some additional reading and video lecture watching on your own.

J+M Draft Edition Chapter 3: N-gram Language Models. pdf
Lecture videos on introductory NLP including language modeling youtube

Lecture 8 (Wed 4.24.24)

lecture slides

Speech recognition: HMM-DNN systems. Connectionist Temporal Classification (CTC). End-to-end neural ASR.

Readings:

Hinton, Geoffrey, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine. 2012.
J+M Draft Edition Chapter 16.4: CTC online pdf
Graves, A., and Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. ICML. 2014.
Maas, A., Xie, Z., Jurafsky, D., & Ng, A. Lexicon-free conversational speech recognition with neural networks. ACL-HLT. 2015. (* indicates equal contribution)
Chan, W., Jaitly, N., Le, Q.V. and Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition ICASSP. 2016. arXiv preprint.
Prabhavalkar, R., Hori, T., Sainath, T. N., Schlüter, R., and Watanabe, S. End-to-end speech recognition: A survey. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2023.

Week 5: State-of-the-art ASR and customizing ASR for products

Deliverables

Assignment 2 due by Wednesday 5.1.24 11:59PM Pacific.

Lecture 9 (Mon 4.29.24)

lecture slides

Guest lecture: State-of-the-art deep learning approaches for speech recognition. Conformer. Whisper. Fine tuning base models. Abhinav Garg.

Readings:

Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y. and Pang, R. Conformer: Convolution-augmented transformer for speech recognition. arXiv. 2020. 2.TBD

Lecture 10 (Wed 5.1.24)

Guest Lecture: Ello: A case study in building spoken language products. Joe Lou, Ello.

Readings:

T. Bluche, M. Primet, & T. Gisselbrecht. Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks.ArXiv. 2020.
K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, & B. Kingsbury. End-to-End ASR-free Keyword Search from Speech. IEEE J. Signal Processing 2017.
N. Sacchi, A. Nanchen, M. Jaggi, & M. Cerňak. Open-Vocabulary Keyword Spotting with Audio and Text Embeddings. Interspeech 2019.

Week 6: Foundation models and non-English languages

Deliverables

Assignment 3 released on Mon 5.6.24.

Lecture 11 (Mon 5.6.24)

Guest Lecture: Foundation models for spoken language. Dr. Karen Livescu. Readings

Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33. 2020.
A. Mohamed et al., Self-supervised speech representation learning: A review IEEE Journal of Selected Topics in Signal Processing 16(6):1179-1210, October 2022.
W. Hsu et al., HuBERT: How much can a bad teacher benefit ASR pretraining?, ICASSP 2021.
A. Pasad et al., Comparative layer-wise analysis of self-supervised speech models, ICASSP 2023.

Lecture 12 (Wed 5.8.24)

Guest lecture: Speech Recognition Beyond English. Tolúlọpẹ́ Ògúnrẹ̀mí

lecture slides

Conneau, A., Baevski, A., Collobert, R., Mohamed, A., & Auli, M. Unsupervised cross-lingual representation learning for speech recognition. ArXiv. 2020.
Shi J, Berrebbi D, Chen W, Chung HL, Hu EP, Huang WP, Chang X, Li SW, Mohamed A, Lee HY, Watanabe S. ML-SUPERB: Multilingual speech universal performance benchmark. Interspeech 2023.

Week 7: Non-English spoken language understanding cont’d + project check-ins

Lecture 13 (Mon 5.13.24)

Guest lecture: Representing Low-Resource Language Varieties. Martijn Bartelds & Nay San

lecture slides

Lecture 14 (Wed 5.15.24)

Project check-ins during class. Each group will speak for ~2 minutes about progress and planned work

Week 8: Introduction to spoken dialog + project check-ins

Deliverables

Assignment 3 due by Monday 5.20.24 11:59PM Pacific.

Lecture 15 (Mon 5.20.24)

Project check-ins during class. Each group will speak for ~2 minutes about progress and planned work

Lecture 16 (Wed 5.22.24)

Overview of dialog: Human conversation. Task-oriented dialog. Dialog system design. GUS and frame-based dialog systems.

lecture slides

Readings:

J+M Draft Edition Chapter 15: Dialogue Systems and Chatbots, online pdf

Week 9: Spoken dialog with LLMs

Course Project Milestone due by Wednesday 5.29.24 11:59PM Pacific.

Memorial Day. NO CLASS (Mon 5.27.24)

Lecture 17 (Wed 5.29.24)

Guest lecture: Developing spoken dialog systems with LLMs. Anthony Scodary, Co-Founder @ Gridspace.

Readings:

Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, Le QV, & Zhou D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. ArXiv 2022.
Gao Y, Xiong Y, Gao X, Jia K, Pan J, Bi Y, Dai Y, Sun J, & Wang H. Retrieval-Augmented Generation for Large Language Models: A Survey. ArXiv. 2023.
Chen C, Borgeaud S, Irving G, Lespiau JB, Sifre L, & Jumper J. Accelerating Large Language Model Decoding with Speculative Sampling. ArXiv. 2023.

Week 10 : Spoken dialog development & final poster session

Lecture 18 (Mon 6.3.24)

Case study: Alexa Skills Kit in the era of LLMs.

lecture slides

Readings:

Alexa Skills Kit Documentation.
- Overview
- Understand Custom Skills. You do not need to cover adding visual components to skills.
- Interaction Model Design

Final project poster session (Wed 6.5.24)

Present posters at in-person session during lecture time.

Syllabus

Spring 2024

Instructor

Time and Location

Week 1: Introduction and Acoustic Phonetics

Lecture 1 (Mon 4.1.24)

Lecture 2 (Wed 4.3.24)

Week 2: Speech Synthesis / Text to Speech (TTS)

Lecture 3 (Mon 4.8.24)

Lecture 4 (Wed 4.10.24)

Week 3: Course project + TTS with deep learning

Lecture 5 (Mon 4.15.24)

Lecture 6 (Wed 4.17.24)

Week 4: Speech to Text / Automatic Speech Recognition (ASR)

Lecture 7 (Mon 4.22.24)

Lecture 8 (Wed 4.24.24)

Week 5: State-of-the-art ASR and customizing ASR for products

Lecture 9 (Mon 4.29.24)

Lecture 10 (Wed 5.1.24)

Week 6: Foundation models and non-English languages

Lecture 11 (Mon 5.6.24)

Lecture 12 (Wed 5.8.24)

Week 7: Non-English spoken language understanding cont’d + project check-ins

Lecture 13 (Mon 5.13.24)

Lecture 14 (Wed 5.15.24)

Week 8: Introduction to spoken dialog + project check-ins

Lecture 15 (Mon 5.20.24)

Lecture 16 (Wed 5.22.24)

Week 9: Spoken dialog with LLMs

Memorial Day. NO CLASS (Mon 5.27.24)

Lecture 17 (Wed 5.29.24)

Week 10 : Spoken dialog development & final poster session

Lecture 18 (Mon 6.3.24)

Final project poster session (Wed 6.5.24)

Course Project Report due by Saturday 6.8.24 by 11:59 PM Pacific. No late days allowed