Week 1: Introduction and Acoustic Phonetics
Deliverables
- Assignment 1 released on Mon 4.1.24.
Lecture 1 (Mon 4.1.24)
Course introduction.
Lecture 2 (Wed 4.3.24)
Phonetics: Articulatory phonetics. Acoustics. ARPAbet transcription.
Readings:
- J+M Draft Edition Appendix H: Phonetics, online pdf
- Fun read (optional). The Art of Language Invention. David J Peterson. 2015.
Week 2: Speech Synthesis / Text to Speech (TTS)
Course Project Overview released on Mon 4.8.24.
Lecture 3 (Mon 4.8.24)
Some history of ASR, TTS, and dialog. TTS Overview. Text normalization. Letter-to-sound. Prosody.
Readings:
- J+M Draft Edition Chapter 16.6: TTS online pdf
Lecture 4 (Wed 4.10.24)
Foundations of TTS: Data collection. Evaluation. Signal processing. Concatenative and parametric approaches.
Readings:
- J+M Draft Edition Chapter 16.6: TTS (cont’d) online pdf
Week 3: Course project + TTS with deep learning
Deliverables
- Assignment 1 due by Monday 4.15.24 11:59PM Pacific.
- Assignment 2 released on Monday 4.15.24.
Lecture 5 (Mon 4.15.24)
Course project overview and Q&A. Social meaning extraction as supervised machine learning.
Readings:
- Rajesh Ranganath, Dan Jurafsky, and Daniel A. McFarland. . Detecting friendly, flirtatious, awkward, and assertive speech in speed-dates. Computer Speech and Language. 2013.
Lecture 6 (Wed 4.17.24)
Deep learning for TTS.
Readings:
- Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S. and Le, Q., Tacotron: Towards end-to-end speech synthesis. arXiv. 2017.
- Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv. 2016.
- Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T. Y. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32. 2019.
Deep Learning Preliminaries. Review on your own as needed depending on your experience so far with deep learning models:
- J+M Draft Edition Chapter 7: Neural Networks and Neural Language Models. pdf
- J+M Draft Edition Chapter 9: RNNs and LSTMs. pdf
- J+M Draft Edition Chapter 10: Transformers and Large Language Models. pdf
- The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time
Week 4: Speech to Text / Automatic Speech Recognition (ASR)
Deliverables
- Course Project Proposal due by Wednesday 4.24.24 11:59PM Pacific.
Lecture 7 (Mon 4.22.24)
Speech recognition overview: Noisy channel model. Word error rate metrics. Hidden Markov models (HMMs).
Readings:
- J+M Draft Edition Chapter 16.1, 16.2, 16.3, 16.5: Automatic Speech Recognition online pdf
- J+M Draft Edition. Appendix A pdf
- J+M Draft Edition. Appendix B pdf
- Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J.R., Jurafsky, D. and Goel, S. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences. 2020.
- (Optional). If you have never studied language modeling (i.e., have never taken CS124, CS224N, or similar) you should do some additional reading and video lecture watching on your own.
- J+M Draft Edition Chapter 3: N-gram Language Models. pdf
- Lecture videos on introductory NLP including language modeling youtube
Lecture 8 (Wed 4.24.24)
Speech recognition: HMM-DNN systems. Connectionist Temporal Classification (CTC). End-to-end neural ASR.
Readings:
- Hinton, Geoffrey, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine. 2012.
- J+M Draft Edition Chapter 16.4: CTC online pdf
- Graves, A., and Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. ICML. 2014.
- Maas, A., Xie, Z., Jurafsky, D., & Ng, A. Lexicon-free conversational speech recognition with neural networks. ACL-HLT. 2015. (* indicates equal contribution)
- Chan, W., Jaitly, N., Le, Q.V. and Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition ICASSP. 2016. arXiv preprint.
- Prabhavalkar, R., Hori, T., Sainath, T. N., Schlüter, R., and Watanabe, S. End-to-end speech recognition: A survey. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2023.
Week 5: State-of-the-art ASR and customizing ASR for products
Deliverables
- Assignment 2 due by Wednesday 5.1.24 11:59PM Pacific.
Lecture 9 (Mon 4.29.24)
Guest lecture: State-of-the-art deep learning approaches for speech recognition. Conformer. Whisper. Fine tuning base models. Abhinav Garg.
Readings:
- Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y. and Pang, R. Conformer: Convolution-augmented transformer for speech recognition. arXiv. 2020. 2.TBD
Lecture 10 (Wed 5.1.24)
Guest Lecture: Ello: A case study in building spoken language products. Joe Lou, Ello.
Readings:
- T. Bluche, M. Primet, & T. Gisselbrecht. Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks.ArXiv. 2020.
- K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, & B. Kingsbury. End-to-End ASR-free Keyword Search from Speech. IEEE J. Signal Processing 2017.
- N. Sacchi, A. Nanchen, M. Jaggi, & M. Cerňak. Open-Vocabulary Keyword Spotting with Audio and Text Embeddings. Interspeech 2019.
Week 6: Foundation models and non-English languages
Deliverables
- Assignment 3 released on Mon 5.6.24.
Lecture 11 (Mon 5.6.24)
Guest Lecture: Foundation models for spoken language. Dr. Karen Livescu. Readings
- Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33. 2020.
- A. Mohamed et al., Self-supervised speech representation learning: A review IEEE Journal of Selected Topics in Signal Processing 16(6):1179-1210, October 2022.
- W. Hsu et al., HuBERT: How much can a bad teacher benefit ASR pretraining?, ICASSP 2021.
- A. Pasad et al., Comparative layer-wise analysis of self-supervised speech models, ICASSP 2023.
Lecture 12 (Wed 5.8.24)
Guest lecture: Speech Recognition Beyond English. Tolúlọpẹ́ Ògúnrẹ̀mí
- Conneau, A., Baevski, A., Collobert, R., Mohamed, A., & Auli, M. Unsupervised cross-lingual representation learning for speech recognition. ArXiv. 2020.
- Shi J, Berrebbi D, Chen W, Chung HL, Hu EP, Huang WP, Chang X, Li SW, Mohamed A, Lee HY, Watanabe S. ML-SUPERB: Multilingual speech universal performance benchmark. Interspeech 2023.
Week 7: Non-English spoken language understanding cont’d + project check-ins
Lecture 13 (Mon 5.13.24)
Guest lecture: Representing Low-Resource Language Varieties. Martijn Bartelds & Nay San
Lecture 14 (Wed 5.15.24)
Project check-ins during class. Each group will speak for ~2 minutes about progress and planned work
Week 8: Introduction to spoken dialog + project check-ins
Deliverables
- Assignment 3 due by Monday 5.20.24 11:59PM Pacific.
Lecture 15 (Mon 5.20.24)
Project check-ins during class. Each group will speak for ~2 minutes about progress and planned work
Lecture 16 (Wed 5.22.24)
Overview of dialog: Human conversation. Task-oriented dialog. Dialog system design. GUS and frame-based dialog systems.
Readings:
- J+M Draft Edition Chapter 15: Dialogue Systems and Chatbots, online pdf
Week 9: Spoken dialog with LLMs
- Course Project Milestone due by Wednesday 5.29.24 11:59PM Pacific.
Memorial Day. NO CLASS (Mon 5.27.24)
Lecture 17 (Wed 5.29.24)
Guest lecture: Developing spoken dialog systems with LLMs. Anthony Scodary, Co-Founder @ Gridspace.
Readings:
- Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, Le QV, & Zhou D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. ArXiv 2022.
- Gao Y, Xiong Y, Gao X, Jia K, Pan J, Bi Y, Dai Y, Sun J, & Wang H. Retrieval-Augmented Generation for Large Language Models: A Survey. ArXiv. 2023.
- Chen C, Borgeaud S, Irving G, Lespiau JB, Sifre L, & Jumper J. Accelerating Large Language Model Decoding with Speculative Sampling. ArXiv. 2023.
Week 10 : Spoken dialog development & final poster session
Lecture 18 (Mon 6.3.24)
Case study: Alexa Skills Kit in the era of LLMs.
Readings:
- Alexa Skills Kit Documentation.
- Overview
- Understand Custom Skills. You do not need to cover adding visual components to skills.
- Interaction Model Design
Final project poster session (Wed 6.5.24)
Present posters at in-person session during lecture time.
CS224S