Syllabus

Spring 2025

Week 1: Introduction and acoustic phonetics

Assignments

  • Assignment 1 released on Wednesday 4.2.25.

Lecture 1 (Mon 3.31.25)

Course introduction.

lecture slides

Lecture 2 (Wed 4.2.25)

Phonetics: Articulatory phonetics. Acoustics. ARPAbet transcription.

lecture slides

Readings:

  1. J+M Draft Edition Appendix H: Phonetics, online pdf
  2. Fun read (optional). The Art of Language Invention. David J Peterson. 2015.

Week 2: Speech synthesis / Text to speech (TTS)

Lecture 3 (Mon 4.7.25)

Some history of ASR, TTS, and dialog. TTS Overview. Text normalization. Letter-to-sound. Prosody.

lecture slides

Readings:

  1. J+M Draft Edition Chapter 16.6: TTS online pdf

Lecture 4 (Wed 4.9.25)

Foundations of TTS: Data collection. Evaluation. Signal processing. Concatenative and parametric approaches.

lecture slides

Readings:

  1. J+M Draft Edition Chapter 16.6: TTS (cont’d) online pdf

Week 3: TTS with deep learning

Assignments

  • Assignment 1 due by Monday 4.14.25 11:59PM Pacific.
  • Assignment 2 released on Monday 4.14.25.

Lecture 5 (Mon 4.14.25)

Deep learning background. TTS using deep learning.

lecture slides

Readings:

  1. Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S. and Le, Q., Tacotron: Towards end-to-end speech synthesis. arXiv. 2017.
  2. Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv. 2016.
  3. Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T. Y. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32. 2019.

Deep Learning Preliminaries. Review on your own as needed depending on your experience so far with deep learning models:

  1. J+M Draft Edition Chapter 7: Neural Networks and Neural Language Models. pdf
  2. J+M Draft Edition Chapter 8: RNNs and LSTMs. pdf
  3. J+M Draft Edition Chapter 9: Transformers. pdf
  4. J+M Draft Edition Chapter 10: Large Language Models. pdf
  5. The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time

Lecture 6 (Wed 4.16.25)

Advanced deep learning for speech and audio synthesis.

lecture slides

Readings:

  1. Du, Z., Wang, Y., Chen, Q., Shi, X., Lv, X., Zhao, T., Gao, Z., Yang, Y., Gao, C., Wang, H., Yu, F., Liu, H., Sheng, Z., Gu, Y., Deng, C., Wang, W., Zhang, S., Yan, Z., & Zhou, J. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv:2412.10117. 2024
  2. Kong, J., Park, J., Kim, B., Kim, J., Kong, D., & Kim, S. Vits2: Improving quality and efficiency of single-stage text-to-speech with adversarial learning and architecture design. arXiv:2307.16430. 2023.
  3. Triantafyllopoulos, A., Schuller, B. W., İymen, G., Sezgin, M., He, X., Yang, Z., … & Tao, J. An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE, 111(10). 2023. [arXiv PDF]

Week 4: Overview of spoken dialog. Designing systems and user experiences

Lecture 7 (Mon 4.21.25)

Overview of dialog: Human conversation. Task-oriented dialog. Dialog system design. GUS and frame-based dialog systems.

lecture slides

Readings:

  1. J+M Draft Edition Chapter 15: Dialogue Systems and Chatbots, online pdf
  2. Mahmood, A., Wang, J., Yao, B., Wang, D., & Huang, C. M. LLM-powered conversational voice assistants: Interaction patterns, opportunities, challenges, and design guidelines. arXiv:2309.13879. 2023.

Lecture 8 (Wed 4.23.25)

Developing dialog systems case studies: Alexa Skills Kit, Gridspace virtual agent builder, ReTell AI dialog flow builder, Apple App Intents.

lecture slides

Readings:

  1. Alexa Skills Kit Documentation.
  2. Gridspace virtual agent building is free to demo / try (requires account creation).
  3. ReTell AI dialog flow builder.
  1. Apple App Intents

Week 5: Speech to text / Automatic speech recognition (ASR) introduction

Assignments

  • Assignment 2 due by Monday 4.28.25 11:59PM Pacific.
  • Assignment 3 released on Monday 4.28.25.

Lecture 9 (Mon 4.28.25)

Speech recognition overview: Noisy channel model. Word error rate metrics. Hidden Markov models (HMMs).

lecture slides

Readings:

  1. J+M Draft Edition Chapter 16.1, 16.2, 16.3, 16.5: Automatic Speech Recognition online pdf
  2. J+M Draft Edition. Appendix A pdf
  3. J+M Draft Edition. Appendix B pdf
  4. Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J.R., Jurafsky, D. and Goel, S. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences. 2020.
  5. (Optional). If you have never studied language modeling (i.e., have never taken CS124, CS224N, or similar) you should do some additional reading and video lecture watching on your own.
  • J+M Draft Edition Chapter 3: N-gram Language Models. pdf
  • Lecture videos on introductory NLP including language modeling youtube

Lecture 10 (Wed 4.30.25)

Speech recognition: HMM-DNN systems. Connectionist Temporal Classification (CTC). End-to-end neural ASR.

lecture slides

Readings:

  1. Hinton, Geoffrey, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine. 2012.
  2. J+M Draft Edition Chapter 16.4: CTC online pdf
  3. Graves, A., and Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. ICML. 2014.
  4. Maas, A., Xie, Z., Jurafsky, D., & Ng, A. Lexicon-free conversational speech recognition with neural networks. ACL-HLT. 2015. (* indicates equal contribution)
  5. Chan, W., Jaitly, N., Le, Q.V. and Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition ICASSP. 2016. arXiv preprint.
  6. Prabhavalkar, R., Hori, T., Sainath, T. N., Schlüter, R., and Watanabe, S. End-to-end speech recognition: A survey. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2023.

Week 6: State-of-the-art ASR and social meaning extraction

Lecture 11 (Mon 5.5.25)

Guest lecture: State-of-the-art deep learning approaches for speech recognition. Conformer. Whisper. Fine tuning base models. Tolúlọpẹ́ Ògúnrẹ̀mí.

lecture slides

Readings:

  1. Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y. and Pang, R. Conformer: Convolution-augmented transformer for speech recognition. arXiv. 2020.
  2. Whisper ASR model overview.
  3. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. Robust speech recognition via large-scale weak supervision. PMLR. 2023.

Lecture 12 (Wed 5.7.25)

Social meaning extraction as supervised machine learning. Ethics and bias in spoken language systems. Course project overview and Q&A.

lecture slides

Readings:

  1. Wagner, J., Triantafyllopoulos, A., Wierstorf, H., Schmitt, M., Burkhardt, F., Eyben, F., & Schuller, B. W. Dawn of the transformer era in speech emotion recognition: closing the valence gap. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9). 2023. [arXiv PDF]
  2. Zhang, Z., Xu, W., Dong, Z., Wang, K., Wu, Y., Peng, J., … & Huang, D. Y. ParaLBench: a Large-Scale Benchmark for Computational Paralinguistics over Acoustic Foundation Models. IEEE Transactions on Affective Computing. 2024. [arXiv PDF]
  3. Tao, F., Mirheidari, B., Pahar, M., Young, S., Xiao, Y., Elghazaly, H., … & Christensen, H. Early dementia detection using multiple spontaneous speech prompts: The process challenge. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2025. [arXiv PDF]
  4. Rajesh Ranganath, Dan Jurafsky, and Daniel A. McFarland. Detecting friendly, flirtatious, awkward, and assertive speech in speed-dates. Computer Speech and Language. 2013.

Week 7: Foundation models and non-English languages

Assignments

  • Assignment 3 due by Monday 5.12.25 11:59PM Pacific.
  • (Optional) Project proposals due by Monday 5.12.25 11:59PM Pacific.
  • Assignment 4 released on Wednesday 5.14.25.

Lecture 13 (Mon 5.12.25)

Guest Lecture: Foundation models for spoken language. Dr. Karen Livescu.

lecture slides

Readings

  1. Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33. 2020.
  2. A. Mohamed et al., Self-supervised speech representation learning: A review IEEE Journal of Selected Topics in Signal Processing 16(6):1179-1210, October 2022.
  3. W. Hsu et al., HuBERT: How much can a bad teacher benefit ASR pretraining?, ICASSP 2021.
  4. A. Pasad et al., Comparative layer-wise analysis of self-supervised speech models, ICASSP 2023.

Lecture 14 (Wed 5.14.25)

Guest lecture: Speech Recognition Beyond English. Tolúlọpẹ́ Ògúnrẹ̀mí.

lecture slides

Readings

  1. Conneau, A., Baevski, A., Collobert, R., Mohamed, A., & Auli, M. Unsupervised cross-lingual representation learning for speech recognition. ArXiv. 2020.
  2. Shi J, Berrebbi D, Chen W, Chung HL, Hu EP, Huang WP, Chang X, Li SW, Mohamed A, Lee HY, Watanabe S. ML-SUPERB: Multilingual speech universal performance benchmark. Interspeech 2023.

Week 8: Low-resource language systems & revisiting dialog control using LLMs

Lecture 15 (Mon 5.19.25)

Guest lecture: Representing Low-Resource Language Varieties. Martijn Bartelds

lecture slides

Lecture 16 (Wed 5.21.25)

Guest lecture: Developing spoken dialog systems with LLMs. Anthony Scodary, Co-Founder @ Gridspace.

Readings:

  1. Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, Le QV, & Zhou D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. ArXiv 2022.
  2. Gao Y, Xiong Y, Gao X, Jia K, Pan J, Bi Y, Dai Y, Sun J, & Wang H. Retrieval-Augmented Generation for Large Language Models: A Survey. ArXiv. 2023.
  3. Chen C, Borgeaud S, Irving G, Lespiau JB, Sifre L, & Jumper J. Accelerating Large Language Model Decoding with Speculative Sampling. ArXiv. 2023.

Week 9: Societal impacts of conversational AI systems

Memorial Day. NO CLASS (Mon 5.26.25)

Lecture 17 (Wed 5.28.25)

Guest lecture: Conversational AI systems’ impact on society. Alex Acero (Google Scholar).

Week 10 : Specializing speech systems in industry & course project presentations

Assignments

  • Assignment 4 due by Monday 6.2.25 11:59PM Pacific.

Lecture 18 (Mon 6.2.25)

LLM-based spoken dialog systems. Current trends.

Readings:

1.Wang, P., Lu, S., Tang, Y., Yan, S., Xia, W., & Xiong, Y. A full-duplex speech dialogue scheme based on large language models. arXiv. 2024. 2. TBD

Final project presentations (Wed 6.4.25)

Students submitting projects give spotlight talks during lecture time.

Course Project Report due by Saturday 6.7.25 by 11:59 PM Pacific. No late days allowed