Winter 2021

Week 1: Introduction and Acoustic Phonetics


Lecture 1 (Mon 1.11.21)

Course introduction.

Lecture 2 (Wed 1.13.21)

Phonetics: Articulatory phonetics. Acoustics. ARPAbet transcription. Readings:

  1. J+M Draft Edition Chapter 25: Phonetics, online pdf
  2. Fun read (optional). The Art of Language Invention. David J Peterson. 2015.

Week 2: Introduction to Dialog

Martin Luther King, Jr. Day. No class. (Mon 1.18.21)

Lecture 3 (Wed 1.20.21)

Overview of dialog: Human conversation. Task-oriented dialog. Dialog systems overview. Readings:

  1. J+M Draft Edition Chapter 24: Dialogue Systems and Chatbots, online pdf

Week 3: Machine Learning in Dialog


  • Assignment 1 due on Mon 1.25.21.
  • Assignment 2 released on Wed 1.27.21.

Lecture 4 (Mon 1.25.21)

Dialog systems: Dialog system design. GUS and frame-based dialog systems. Alexa Skills Kit. Readings:

  1. J+M Draft Edition Chapter 24: Dialogue Systems and Chatbots, (cont’d) online pdf
  2. Alexa Skills Kit Documentation.

Lecture 5 (Wed 1.27.21)

Some history of ASR, TTS, and dialog. Course project overview and Q&A.

Week 4: Machine Learning in Dialog (continued)


Lecture 6 (Mon 2.1.21)

Deep Learning Preliminaries. Neural Chatbots as a motivating example. Encoder-decoder models. Lecture by Mike Wu. Readings:

  1. J+M Draft Edition Chapter 24: Dialogue Systems and Chatbots, (cont’d) online pdf
  2. J+M Draft Edition Chapter 9: Sequence Processing with Recurrent Networks. pdf
  3. J+M Draft Edition Chapter 10: Encoder-Decoder Models. pdf
  4. Vaswani, Ashish, et al. Attention is all you need. arXiv 2017.
  5. Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks arXiv 2014.
  6. The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time

Lecture 7 (Wed 2.3.21)

End-to-End neural approaches for dialog. Readings:

  1. Liu, B., Tür, G., Hakkani-Tür, D., Shah, P., & Heck, L. Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems. 2018. In Proceedings of NAACL-HLT (pp. 2060-2069).
  2. Budzianowski, Paweł, et al. MultiWOZ–A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. 2018.
  3. Ham, Donghoon, et al. End-to-end neural pipeline for goal-oriented dialogue systems using GPT-2. ACL 2020.

Week 5: Automatic Speech Recognition (ASR) Introduction


  • Assignment 2 due on Mon 2.8.21.
  • Assignment 3 released on Wed 2.10.21.

Lecture 8 (Mon 2.8.21)

Speech recognition overview: Noisy channel model. Word error rate metrics. Hidden Markov models (HMMs). Readings:

  1. J+M Draft Edition. Appendix A pdf
  2. J+M Draft Edition. Appendix B pdf
  3. (Optional). If you have never had language modeling (i.e., have never taken CS124 or CS224N or similar) you should do some additional reading and video lecture watching on your own.
    • Read J+M 3rd Edition Chapter 4 pages 1-20 (you can skip section 4.5) pdf
    • CS224N Lecture on N-gram and neural network language modeling pdf
    • Lecture videos on introductory NLP including language modeling youtube

Lecture 9 (Wed 2.10.21)

Speech recognition components: Acoustic modeling. Feature extraction. Decoding. Finite state transducers. Readings:

  1. Park, Daniel S., et al. Specaugment on large scale datasets. ICASSP 2020 arXiv version
  2. (Optional) Weighted Finite-State Transducers in Speech Recognition. Mohri, Mehryar, Fernando Pereira, and Michael Riley. Computer Speech & Language. 2002.
  3. (HMM/CRF basics) Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields. 2006.
  4. (HMM/CRF basics) Matt Gormley. HMMs & CRFs. CMU 10-715 Advanced Intro to ML. 2018.

Week 6: Deep Learning in ASR


  • Course Project Proposal due on Wed 2.17.21.

President’s Day. No class. (Mon 2.15.21)

Lecture 10 (Wed 2.17.21)

Deep neural network (DNN) acoustic modeling. DNN-HMM systems. Connectionist Temporal Classification (CTC). Listen, Attend & Spell. Readings:

  1. Hinton, Geoffrey, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine. 2012.
  2. Graves, A., and Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. ICML. 2014.
  3. Maas, A., Xie, Z., Jurafsky, D., & Ng, A. Lexicon-free conversational speech recognition with neural networks. ACL-HLT. 2015.

Week 7: Speech Synthesis / Text to Speech (TTS) and DL in ASR (continued)


  • Assignment 3 due on Mon 2.22.21.
  • Assignment 4 released on Fri 2.26.21.

Lecture 11 (Mon 2.22.21)

Recent end-to-end deep learning approaches for speech recognition. Practical considerations for building ethical systems. Readings:

  1. Chan, W., Jaitly, N., Le, Q.V. and Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition ICASSP. 2016. arXiv preprint.
  2. Kim, S., Hori, T. and Watanabe, S. Joint CTC-attention based end-to-end speech recognition using multi-task learning. ICASSP. 2017.
  3. Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y. and Pang, R. Conformer: Convolution-augmented transformer for speech recognition. arXiv. 2020.
  4. Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J.R., Jurafsky, D. and Goel, S. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences. 2020.

Lecture 12 (Wed 2.24.21)

Text to Speech (TTS): Overview. Text normalization. Letter-to-sound. Prosody. Readings: J+M Draft Edition Chapter 26.6: TTS online pdf

Week 8: TTS (continued)


  • Course Project Milestone due on Wed 3.3.21.

Lecture 13 (Mon 3.1.21)

TTS: Concatenative and parametric approaches Readings: J+M Draft Edition Chapter 26.6: TTS (cont’d) online pdf

Lecture 14 (Wed 3.3.21)

Guest Lecture: Deep learning for TTS, Alex Barron, Gridspace. Readings:

  1. Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S. and Le, Q., Tacotron: Towards end-to-end speech synthesis. arXiv. 2017.
  2. Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv. 2016.
  3. Gridspace demo. Spoken dialog system

Week 9: Applications and Meaning Extraction


  • Assignment 4 due on Fri 3.12.21.

Lecture 15 (Mon 3.8.21)

Social meaning extraction: Interpersonal stance. Flirtation. Intoxication. Readings:

  1. Scherer, K. R. Vocal communication of emotion: A review of research paradigms. Speech Communication. 2003. Please read section 1 and 3 and skim section 2 to get an idea of the previous literature.
  2. Rajesh Ranganath, Dan Jurafsky, and Daniel A. McFarland. . Detecting friendly, flirtatious, awkward, and assertive speech in speed-dates. Computer Speech and Language. 2013.
  3. F. Mairesse, M. Walker, M. Mehl, and R. Moore. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research. 2007.

Lecture 16 (Wed 3.10.21)

Guest Lecture: Dr. Alborz Geramifard, Facebook. Readings: TBD

Week 10 : Project Presentations and Wrap-up


  • Course Project Report due on Mon 3.22.21.

Lecture 17 (Mon 3.15.21)

Project presentations during class.

Lecture 18 (Wed 3.17.21)

Project presentations during class.