Syllabus

Spring 2022

Week 1: Introduction and Acoustic Phonetics

Deliverables

  • Assignment 1 released on Tue 3.29.22.

Lecture 1 (Tue 3.29.22)

Course introduction. Slides.

Lecture 2 (Thu 3.31.22)

Phonetics: Articulatory phonetics. Acoustics. ARPAbet transcription. Slides. Readings:

  1. J+M Draft Edition Chapter 25: Phonetics, online pdf
  2. Fun read (optional). The Art of Language Invention. David J Peterson. 2015.

Week 2: Introduction to Dialog

Lecture 3 (Tue 4.5.22)

Overview of dialog: Human conversation. Task-oriented dialog. Dialog systems overview. Slides. Readings:

  1. J+M Draft Edition Chapter 24: Dialogue Systems and Chatbots, online pdf

Lecture 4 (Thu 4.7.22)

Dialog systems: Dialog system design. GUS and frame-based dialog systems. Alexa Skills Kit. Slides. Readings:

  1. J+M Draft Edition Chapter 24: Dialogue Systems and Chatbots, (cont’d) online pdf
  2. Alexa Skills Kit Documentation.

Week 3: Machine Learning in Dialog

Course Project Overview released on Sun 4.10.22.

Deliverables

  • Assignment 1 due by Monday 4.11.22 11:59PM Pacific.
  • Assignment 2 released on Tue 4.12.22.

Lecture 5 (Tue 4.12.22)

Deep Learning Preliminaries. Neural Chatbots as a motivating example. Encoder-decoder models. Slides. Readings:

  1. J+M Draft Edition Chapter 24: Dialogue Systems and Chatbots, (cont’d) online pdf
  2. J+M Draft Edition Chapter 9: Sequence Processing with Recurrent Networks. pdf
  3. J+M Draft Edition Chapter 10: Encoder-Decoder Models. pdf
  4. Vaswani, Ashish, et al. Attention is all you need. arXiv 2017.
  5. Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks arXiv 2014.
  6. The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time

Lecture 6 (Tue 4.14.22)

End-to-End neural approaches for dialog. Slides. Readings:

  1. Liu, B., Tür, G., Hakkani-Tür, D., Shah, P., & Heck, L. Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems. 2018. In Proceedings of NAACL-HLT (pp. 2060-2069).
  2. Budzianowski, Paweł, et al. MultiWOZ–A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. 2018.
  3. Ham, Donghoon, et al. End-to-end neural pipeline for goal-oriented dialogue systems using GPT-2. ACL 2020.

Week 4: Course Project & Automatic Speech Recognition (ASR) Introduction

Lecture 7 (Tue 4.19.22)

Some history of ASR, TTS, and dialog. Course project overview and Q&A. Slides.

Lecture 8 (Thu 4.21.22)

Speech recognition overview: Noisy channel model. Word error rate metrics. Hidden Markov models (HMMs). Slides. Readings:

  1. J+M Draft Edition. Appendix A pdf
  2. J+M Draft Edition. Appendix B pdf
  3. (Optional). If you have never studied language modeling (i.e., have never taken CS124 or CS224N or similar) you should do some additional reading and video lecture watching on your own.
    • Read J+M 3rd Edition Chapter 4 pages 1-20 (you can skip section 4.5) pdf
    • CS224N Lecture on N-gram and neural network language modeling pdf
    • Lecture videos on introductory NLP including language modeling youtube

Week 5: Automatic Speech Recognition

Deliverables

  • Assignment 2 due by Monday 4.25.22 11:59PM Pacific.
  • Assignment 3 released on Tue 4.26.22.

Lecture 9 (Tue 4.26.22)

Speech recognition: Acoustic modeling. Deep neural network (DNN) acoustic modeling. HMM-DNN systems. Feature extraction. Slides. Readings:

  1. Hinton, Geoffrey, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine. 2012.
  2. Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., and Le, Q. V. Specaugment on large scale datasets. ICASSP. 2020. arXiv version
  3. (HMM/CRF basics) Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields. 2006.
  4. (HMM/CRF basics) Matt Gormley. HMMs & CRFs. CMU 10-715 Advanced Intro to ML. 2018.

Lecture 10 (Thu 4.28.22)

Connectionist Temporal Classification (CTC). Listen, Attend & Spell (LAS). Multi-task objectives for end-to-end ASR. Slides. Readings:

  1. Graves, A., and Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. ICML. 2014.
  2. Maas, A., Xie, Z., Jurafsky, D., & Ng, A. Lexicon-free conversational speech recognition with neural networks. ACL-HLT. 2015. (* indicates equal contribution)
  3. Chan, W., Jaitly, N., Le, Q.V. and Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition ICASSP. 2016. arXiv preprint.
  4. Kim, S., Hori, T. and Watanabe, S. Joint CTC-attention based end-to-end speech recognition using multi-task learning. ICASSP. 2017.

Week 6: Advanced ASR

Deliverables

  • Course Project Proposal due by Tue 5.3.22 11:59PM Pacific.

Lecture 11 (Tue 5.3.22)

Recent end-to-end deep learning approaches for speech recognition. Practical considerations for building ethical systems. Decoding with finite state transducers. Slides. Readings:

  1. Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y. and Pang, R. Conformer: Convolution-augmented transformer for speech recognition. arXiv. 2020.
  2. Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J.R., Jurafsky, D. and Goel, S. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences. 2020.

Lecture 12 (Thu 5.5.22)

Guest Lecture: Graph search and Lattices in ASR. Dr. Arlo Faria Slides. Readings

  1. Povey, D., Hannemann, M., Boulianne, G., Burget, L., Ghoshal, A., Janda, M., Karafiát, M., Kombrink, S., Motlíček, P., Qian, Y. and Riedhammer, K. Generating exact lattices in the WFST framework. ICASSP. 2012.
  2. (Optional) Mohri, M., Pereira, F., and Riley, M. Speech recognition with weighted finite-state transducers. In Springer Handbook of Speech Processing (pp. 559-584). Springer. 2008.

Week 7: Spoken language products with modern toolkits

Deliverables

  • Assignment 3 due by Monday 5.9.22 11:59PM Pacific.
  • Assignment 4 released on Tue 5.10.22.

Lecture 13 (Tue 5.10.22)

Foundation models for spoken language. Interactive session: Using the SpeechBrain ASR toolkit. Slides. Readings:

  1. Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33. 2020.
  2. Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J. & Chou, J.C., 2021. SpeechBrain: A general-purpose speech toolkit. arXiv. 2021.

Lecture 14 (Thu 5.12.22) ZOOM ONLY

Guest Lecture: Ello: A case study in building spoken language products. Catalin Voss, Ello. Readings:

  1. T. Bluche, M. Primet, & T. Gisselbrecht. Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks.ArXiv. 2020.
  2. K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, & B. Kingsbury. End-to-End ASR-free Keyword Search from Speech. IEEE J. Signal Process. 2017.
  3. N. Sacchi, A. Nanchen, M. Jaggi, & M. Cerňak. Open-Vocabulary Keyword Spotting with Audio and Text Embeddings. Interspeech 2019.

Week 8: Speech Synthesis / Text to Speech (TTS)

Deliverables

  • Course Project Milestone due by Tue 5.17.22 11:59PM Pacific.

Lecture 15 (Tue 5.17.22)

NOTE: In room 300-300 not the usual lecture venue Text to Speech (TTS): Overview. Text normalization. Letter-to-sound. Prosody. Slides. Readings:

  1. J+M Draft Edition Chapter 26.6: TTS online pdf

Lecture 16 (Thu 5.19.22)

Guest Lecture: Deep learning for TTS, Alex Barron. Slides. Readings:

  1. Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S. and Le, Q., Tacotron: Towards end-to-end speech synthesis. arXiv. 2017.
  2. Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv. 2016.
  3. Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T. Y. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32. 2019.
  4. Gridspace demo. Spoken dialog system

Week 9: Practical TTS and Meaning Extraction

Deliverables

  • Assignment 4 due by Monday 5.23.22 11:59PM Pacific.

Lecture 17 (Tue 5.24.22)

Getting TTS working well: Data collection. Evaluation. Signal processing. Concatenative and parametric approaches. Slides. Readings:

  1. J+M Draft Edition Chapter 26.6: TTS (cont’d) online pdf

Lecture 18 (Thu 5.26.22)

Social meaning extraction: Interpersonal stance. Flirtation. Intoxication. Slides. Readings:

  1. Scherer, K. R. Vocal communication of emotion: A review of research paradigms. Speech Communication. 2003. Please read section 1 and 3 and skim section 2 to get an idea of the previous literature.
  2. Rajesh Ranganath, Dan Jurafsky, and Daniel A. McFarland. . Detecting friendly, flirtatious, awkward, and assertive speech in speed-dates. Computer Speech and Language. 2013.
  3. F. Mairesse, M. Walker, M. Mehl, and R. Moore. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research. 2007.

Week 10 : Poster Presentations and Wrap-up

Final project poster session (Tue 5.31.22)

Present posters at in-person session. 5:30pm - 7:30pm. Location TBD

Course Project Report due by Saturday 6.4.22 by 11:59 PM Pacific. No late days allowed