Syllabus

Week 1: Introduction and Acoustic Phonetics

Deliverables

Assignment 1 released on Tue 3.29.22.

Lecture 1 (Tue 3.29.22)

Course introduction. Slides.

Lecture 2 (Thu 3.31.22)

Phonetics: Articulatory phonetics. Acoustics. ARPAbet transcription. Slides. Readings:

J+M Draft Edition Chapter 25: Phonetics, online pdf
Fun read (optional). The Art of Language Invention. David J Peterson. 2015.

Week 2: Introduction to Dialog

Lecture 3 (Tue 4.5.22)

Overview of dialog: Human conversation. Task-oriented dialog. Dialog systems overview. Slides. Readings:

J+M Draft Edition Chapter 24: Dialogue Systems and Chatbots, online pdf

Lecture 4 (Thu 4.7.22)

Dialog systems: Dialog system design. GUS and frame-based dialog systems. Alexa Skills Kit. Slides. Readings:

J+M Draft Edition Chapter 24: Dialogue Systems and Chatbots, (cont’d) online pdf
Alexa Skills Kit Documentation.
- Overview
- Understand Custom Skills. You do not need to cover adding visual components to skills.
- Interaction Model Design

Week 3: Machine Learning in Dialog

Course Project Overview released on Sun 4.10.22.

Deliverables

Assignment 1 due by Monday 4.11.22 11:59PM Pacific.
Assignment 2 released on Tue 4.12.22.

Lecture 5 (Tue 4.12.22)

Deep Learning Preliminaries. Neural Chatbots as a motivating example. Encoder-decoder models. Slides. Readings:

J+M Draft Edition Chapter 24: Dialogue Systems and Chatbots, (cont’d) online pdf
J+M Draft Edition Chapter 9: Sequence Processing with Recurrent Networks. pdf
J+M Draft Edition Chapter 10: Encoder-Decoder Models. pdf
Vaswani, Ashish, et al. Attention is all you need. arXiv 2017.
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks arXiv 2014.
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time

Lecture 6 (Tue 4.14.22)

End-to-End neural approaches for dialog. Slides. Readings:

Liu, B., Tür, G., Hakkani-Tür, D., Shah, P., & Heck, L. Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems. 2018. In Proceedings of NAACL-HLT (pp. 2060-2069).
Budzianowski, Paweł, et al. MultiWOZ–A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. 2018.
Ham, Donghoon, et al. End-to-end neural pipeline for goal-oriented dialogue systems using GPT-2. ACL 2020.

Week 4: Course Project & Automatic Speech Recognition (ASR) Introduction

Lecture 7 (Tue 4.19.22)

Some history of ASR, TTS, and dialog. Course project overview and Q&A. Slides.

Lecture 8 (Thu 4.21.22)

Speech recognition overview: Noisy channel model. Word error rate metrics. Hidden Markov models (HMMs). Slides. Readings:

J+M Draft Edition. Appendix A pdf
J+M Draft Edition. Appendix B pdf
(Optional). If you have never studied language modeling (i.e., have never taken CS124 or CS224N or similar) you should do some additional reading and video lecture watching on your own.
- Read J+M 3rd Edition Chapter 4 pages 1-20 (you can skip section 4.5) pdf
- CS224N Lecture on N-gram and neural network language modeling pdf
- Lecture videos on introductory NLP including language modeling youtube

Week 5: Automatic Speech Recognition

Deliverables

Assignment 2 due by Monday 4.25.22 11:59PM Pacific.
Assignment 3 released on Tue 4.26.22.

Lecture 9 (Tue 4.26.22)

Speech recognition: Acoustic modeling. Deep neural network (DNN) acoustic modeling. HMM-DNN systems. Feature extraction. Slides. Readings:

Hinton, Geoffrey, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine. 2012.
Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., and Le, Q. V. Specaugment on large scale datasets. ICASSP. 2020. arXiv version
(HMM/CRF basics) Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields. 2006.
(HMM/CRF basics) Matt Gormley. HMMs & CRFs. CMU 10-715 Advanced Intro to ML. 2018.

Lecture 10 (Thu 4.28.22)

Connectionist Temporal Classification (CTC). Listen, Attend & Spell (LAS). Multi-task objectives for end-to-end ASR. Slides. Readings:

Graves, A., and Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. ICML. 2014.
Maas, A., Xie, Z., Jurafsky, D., & Ng, A. Lexicon-free conversational speech recognition with neural networks. ACL-HLT. 2015. (* indicates equal contribution)
Chan, W., Jaitly, N., Le, Q.V. and Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition ICASSP. 2016. arXiv preprint.
Kim, S., Hori, T. and Watanabe, S. Joint CTC-attention based end-to-end speech recognition using multi-task learning. ICASSP. 2017.

Week 6: Advanced ASR

Deliverables

Course Project Proposal due by Tue 5.3.22 11:59PM Pacific.

Lecture 11 (Tue 5.3.22)

Recent end-to-end deep learning approaches for speech recognition. Practical considerations for building ethical systems. Decoding with finite state transducers. Slides. Readings:

Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y. and Pang, R. Conformer: Convolution-augmented transformer for speech recognition. arXiv. 2020.
Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J.R., Jurafsky, D. and Goel, S. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences. 2020.

Lecture 12 (Thu 5.5.22)

Guest Lecture: Graph search and Lattices in ASR. Dr. Arlo Faria Slides. Readings

Povey, D., Hannemann, M., Boulianne, G., Burget, L., Ghoshal, A., Janda, M., Karafiát, M., Kombrink, S., Motlíček, P., Qian, Y. and Riedhammer, K. Generating exact lattices in the WFST framework. ICASSP. 2012.
(Optional) Mohri, M., Pereira, F., and Riley, M. Speech recognition with weighted finite-state transducers. In Springer Handbook of Speech Processing (pp. 559-584). Springer. 2008.

Week 7: Spoken language products with modern toolkits

Deliverables

Assignment 3 due by Monday 5.9.22 11:59PM Pacific.
Assignment 4 released on Tue 5.10.22.

Lecture 13 (Tue 5.10.22)

Foundation models for spoken language. Interactive session: Using the SpeechBrain ASR toolkit. Slides. Readings:

Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33. 2020.
Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J. & Chou, J.C., 2021. SpeechBrain: A general-purpose speech toolkit. arXiv. 2021.

Lecture 14 (Thu 5.12.22) ZOOM ONLY

Guest Lecture: Ello: A case study in building spoken language products. Catalin Voss, Ello. Readings:

T. Bluche, M. Primet, & T. Gisselbrecht. Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks.ArXiv. 2020.
K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, & B. Kingsbury. End-to-End ASR-free Keyword Search from Speech. IEEE J. Signal Process. 2017.
N. Sacchi, A. Nanchen, M. Jaggi, & M. Cerňak. Open-Vocabulary Keyword Spotting with Audio and Text Embeddings. Interspeech 2019.

Week 8: Speech Synthesis / Text to Speech (TTS)

Deliverables

Course Project Milestone due by Tue 5.17.22 11:59PM Pacific.

Lecture 15 (Tue 5.17.22)

NOTE: In room 300-300 not the usual lecture venue Text to Speech (TTS): Overview. Text normalization. Letter-to-sound. Prosody. Slides. Readings:

J+M Draft Edition Chapter 26.6: TTS online pdf

Lecture 16 (Thu 5.19.22)

Guest Lecture: Deep learning for TTS, Alex Barron. Slides. Readings:

Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S. and Le, Q., Tacotron: Towards end-to-end speech synthesis. arXiv. 2017.
Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv. 2016.
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T. Y. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32. 2019.
Gridspace demo. Spoken dialog system

Week 9: Practical TTS and Meaning Extraction

Deliverables

Assignment 4 due by Monday 5.23.22 11:59PM Pacific.

Lecture 17 (Tue 5.24.22)

Getting TTS working well: Data collection. Evaluation. Signal processing. Concatenative and parametric approaches. Slides. Readings:

J+M Draft Edition Chapter 26.6: TTS (cont’d) online pdf

Lecture 18 (Thu 5.26.22)

Social meaning extraction: Interpersonal stance. Flirtation. Intoxication. Slides. Readings:

Scherer, K. R. Vocal communication of emotion: A review of research paradigms. Speech Communication. 2003. Please read section 1 and 3 and skim section 2 to get an idea of the previous literature.
Rajesh Ranganath, Dan Jurafsky, and Daniel A. McFarland. . Detecting friendly, flirtatious, awkward, and assertive speech in speed-dates. Computer Speech and Language. 2013.
F. Mairesse, M. Walker, M. Mehl, and R. Moore. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research. 2007.

Week 10 : Poster Presentations and Wrap-up

Final project poster session (Tue 5.31.22)

Present posters at in-person session. 5:30pm - 7:30pm. Location TBD

Syllabus

Spring 2022

Instructor

Time and Location

Week 1: Introduction and Acoustic Phonetics

Lecture 1 (Tue 3.29.22)

Lecture 2 (Thu 3.31.22)

Week 2: Introduction to Dialog

Lecture 3 (Tue 4.5.22)

Lecture 4 (Thu 4.7.22)

Week 3: Machine Learning in Dialog

Lecture 5 (Tue 4.12.22)

Lecture 6 (Tue 4.14.22)

Week 4: Course Project & Automatic Speech Recognition (ASR) Introduction

Lecture 7 (Tue 4.19.22)

Lecture 8 (Thu 4.21.22)

Week 5: Automatic Speech Recognition

Lecture 9 (Tue 4.26.22)

Lecture 10 (Thu 4.28.22)

Week 6: Advanced ASR

Lecture 11 (Tue 5.3.22)

Lecture 12 (Thu 5.5.22)

Week 7: Spoken language products with modern toolkits

Lecture 13 (Tue 5.10.22)

Lecture 14 (Thu 5.12.22) ZOOM ONLY

Week 8: Speech Synthesis / Text to Speech (TTS)

Lecture 15 (Tue 5.17.22)

Lecture 16 (Thu 5.19.22)

Week 9: Practical TTS and Meaning Extraction

Lecture 17 (Tue 5.24.22)

Lecture 18 (Thu 5.26.22)

Week 10 : Poster Presentations and Wrap-up

Final project poster session (Tue 5.31.22)

Course Project Report due by Saturday 6.4.22 by 11:59 PM Pacific. No late days allowed