Week 1: Introduction and Acoustic Phonetics
Deliverables
- Assignment 1 released on Tue 3.29.22.
Lecture 1 (Tue 3.29.22)
Course introduction. Slides.
Lecture 2 (Thu 3.31.22)
Phonetics: Articulatory phonetics. Acoustics. ARPAbet transcription. Slides. Readings:
- J+M Draft Edition Chapter 25: Phonetics, online pdf
- Fun read (optional). The Art of Language Invention. David J Peterson. 2015.
Week 2: Introduction to Dialog
Lecture 3 (Tue 4.5.22)
Overview of dialog: Human conversation. Task-oriented dialog. Dialog systems overview. Slides. Readings:
- J+M Draft Edition Chapter 24: Dialogue Systems and Chatbots, online pdf
Lecture 4 (Thu 4.7.22)
Dialog systems: Dialog system design. GUS and frame-based dialog systems. Alexa Skills Kit. Slides. Readings:
- J+M Draft Edition Chapter 24: Dialogue Systems and Chatbots, (cont’d) online pdf
- Alexa Skills Kit Documentation.
- Overview
- Understand Custom Skills. You do not need to cover adding visual components to skills.
- Interaction Model Design
Week 3: Machine Learning in Dialog
Course Project Overview released on Sun 4.10.22.
Deliverables
- Assignment 1 due by Monday 4.11.22 11:59PM Pacific.
- Assignment 2 released on Tue 4.12.22.
Lecture 5 (Tue 4.12.22)
Deep Learning Preliminaries. Neural Chatbots as a motivating example. Encoder-decoder models. Slides. Readings:
- J+M Draft Edition Chapter 24: Dialogue Systems and Chatbots, (cont’d) online pdf
- J+M Draft Edition Chapter 9: Sequence Processing with Recurrent Networks. pdf
- J+M Draft Edition Chapter 10: Encoder-Decoder Models. pdf
- Vaswani, Ashish, et al. Attention is all you need. arXiv 2017.
- Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks arXiv 2014.
- The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time
Lecture 6 (Tue 4.14.22)
End-to-End neural approaches for dialog. Slides. Readings:
- Liu, B., Tür, G., Hakkani-Tür, D., Shah, P., & Heck, L. Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems. 2018. In Proceedings of NAACL-HLT (pp. 2060-2069).
- Budzianowski, Paweł, et al. MultiWOZ–A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. 2018.
- Ham, Donghoon, et al. End-to-end neural pipeline for goal-oriented dialogue systems using GPT-2. ACL 2020.
Week 4: Course Project & Automatic Speech Recognition (ASR) Introduction
Lecture 7 (Tue 4.19.22)
Some history of ASR, TTS, and dialog. Course project overview and Q&A. Slides.
Lecture 8 (Thu 4.21.22)
Speech recognition overview: Noisy channel model. Word error rate metrics. Hidden Markov models (HMMs). Slides. Readings:
- J+M Draft Edition. Appendix A pdf
- J+M Draft Edition. Appendix B pdf
- (Optional). If you have never studied language modeling (i.e., have never taken CS124 or CS224N or similar) you should do some additional reading and video lecture watching on your own.
Week 5: Automatic Speech Recognition
Deliverables
- Assignment 2 due by Monday 4.25.22 11:59PM Pacific.
- Assignment 3 released on Tue 4.26.22.
Lecture 9 (Tue 4.26.22)
Speech recognition: Acoustic modeling. Deep neural network (DNN) acoustic modeling. HMM-DNN systems. Feature extraction. Slides. Readings:
- Hinton, Geoffrey, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine. 2012.
- Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., and Le, Q. V. Specaugment on large scale datasets. ICASSP. 2020. arXiv version
- (HMM/CRF basics) Charles Sutton and Andrew McCallum. An Introduction to Conditional Random Fields. 2006.
- (HMM/CRF basics) Matt Gormley. HMMs & CRFs. CMU 10-715 Advanced Intro to ML. 2018.
Lecture 10 (Thu 4.28.22)
Connectionist Temporal Classification (CTC). Listen, Attend & Spell (LAS). Multi-task objectives for end-to-end ASR. Slides. Readings:
- Graves, A., and Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. ICML. 2014.
- Maas, A., Xie, Z., Jurafsky, D., & Ng, A. Lexicon-free conversational speech recognition with neural networks. ACL-HLT. 2015. (* indicates equal contribution)
- Chan, W., Jaitly, N., Le, Q.V. and Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition ICASSP. 2016. arXiv preprint.
- Kim, S., Hori, T. and Watanabe, S. Joint CTC-attention based end-to-end speech recognition using multi-task learning. ICASSP. 2017.
Week 6: Advanced ASR
Deliverables
- Course Project Proposal due by Tue 5.3.22 11:59PM Pacific.
Lecture 11 (Tue 5.3.22)
Recent end-to-end deep learning approaches for speech recognition. Practical considerations for building ethical systems. Decoding with finite state transducers. Slides. Readings:
- Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y. and Pang, R. Conformer: Convolution-augmented transformer for speech recognition. arXiv. 2020.
- Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J.R., Jurafsky, D. and Goel, S. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences. 2020.
Lecture 12 (Thu 5.5.22)
Guest Lecture: Graph search and Lattices in ASR. Dr. Arlo Faria Slides. Readings
- Povey, D., Hannemann, M., Boulianne, G., Burget, L., Ghoshal, A., Janda, M., Karafiát, M., Kombrink, S., Motlíček, P., Qian, Y. and Riedhammer, K. Generating exact lattices in the WFST framework. ICASSP. 2012.
- (Optional) Mohri, M., Pereira, F., and Riley, M. Speech recognition with weighted finite-state transducers. In Springer Handbook of Speech Processing (pp. 559-584). Springer. 2008.
Week 7: Spoken language products with modern toolkits
Deliverables
- Assignment 3 due by Monday 5.9.22 11:59PM Pacific.
- Assignment 4 released on Tue 5.10.22.
Lecture 13 (Tue 5.10.22)
Foundation models for spoken language. Interactive session: Using the SpeechBrain ASR toolkit. Slides. Readings:
- Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33. 2020.
- Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J. & Chou, J.C., 2021. SpeechBrain: A general-purpose speech toolkit. arXiv. 2021.
Lecture 14 (Thu 5.12.22) ZOOM ONLY
Guest Lecture: Ello: A case study in building spoken language products. Catalin Voss, Ello. Readings:
- T. Bluche, M. Primet, & T. Gisselbrecht. Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks.ArXiv. 2020.
- K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran, & B. Kingsbury. End-to-End ASR-free Keyword Search from Speech. IEEE J. Signal Process. 2017.
- N. Sacchi, A. Nanchen, M. Jaggi, & M. Cerňak. Open-Vocabulary Keyword Spotting with Audio and Text Embeddings. Interspeech 2019.
Week 8: Speech Synthesis / Text to Speech (TTS)
Deliverables
- Course Project Milestone due by Tue 5.17.22 11:59PM Pacific.
Lecture 15 (Tue 5.17.22)
NOTE: In room 300-300 not the usual lecture venue Text to Speech (TTS): Overview. Text normalization. Letter-to-sound. Prosody. Slides. Readings:
- J+M Draft Edition Chapter 26.6: TTS online pdf
Lecture 16 (Thu 5.19.22)
Guest Lecture: Deep learning for TTS, Alex Barron. Slides. Readings:
- Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S. and Le, Q., Tacotron: Towards end-to-end speech synthesis. arXiv. 2017.
- Oord, A.V.D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv. 2016.
- Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T. Y. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems 32. 2019.
- Gridspace demo. Spoken dialog system
Week 9: Practical TTS and Meaning Extraction
Deliverables
- Assignment 4 due by Monday 5.23.22 11:59PM Pacific.
Lecture 17 (Tue 5.24.22)
Getting TTS working well: Data collection. Evaluation. Signal processing. Concatenative and parametric approaches. Slides. Readings:
- J+M Draft Edition Chapter 26.6: TTS (cont’d) online pdf
Lecture 18 (Thu 5.26.22)
Social meaning extraction: Interpersonal stance. Flirtation. Intoxication. Slides. Readings:
- Scherer, K. R. Vocal communication of emotion: A review of research paradigms. Speech Communication. 2003. Please read section 1 and 3 and skim section 2 to get an idea of the previous literature.
- Rajesh Ranganath, Dan Jurafsky, and Daniel A. McFarland. . Detecting friendly, flirtatious, awkward, and assertive speech in speed-dates. Computer Speech and Language. 2013.
- F. Mairesse, M. Walker, M. Mehl, and R. Moore. Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research. 2007.
Week 10 : Poster Presentations and Wrap-up
Final project poster session (Tue 5.31.22)
Present posters at in-person session. 5:30pm - 7:30pm. Location TBD
CS224S