Speech Recognition and Understanding Research in Dan Jurafsky's Lab

The lab studies a number of areas in speech recognition, understanding, and synthesis. Our high-level focus is on the use of linguistic knowledge (phonetic, phonological, prosodic, syntactic, semantic, pragmatic) in machine speech processing.

Prosody in Speech Recognition and Synthesis: We are working on various projects in prosody; Jason Brenier is working on the automatic detection of prosodic phenomena like emphatic pitch accents. A new project, joint with Simon King and Mark Steedman at the University of Edinburgh, focuses on the use of prosody in speech synthesis. The postdoctoral position on this project has been filled!.

Pronunciation Modeling: A key ASR problem, especially in recognizing human-to-human conversational speech, is predicting how words are likely to be pronounced in context.

Recognition of Dialect-Accented Speech In a JHU Summer Workshop 2004 project directed by Thomas Zheng of Tsinghua University and Richard Sproat of the University of Illinois, we are working on robust speech recognition of Mandarin Chinese spoken by speakers with southern (Shanghainese) accents. Related to this project, Stanford linguistics student Rebecca Starr is working on sociolinguistic and phonological causes of variation in southern Mandarin.

PMLA Workshop: With Eric Fosler-Lussier and Bill Byrne, I recently co-organized PMLA-2002 (the Pronunciation Modeling/Lexicon Adaptation Workshop), a satellite conference to the ICSLP-2002 conference.

What kinds of Pronunciation Variation are Already Modelled by Triphones: We have been trying to understand why improvements in ASR due to pronunciation modeling have proven so elusive. We show that many of the kinds of variation which previous pronunciation models attempt to capture, including phone substitution or phone reduction, are in fact already well captured by triphones. Our analysis suggests new areas where future pronunciation models should focus instead, including syllable deletion. Jurafsky, Ward, Zhang, Herold, Yu, and Zhang (2001). .

The Effect of Disfluencies on Pronunciation Reduction: In a number of recent papers, Alan Bell, Eric Fosler-Lussier, Dan Gildea, Cynthia Girand, Michelle Gregory, Bill Raymond and I have studied what factors cause the pronunciation of words to be reduced or alternatively what causes words to have full or longer pronunciations. One result is that words are longer when they are in disfluent contexts; either preceded or followed by pauses, filled pauses, or repetitions. See most recently our JASA paper Bell et al. 2003,

The Effect of Word Frequency and Probability on Pronunciation Reduction: Our lab has also been working on the effect of word frequency and word predictability or probability on pronunciation variation. We have found that words are more likely to be full when they have surprising or unpredictable. See Jurafsky, Bell, Gregory, and Raymond (2000), Bell et al. 2003, Gregory, Raymond, Bell, Fosler-Lussier, and Jurafsky (1999) (ps) and Jurafsky, Bell, Fosler, Girand, Raymond (1998) (ps) .

The Effect of Word Sense or Part of Speech on Pronunciation Reduction: We are also working on studying whether the different senses or parts of speech of ambiguous words have different pronunciations. See Jurafsky, Bell, and Girand (2002).

Recognition of Foreign-Accented Speech Together with Wayne Ward and other collaborators at Boulder, we have been working on better recognition of foreign-accented English. Here's a paper on recognition of Spanish-Accented spontaneous English.

Probabilistic Phonological Rules: Gary Tajchman, Eric Fosler-Lussier, and I have looked at various ways that hand-written phonological rules can be training probabilistically and then used to augment an ASR lexicon. See Tajchman, Fosler, and Jurafsky 1995 (ps) .

Language Modeling: One of the most important problems in ASR is predicting the next word the user is likely to say. Among our areas of interest are:

Latent Semantic Analysis: Noah Coccaro and I are exploring the use of Latent Semantic Analysis (LSA), a topic-based or word-association-based model of word-document similarity, as a language model. See for example Coccaro and Jurafsky 1998 (ps) .

Stochastic Context-Free Grammars: We have tried various experiments over the years with language models based on stochastic context-free grammars. A typical paper: Jurafsky et al 1995 (ps) .

Dialogue Modeling: We also work on probabilistic models of dialog, especially together with our team from the 1997 Johns Hopkins Workshop on Innovative Techniques in LVCSR (Becky Bates, Noah Coccaro, Rachel Martin, Marie Meteer, Klaus Ries, Liz Shriberg, Andreas Stolcke, Paul Taylor, Carol Van Ess-Dykema and me). We are especially interested in the automatic detection of dialogue structure, such as automatic labeling of speech acts or dialogue acts. See the publications page for various results from this work, including for example:

Stolcke et al. 2000. Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech. Computational Linguistics 26:3, 339-371

Shriberg et al. 1998. Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech? Language and Speech 41:3-4, 439-487

the coders manual for the discourse tagging system called SWBD-DAMSL (also check out the LDC's page on linguistic annotation schemes).