Stanford University


My research examines the effects of phonetic variation in speech on the perception, recognition, and representation of spoken words.  My perspective is that speech is a multi-faceted information source and that phonetic variation is critical to the process of understanding spoken words, influencing the encoding, processing, and representation of spoken words.  Currently, I am working on a theory of the socially-weighted encoding of spoken words, and my students and I are testing the predictions of that theory.  Some of our current projects are summarized below.

National Science Foundation Grant 1226963: Understanding spoken words: Effects of phonetics, phonological variation, and speech mode in speech perception, 2012 – 2016

Everyday, we face variation in language. As readers, we see words printed in different fonts, sizes and typefaces, typically static on a page. As listeners, we hear a speech signal that is riddled with variation. &;We are exposed to words, but a single word is produced differently each time it is uttered. These words stream by listeners at a rate of about 5 – 7 syllables per second, further complicating the listeners’ task. How listeners map a speech signal onto meaning despite massive variation is an issue central to linguistic theory. One problem we currently face is that the vast majority of these different realizations of words are understood equally well by listeners. We know that listeners have at their disposal detailed and rich lexical representations. But, as I show here, even these cannot account for a listener’s ability to take all the individual parts of a word that often vary drastically and understand them as quickly and adeptly as they do. Given any window of speech, listeners are presented with information about sounds, sound patterns, words, and speakers and their intentions, emotions, accents and other social characteristics. I propose that advancement in understanding the perception and recognition of spoken words will come from examining the ways in which listeners use these ever-present cues together. In this proposal, I begin to pave this path by investigating the effects of phonetic variation that co-varies with phonological variants and social speech modes on the perception and recognition of spoken words. The specific aims of this proposal are to show that:

  1. Listeners rely on the acoustic values of adjacent sounds to facilitate the perception of upcoming phonological variants, even those rarely uttered in speech, and
  2. Acoustic patterns are stored along with social representations that influence speech perception at a level much lower than once thought.

The investigation of these two claims is critical to answering the age-old question of how listeners understand spoken words. For many phonological variant pairs (e.g., [t] or tap in atom), studies have shown both a cost and a benefit for each variant in a pair in speech perception. These data have been used to argue for either specific or abstract representations. But, different phonological variants do not occur in comparable phonetic contexts. Examining the role of the phonetic cues that typically co-occur with different phonological variants will solve this conundrum and provide insight into how the perceptual system maps even rarely uttered sounds to meaning with relative ease. More broadly, this project provides greater understanding of the role of detailed lexical representations in speech perception, suggesting they have a smaller role than thought.
The finding that phonetic detail is stored in lexical representations has considerably advanced the field. It does not, though, imply that phonetic patterns are stored solely in lexical representations. Separating effects due to phonetics and those due to lexical representations enables us to add a new dimension to theories of speech perception – a direct link between linguistic and social experience. This is accomplished by investigating the claim that acoustic patterns are stored with social representations independent of the lexicon. This line of research will show that the activation of social constructs influences the low-level categorization of sounds. This claim, if supported, will have major implications for how linguistic units are stored and recalled by listeners. This research program has the potential to advance theory development and prompts us to reconsider the role of acoustic patterns and their associations in theory mode broadly.

Which is better:  Half of a clearly articulated word or a whole casually-articulated word?
With Jeremy Calder, Annette D’Onofrio, Kevin McGowan, Teresa Pratt

Previous work in spoken word recognition and speech perception has shown two seemingly conflicting patterns. While some studies have shown a processing benefit for more frequent word variants (i.e. in a casual speech mode), others have found a benefit for more canonical word forms (i.e. in a careful speech mode). This study aims to reconcile these findings, proposing that different types of processing apply to each speech mode --top-down processing for casual speech, and bottom-up for careful speech. In Exp. 1., listeners in an auditory priming task heard natural (non-spliced) sentences spoken in either a careful or casual speech mode. Sentences with high semantic predictability served as primes.  At the end of the sentence, listeners were presented with a visual probe that was either the final word heard in the prime sentence, or an unrelated probe.  Preliminary results suggest that, regardless of speech style, reaction times are faster for related targets in the semantically predictable conditions than for unrelated targets. Crucially, responses to the target word in the casual condition are delayed compared to careful speech for semantically predictable sentences with unrelated probes. This suggests that unrelated words in casual speech are associated with a processing cost, reflecting the top-down weighting of processing careful speech. In Exp. 2, we move the same probes earlier in the sentence at the point in the first word that is semantically-relevant for high predictability sentences only.   The time point corresponds to the end of critical word in casually-articulated sentences and the duration endpoint of that casually-articulated word that occurs during the same word in the carefully-articulated frame.  Data collection is in progress, and we hope to better understand the information provided in a window of time that is filled either by an entire, but highly reduced word or by a partial, but clearly-articulated word.

Detangling emotion from the words we speak: Simultaneous and independent processing of words and emotions, with Seung Kyung Kim

Phonetic variation in speech informs listeners about sounds and words and about talkers (e.g. emotion). In speech perception, this indexical variation is accommodated via an exemplar lexicon.  This assumes that lexical and indexical information are coupled.  We investigate the effect of words produced with different emotions on the recognition of spoken words. First, we compared the recognition of emotion word targets (UPSET) preceded by semantically-unrelated primes spoken with emotionally-related or unrelated prosody (pineapple_[AngryVoice] or pineapple_[NeutralVoice]). Second, we investigated the effects of emotion on semantic priming (pineapple_[AngryVoice]/[NeutralVoice] - FRUIT). Recognition of both emotionally-related and semantically-related targets were facilitated by primes with angry prosody. These data suggest that indexical variation in speech influences the recognition process beyond detailed lexical representations. We suggest listeners simultaneously process acoustic variation in speech for indexical and lexical meaning and argue that emotional prosody activates emotion features and categories, independent of lexical access.

Download Conference Poster

Voice-specific lexicons: Effects of indexical phonetic variation on semantic activation, with Ed King

The role of indexical variation in spoken word recognition is constrained to acoustically-rich lexical representations.  Theoretically, lexical activation depends on indexical variation, but subsequent processes like associative semantic spread depend on activation strength, not indexical variation.  Social psychological theories view indexical variation as integral to online processes such as persona construal.  Therefore, information gleaned from indexical variation might pervade spoken word recognition more broadly. We investigate the effects of indexical variation on semantic activation in word-association and semantic-priming paradigms.  Across three studies, we show that top associates depend on the voice of the associative probe (man’s_voice: space-time, woman’s_voice: space-star, child’s_voice: space-planet).  And, we find that semantic priming is stronger for voice-congruent (spacewoman-star) than voice-incongruent targets (spacewoman-time).  We argue that indexical variation affects spoken word recognition beyond an episodic lexicon and provide an account capturing effects of learned associations between acoustic patterns and linguistic and social features/categories in spoken language processing.