STANFORD LINGUIST 138/238     -     SYMBSYS 138
Introduction to Computer Speech and Language Processing 
Autumn 2004,     Dan Jurafsky

Speech Synthesis Part I: Articulatory Phonetics, ARPAbet transcription, TTS Architecture, Festival, Text Normalization, Letter-to-Sound

And computers are getting smarter all the time; scientists tell us that soon they will be able to talk with us. (By "they", I mean computers; I doubt scientists will ever be able to talk to us.)
Dave Barry

IP notice: These lecture notes include direct quotes from many web sites, especially including Alan Black's, but also and many others. Thus any text on any of these lecture notes should be viewed as being stolen from other people, all credit and rights go to them.
Introduction to TTS

  1. Applications of Speech Synthesis (TTS)
  2. Demonstrations
  3. Speech Synthesis Overview
    Articulatory Phonetics
  4. Acoustic Phonetics vs Articulatory Phonetics
  5. Textbook on Phonetics: Peter Ladefoged. 2001. A Course in Phonetics. 4th edition. Harcourt.
  6. Or here's the website for his Vowels and Consonants
  7. The Articulatory Process: Voicing
  8. The Vocal Organs here
  9. places of articulation
  10. Manner of Articulation
  11. Manner of Articulation for Vowels
  12. Some movies of articulation
  13. Tones:
  14. Context Dependence of phones (from Peter Ladefoged's website at UCLA)
    The ARPAbet
  15. ARPAbet (and the IPA)
    The History of Speech Synthesis
  16. History of Speech Synthesis
    Speech Synthesis Architectures
  17. Major Components of Speech Synthesis Systems
    1. Text Processing (also called Text Normalization): Analysis of raw and labelled text into identifiable words
      • Sample problems:
        • He stole $100 million from the bank
        • It's 13 St. Andrews St.
        • The home page is http://www.stanford.edu
        • yes, see you the following tues, that's 11/12/01
      • Steps
        • Identify tokens in text
        • Chunk tokens into reasonably sized sections
        • Map tokens to words
        • Identify types for words
    2. Linguistic/Prosodic Processing: from words to segments, F0, durations
      • How to pronounce a word?
      • Look it up in a lexicon
      • But what about languages like Turkish or Finnish
        • Turkish word: uygarlaStIramadIklarImIzdanmISsInIzcasIna
        • Meaning ``(behaving) as if you are among those whom we could not civilize/cause to become civilized''
        • Breakdown into morphemes:
          uygar +laS +tIr +ama +dIk +lar +ImIz +dan +mIS +sInIz +casIna
          civilized +bec +caus +NegAble +ppart +pl +p1pl +abl +past +2pl +AsIf
      • Also, always missing some words (names, other proper nouns, rare words, etc)
      • So need letter-to-sound rules
      • Homograph disambiguation (wind, live, read)
      • Prosodic Phrasing
        • need to break utterance into phrases
        • punctuation is useful but not sufficient
      • Intonation
        • Prediction of accents: which syllables should be accented
        • Realization of F0 contour: given accents/tones, generate F0 contour.
    3. Waveform synthesis: From segments, F0, and duration to a waveform.
      • Three possible techniques:
        1. concatenative synthesis
          • diphone synthesis
          • unit selection synthesis
        2. formant synthesis
        3. articulatory synthesis
      • Collecting data
        • Collecting diphones: need to record diphones in correct contexts
        • l sounds different in onset than coda, t is flapped sometimes, etc.
        • need EGG, quiet recording room, etc.
        • then need to label them very very exactly
      • Unit selection: how to pick the right unit?
      • Joining the units
        • dumb (just stick'em together)
        • PSOLA (Pitch-Synchronous Overlap and Add)
        • MBROLA (Multi-band overlap and add)
    Lexicons and Lexical Entries in Festival
  18. Lexicons and Lexical Entries

    Lookup words with

    (lex.lookup WORD PART-OF-SPEECH)
    
    or
    (lex.lookup WORD)
    

    e.g.

    (lex.lookup 'reagan)
    
    (lex.lookup 'object 'v)
    
    (lex.lookup 'object 'n)
    
    (lex.add.entry
     '("reagan" n (((r ey) 1) ((g ax n) 0))))
    

    Format is (WORD POS (SYL0 SYL1 ...)))

    Syllable is ((PHONE0 PHONE 1...) STRESS)

    (lex.lookup 'cepstra)
    ("cepstra" n (((k eh p) 1) ((s t r aa) 0)))
    

    To find out what the phoneme set is and possible formats, it is often useful to lookup similar words. Use the lex.lookup function
    Text Normalization

  19. The Task: Text Processing (also called Text Normalization): Analysis of raw and labelled text into identifiable words
  • Step 1: Identify tokens in text
  • Step 2: Chunk

    Alan Black: Festival's currently-used decision tree for determining end of utterance is:

    ((n.whitespace matches ".*\n.*\n[ \n]*") ;; A significant break in the text
      ((1))
      ((punc in ("?" ":" "!"))
       ((1))
       ((punc is ".")
        ;; This is to distinguish abbreviations vs periods
        ;; These are heuristics
        ((name matches "\\(.*\\..*\\|[A-Z][A-Za-z]?[A-Za-z]?\\|etc\\)")
         ((n.whitespace is " ")
          ((0))                  ;; if abbrev signal space is enough for break
          ((n.name matches "[A-Z].*")
           ((1))
           ((0))))
         ((n.whitespace is " ")  ;; if it doesn't look like an abbreviation
          ((n.name matches "[A-Z].*")  ;; single space and non-cap is no break
           ((1))
           ((0)))
          ((1))))
        ((0)))))
    

    Thus the above difficult cases try to deal with the case where a token is terminated by a period but could be an abbreviation. An abbreviation is recognized as containing a dot or capitalized with one or two letters or three capital letters. When an abbreviation is detected there must be more than one space and the next word must be capitalized to signal a break. If the word doesn't appear to be an abbreviation, then any long break or capitalized following word will signal a break.

    This will fail for such examples as

  • Step 3+4: Identify Types of Tokens, and Converting Tokens to Words

    Pronunciation of numbers often depends on its type.

    An example rule for dealing with such phrases as "$1.2 million" would be

    (define (token_to_words utt token name)
     (cond
      ((and (string-matches name "\\$[0-9,]+\\(\\.[0-9]+\\)?")
            (string-matches (utt.streamitem.feat utt token "n.name")
                            ".*illion.?"))
       (append
        (builtin_english_token_to_words utt token (string-after name "$"))
        (list
         (utt.streamitem.feat utt token "n.name"))))
      ((and (string-matches (utt.streamitem.feat utt token "p.name")
                            "\\$[0-9,]+\\(\\.[0-9]+\\)?")
            (string-matches name ".*illion.?"))
       (list "dollars"))
      (t
       (builtin_english_token_to_words utt token name))))
    
  • More recent, advanced methods (based on the 1999 Hopkins workshop: "Normalization of Non-Standard Words"):

    Lexicons and Letter-to-Sound Rules
  • Lexicons and Letter-to-Sound Rules
  • Lexicons
  • Letter-to-Sound Rules
    Main errors left in Festival (from Alan Black lecture notes:)
    1. LTS rules failing on novel forms
    2. Foreign proper names often fail
    3. Wrong POS; especially in newspaper headlines
    4. POS is right but not in lexicon
    5. POS not enough to differentiate pronunciation (wind w ih n d/wind w ay n d) and not yet dealt with by homograph disambiguation CART.