LINGUIST 138/238 -- SYMBSYS 138: Introduction to Computer Speech and Language Processing. Spring 2004

LINGUIST 138/238 - SYMBSYS 138
Introduction to Computer Speech and Language Processing
Autumn 2004, Dan Jurafsky

Speech Synthesis Part I: Articulatory Phonetics, ARPAbet transcription, TTS Architecture, Festival, Text Normalization, Letter-to-Sound

And computers are getting smarter all the time; scientists tell us that soon they will be able to talk with us. (By "they", I mean computers; I doubt scientists will ever be able to talk to us.)
Dave Barry

IP notice: These lecture notes include direct quotes from many web sites, especially including Alan Black's, but also and many others. Thus any text on any of these lecture notes should be viewed as being stolen from other people, all credit and rights go to them.

Introduction to TTS

Applications of Speech Synthesis (TTS)
- Games
- Telephone-based Information (directions, air travel, banking, etc)
- Hands-free (in car)
- Speaker Identification
- Language Identification
- Reading/speaking for disabled
- Education (Reading tutors, L2)
Demonstrations
- History of synthesizers
- Rhetorical synthesizer
Speech Synthesis Overview
- Concatenative Unit-Selection Text-to-Speech Synthesis Intuition:
  - Collect lots and lots of speech from one speaker, and transcribe very carefully in detail all the syllables and phones and whatnot
  - To synthesize a sentence, patch together syllables and phones from the training data.
  - Paradigm: Search
Articulatory Phonetics
Acoustic Phonetics vs Articulatory Phonetics
Textbook on Phonetics: Peter Ladefoged. 2001. A Course in Phonetics. 4th edition. Harcourt.
Or here's the website for his Vowels and Consonants
The Articulatory Process: Voicing
The Vocal Organs here
- lip
- teeth
- alveolar ridge
- hard palate
- soft palate = velum
- uvula
- pharynx
- larynx
- glottis (space between vocal cords)
- lungs
- nose
- tongue tip, blade, dorsum
- vocal folds vibrating
places of articulation
- Here's a moving picture of articulators
- labial, coronal (tip or blade of tongue), dorsal (back of tongue
- bilabial stops (pie, buy. my)
- labiodental (fie, vie)
- dental (thigh, thy)
- alveolar (tongue tip or blade and alveolar ridge)
- retroflex (tongue tip and back of alveolar ridge) rye, row, ray, ire, hour, air
- palato-alveolar tongue blade and back of alveolar ridge (shy, she, show)
- palatal (y)
- velar (k,g)
Manner of Articulation
- stop (complete closure of articulators)
- oral
- nasal
- fricative (close approximation of two articulators so airstream is partially obstructed)
- approximant (one articulator close to another, without vocal tract being narrowed enough to produce turbulent airflow")
- lateral approximant (obstruction in middle, incompelte closure between sides of tongue and roof)
- tap, affricate
Manner of Articulation for Vowels
- height of body of tongue
- front/back position
- rounding of lips
Some movies of articulation
Tones:
- Mandarin
- Cantonese
Context Dependence of phones (from Peter Ladefoged's website at UCLA)
- Notice the difference in the "l".
- The words leaf and feel, played forwards.
- The same words played in reverse
The ARPAbet
ARPAbet (and the IPA)
The History of Speech Synthesis
History of Speech Synthesis
- Von Kempelen 1780; b. Bratislava 1734, d. Vienna 1804
  - A leather resonator manipulated by the operator to try to copy vocal tract configuration during sonorants (vowels, glides, nasals )
  - Bellows provided air stream, counterweight provided inhalation
  - Vibrating reed produced periodic pressure wave
  - Various small whistles controlled consonants
  - Rubber mouth and nose; nose had to be covered with two fingers for non-nasals.
  - For unvoiced sounds, mouth covered tightly and then small auxiliary bellows driven by a string would provide puff of air.
  - Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine (1791)
  - Here's a page with further info
- Homer Dudley's VODER
  - Manually controlled through a console as complex as a piano keyboard
  - Operator training was a problem
  - Block Diagram of the VODER
  - Picture of VODER demonstration at the World Fair in 1939
  - The voder talks
- An early application: The 1936 UK Speaking Clock
  - (this image and the following caption are from Andrew Emmerson's Speaking Clocks site):
    "P4727 - the first speaking clock mechanism, which used photographic storage in revolving glass discs. BT copyright image, used with acknowledgement."
  - (this image and the following caption are also from Andrew Emmerson's Speaking Clocks site):
    P1280 - a technician adjusts the amplifiers of the first speaking clock mechanism; note Tele. 121 and Bellset 20 on the wall at rear. BT copyright image, used with acknowledgement."
- 1940's, 50's: Analog synthesizers, formant synthesis.
- Frank Cooper's Pattern Playback
  - Developed at Haskins Lab for investigating speech perception.
  - Works like an inverse of a sound spectrogram
  - Light from a lamp goes through rotating disk then through a spectrogram into photovoltaic cells.
  - Thus amount of light that gets transmitted at each frequency band corresponds to amount of acoustic energy at that band.
  - The Pattern Playback talks
- 1960's
  - rule-based TTS, Holmes, Mattingly, Shearne 1964, Coker 1968
- 1970's
- First full TTS system: Umeda et al (1968)
- Klatt 1976 phonological rules
- Joe Olive 1977 concatenation of linear-prediction diphones
- Products, e.g. Speak and Spell
  - The M.I.T. MITalk system by Jonathan Allen, Sheri Hunnicut, and Dennis Klatt, 1979 demo
  - The Klattalk system by Dennis Klatt of M.I.T. which formed the basis for Digital Equiptment Corporation's DECtalk commercial system 1983. demo
- Concatenative Synthesis
  - Advent of digital computers, digital rep of speech led to concatenation of natural recorded speech.
  - Natural representational unit: diphones
  - Diphone is a unit from middle of one phone to middle of another.
  - Why? phone boundaries are where changes happen.
  - Concatenate diphones.
  - Why rise of concatenive synthesis: Moore's Law
  - Thus after 1980, big effort on using larger, more varied inventories
  - Sagisaka at ATR in Japan in late 80's: use more than one example of each diphone
  - 1995, 1996: idea of many units of different sizes, and taks of Unit Selection
  - History of synthesizers
Speech Synthesis Architectures
Major Components of Speech Synthesis Systems
1. Text Processing (also called Text Normalization): Analysis of raw and labelled text into identifiable words
  - Sample problems:
    - He stole $100 million from the bank
    - It's 13 St. Andrews St.
    - The home page is http://www.stanford.edu
    - yes, see you the following tues, that's 11/12/01
  - Steps
    - Identify tokens in text
    - Chunk tokens into reasonably sized sections
    - Map tokens to words
    - Identify types for words
2. Linguistic/Prosodic Processing: from words to segments, F0, durations
  - How to pronounce a word?
  - Look it up in a lexicon
  - But what about languages like Turkish or Finnish
    - Turkish word: uygarlaStIramadIklarImIzdanmISsInIzcasIna
    - Meaning ``(behaving) as if you are among those whom we could not civilize/cause to become civilized''
    - Breakdown into morphemes:
      uygar +laS +tIr +ama +dIk +lar +ImIz +dan +mIS +sInIz +casIna
      civilized +bec +caus +NegAble +ppart +pl +p1pl +abl +past +2pl +AsIf
  - Also, always missing some words (names, other proper nouns, rare words, etc)
  - So need letter-to-sound rules
  - Homograph disambiguation (wind, live, read)
  - Prosodic Phrasing
    - need to break utterance into phrases
    - punctuation is useful but not sufficient
  - Intonation
    - Prediction of accents: which syllables should be accented
    - Realization of F0 contour: given accents/tones, generate F0 contour.
3. Waveform synthesis: From segments, F0, and duration to a waveform.
  - Three possible techniques:
    1. concatenative synthesis
      - diphone synthesis
      - unit selection synthesis
    2. formant synthesis
    3. articulatory synthesis
  - Collecting data
    - Collecting diphones: need to record diphones in correct contexts
    - l sounds different in onset than coda, t is flapped sometimes, etc.
    - need EGG, quiet recording room, etc.
    - then need to label them very very exactly
  - Unit selection: how to pick the right unit?
  - Joining the units
    - dumb (just stick'em together)
    - PSOLA (Pitch-Synchronous Overlap and Add)
    - MBROLA (Multi-band overlap and add)
- Festival Overview
  - Pretty close to state-of-the-art TTS system
  - Good for playing around, since has scripting language, so can make changes to system without recompiling
  - Free, runs on UNIX and to some extent Windows.
  - Core system:
    - Scheme-based scripting language
    - C++/C Core modules
    - General utterance representation
    - Supports many common waveform formats
    - Standard data tools (Viterbi decoder, N-gram support, Regex matching, CART tree, etc)
    - English (British and American) and Spanish TTS
    - Everything is configurable: phonesets, lexicons, intonation, POS, duration, diphon/unit selection, letter-to-sound rules, text modes.
    - ```
    festival --tts news.txt
    echo "Hello world" | festival --tts
```
- Interactive command interpreter Scheme-based read-eval-print loop
- C++ library adding modules in C++
- Festival Architecture
- Traditional Architectures:
  - String Pipeline
  - Structured Blackboard
  - Festival: combined: Structure pipeline, each module adds information to structured relation graph which represents utterance
- String Pipeline
  - Start with a string of tokens:
```
We started on Feb 25.
```
  - The first expansion modules would replace all tokens with words to give a string
```
We started on february twenty fifth .
```
  - Next module would replace words with phones, etc.
  - Problem: Information about previous levels is lost at each stage.
- Another method: build table where bounaries denote times in the eventual synthesized utterance:
```
| Feb                           | 25                                        |
| february                      | twenty                  | fifth           |
|    1       |   0    | 0  | 0  |         1      |     0  |     1           |
| f | eh | b | r | ax | er | iy | t | w | eh | n | t | iy | f | ih | f | th |
```
  Thus giving layers for tokens, words, syllables and phones.
  
  Problems:
  - Intonation accents and boundaries can best be done orthogonal to syllables.
  - Diphones also cross over boundaries.
  - May want to add complex information like trees for syntax or prosodic
  - Want to easily find, e.g. "second syllable of second word"
  - Festival solution: heterogeneous relation graph.
Lexicons and Lexical Entries in Festival
Lexicons and Lexical Entries
Lookup words with
```
(lex.lookup WORD PART-OF-SPEECH)
```
or
```
(lex.lookup WORD)
```
e.g.
```
(lex.lookup 'reagan)
```
```
(lex.lookup 'object 'v)
```
```
(lex.lookup 'object 'n)
```
```
(lex.add.entry
 '("reagan" n (((r ey) 1) ((g ax n) 0))))
```
Format is (WORD POS (SYL0 SYL1 ...)))
Syllable is ((PHONE0 PHONE 1...) STRESS)
```
(lex.lookup 'cepstra)
("cepstra" n (((k eh p) 1) ((s t r aa) 0)))
```
To find out what the phoneme set is and possible formats, it is often useful to lookup similar words. Use the lex.lookup function

Text Normalization
The Task: Text Processing (also called Text Normalization): Analysis of raw and labelled text into identifiable words
- Sample problems we saw earlier:
  - He stole $100 million from the bank
  - It's 13 St. Andrews St.
  - The home page is http://www.stanford.edu/
  - yes, see you the following tues, that's 11/12/01
  - IV could be four, fourth, fourth, or I.V.
  - IRA could be I.R.A. or Ira
  - 1750 could be seventeen fifty (date, address) or seventeen hundred (and) fifty as cardinal
- Important in both TTS and ASR.
- Steps in Text Normalization
  1. Identify tokens in text
  2. Chunk tokens into utterances or sentences.
  3. Identify types for each token
  4. (Type-specifically) map tokens to strings of words
  5. Assign part of speech to each word.

Step 1: Identify tokens in text

Whitespace (space, tab, newline, and carriage return) can be viewed as separators.
Punctuation can also be separated from the raw tokens.
Festival converts text from files into an ordered list of tokens each with its own preceding whitespace and succeeding punctuation as features of the token.

Step 2: Chunk

Alan Black: Festival's currently-used decision tree for determining end of utterance is:

((n.whitespace matches ".*\n.*\n[ \n]*") ;; A significant break in the text
  ((1))
  ((punc in ("?" ":" "!"))
   ((1))
   ((punc is ".")
    ;; This is to distinguish abbreviations vs periods
    ;; These are heuristics
    ((name matches "\\(.*\\..*\\|[A-Z][A-Za-z]?[A-Za-z]?\\|etc\\)")
     ((n.whitespace is " ")
      ((0))                  ;; if abbrev signal space is enough for break
      ((n.name matches "[A-Z].*")
       ((1))
       ((0))))
     ((n.whitespace is " ")  ;; if it doesn't look like an abbreviation
      ((n.name matches "[A-Z].*")  ;; single space and non-cap is no break
       ((1))
       ((0)))
      ((1))))
    ((0)))))

Thus the above difficult cases try to deal with the case where a token is terminated by a period but could be an abbreviation. An abbreviation is recognized as containing a dot or capitalized with one or two letters or three capital letters. When an abbreviation is detected there must be more than one space and the next word must be capitalized to signal a break. If the word doesn't appear to be an abbreviation, then any long break or capitalized following word will signal a break.

This will fail for such examples as

cog. sci. Newsletter.
many cases at end of line.
Badly spaced/capitalized sentences.

Step 3+4: Identify Types of Tokens, and Converting Tokens to Words

Pronunciation of numbers often depends on its type.

1776 date: seventeen seventy six.
1776 phone number: one seven seven six.
1776 quantifier: one thousand seven hundred (and) seventy six
25 day: twenty fifth

An example rule for dealing with such phrases as "$1.2 million" would be

(define (token_to_words utt token name)
 (cond
  ((and (string-matches name "\\$[0-9,]+\\(\\.[0-9]+\\)?")
        (string-matches (utt.streamitem.feat utt token "n.name")
                        ".*illion.?"))
   (append
    (builtin_english_token_to_words utt token (string-after name "$"))
    (list
     (utt.streamitem.feat utt token "n.name"))))
  ((and (string-matches (utt.streamitem.feat utt token "p.name")
                        "\\$[0-9,]+\\(\\.[0-9]+\\)?")
        (string-matches name ".*illion.?"))
   (list "dollars"))
  (t
   (builtin_english_token_to_words utt token name))))

More recent, advanced methods (based on the 1999 Hopkins workshop: "Normalization of Non-Standard Words"):

Idea: formalize text analysis so small number of cleanly defined, statistically trainable models.
Project homepage
R. Sproat et al. Normalization of non-standard words. Computer Speech and Language, 15(3):287-- 333, 2001.
4 basic stages of processing
1. splitter: (on whitespace or also within words ("AltaVista"))
2. type identifier: for each split token identify type,
3. token expander: for each typed token, expand to words (deterministic for number, date, money, letter sequence. Only hard (nondeterministic) for abbreviations).
4. language model: to select between alternative pronunciations
Oops, step 0: the phenomenon itself: NSW Examples.
- Numbers:
  123, 12 March 1994
- Abbreviations, contractions, acronyms:
  approx. mph, ctrl-C, US, pp, lb
- punctuation conventions:
  3-4, +/-, and/or
- dates
- times
- urls
How common are they?
- Varies over text type
- words not in lexicon, or with non-alph chars:
  
  Text Type % NSW
  novels> 1.5%
  
  press wire> 4.9%
  
  e-mail> 10.7%
  
  recipes> 13.7%
  
  classified> 27.9%
What is the distribution of NSW types?
- In North American News Text Corpus, from 121,464 NSWs
  
  Major type minor type %
  numeric number 26%
  
  year 7%
  
  ordinal 3%
  
  alphabetic as word 30%
  
  as letters 12%
  
  as abbrev 2%
How difficult are they?
- Identification:
  - some homographs: "Wed", "PA"
  - some false positives: OOV
- Realization:
  - simple rule: money, $2.34
  - POS tags: "lives" / "lives"
  - type identification + rules: numbers
  - text type specific knowledge (in classified ads, BR for bedroom)
- Ambiguity (acceptable multiple answers)
  - "D.C." as letters or full words
  - "MB" as "meg" or "megabyte"
  - 250
Existing techniques
- ignored
- lexical lookup
- hacky hand-written rules
- (not so hacky) hand-written rules
- statistical trained prediction
Step 1: Splitter
- letter/number conjunctions (WinNT, SunOS, PC110)
- Hand-written rules in two parts:
  - Part I: group things not to be split (numbers, etc; including commas in numbers. slashes in dates)
  - Part II: apply rules:
    - at transitions from lower to upper case
    - after penultimate upper-case character in transitions from upper to lower
    - at transitions from digits to alpha
    - at punctuation
- Step 2: Classify token into 1 of 20 types
  
  EXPN
  abbreviation, contractions e.g. adv, N.Y, mph, gov't
  LSEQ
  letter sequence e.g. CIA, D.C, CDs
  ASWD
  read as word, e.g. CAT, proper names
  MSPL
  misspelling e.g. geogaphy
  NUM
  number (cardinal) e.g. 12, 45, 1/2, 0.6
  NORD
  number (ordinal) e.g. May 7, 3rd, Bill Gates III
  NTEL
  telephone (or part of) e.g. 212 555-4523
  NDIG
  number as digits e.g. Room 101,
  NIDE
  identifier e.g. 747, 386, I5, PC110, 3A
  NADDR
  number as street address e.g. 5000 Pennsylvania, 4523 Forbes
  NZIP
  zip code or PO Box e.g. 91020
  NTIME
  a (compound) time e.g. 3.20, 11:45
  NDATE
  a (compound) date e.g. 2/2/99, 14/03/87 (or US) 03/14/87
  NYER
  year(s) e.g. 1998 80s 1900s 2003
  MONEY
  money (US or otherwise) e.g. \$3.45 HK\$300, Y20,000, \$200K
  BMONY
  money tr/m/billions e.g. \$3.45 billion
  PRCT
  percentage e.g. 75\%, 3.4\%
  SLNT
  not spoken, word boundary e.g. word boundary or emphasis character: M.bath, KENT*REALTY, \_really\_, ***Added
  PUNC
  not spoken, phrase boundary e.g. non-standard punctuation: "..." in e.g. DECIDE...Year, *** in $99,9K***Whites
  FNSP
  funny spelling e.g. slloooooww, sh*t
  URL
  url, pathname or email e.g. http://apj.co.uk, /usr/local, phj@teleport.com
  NONE
  token should be ignored e.g. ascii art, formating junk
- For example: 4 categories for alphabetic sequences
  1. EXPN: expand to full word or word sequence (fplc for fireplace) (NY for New York)
  2. LSEQ: say as letter sequence (IBM)
  3. ASWD: say as standard word (either OOV or acronyms said as word like NATO)
- For example: 5 main ways to read numbers:
  1. cardinal: (quantities)
  2. ordinal: (dates)
  3. string of digits: (phone numbers)
  4. pairs of digits: (years)
  5. trailing unit: serial until last non-zero digit: 8765000 is "eight seven six five thousand" (some phone numbers, long addresses)
  6. But still exceptions: (947-3030, 830-7056)
- Type Identifier: create a large hand-labeled training set and build a decision tree to predict type.
- Example of features in tree for sub-classifier for alphabetic tokens:
- p(t|o) = p(o|t)p(t)/p(o)
  - p(o|t), for t in ASWD, LSWQ, EXPN, (from trigram letter model)
    p(o|t) = \sum_i=1^N{p(l_i|l_i-1,l_i-2)
  - p(t), from counts of each tag in text
  - p(o), normalization factor
- Hand-written context-dependent rules:
  - list of lexical items (Act, Advantage, amendment ...Wespac, Wrestlemania) after which Roman numbers read as cardinals not ordinals.
Classifier accuracy: 98.1% in news data, 91.8% in email data
Step 3: Expanding NSW Tokens; use type-specific heuristics
- ASWD expands to itself
- PUNC expands to itself
- LSEQ expands to a list of words one for each letter
- NUM expands to a string of words representing cardinal
- NORD expands to a string of words representing ordinal.
- NDIG expands to a string of digits.
- NYER expands to a 2 pairs of NUM digits, except where following two are 00, in which case group of four is pronounced as whole NUM.
- NTEL: string of digits with silence for punctuation
- Abbreviation: use abbreviation lexicon if it's one we've seen
- Abbreviation: else use training set to know how to expand
- Abbreviation cute idea:if "eat in kit" occurs in the text, "eat-in kitchen" will also occur somewhere
- Step 1) predict all possible expansions of abbreviation with WFST bnuild from CART which predicts deletion
- Step 2) Language model to predict occurance of the full words in text.
- on the "classified ad" domain, works about 80% correct.
What about languages with no spaces? Chinese, Japanese, etc

Lexicons and Letter-to-Sound Rules

Lexicons and Letter-to-Sound Rules

Lexicons

Some history: early systems used all LTS rules due to lack of memory. MITtalk was "radical" in having a "huge" dictionary of 10,000 words.

Last rev of CMU dict had 127,000 words. Here's some:

A  AH0
A'S  EY1 Z
A(2)  EY1
A.  EY1
A.'S  EY1 Z
A.S  EY1 Z
A42128  EY1 F AO1 R T UW1 W AH1 N T UW1 EY1 T
AAA  T R IH2 P AH0 L EY1
AABERG  AA1 B ER0 G
AACHEN  AA1 K AH0 N
AAKER  AA1 K ER0
AALSETH  AA1 L S EH0 TH
AAMODT  AA1 M AH0 T
AANCOR  AA1 N K AO2 R
AARDEMA  AA0 R D EH1 M AH0
AARDVARK  AA1 R D V AA2 R K
AARON  EH1 R AH0 N
AARON'S  EH1 R AH0 N Z
AARONS  EH1 R AH0 N Z
AARONSON  EH1 R AH0 N S AH0 N
AARONSON'S  EH1 R AH0 N S AH0 N Z
AARONSON'S(2)  AA1 R AH0 N S AH0 N Z
AARONSON(2)  AA1 R AH0 N S AH0 N
AARTI  AA1 R T IY2
AASE  AA1 S
AASEN  AA1 S AH0 N
AB  AE1 B
AB(2)  EY1 B IY1
ABABA  AH0 B AA1 B AH0
ABABA(2)  AA1 B AH0 B AH0
ABACHA  AE1 B AH0 K AH0
ABACK  AH0 B AE1 K
ABACO  AE1 B AH0 K OW2
ABACUS  AE1 B AH0 K AH0 S
ABAD  AH0 B AA1 D
ABADAKA  AH0 B AE1 D AH0 K AH0
ABADI  AH0 B AE1 D IY0
ABADIE  AH0 B AE1 D IY0
ABAIR  AH0 B EH1 R
ABALKIN  AH0 B AA1 L K AH0 N
ABALONE  AE2 B AH0 L OW1 N IY0
ABALOS  AA0 B AA1 L OW0 Z
ABANDON  AH0 B AE1 N D AH0 N

Letter-to-Sound Rules

The basic form of FESTIVAL LTS rules is
```
( LEFTCONTEXT [ ITEMS ] RIGHTCONTEXT = NEWITEMS )
```
For example
```
( # [ c h ] C = k )
( # [ c h ] = ch )
```
The # denotes beginning of word and the C is defined to denote all consonants. The above two rules which are applied in order, meaning that a word like christmas will be pronounced with a k which a word starting with ch but not followed by a consonant will be pronounced ch (e.g. choice.)
What about stress rules?
- Famously evil in English; here's an example from the MITalk system (Allen et al. 1987):
- V -> [1-stress] / X _ C* { Vshort C C? | V} { [Vshort C* | V}
  - Where X must contain all prefixes
  - Assign 1-stress to the vowel in a syllable preceding a weak syllable followed by a morpheme-final syllable containing a short vowel and zero or more consonants (e.g. difficult -> d 'ih f f i k ah l t)
  - Assign 1-stress to the vowel in a syllable preceding a weak syllable followed by a morph-final vowel (e.g. oregano -> ao r 'eh g ae n ow)
  - Assign 1-stress to the vowel in a syllable preceding a vowel followed by a morph-final syllable containing a short vowel and zero or more consonants (e.g. secretariat -> s eh k r eh t 'ae r iy ae t)
  - etc; also various other rules:
  - V -> 1-stress ? X _ C* { shortV C* / V}
  - run in cycles; as you add on affixes you rerun the stress rule.
  - honor -> honorific, diplomat/diplomacy/diplomatic, photograph/photography/photographic,monotone/monotony/monotonic
  - What are rules for above?
Learning LTS rules automatically
- Induce LTS from a dictionary of the language (Black et al. 1998b)
- Applied to English, German, French
- Two steps: alignment and (CART-based) rule-induction
- Alignment (this is from Alan's lecture notes:)
  There are typically less phones that letters in the pronunciation of many langauges. Thus there is not a one to one mapping of letter t phones in a typical lexical pronunciation. In order for a machine learner technique to build reasonable prediction models we must first align the letter to phones, inserting epsilon where there is no mapping. For example
```
Letters: c  h  e  c  k  e  d
Phones:  ch _  eh _  k  _  t
```
  We provide two methods for this, one fully automatic and one requiring hand seeding. In the full automatic case first scatter eplisons in all possible ways to cause the letter and phoens to align. The we collect stats for the P(Letter|Phone) and select the best to generate a new set of stats. This iterated a number of times until it settles (typically 5 or 6 times). This is an example of the EM (expectation maximisation algorithm).
  
  The alternative method that may (or may not) give better results is to hand specify which letters can be rendered as which phones. This is fairly easy to do. For example, letter c goes to phones k ch s sh, letter w goes to w v f, etc. Typically all letter can at some time go to eplison, consonants go to some small number of phones and letter vowels got to some larer number of phone vowels. Once the table mapping is created, similary to the epsilon scatter above, we find all valid alignments and find the probabilities of letter given phone. Then we score all the alignments and take the best.
  
  Typically (in both cases) the alignments are good but some set are very bad. This very bad alignment set, which can be detected automatically due to their low alignment score, are exactly the words whose pronunciations don't match their letters. For example
  
  dept @tab d ih p aa r t m ah n t
  
  lieutenant @tab l eh f t eh n ax n t
  
  CMU @tab s iy eh m y uw
  
  Other such examples are foreign words. As these words are in some sense non-standard these can validly be removed from the set of examples we use to build the phone prediction models.
- Building CART Trees:
  CART trees are build for each letter in the alphabet (twenty six plus any accented characters in the language), using a context of three letters before an three letters after. Thus we collect features sets like
```
# # # c h e c --> ch
c h e c k e d --> _
```
  Using this technique we get the following results.
  
  Lexicon Letters Correct Words Correct
  OALD (UK English) 95.80% 74.56%
  
  CMUDICT (US English) 91.99% 57.80%
  
  BRULEX (French) 99.00% 93.03%
  
  DE-CELEX (German) 98.79% 89.38%
- CMUDICT, although also English, does not get as good results compared with OALD as it contains many more ``foreign'' words, particularly names, which are much harder to predict without any higher level information (such as ethnic origin).
- (Note: this is all still from Alan Black's lecture notes) The second aspect of measurement is the question of how well does the notion of correct match what the system is actually going to do for real. In order to get a better idea of that we tested the models on actual unknown words from a corpus 39,923 words in the Wall Street Journal (from the Penn Treebank marcus93). Of this set 1,775 (4.6%) were not in OALD. Of those 1,360 were names, 351 were unknown words, 57 were American spelling (OALD is a UK English lexicon) and 7 were misspelling.
  After testing various models we found that the best models for the held out test set from the lexicon we not the the best set for genuinely unknown words. BAsically the lexicon optimised models were over trained for that test set, so we relaxed the stop criteria for the CART trees and got a better result on the 1,775 unknown words. The best results give 70.65% word correct. In this test we judged correct to be what a human listener judges asa correct. Sometimes even though the prediction is wrong with respect to the lexical entry in a test set the result is actually acceptable as a pronunciation.
  This also highlights how a test set may be good to begin with after some time and a number of passes and corrections to ones training algorithm any test set will be become tainted and you need a new test set. It is normal to have a development test which is used during development of an algorithm then keep out a real test set that only get used once the algorithm is developed. Of course as development happens in cycles the real test set will effectively become the development set and hence you'll need another new test set.
- Stress: include stressed and unstressed versions of each vowel.
- (this is not from alan black's notes) What about names?
- As saw above, names may not be well trained from standard dictionary entries
- Liberman and Church 1987:
- Donnely marketing organization list of names in 1987: 1.5 million names
- that's 3 times larger than number of entries in a large unabridge dictionary.
- ATT built pronunciation dictionary of 50,000 most frequent names
- Can combine this with morphology (Wlaters, Lucasville)
- Also can write stress-shifting rules (Jordan -> Jordanian, Washington ->)
- Rhyme analogy: Plotsky by analogy with Trotsky (replace tr with pl)
- Liberman and Church found that for 250,000 most common Donnelly names; got 212,000 (85%) from these modified dictionary-based methods, used LTS for the rest)
Use of part-of-speech tagging in TTS
Use of part of speech tagging in TTS
- From Liberman and Church: 20 most frequent homographs (words spelled the same but pronounced differently:
```
use 319
increase 230
close 215
record 195
house 150
contract 143
lead 131
live 130
lives 105
protest 94
survey 91
project 90
separate 87
present 80
read 72
subject 68
rebel 48
finance 46
estimate 46
```
- Still not very many; these account for well under a percent of word pronunciation errors.
- Festival uses DeRose tagger, which is an early HMM-style tagger.

Main errors left in Festival (from Alan Black lecture notes:)

LTS rules failing on novel forms
Foreign proper names often fail
Wrong POS; especially in newspaper headlines
POS is right but not in lexicon
POS not enough to differentiate pronunciation (wind w ih n d/wind w ay n d) and not yet dealt with by homograph disambiguation CART.

Text Type	% NSW
novels>	1.5%
press wire>	4.9%
e-mail>	10.7%
recipes>	13.7%
classified>	27.9%

Major type	minor type	%
numeric	number	26%
	year	7%
	ordinal	3%
alphabetic	as word	30%
	as letters	12%
	as abbrev	2%

Lexicon	Letters Correct	Words Correct
OALD (UK English)	95.80%	74.56%
CMUDICT (US English)	91.99%	57.80%
BRULEX (French)	99.00%	93.03%
DE-CELEX (German)	98.79%	89.38%

Speech Synthesis Part I: Articulatory Phonetics, ARPAbet transcription, TTS Architecture, Festival, Text Normalization, Letter-to-Sound

More recent, advanced methods (based on the 1999 Hopkins workshop: "Normalization of Non-Standard Words"):

Step 1: Splitter

Step 2: Classify token into 1 of 20 types

Step 3: Expanding NSW Tokens; use type-specific heuristics