|
1
|
- Florian Jaeger,
- tiflo@stanford.edu
- For LIN110,
- May 10th, 2004
|
|
2
|
- This presentation will be available online at:
- http://www.stanford.edu/dept/linguistics/corpora/material/X_speech_corpora/
- (that is: X_speech_corpora)
- Local support
- Where are our corpora?
- Setting up your account on AFS
|
|
3
|
- Where can you get help with your project?
- Your TA
- The Corpora@Stanford website (http://www.stanford.edu/dept/linguistics/corpora/)
- The corpora@csli.stanford.edu email list (you have to subscribe first),
cf.:
- http://www.stanford.edu/dept/linguistics/corpora/cas-support.html
- The corpus TA (tiflo@stanford.edu)
|
|
4
|
- AFS:
- AFS is Stanford’s file sharing system
- The linguistic corpora are stored at:
- /afs/ir/data/linguistic-data/
- You need to register for AFS access:
- http://www.stanford.edu/dept/linguistics/corpora/cas-rules.html#reg-standard
- You need to set up your account:
- http://www.stanford.edu/dept/linguistics/corpora/cas-setup.html
|
|
5
|
- Corpus Computer
- The computer is the one closest to the printer in the linguistics
department’s computer cluster (MJH, 1st floor)
- Login: ‘user’
- Pwd:
- The corpora are stored on partition D:\
- Mapping the drive via a network:
- http://www.stanford.edu/dept/linguistics/corpora/material/corpora-list/021204_FYI%20NEW%20local%20resource.txt
- http://www.stanford.edu/dept/linguistics/corpora/material/corpora-list/021204_addendum%20to%20last%20message.txt
|
|
6
|
- Example project
- Overview of available corpora
- Where to find them
- How does the annotation look like?
- How to search speech corpora
|
|
7
|
- Differences in the realization of phonemes depending on their context
- ‘Context’ can be segmental [1]
- How does the realization of syllabic /m/ differ depending on the
preceding onset?
- Word final vowel aspiration
- ‘Context’ can be supra-segmental: [3]
- How does the realization of syllabic /m/ differ at the beginning/end
of conversations/utterances/sentences?
- Reduction of complex clusters
|
|
8
|
- ‘Context’ could also include the register, style (formal vs. informal),
genre (reading a fairy tale vs. reading an article), different
dialects, etc. [2]
- Pitch contours related to specific meanings [1]
- Steady-state pitch contours
|
|
9
|
- Cf. Colleen’s handout
- See also:
- http://www.stanford.edu/dept/linguistics/corpora/cas-corpora.html
|
|
10
|
|
|
11
|
- Transcripts uploaded to AFS:
- /afs/ir/data/linguistic-data/Switchboard/
- Sound files available on CD
- available in several formats:
- All in one file
- Separate files for
- Syllables
- Words
- Orthographic transcription
|
|
12
|
- Some files in Switchboard
|
|
13
|
- Key:
- SENTENCE: word1 word2 ...
(2005_A_0041)
- WORD: word canonical? [lm-probs] [rates] [positions] [morebigrams]
part-of-speech phone1 phone2 ...
- SYL: baseform transcribed syl_structure stress length [lm-probs] [rates]
[positions]
- PHONE: baseform stress syl_part [lm-probs] [rates] [positions] tran1
tran2 ...
|
|
14
|
- [lm-probs]= trigram unigram trigram-unigram
- [rates]= seg_tr_syl seg_tr_phn lex_syl lex_phn enrate vrate nvrate mrate
mfrate enmmfrate mmfrate
- [positions] = word_num_in_utterance word_num_in_turn
- [morebigrams] = bigram reverse-bigram reverse-trigram center-trigram
- part-of-speech = syntactic part of speech (currently only done for the
word "to")
- wordX= word number X in acoustically segmented `sentence'
- canonical?= can if canonical (pronlex) pronunciation, alt otherwise
- trigram= p(word | previous two words)
- unigram= p(word)
- trigram-unigram = difference between two probabilities
- seg_tr_syl= transcribed syllable rate between closest two pauses
- seg_tr_phn= transcribed phone rate between closest two pauses
- lex_syl= lexical syllabic rate (i.e. as determined from wd
transcription)
- lex_phn= lexical phone rate (i.e. as determined from wd transcription)
|
|
15
|
- enrate= old enrate measure
- vrate= voicing rate
- nvrate= another voicing rate
- mrate= sub-part of mrate measure
- mfrate= sub-part of mrate measure
- enmmfrate= *this is what we call mrate* average of enrate, mrate, mfrate
- mmffrate= average of mrate, mfrate
- baseform= pronunciation as written in dictionary
- transcribed= transcribed syllable
- syl_structure= onset/nucleus/coda markings from dictionary
- stress= syllable stress marking from dictionary P=primary S=secondary
N=none
- length= syllable length
- tranX= transcribed phone X corresponding to baseform phone
|
|
16
|
- SENTENCE: like finding a proper nursing home (2005_A_0041)
- WORD: like 1 can -2.408 -2.152 -0.256 4.64 10.43 3.87 9.89 3.80 2.32
5.79 2.32 4.64 3.59 3.48 0 26 l ay k
- SYL: l_ay_k l_ay_k O_N_C P 0.258 -2.408 -2.152 -0.256 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 0 26
- PHONE: l P O -2.408 -2.152 -0.256 4.64 10.43 3.87 9.89 3.80 2.32 5.79
2.32 4.64 3.59 3.48 0 26 l
- PHONE: ay P N -2.408 -2.152 -0.256 4.64 10.43 3.87 9.89 3.80 2.32 5.79
2.32 4.64 3.59 3.48 0 26 ay
- PHONE: k P C -2.408 -2.152 -0.256 4.64 10.43 3.87 9.89 3.80 2.32 5.79
2.32 4.64 3.59 3.48 0 26 k
- WORD: finding 2 alt -3.604 -4.256 0.652 4.64 10.43 3.87 9.89 3.80 2.32
5.79 2.32 4.64 3.59 3.48 1 27 f ay n ih ng
- SYL: f_ay_n f_ay_n O_N_C P 0.358 -3.604 -4.256 0.652 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 1 27
- PHONE: f P O -3.604 -4.256 0.652 4.64 10.43 3.87 9.89 3.80 2.32 5.79
2.32 4.64 3.59 3.48 1 27 f
- PHONE: ay P N -3.604 -4.256 0.652 4.64 10.43 3.87 9.89 3.80 2.32 5.79
2.32 4.64 3.59 3.48 1 27 ay
- PHONE: n P C -3.604 -4.256 0.652 4.64 10.43 3.87 9.89 3.80 2.32 5.79
2.32 4.64 3.59 3.48 1 27 n
- SYL: d_ih_ng NULL_ih_ng O_N_C N 0.117 -3.604 -4.256 0.652 4.64 10.43
3.87 9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 1 27
- PHONE: d N O -3.604 -4.256 0.652 4.64 10.43 3.87 9.89 3.80 2.32 5.79
2.32 4.64 3.59 3.48 NULL 1 27
- PHONE: ih N N -3.604 -4.256 0.652 4.64 10.43 3.87 9.89 3.80 2.32 5.79
2.32 4.64 3.59 3.48 1 27 ih
- PHONE: ng N C -3.604 -4.256 0.652 4.64 10.43 3.87 9.89 3.80 2.32 5.79
2.32 4.64 3.59 3.48 1 27 ng
|
|
17
|
- Includes read news etc. (i.e. non-spontaneous read speech)
- Transcripts uploaded to AFS at:
- /afs/ir/data/linguistic-data/Boston-University-Radio
- Sound files available on CD
|
|
18
|
- Boston News Corpus
- H# 0 4
- >endsil
- DH 4 5
- IH+1 9 10
- S 19 9
- >This
- HH 28 5
- AA+1 33 9
- L 42 12
- AX 54 4
- DCL 58 3
- D 61 1
- EY 62 16
- >holiday
- S 78 11
- IY+1 89 14
- Z 103 7
- EN 110 20
- …
|
|
19
|
- XWAVES/PRAAT readable:
- signal st43/f3ast43p1
- type 1
- color 76
- font -*-times-medium-r-*-*-17-*-*-*-*-*-*-*
- separator ;
- nfields 1
- #
- 0.035000 76 H#
- 0.085000 76 DH
- 0.185000 76 IH+1
- 0.275000 76 S
- 0.325000 76 HH
- 0.415000 76 AA+1
- 0.535000 76 L
- 0.575000 76 AX
- 0.605000 76 DCL
- 0.615000 76 D
- 0.775000 76 EY
- 0.885000 76 S
- …
|
|
20
|
- CALLHOME – Mandarin
- Transcripts uploaded to AFS:
- /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Mandarin-Transcripts/
- Lexicon with pronunciation information available at:
- /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Mandarin-Lexicon/
- Sound files only available on CD/DVD, but I could put them on the
corpus computer
|
|
21
|
- 150.41 150.82 B: ßÀ ºß ,
- 154.57 161.32 B: °¥ , ÄǸö ßÀ Äã ˵ Äã ˵ Ëû ÓÐûÓÐ &¿Àñ& µÄ
µØÖ· °¡ ? ÎÒ Ôõô [channel_noise] ÕÒ ÁË °ëÌì ÄØ û ÕÒ ×Å Ëû µÄ µØÖ· °¡ ?
- 161.45 162.11 A: ÊÇÂð ?
- 162.24 162.70 B: ßÀ ,
- 162.93 163.88 A: %Ŷ% ,
- 163.96 164.41 A: [background_speech_((°¥_Ó´))]
|
|
22
|
- Telephone recording of 8 major dialects of American English
- (orthographic) transcripts on AFS, sound files available on CD
- Comparable dialect corpora exist for the British Isles (IViE; stored on
the corpus computer)
|
|
23
|
- TIMIT
- Word label (.wrd):
- 7470 11362 she
- 11362 16000 had
- 15420 17503 your
- 17503 23360 dark
- 23360 28360 suit
- 28360 30960 in
- 30960 36971 greasy
- Phonetic label (.phn):
- (Note: beginning and ending silence regions are marked with h#)
- 0 7470 h#
- 7470 9840 sh
- 9840 11362 iy
- 11362 12908 hv
- 12908 14760 ae
- 14760 15420 dcl
- 15420 16000 jh
- 16000 17503 axr
|
|
24
|
- Either load the files into your favorite text editor
- Or use a command from the ‘grep’ family (run on a UNIX shell)
- This allows you to search many files as once for patterns that are
described by regular expressions
- For help, see our tutorial page at:
- http://www.stanford.edu/dept/linguistics/corpora/cas-tut-grep.html
|