GBO Presentation Cross-Lingual Language …clsp.jhu.edu/~woosung/pdf/gbo.pdfGBO Presentation...
Transcript of GBO Presentation Cross-Lingual Language …clsp.jhu.edu/~woosung/pdf/gbo.pdfGBO Presentation...
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 1/19
GBO Presentation
Cross-Lingual Language Modeling
for Automatic Speech RecogntionNovember 14, 2003
Woosung [email protected]
Center for Language and Speech ProcessingDept. of Computer Science
The Johns Hopkins University, Baltimore, MD 21218, USA
•Title
•Introduction
•Langauge Models
•Information Retrieval
•Cross-Lingual LM for ASR
•Model Estimation
•Cross-Lingual Triggers
•LSA for CLIR
•Corpora
•Experimental Results
•Conclusions
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 2/19
Introduction
Motivation :Success of statistical modeling techniques
Development of modeling and automatic learningtechniquesA large amount of data for training is availableMost resources on English, French and German
How to construct stochastic models in resource-deficientlanguages?
•Title
•Introduction
•Langauge Models
•Information Retrieval
•Cross-Lingual LM for ASR
•Model Estimation
•Cross-Lingual Triggers
•LSA for CLIR
•Corpora
•Experimental Results
•Conclusions
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 2/19
Introduction
Motivation :Success of statistical modeling techniques
Development of modeling and automatic learningtechniquesA large amount of data for training is availableMost resources on English, French and German
How to construct stochastic models in resource-deficientlanguages? ➔ Bootstrap from other languages, e.g.
Universal phone-set for ASR(Schultz & Waibel, 98, Byrne et al, 00)
Exploit parallel texts to project morphologicalanalyzers, POS taggers, etc.(Yarowsky, Ngai & Wicentowski, 01)
Language modeling (this talk)
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 3/19
Introduction
We present:An approach to sharpen an LM in a resource-deficient language usingcomparable text from resource-rich languages
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 3/19
Introduction
We present:An approach to sharpen an LM in a resource-deficient language usingcomparable text from resource-rich languages
Story-specific language models from contemporaneous text
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 3/19
Introduction
We present:An approach to sharpen an LM in a resource-deficient language usingcomparable text from resource-rich languages
Story-specific language models from contemporaneous text
Integration of machine translation (MT), cross-language informationretrieval (CLIR), and language modeling (LM)
•Title
•Introduction
•Langauge Models
•Information Retrieval
•Cross-Lingual LM for ASR
•Model Estimation
•Cross-Lingual Triggers
•LSA for CLIR
•Corpora
•Experimental Results
•Conclusions
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 4/19
Langauge Models
Ex1: Optical Character Recognition
All Iam a student
•Title
•Introduction
•Langauge Models
•Information Retrieval
•Cross-Lingual LM for ASR
•Model Estimation
•Cross-Lingual Triggers
•LSA for CLIR
•Corpora
•Experimental Results
•Conclusions
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 4/19
Langauge Models
Ex1: Optical Character Recognition
All I am a student
Ex2: Speech Recognition
It’s [tu:] cold He is [tu:]-years-old
•Title
•Introduction
•Langauge Models
•Information Retrieval
•Cross-Lingual LM for ASR
•Model Estimation
•Cross-Lingual Triggers
•LSA for CLIR
•Corpora
•Experimental Results
•Conclusions
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 4/19
Langauge Models
Ex1: Optical Character Recognition
All I am a student
Ex2: Speech Recognition
It’s [tu:] cold He is [tu:]-years-old
too two
Many speech & NLP applications deal with ungrammaticalsentences/phrases
Need to suppress ungrammatical outputs
Assign a probability to given a sequence of word strings
P (“He is two-years-old”) ≫ P (“He is too-years-old”)
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 5/19
Langauge Models: ASR problem
ASR problem : Finding word string W = w1, w2, · · · , wn given acousticevidence A
W = arg maxW
P (W |A) (1)
Bayes’ Rule
P (W |A) =P (W )P (A|W )
P (A)(2)
W = arg maxW
P (W )P (A|W ) (3)
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 5/19
Langauge Models: ASR problem
ASR problem : Finding word string W = w1, w2, · · · , wn given acousticevidence A
W = arg maxW
P (W |A) (1)
Bayes’ Rule
P (W |A) =P (W )P (A|W )
P (A)(2)
W = arg maxW
P (W )P (A|W ) (3)
Language Model Acoustic Model
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 6/19
Langauge Models: Formulation
P (w1, w2, · · · , wn)
= P (w1)P (w2|w1)P (w3|w1, w2) · · ·P (wn|w1, · · · , wn−1)
≈ P (w1)P (w2|w1)P (w3|w1, w2) · · ·P (wn|wn−2, wn−1)
= P (w1)P (w2|w1)
n∏
i=3
P (wi|wi−2, wi−1) (4)
P (wi|wi−2, wi−1) =N(wi−2, wi−1, wi)
N(wi−2, wi−1)=⇒ Trigram (5)
P (wi|wi−1) =N(wi−1, wi)
N(wi−1)=⇒ Bigram (6)
P (wi) =N(wi)
∑
w∈V N(w)=⇒ Unigram (7)
where N(wi−2, wi−1, wi) denotes the number of times entry (wi−2, wi−1, wi)appears in the training data and V is the vocabulary.
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 7/19
Langauge Models: Evaluation
Word Error Rate (WER): Performace of ASR given LM
REFERENCE: UP UPSTATE NEW YORK SOMEWHERE UH OVER OVER HUGE AREAS
HYPOTHESIS: UPSTATE NEW YORK SOMEWHERE UH ALL ALL THE HUGE AREAS
COR(0)/ERR(1): 1 0 0 0 0 0 1 1 1 0 0
:4 errors per 10 words in reference; WER = 40%
➔ Ultimate measure, but expensive to compute
Perplexity (PPL): based on cross entropy of test data D w.r.t. LM M
H(PD; PM ) = −∑
w∈V
PD(w) log2 PM (w) (8)
= −1
N
N∑
i=1
log2 PM (wi|wi−2, wi−1) (9)
PPLM (D) = 2H(PD;PM ) = [PM (w1, · · · , wN )]−1
N (10)
For both of WER and PPL, the lower, the better !Close correlation between WER and PPL
•Title
•Introduction
•Langauge Models
•Information Retrieval
•Cross-Lingual LM for ASR
•Model Estimation
•Cross-Lingual Triggers
•LSA for CLIR
•Corpora
•Experimental Results
•Conclusions
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 8/19
Information Retrieval
Given a query doc, find relevant docs from the collection
Vector-based IR:Represent each doc as a Bag-Of-Words (BOW)Convert the BOW into the vector of termsAll terms are considered as independent (orthogonal)Measure similarity between a query and docs
air automobile bank car ...query : ( 0 1 2 0 ... )doc : ( 2 0 5 3 ... )
Similarity measure: cosine similarity ➔ inner product
sim(~dj , ~q) =
∑t
i=1 wij × wiq√
∑t
i=1 w2ij ×
√
∑t
i=1 w2iq
(11)
where ~dj = (w1j , · · · , wtj) and ~q = (w1q, · · · , wtq)
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 9/19
Information Retrieval
Retrieval performance evaluation:
Precision =No. {Retrieved docs ∩ Relevant docs}
No. Retrieved docs(12)
Recall =No. {Retrieved docs ∩ Relevant docs}No. Relevant docs (in the collection)
(13)
ProblemsTerms are not orthgonalMismatches between queries and docs
Polysemy: e.g., bank (❶ river, ❷ money, · · · ) ➔ recall ⇓Synonymy: e.g., car, automobile, vehicle, · · · ➔ precision ⇓
➔ Latent Semantic Analysis (LSA) has been proposed
Cross-Lingual IR: query & docs are in different languages ➔ Querytranslation approach
•Title
•Introduction
•Langauge Models
•Information Retrieval
•Cross-Lingual LM for ASR
•Model Estimation
•Cross-Lingual Triggers
•LSA for CLIR
•Corpora
•Experimental Results
•Conclusions
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 10/19
Cross-Lingual LM for ASR
Automatic Speech
Recognition
Cross-Language
Information Retrieval
Statistical Machine
Translation
Baseline Chinese
Acoustic Model
Chinese Dictionary
(Vocabulary)
Baseline Chinese
Language Model
Translation Lexicons
Cross-Language
Unigram Model
Automatic
Transcription Contemporaneous
English Articles
English Article Aligned
with Mandarin Story
C
i d
E
i d
) | ( ˆ E
i d e P
) | ( e c P T
) | ( unigram CL
E
i d c P
Mandarin Story
•Title
•Introduction
•Langauge Models
•Information Retrieval
•Cross-Lingual LM for ASR
•Model Estimation
•Cross-Lingual Triggers
•LSA for CLIR
•Corpora
•Experimental Results
•Conclusions
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 11/19
Model Estimation
Assume document correspondence, dEi ↔ dC
i , is known forChinese test doc dC
i ,
PCL-unigram(c|dEi ) =
∑
e∈E
PT (c|e)P (e|dEi ), ∀c ∈ C (14)
Cross-Language LM constructionBuild story-specific cross-language LMs, P (c|dE
i )Linear interpolation with the baseline trigram LM
PCL-interpolated(ck|ck−1, ck−2, dEi ) (15)
= λPCL-unigram(ck|dEi ) + (1 − λ)P (ck|ck−1, ck−2)
λ is optimized to minimize the PPL of heldout data viaEM algorithm
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 12/19
Model Estimation
Document correspondence ➔ obtained by CLIRFor each Chinese test doc dC
i , create an English BOWFind the English doc with the highest cosine similarity
dEi = arg max
dEj∈DE
simCL(P (e|dCi ), P (e|dE
j )) (16)
Estimation of PT (c|e) and PT (e|c)GIZA++ : statistical MT tool based on IBM model-4
Input : Hong Kong news Chinese-English sentence-aligned parallelcorpus➔ 18K docs, 200K sents, 4M wds eachWe only need translation tables : PT (e|c) and PT (c|e)
Mutual Information-based CL-triggersCL-LSA (Latent Semantic Analysis)
•Title
•Introduction
•Langauge Models
•Information Retrieval
•Cross-Lingual LM for ASR
•Model Estimation
•Cross-Lingual Triggers
•LSA for CLIR
•Corpora
•Experimental Results
•Conclusions
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 13/19
Cross-Lingual Lexical Triggers: Identification
Monolingual Triggers : e.g. “either ... or”
Cross-Lingual Setting : Translation lexicons
Based on Average Mutual Information (I(e; c))Let
P (e, c) =#d(e, c)
Nand P (e, c) =
#d(e, c)
N(17)
where #d(e) denote the number of English articles in which e
occurs, and let
P (e) =#d(e)
Nand P (c|e) =
P (e, c)
P (e)(18)
I(e; c) = P (e, c) log P (c|e)P (c) + P (e, c) log P (c|e)
P (c)
+ P (e, c) log P (c|e)P (c) + P (e, c) log P (c|e)
P (c) (19)
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 14/19
Cross-Lingual Lexical Triggers: Estimation
We estimate the trigger-based CL unigram probs with
PTrig(c|e) =I(e; c)
∑
c′∈C I(e; c′), (20)
Analogous to (14),
PTrig-unigram(c|dEi ) =
∑
e∈E
PTrig(c|e)P (e|dEi ) (21)
Again, we build the interpolated model
PTrig-interpolated(ck|ck−1, ck−2, dEi ) (22)
= λPTrig-unigram(ck|dEi ) + (1 − λ)P (ck|ck−1, ck−2)
•Title
•Introduction
•Langauge Models
•Information Retrieval
•Cross-Lingual LM for ASR
•Model Estimation
•Cross-Lingual Triggers
•LSA for CLIR
•Corpora
•Experimental Results
•Conclusions
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 15/19
Latent Semantic Analysis for CLIR
Singular Value Decomposition (SVD) of the parallel corpus
d C
N
M N M R R R R N
W U S V T
x x x x
= d
E
N
d C
1
d E
1 ...
...
Input : word-document frequency matrix, W
Reduce the dimension into the smaller but adequatesubpace ➔ Singular Value Decomposition : U, V , and S
S : diagonal matrix w/ diagonal entries σ1, · · · , σk whereσ1 ≥ σ2 ≥ · · · ≥ σk(k ≥ R)
Remove noisy entries by setting σi = 0 for i > R
•Title
•Introduction
•Langauge Models
•Information Retrieval
•Cross-Lingual LM for ASR
•Model Estimation
•Cross-Lingual Triggers
•LSA for CLIR
•Corpora
•Experimental Results
•Conclusions
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 15/19
Latent Semantic Analysis for CLIR
Singular Value Decomposition (SVD) of the parallel corpus
d C
J
M N M R R R R N
W U S V T
x x x x
= d
E
J
Input : word-document frequency matrix, W
Reduce the dimension into the smaller but adequatesubpace ➔ Singular Value Decomposition : U, V , and S
S : diagonal matrix w/ diagonal entries σ1, · · · , σk whereσ1 ≥ σ2 ≥ · · · ≥ σk(k ≥ R)
Remove noisy entries by setting σi = 0 for i > R
•Title
•Introduction
•Langauge Models
•Information Retrieval
•Cross-Lingual LM for ASR
•Model Estimation
•Cross-Lingual Triggers
•LSA for CLIR
•Corpora
•Experimental Results
•Conclusions
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 15/19
Latent Semantic Analysis for CLIR
Singular Value Decomposition (SVD) of the parallel corpus
d C
J
d E
J
M N M R R R R N
W U S V T
x x x x
=
Input : word-document frequency matrix, W
Reduce the dimension into the smaller but adequatesubpace ➔ Singular Value Decomposition : U, V , and S
S : diagonal matrix w/ diagonal entries σ1, · · · , σk whereσ1 ≥ σ2 ≥ · · · ≥ σk(k ≥ R)
Remove noisy entries by setting σi = 0 for i > R
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 16/19
Latent Semantic Analysis for CLIR
Folding-in a monolingual corpus
. . . .
0 . . . . 0
d E
1
M P M R R R R P
W U S V T
x x x x
= d
E
P
Given a monolingual corpus, W , in either side
Use the same matrices U, S
Project into low-dimensional space, V = S−1U−1W
Compare a query and a document in the reduced dimensional space
•Title
•Introduction
•Langauge Models
•Information Retrieval
•Cross-Lingual LM for ASR
•Model Estimation
•Cross-Lingual Triggers
•LSA for CLIR
•Corpora
•Experimental Results
•Conclusions
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 17/19
Training and Test Corpora
Acoustic model trainingHUB4-NE Mandarin training data (96K wds) ∼ 10 hours
Chinese monolingual language model trainingXINHUA : 13M wdsHUB4-NE : 96K wds
ASR test set : NIST HUB4-NE test data (only F0 portion)1263 sents, 9.8K wds (1997 ∼ 1998)
English CLIR corpus : NAB-TDTNAB (1997 LA, WP) + TDT-2 (1998 APW, NYT)45K docs, 30M wds
•Title
•Introduction
•Langauge Models
•Information Retrieval
•Cross-Lingual LM for ASR
•Model Estimation
•Cross-Lingual Triggers
•LSA for CLIR
•Corpora
•Experimental Results
•Conclusions
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 18/19
ASR Experimental Results
Vocab : 51K for Chinese300-best list rescoring
Oracle best/worst WER :33.4/94.4% for Xinhua and 39.7/95.5% for HUB4-NE
Language Model Perp WER CER p-value
Xinhua Trigram 426 49.9% 28.8% –
Trig-interpolated 367 49.1% 28.6% 0.004LSA-interpolated 364 49.3% 28.9% 0.043CL-interpolated 346 48.8% 28.4% < 0.001
HUB4-NE Trigram 1195 60.1% 44.1% –
Trig-interpolated 727 58.8% 43.3% < 0.001LSA-interpolated 695 58.6% 43.1% <0.001CL-interpolated 630 58.8% 43.1% < 0.001
•Title
•Introduction
•Langauge Models
•Information Retrieval
•Cross-Lingual LM for ASR
•Model Estimation
•Cross-Lingual Triggers
•LSA for CLIR
•Corpora
•Experimental Results
•Conclusions
GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 19/19
Conclusions
Exploits side-information from contemporaneous articles➔ useful for resource-deficient languages
Statistically significant improvements in ASR WER
Use of CL triggers & CL-LSA➔ A document-aligned corpus suffices rather than asentence-aligned corpus
Future workExtensions to higher order N -grams (e.g., bigrams)Discriminate LMs for Word Sense Disambiguation➔ story-specific translation modelsApplications to other languages (e.g., Arabic) and othertasks (e.g., MT)