GBO Presentation Cross-Lingual Language …clsp.jhu.edu/~woosung/pdf/gbo.pdfGBO Presentation...

GBO : Nov. 14, 2003 Cross-Lingual Language Modeling for ASR – p. 1/19

GBO Presentation

Cross-Lingual Language Modeling

for Automatic Speech RecogntionNovember 14, 2003

Woosung [email protected]

Center for Language and Speech ProcessingDept. of Computer Science

The Johns Hopkins University, Baltimore, MD 21218, USA

•Title

•Introduction

•Langauge Models

•Information Retrieval

•Cross-Lingual LM for ASR

•Model Estimation

•Cross-Lingual Triggers

•LSA for CLIR

•Corpora

•Experimental Results

•Conclusions


Introduction

Motivation :Success of statistical modeling techniques

Development of modeling and automatic learningtechniquesA large amount of data for training is availableMost resources on English, French and German

How to construct stochastic models in resource-deficientlanguages?

•Title

•Introduction

•Langauge Models



•Model Estimation


•LSA for CLIR

•Corpora


•Conclusions


Introduction

Motivation :Success of statistical modeling techniques

Development of modeling and automatic learningtechniquesA large amount of data for training is availableMost resources on English, French and German

How to construct stochastic models in resource-deficientlanguages? ➔ Bootstrap from other languages, e.g.

Universal phone-set for ASR(Schultz & Waibel, 98, Byrne et al, 00)

Exploit parallel texts to project morphologicalanalyzers, POS taggers, etc.(Yarowsky, Ngai & Wicentowski, 01)

Language modeling (this talk)


Introduction

We present:An approach to sharpen an LM in a resource-deficient language usingcomparable text from resource-rich languages


Introduction


Story-specific language models from contemporaneous text


Introduction


Story-specific language models from contemporaneous text

Integration of machine translation (MT), cross-language informationretrieval (CLIR), and language modeling (LM)

•Title

•Introduction

•Langauge Models



•Model Estimation


•LSA for CLIR

•Corpora


•Conclusions


Langauge Models

Ex1: Optical Character Recognition

All Iam a student

•Title

•Introduction

•Langauge Models



•Model Estimation


•LSA for CLIR

•Corpora


•Conclusions


Langauge Models


All I am a student

Ex2: Speech Recognition

It’s [tu:] cold He is [tu:]-years-old

•Title

•Introduction

•Langauge Models



•Model Estimation


•LSA for CLIR

•Corpora


•Conclusions


Langauge Models


All I am a student

Ex2: Speech Recognition

It’s [tu:] cold He is [tu:]-years-old

too two

Many speech & NLP applications deal with ungrammaticalsentences/phrases

Need to suppress ungrammatical outputs

Assign a probability to given a sequence of word strings

P (“He is two-years-old”) ≫ P (“He is too-years-old”)


Langauge Models: ASR problem

ASR problem : Finding word string W = w1, w2, · · · , wn given acousticevidence A

W = arg maxW

P (W |A) (1)

Bayes’ Rule

P (W |A) =P (W )P (A|W )

P (A)(2)

W = arg maxW

P (W )P (A|W ) (3)


Langauge Models: ASR problem

ASR problem : Finding word string W = w1, w2, · · · , wn given acousticevidence A

W = arg maxW

P (W |A) (1)

Bayes’ Rule

P (W |A) =P (W )P (A|W )

P (A)(2)

W = arg maxW

P (W )P (A|W ) (3)

Language Model Acoustic Model


Langauge Models: Formulation

P (w1, w2, · · · , wn)

= P (w1)P (w2|w1)P (w3|w1, w2) · · ·P (wn|w1, · · · , wn−1)

≈ P (w1)P (w2|w1)P (w3|w1, w2) · · ·P (wn|wn−2, wn−1)

= P (w1)P (w2|w1)

n∏

i=3

P (wi|wi−2, wi−1) (4)

P (wi|wi−2, wi−1) =N(wi−2, wi−1, wi)

N(wi−2, wi−1)=⇒ Trigram (5)

P (wi|wi−1) =N(wi−1, wi)

N(wi−1)=⇒ Bigram (6)

P (wi) =N(wi)

∑

w∈V N(w)=⇒ Unigram (7)

where N(wi−2, wi−1, wi) denotes the number of times entry (wi−2, wi−1, wi)appears in the training data and V is the vocabulary.


Langauge Models: Evaluation

Word Error Rate (WER): Performace of ASR given LM

REFERENCE: UP UPSTATE NEW YORK SOMEWHERE UH OVER OVER HUGE AREAS

HYPOTHESIS: UPSTATE NEW YORK SOMEWHERE UH ALL ALL THE HUGE AREAS

COR(0)/ERR(1): 1 0 0 0 0 0 1 1 1 0 0

:4 errors per 10 words in reference; WER = 40%

➔ Ultimate measure, but expensive to compute

Perplexity (PPL): based on cross entropy of test data D w.r.t. LM M

H(PD; PM ) = −∑

w∈V

PD(w) log2 PM (w) (8)

= −1

N

N∑

i=1

log2 PM (wi|wi−2, wi−1) (9)

PPLM (D) = 2H(PD;PM ) = [PM (w1, · · · , wN )]−1

N (10)

For both of WER and PPL, the lower, the better !Close correlation between WER and PPL

•Title

•Introduction

•Langauge Models



•Model Estimation


•LSA for CLIR

•Corpora


•Conclusions


Information Retrieval

Given a query doc, find relevant docs from the collection

Vector-based IR:Represent each doc as a Bag-Of-Words (BOW)Convert the BOW into the vector of termsAll terms are considered as independent (orthogonal)Measure similarity between a query and docs

air automobile bank car ...query : ( 0 1 2 0 ... )doc : ( 2 0 5 3 ... )

Similarity measure: cosine similarity ➔ inner product

sim(~dj , ~q) =

∑t

i=1 wij × wiq√

∑t

i=1 w2ij ×

√

∑t

i=1 w2iq

(11)

where ~dj = (w1j , · · · , wtj) and ~q = (w1q, · · · , wtq)



Retrieval performance evaluation:

Precision =No. {Retrieved docs ∩ Relevant docs}

No. Retrieved docs(12)

Recall =No. {Retrieved docs ∩ Relevant docs}No. Relevant docs (in the collection)

(13)

ProblemsTerms are not orthgonalMismatches between queries and docs

Polysemy: e.g., bank (❶ river, ❷ money, · · · ) ➔ recall ⇓Synonymy: e.g., car, automobile, vehicle, · · · ➔ precision ⇓

➔ Latent Semantic Analysis (LSA) has been proposed

Cross-Lingual IR: query & docs are in different languages ➔ Querytranslation approach

•Title

•Introduction

•Langauge Models



•Model Estimation


•LSA for CLIR

•Corpora


•Conclusions


Cross-Lingual LM for ASR

Automatic Speech

Recognition

Cross-Language


Statistical Machine

Translation

Baseline Chinese

Acoustic Model

Chinese Dictionary

(Vocabulary)

Baseline Chinese

Language Model

Translation Lexicons

Cross-Language

Unigram Model

Automatic

Transcription Contemporaneous

English Articles

English Article Aligned

with Mandarin Story

C

i d

E

i d

) | ( ˆ E

i d e P

) | ( e c P T

) | ( unigram CL

E

i d c P

Mandarin Story

•Title

•Introduction

•Langauge Models



•Model Estimation


•LSA for CLIR

•Corpora


•Conclusions


Model Estimation

Assume document correspondence, dEi ↔ dC

i , is known forChinese test doc dC

i ,

PCL-unigram(c|dEi ) =

∑

e∈E

PT (c|e)P (e|dEi ), ∀c ∈ C (14)

Cross-Language LM constructionBuild story-specific cross-language LMs, P (c|dE

i )Linear interpolation with the baseline trigram LM

PCL-interpolated(ck|ck−1, ck−2, dEi ) (15)

= λPCL-unigram(ck|dEi ) + (1 − λ)P (ck|ck−1, ck−2)

λ is optimized to minimize the PPL of heldout data viaEM algorithm


Model Estimation

Document correspondence ➔ obtained by CLIRFor each Chinese test doc dC

i , create an English BOWFind the English doc with the highest cosine similarity

dEi = arg max

dEj∈DE

simCL(P (e|dCi ), P (e|dE

j )) (16)

Estimation of PT (c|e) and PT (e|c)GIZA++ : statistical MT tool based on IBM model-4

Input : Hong Kong news Chinese-English sentence-aligned parallelcorpus➔ 18K docs, 200K sents, 4M wds eachWe only need translation tables : PT (e|c) and PT (c|e)

Mutual Information-based CL-triggersCL-LSA (Latent Semantic Analysis)

•Title

•Introduction

•Langauge Models



•Model Estimation


•LSA for CLIR

•Corpora


•Conclusions


Cross-Lingual Lexical Triggers: Identification

Monolingual Triggers : e.g. “either ... or”

Cross-Lingual Setting : Translation lexicons

Based on Average Mutual Information (I(e; c))Let

P (e, c) =#d(e, c)

Nand P (e, c) =

#d(e, c)

N(17)

where #d(e) denote the number of English articles in which e

occurs, and let

P (e) =#d(e)

Nand P (c|e) =

P (e, c)

P (e)(18)

I(e; c) = P (e, c) log P (c|e)P (c) + P (e, c) log P (c|e)

P (c)

+ P (e, c) log P (c|e)P (c) + P (e, c) log P (c|e)

P (c) (19)


Cross-Lingual Lexical Triggers: Estimation

We estimate the trigger-based CL unigram probs with

PTrig(c|e) =I(e; c)

∑

c′∈C I(e; c′), (20)

Analogous to (14),

PTrig-unigram(c|dEi ) =

∑

e∈E

PTrig(c|e)P (e|dEi ) (21)

Again, we build the interpolated model

PTrig-interpolated(ck|ck−1, ck−2, dEi ) (22)

= λPTrig-unigram(ck|dEi ) + (1 − λ)P (ck|ck−1, ck−2)

•Title

•Introduction

•Langauge Models



•Model Estimation


•LSA for CLIR

•Corpora


•Conclusions


Latent Semantic Analysis for CLIR

Singular Value Decomposition (SVD) of the parallel corpus

d C

N

M N M R R R R N

W U S V T

x x x x

= d

E

N

d C

1

d E

1 ...

...

Input : word-document frequency matrix, W

Reduce the dimension into the smaller but adequatesubpace ➔ Singular Value Decomposition : U, V , and S

S : diagonal matrix w/ diagonal entries σ1, · · · , σk whereσ1 ≥ σ2 ≥ · · · ≥ σk(k ≥ R)

Remove noisy entries by setting σi = 0 for i > R

•Title

•Introduction

•Langauge Models



•Model Estimation


•LSA for CLIR

•Corpora


•Conclusions




d C

J

M N M R R R R N

W U S V T

x x x x

= d

E

J





•Title

•Introduction

•Langauge Models



•Model Estimation


•LSA for CLIR

•Corpora


•Conclusions




d C

J

d E

J

M N M R R R R N

W U S V T

x x x x

=







Folding-in a monolingual corpus

. . . .

0 . . . . 0

d E

1

M P M R R R R P

W U S V T

x x x x

= d

E

P

Given a monolingual corpus, W , in either side

Use the same matrices U, S

Project into low-dimensional space, V = S−1U−1W

Compare a query and a document in the reduced dimensional space

•Title

•Introduction

•Langauge Models



•Model Estimation


•LSA for CLIR

•Corpora


•Conclusions


Training and Test Corpora

Acoustic model trainingHUB4-NE Mandarin training data (96K wds) ∼ 10 hours

Chinese monolingual language model trainingXINHUA : 13M wdsHUB4-NE : 96K wds

ASR test set : NIST HUB4-NE test data (only F0 portion)1263 sents, 9.8K wds (1997 ∼ 1998)

English CLIR corpus : NAB-TDTNAB (1997 LA, WP) + TDT-2 (1998 APW, NYT)45K docs, 30M wds

•Title

•Introduction

•Langauge Models



•Model Estimation


•LSA for CLIR

•Corpora


•Conclusions


ASR Experimental Results

Vocab : 51K for Chinese300-best list rescoring

Oracle best/worst WER :33.4/94.4% for Xinhua and 39.7/95.5% for HUB4-NE

Language Model Perp WER CER p-value

Xinhua Trigram 426 49.9% 28.8% –

Trig-interpolated 367 49.1% 28.6% 0.004LSA-interpolated 364 49.3% 28.9% 0.043CL-interpolated 346 48.8% 28.4% < 0.001

HUB4-NE Trigram 1195 60.1% 44.1% –

Trig-interpolated 727 58.8% 43.3% < 0.001LSA-interpolated 695 58.6% 43.1% <0.001CL-interpolated 630 58.8% 43.1% < 0.001

•Title

•Introduction

•Langauge Models



•Model Estimation


•LSA for CLIR

•Corpora


•Conclusions


Conclusions

Exploits side-information from contemporaneous articles➔ useful for resource-deficient languages

Statistically significant improvements in ASR WER

Use of CL triggers & CL-LSA➔ A document-aligned corpus suffices rather than asentence-aligned corpus

Future workExtensions to higher order N -grams (e.g., bigrams)Discriminate LMs for Word Sense Disambiguation➔ story-specific translation modelsApplications to other languages (e.g., Arabic) and othertasks (e.g., MT)

GBO Presentation Cross-Lingual Language …clsp.jhu.edu/~woosung/pdf/gbo.pdfGBO Presentation...

Documents

Transcript of GBO Presentation Cross-Lingual Language …clsp.jhu.edu/~woosung/pdf/gbo.pdfGBO Presentation...