1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA,...

16
1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIM URA IBM Research, Tokyo Research Laboratory, IBM Japan Presenter: Hsuan-Sheng Chiu

Transcript of 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA,...

Page 1: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

1

Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus

Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA

IBM Research, Tokyo Research Laboratory, IBM Japan

Presenter: Hsuan-Sheng Chiu

Page 2: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

2

Reference

• S. Mori and D. Takuma, “Word N-gram Probability Estimation From A Japanese Raw Corpus,” in Proc. of ICSLP 2004, pp. 201–207.

• H. Feng, K. Chen, X. Deng, and W. Zheng, “Accessor Variety Criteria for Chinese Word Extraction,” Computational Linguistics, vol. 30, no.1, pp. 75–93, 2004.

• M. Asahara and Y. Matsumoto, “Japanese Unknown Word Identification by Character-based Chunking,” in Proc. of COLING 2004, pp. 459–465.

Page 3: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

3

Introduction

• Domain-specific words are likely to characterize their domain, misrecognition of these words causes a severe quality degradation of the LVCSR application

• It has been necessary to segment the target domain’s corpus into words because no space exists between words in Japanese

• A ideal method:– Experts manually segment a corpus of the target domain

– Domain-specific words are added to the lexicon

– The domain-specific LM is built from this correct segmented corpus

• This is not realistic because the target domain will change– A fully automatic method is necessary

Page 4: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

4

Proposed method

This method is fully automatic

• 1. segment the raw corpus stochastically

• 2. build a word n-gram model from stochastically segmented corpus

• 3. add probable word into the lexicon to LVCSR

Page 5: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

5

Stochastic Segmentation

• Raw corpus: All of the words are concatenated and there is no word boundary information

• Deterministically segmented corpus: corpus with deterministic word boundary

• Stochastically segmented corpus: corpus with word boundary probability

Page 6: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

6

Word Boundary Probability

• Estimate probability from a relatively small segmented corpus

• Introduce seven character classes since the number of characters in Japanese is large– Kanji, symbols , Arabic digits, Hiragana, Katakana, Latin characters

(Cyrillic and Greek characters)

• Word Boundary Probability:

11

1

,,,

,,

iisiis

iisi xcxcfxcBTxcf

xcBTxcfp

xxf

BT:

xc

s string the of frequery

corpus segmented in token boundary word

to belongsx which classcharacter

:

:

Page 7: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

7

Word N-gram Probability

• The number of word in the corpus

• A character sequence in the raw corpus is treated as a word if and only if there is a word boundary before and after sequence and there is no word boundary inside the sequence– Unigram frequency

– Unigram probability

1

1

1rn

iir pf

wxiOwherepppwf kiiki

Oi

k

jjiir

11

1

1

| ,11

rrg fwfwP1

今 天 天 氣 真 好 0.2 0.1 0.5 0.1 0.5 0.1 0.2

Page 8: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

8

Word N-gram Probability (cont.)

• Word-based n-gram model– Probability of word sequence

• Since it is impossible to define the complete vocabulary, use a special token UW for unknown words

• Unknown word spelling is predicted by character-based n-gram model

• Probability of OOV

okenboundary tBT:,,| 10

1

1

111,

h

h

i

inii

hnw wwwwPwM

1'

1

11

'1, |

h

i

inii

hnx xxPxM

11,

11 ||

iniinx

inii wUWPwMwwP

Page 9: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

9

Probable Character Strings Added to the Lexicon

• All of the character string appearing in the domain-specific corpus can be treated as words

• However, a lot of meaningless character strings are also included

• Use traditional character-based approach to judge whether or not a character string is appropriate as a word– Accessor Variety Criteria

• Accessor Varieties• Adhesive characters

– Character-based Chunking• POS feature for chunking• SVM-based chunking

Page 10: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

10

Accessor Variety Criteria

• We first discard those strings with accessor varieties that are smaller than a certain number

• The remaining strings are considered to be potentially meaningful words. In addition, we apply rules to remove strings that consist of a word and adhesive characters

Example“ 的人們” AV is high, but it’s not a wordh: Head-adhesive charactert: Tail-adhesive character

core: meaningful word

h + core: 的我core + t: 我的h + core + t: 的過程是should be discarded

Rule-based discarding

Example門把手弄壞了小明修好了門把手這個門把手很漂亮這個門把手壞了小明弄壞門把手

Prefix of 門把手 (left AV): S, 了 , 個 , 壞 Suffix of 門把手 (right AV): 弄 , E, 很 , 壞 了 ,E

Distinct words countedS, E repeatedly countedAV=min{ left AV, right AV}=4

Page 11: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

11

Summary of the proposed method

• as regards time, the proposed method has an advantage, because it only requires a raw corpus and doesn’t need labor-intensive manual segmentation to adapt an LVCSR system to the target domain– OOV words can be treated as words– Proper n-gram probability are assigned to OOV words and word

sequences containing OOV words

Page 12: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

12

Basic Material

• Acoustic Model– 83 hours from spontaneous speech corpus– Phones are represented as context-dependent, three state left-to-right H

MMs – State are clustered using phonetic decision tree (2728) – Each state is modeled using 11 mixture of Gaussians

• General LM– Large corpus of a general domain – A small part of the corpus was segmented by experts– The rest was segmented automatically by the word segmenter and roug

hly checked by experts– 24,442,503 words

• General Lexicon– 45402 words

Page 13: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

13

Experiments

• Experiment on lectures of the University of Air

• Domain-specific words which never appear in newspaper articles are often used

• Select 3 lectures for the experiments– Mainly composed of the textbooks

Small: about 20 pagesLarge: one entire textbook

Page 14: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

14

Experiment (cont.)

Three methods are compared– Ideal – Automatic– Proposed

• OOV are added to the Lexicon

• Evaluation– Use CER instead of WER

• The reason is that in Japanese ambiguity exists in word segmentation

– eWER• Estimate WER based on CER and the average number of character

per one word

10011~

xCEReWER n

009518655415355819792172.35183764n~

Page 15: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

15

Experimental Results

Page 16: 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

16

Conclusions

• Considering this result, the larger corpora are, the better the performances are with the proposed method

• It is not difficult to collect raw corpora but it will be an expensive and time-consuming task to manually segment a raw corpus

• Propose a new method to adapt an LVCSR system to specific domain based on stochastic segmentation

• The proposed method allows us to adapt LVCSR to various domains in much less time