1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA,...

1

Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus

Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA

IBM Research, Tokyo Research Laboratory, IBM Japan

Presenter: Hsuan-Sheng Chiu

2

Reference

• S. Mori and D. Takuma, “Word N-gram Probability Estimation From A Japanese Raw Corpus,” in Proc. of ICSLP 2004, pp. 201–207.

• H. Feng, K. Chen, X. Deng, and W. Zheng, “Accessor Variety Criteria for Chinese Word Extraction,” Computational Linguistics, vol. 30, no.1, pp. 75–93, 2004.

• M. Asahara and Y. Matsumoto, “Japanese Unknown Word Identification by Character-based Chunking,” in Proc. of COLING 2004, pp. 459–465.

3

Introduction

• Domain-specific words are likely to characterize their domain, misrecognition of these words causes a severe quality degradation of the LVCSR application

• It has been necessary to segment the target domain’s corpus into words because no space exists between words in Japanese

• A ideal method:– Experts manually segment a corpus of the target domain

– Domain-specific words are added to the lexicon

– The domain-specific LM is built from this correct segmented corpus

• This is not realistic because the target domain will change– A fully automatic method is necessary

4

Proposed method

This method is fully automatic

• 1. segment the raw corpus stochastically

• 2. build a word n-gram model from stochastically segmented corpus

• 3. add probable word into the lexicon to LVCSR

5

Stochastic Segmentation

• Raw corpus: All of the words are concatenated and there is no word boundary information

• Deterministically segmented corpus: corpus with deterministic word boundary

• Stochastically segmented corpus: corpus with word boundary probability

6

Word Boundary Probability

• Estimate probability from a relatively small segmented corpus

• Introduce seven character classes since the number of characters in Japanese is large– Kanji, symbols , Arabic digits, Hiragana, Katakana, Latin characters

(Cyrillic and Greek characters)

• Word Boundary Probability:

11

1

,,,

,,

iisiis

iisi xcxcfxcBTxcf

xcBTxcfp

xxf

BT:

xc

s string the of frequery

corpus segmented in token boundary word

to belongsx which classcharacter

:

:

7

Word N-gram Probability

• The number of word in the corpus

• A character sequence in the raw corpus is treated as a word if and only if there is a word boundary before and after sequence and there is no word boundary inside the sequence– Unigram frequency

– Unigram probability

1

1

1rn

iir pf

wxiOwherepppwf kiiki

Oi

k

jjiir

11

1

1

| ,11

rrg fwfwP1

今天天氣真好 0.2 0.1 0.5 0.1 0.5 0.1 0.2

8

Word N-gram Probability (cont.)

• Word-based n-gram model– Probability of word sequence

• Since it is impossible to define the complete vocabulary, use a special token UW for unknown words

• Unknown word spelling is predicted by character-based n-gram model

• Probability of OOV

okenboundary tBT:,,| 10

1

1

111,

h

h

i

inii

hnw wwwwPwM

1'

1

11

'1, |

h

i

inii

hnx xxPxM

11,

11 ||

iniinx

inii wUWPwMwwP

9

Probable Character Strings Added to the Lexicon

• All of the character string appearing in the domain-specific corpus can be treated as words

• However, a lot of meaningless character strings are also included

• Use traditional character-based approach to judge whether or not a character string is appropriate as a word– Accessor Variety Criteria

• Accessor Varieties• Adhesive characters

– Character-based Chunking• POS feature for chunking• SVM-based chunking

10

Accessor Variety Criteria

• We first discard those strings with accessor varieties that are smaller than a certain number

• The remaining strings are considered to be potentially meaningful words. In addition, we apply rules to remove strings that consist of a word and adhesive characters

Example“ 的人們” AV is high, but it’s not a wordh: Head-adhesive charactert: Tail-adhesive character

core: meaningful word

h + core: 的我core + t: 我的h + core + t: 的過程是should be discarded

Rule-based discarding

Example門把手弄壞了小明修好了門把手這個門把手很漂亮這個門把手壞了小明弄壞門把手

Prefix of 門把手 (left AV): S, 了 , 個 , 壞 Suffix of 門把手 (right AV): 弄 , E, 很 , 壞了 ,E

Distinct words countedS, E repeatedly countedAV=min{ left AV, right AV}=4

11

Summary of the proposed method

• as regards time, the proposed method has an advantage, because it only requires a raw corpus and doesn’t need labor-intensive manual segmentation to adapt an LVCSR system to the target domain– OOV words can be treated as words– Proper n-gram probability are assigned to OOV words and word

sequences containing OOV words

12

Basic Material

• Acoustic Model– 83 hours from spontaneous speech corpus– Phones are represented as context-dependent, three state left-to-right H

MMs – State are clustered using phonetic decision tree (2728) – Each state is modeled using 11 mixture of Gaussians

• General LM– Large corpus of a general domain – A small part of the corpus was segmented by experts– The rest was segmented automatically by the word segmenter and roug

hly checked by experts– 24,442,503 words

• General Lexicon– 45402 words

13

Experiments

• Experiment on lectures of the University of Air

• Domain-specific words which never appear in newspaper articles are often used

• Select 3 lectures for the experiments– Mainly composed of the textbooks

Small: about 20 pagesLarge: one entire textbook

14

Experiment (cont.)

Three methods are compared– Ideal – Automatic– Proposed

• OOV are added to the Lexicon

• Evaluation– Use CER instead of WER

• The reason is that in Japanese ambiguity exists in word segmentation

– eWER• Estimate WER based on CER and the average number of character

per one word

10011~

xCEReWER n

009518655415355819792172.35183764n~

15

Experimental Results

16

Conclusions

• Considering this result, the larger corpora are, the better the performances are with the proposed method

• It is not difficult to collect raw corpora but it will be an expensive and time-consuming task to manually segment a raw corpus

• Propose a new method to adapt an LVCSR system to specific domain based on stochastic segmentation

• The proposed method allows us to adapt LVCSR to various domains in much less time

1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA,...

Documents

Transcript of 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA,...