1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA,...
-
Upload
lindsey-austin -
Category
Documents
-
view
215 -
download
0
Transcript of 1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA,...
1
Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus
Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA
IBM Research, Tokyo Research Laboratory, IBM Japan
Presenter: Hsuan-Sheng Chiu
2
Reference
• S. Mori and D. Takuma, “Word N-gram Probability Estimation From A Japanese Raw Corpus,” in Proc. of ICSLP 2004, pp. 201–207.
• H. Feng, K. Chen, X. Deng, and W. Zheng, “Accessor Variety Criteria for Chinese Word Extraction,” Computational Linguistics, vol. 30, no.1, pp. 75–93, 2004.
• M. Asahara and Y. Matsumoto, “Japanese Unknown Word Identification by Character-based Chunking,” in Proc. of COLING 2004, pp. 459–465.
3
Introduction
• Domain-specific words are likely to characterize their domain, misrecognition of these words causes a severe quality degradation of the LVCSR application
• It has been necessary to segment the target domain’s corpus into words because no space exists between words in Japanese
• A ideal method:– Experts manually segment a corpus of the target domain
– Domain-specific words are added to the lexicon
– The domain-specific LM is built from this correct segmented corpus
• This is not realistic because the target domain will change– A fully automatic method is necessary
4
Proposed method
This method is fully automatic
• 1. segment the raw corpus stochastically
• 2. build a word n-gram model from stochastically segmented corpus
• 3. add probable word into the lexicon to LVCSR
5
Stochastic Segmentation
• Raw corpus: All of the words are concatenated and there is no word boundary information
• Deterministically segmented corpus: corpus with deterministic word boundary
• Stochastically segmented corpus: corpus with word boundary probability
6
Word Boundary Probability
• Estimate probability from a relatively small segmented corpus
• Introduce seven character classes since the number of characters in Japanese is large– Kanji, symbols , Arabic digits, Hiragana, Katakana, Latin characters
(Cyrillic and Greek characters)
• Word Boundary Probability:
11
1
,,,
,,
iisiis
iisi xcxcfxcBTxcf
xcBTxcfp
xxf
BT:
xc
s string the of frequery
corpus segmented in token boundary word
to belongsx which classcharacter
:
:
7
Word N-gram Probability
• The number of word in the corpus
• A character sequence in the raw corpus is treated as a word if and only if there is a word boundary before and after sequence and there is no word boundary inside the sequence– Unigram frequency
– Unigram probability
1
1
1rn
iir pf
wxiOwherepppwf kiiki
Oi
k
jjiir
11
1
1
| ,11
rrg fwfwP1
今 天 天 氣 真 好 0.2 0.1 0.5 0.1 0.5 0.1 0.2
8
Word N-gram Probability (cont.)
• Word-based n-gram model– Probability of word sequence
• Since it is impossible to define the complete vocabulary, use a special token UW for unknown words
• Unknown word spelling is predicted by character-based n-gram model
• Probability of OOV
okenboundary tBT:,,| 10
1
1
111,
h
h
i
inii
hnw wwwwPwM
1'
1
11
'1, |
h
i
inii
hnx xxPxM
11,
11 ||
iniinx
inii wUWPwMwwP
9
Probable Character Strings Added to the Lexicon
• All of the character string appearing in the domain-specific corpus can be treated as words
• However, a lot of meaningless character strings are also included
• Use traditional character-based approach to judge whether or not a character string is appropriate as a word– Accessor Variety Criteria
• Accessor Varieties• Adhesive characters
– Character-based Chunking• POS feature for chunking• SVM-based chunking
10
Accessor Variety Criteria
• We first discard those strings with accessor varieties that are smaller than a certain number
• The remaining strings are considered to be potentially meaningful words. In addition, we apply rules to remove strings that consist of a word and adhesive characters
Example“ 的人們” AV is high, but it’s not a wordh: Head-adhesive charactert: Tail-adhesive character
core: meaningful word
h + core: 的我core + t: 我的h + core + t: 的過程是should be discarded
Rule-based discarding
Example門把手弄壞了小明修好了門把手這個門把手很漂亮這個門把手壞了小明弄壞門把手
Prefix of 門把手 (left AV): S, 了 , 個 , 壞 Suffix of 門把手 (right AV): 弄 , E, 很 , 壞 了 ,E
Distinct words countedS, E repeatedly countedAV=min{ left AV, right AV}=4
11
Summary of the proposed method
• as regards time, the proposed method has an advantage, because it only requires a raw corpus and doesn’t need labor-intensive manual segmentation to adapt an LVCSR system to the target domain– OOV words can be treated as words– Proper n-gram probability are assigned to OOV words and word
sequences containing OOV words
12
Basic Material
• Acoustic Model– 83 hours from spontaneous speech corpus– Phones are represented as context-dependent, three state left-to-right H
MMs – State are clustered using phonetic decision tree (2728) – Each state is modeled using 11 mixture of Gaussians
• General LM– Large corpus of a general domain – A small part of the corpus was segmented by experts– The rest was segmented automatically by the word segmenter and roug
hly checked by experts– 24,442,503 words
• General Lexicon– 45402 words
13
Experiments
• Experiment on lectures of the University of Air
• Domain-specific words which never appear in newspaper articles are often used
• Select 3 lectures for the experiments– Mainly composed of the textbooks
Small: about 20 pagesLarge: one entire textbook
14
Experiment (cont.)
Three methods are compared– Ideal – Automatic– Proposed
• OOV are added to the Lexicon
• Evaluation– Use CER instead of WER
• The reason is that in Japanese ambiguity exists in word segmentation
– eWER• Estimate WER based on CER and the average number of character
per one word
10011~
xCEReWER n
009518655415355819792172.35183764n~
15
Experimental Results
16
Conclusions
• Considering this result, the larger corpora are, the better the performances are with the proposed method
• It is not difficult to collect raw corpora but it will be an expensive and time-consuming task to manually segment a raw corpus
• Propose a new method to adapt an LVCSR system to specific domain based on stochastic segmentation
• The proposed method allows us to adapt LVCSR to various domains in much less time