Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at...
Transcript of Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at...
Exploring Web Scale Language Models for Search Query Processing
Jianfeng Gao, MSR(Joint work with Jian Huang, JiangboMiao, Xiaolong Li, Kuansan Wang and
Fritz Behr)
Outline
• N-gram language model (LM) ABC
• N-gram LM at Microsoft– Bing-It-On-Ngram
• Building Web scale N-gram LM
• Three search query processing tasks– Query spelling correction
– Query bracketing
– Long query segmentation
• Conclusion
2
Word n-gram model
• Compute the probability of a word string using chain rule on its history (=preceding words)
3
P(the dog of our neighbor barks) = P(the | <s>)×P(dog | <s>, the)× P(of | <s>, the, dog)...×P(barks | <s>, the, dog, of, our, neighbor)×P(</s> | <s>, the, dog, or, our, neighbor, barks)
P(w1, w2 ... wn) = P(w1 | <s>)× P(w2 | <s> w1)×P(w3 | <s> w1 w2)
...× P(wn | <s> w1 w2 ... wn-1)×P(</s> | <s> w1 w2 ... wn)
Word n-gram model
• Markov independence assumption– A word depends only on N-1 preceding words
– N=3 → word trigram model
• Reduce the number of parameters in the model– By forming equivalence classes
• Word trigram model
5
P(wi | <s> w1 w2 ... wi-2 wi-1) = P(wi | wi-2 wi-1)
P(w1, w2 ... wn) = P(w1 | <s>)×P(w2 | <s> w1)× P(w3 | w1 w2)...×P(wn | wn-2 wn-1)×P(</s> | wn-1 wn)
Example: input method editor (IME)
• Software to convert keystrokes (Pinyin) to text output
6
mafangnitryyixoazegefanfama马
fang方
ni你
try yi一
xia下
ze则
ge个
fan反
fa发
ma麻
fan
yu与
zhe这
fang方
fa法
yi xia一 下
zhe ge这 个
fang fa方 法
yi xia以 下
fan反
ma妈
fang
xia夏
nitu泥土
nit
ti替
zeng增
ma fan麻 烦
e饿
try
ma fan麻 烦
ni你
zhe ge这 个
yi xia一 下
fang fa方 法
LM Evaluation
• Perplexity – quality of LM– Geometric average inverse probability– Branching factor of a doc: predicting power of LM– Lower perplexities are better– Character perplexity for Chinese/Japanese
7
Better to use task-specific evaluation, e.g.,
Character error rate (CER) – quality of IME
Test set (A, W*)
CER = edit distance between converted W and W*
Correlation with perplexity?
MLE for trigram LM
• PML(w3|w1 w2) = Count(w1 w2 w3)/Count(w1 w2)• PML(w2|w1) = Count(w1 w2)/Count(w1)• PML(w) = Count(w)/N• It is easy – let us get real text and start counting
But why is this the MLE solution?
8
The derivation of MLE for N-gram
• Homework
• Hints
– This is a constrained optimization problem
– Use log likelihood as objective function
– Assume a multinomial distribution of LM
– Introduce Lagrange multiplier for the constraints
• ∑xXP(x) = 1, and P(x) 0
9
Sparse data problems
• Say our vocabulary size is |V|
• There are |V|3 parameters in the trigram LM
– |V| = 20,000 20,0003 = 8 1012 parameters
• Most trigrams have a zero count even in a large text corpus
– Count(w1 w2 w3) = 0
– PML(w3|w1 w2) = Count(w1 w2 w3)/Count(w1 w2) = 0
– P(W) = PML(w1) PML(w2|w1) iPML(wi|wi-2 wi-1) = 0
– W= argmaxW P(A|W)P(W) =… oops
10
Smoothing: backoff
• Backoff trigram to bigram, bigram to unigram
11
D(0,1) is a discount constant – absolute discount
α is calculated so probabilities sum to 1 (homework)
Smoothing: improved backoff
• Allow D to vary – Different D’s for different N-gram– Value of D’s as a function of Count(.)– Modified absolute discount
• Optimizing D’s on dev data using e.g., Powell search
12
Using word type probabilities rather than token probability for backoff models
Kneser-Ney smoothing
What is the best smoothing?
• It varies from task to task– Chen and Goodman (1999) gives a very thorough
evaluation and descriptions of a number of methods
• My favorite smoothing methods– Modified absolute discount (MAD, Gao et al., 2001)
• Simple to implement and use• Good performance across many tasks, e.g., IME, SMT, ASR,
Speller
– Interpolated Kneser-Ney • Recommended by Chen and Goodman (1999)• Only slightly better than MAD on SMT (more expensive to
train, though)
13
N-gram LM at Microsoft
• Contextual speller in WORD– 1-5 MB trigram– LM compression (Church et al. 2007)
• Chinese ASR and IME– 20-100 MB trigram– Training data selection and LM compression (Gao et al. 2002a)
• Japanese IME– 30-60 MB trigram– Capture language structure – headword trigram (Gao et al. 2002b)
• MS-SMT (MSRLM)– 1-20 GB tri/four-gram– LM compression, training data selection, runtime (client/server)
• Bing Search, e.g., query speller/segmentation, ranking– Terabyte 5-gram– Model building and runtime via cloud-based platform
14
LM research
• Research communities (speech/IR/NLP)– Make LM smarter via
• Using better smoothing• Using word class• Capturing linguistic structure, etc.
• Industry: data is smarter!– Make LM simpler and more scalable, e.g.,
• Google’s “stupid smoothing” model• Don’t do research until you run out of data (Eric Brill)
• Bridge the gap btw academic/industry research– Bing-It-On-Ngram service hosted by MS
(http://research.microsoft.com/web-ngram)
15
http://research.microsoft.com/web-ngram
Outline
• N-gram language model (LM) ABC
• N-gram LM at Microsoft– Bing-It-On-Ngram
• Building Web scale N-gram LM
• Three search query processing tasks– Query spelling correction
– Query bracketing
– Long query segmentation
• Conclusion
17
Building Web scale n-gram LM
• Everyone can count. Why is it so difficult?• Count all bigrams of a given text
– Input: tokenized text– Output: a list of <bigram, freq> pairs
• Counting alg.– Use hash if text is small– Sort and merge (used in most SLM toolkits)
• Could be very slow on large amounts of text data
• Probability/backoff estimation often requires sorting n-grams in different orders– This is why KN-smoothed LM is expensive to train
18
A cloud-based n-gram platform
19
Cloud infrastructure
• Build programs using a script language– SQL-like language, easy to use
– User-defined func (operators) written in C#
– Map “serial” code to “parallel” execution plan automatically
20
Example: n-gram count in cloud
Recursive
Reducer
Node 1 Node 2 Node N…...
…...
Output
Web Pages
Parsing
Counting
Local
Hash
Tokenize
Web Pages
Parsing
Counting
Local
Hash
Tokenize
Web Pages
Parsing
Counting
Local
Hash
Tokenize
21
Script language: 5-gram counting
22
Web pages• Web page is a multi-field text
– Content fields: URL/Title/Body– Popularity fields: anchor/query-click (Gao et al. 2009)
23
Web scale n-gram models (updated)
24
Perplexity results on a query set (updated)
• Query/anchor/content are different languages• Web corpus is an aligned multi-lingual corpus
25
Outline
• N-gram language model (LM) ABC
• N-gram LM at Microsoft– Bing-It-On-Ngram
• Building Web scale N-gram LM
• Three search query processing tasks– Query spelling correction
– Query bracketing
– Long query segmentation
• Conclusion
26
Query Spelling Correction
• What does speller do?
• How does it work?
• What is the role of LM?
• Results
27
What does speller do
• Provide suggestion for misspelled queryhttp://www.bing.com/search?q=new+years+dances+srilanka&form=QBRE&qs=n
http://www.bing.com/search?q=yetisport&form=QBRE&qs=n
28
What does speller do
• Alter the original query to improve relevancehttp://www.bing.com/search?q=bmw+competetion&form=QBRE&qs=n
29
WORD Speller vs. Bing Speller
• 1990’s: spellers built by hand– Dictionaries + heuristic rules used to identify misspellings
– Typically, suggest only words sanctioned by dictionary
– No suggestions for unknown words (e.g., names, new words)
– Runs client-side
• 2010: spellers learned from user-generated data– Search query speller
• Spelling modeled as a statistical translation problem: translate badqueries to good queries
• Models trained on billions of query-suggestion pairs from search logs
– Correct suggestions learned automatically from data
– Runs on cluster: large models provide better suggestions
30
How query speller works
Input Query, q
“for eggsample”
Speller A
(Edit distance)
examplw à
example
Speller B
(Phonetic mistake)
eggsample à
example
Speller C
(Word breaker)
eggsample à
egg sample
Candidates, GEN(q)
t1 = “for eggsample”
t2 = “for egg sample”
t3 = “for example”
t4 = “for eggs ample”
…...
Ranking results
“for example” 0.23
“for egg sample” 0.06
“for eggs ample” 0.03
“for eggsample” 0.01
…….
Feature extractor
f0: N-gram prob.
f1: Length
f2: ED_Bin
fi: …….
Ranker
Score(q, t) = λf(q, t)
31
Speller workflow
Candidate Generation
Path Filtering
Ranking
Auto-correction
Generate candidate for each token in the query Typographic generator Phonetic generatorWordbreak generator Concatenation Generator
{Britnay Spears Vidios}
32
Speller workflow
Candidate Generation
Path Filtering
Ranking
Auto-correction
Use a small bigram model to pick the 20 best paths.
Speller workflow
Candidate Generation
Path Filtering
Ranking
Auto-correction
Extract ~200 features.
Return the path with highest score.
Speller workflow
Candidate Generation
Path Filtering
Ranking
Auto-correctionDetermine whether we should alter the original query
Original query = britnay spears vidiosAltered query = word:(britnay britney) spears word:(vidios videos)
Roles of LM in a ranker-based speller
• A light decoder
– Only uses a small bigram model trained on query log
– Run Viterbi/A* to produce top-20 suggestions
– A component of candidate generator
• A heavy ranker
– Feature extraction• Derived from the models that generate candidates
• Additional features (defined by domain experts)
– (Non-)linear model ranker (with complex features)• uses ~200 features, including
• 4 Web scale 4-gram language models (amount to 2 terabyte)
36
Search Query Speller Accuracy
83
86
89
92
95
1 2 3 4 5 6 7
Acc
ura
cy o
n 3
0K
qu
eri
es
(%)
1. Noisy-channel model trained on query logs
2. Ranker-based speller trained on query/session logs
7. 6 + TB language models trained on the Web collection
4. 3 + phrase-based translation model trained on 1-m session logs
6. 3 + phrase-based translation model trained on 3-m session logs
3. 2 + word-based translation model trained on session logs
37
38
Query Bracketing
• Task: given a three-word NP, determine sub-NP structure either as left or right bracketing
• Methods: compare word associations btw w1w2 and w2w3 (or btw w1w2 and w1w3).
• Word association metrics
– PMI based on raw counts or smoothed prob
39
Results
40
Long Query Segmentation
• Task
• Method: best-first search based on SPMI
41
Results
42
Conclusion
• Web page is a multi-field text
– Web corpus is an aligned multi-lingual corpus
• We can build large and smart models
• Performance of a LM depends on
– Language (style), model size, model order and smoothing
• Web as baseline for IR/NLP research
– Bing-It-On-Ngram
43