Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at...

43
Exploring Web Scale Language Models for Search Query Processing Jianfeng Gao, MSR (Joint work with Jian Huang, Jiangbo Miao, Xiaolong Li, Kuansan Wang and Fritz Behr)

Transcript of Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at...

Page 1: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Exploring Web Scale Language Models for Search Query Processing

Jianfeng Gao, MSR(Joint work with Jian Huang, JiangboMiao, Xiaolong Li, Kuansan Wang and

Fritz Behr)

Page 2: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Outline

• N-gram language model (LM) ABC

• N-gram LM at Microsoft– Bing-It-On-Ngram

• Building Web scale N-gram LM

• Three search query processing tasks– Query spelling correction

– Query bracketing

– Long query segmentation

• Conclusion

2

Page 3: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Word n-gram model

• Compute the probability of a word string using chain rule on its history (=preceding words)

3

P(the dog of our neighbor barks) = P(the | <s>)×P(dog | <s>, the)× P(of | <s>, the, dog)...×P(barks | <s>, the, dog, of, our, neighbor)×P(</s> | <s>, the, dog, or, our, neighbor, barks)

P(w1, w2 ... wn) = P(w1 | <s>)× P(w2 | <s> w1)×P(w3 | <s> w1 w2)

...× P(wn | <s> w1 w2 ... wn-1)×P(</s> | <s> w1 w2 ... wn)

Page 4: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Word n-gram model

• Markov independence assumption– A word depends only on N-1 preceding words

– N=3 → word trigram model

• Reduce the number of parameters in the model– By forming equivalence classes

• Word trigram model

5

P(wi | <s> w1 w2 ... wi-2 wi-1) = P(wi | wi-2 wi-1)

P(w1, w2 ... wn) = P(w1 | <s>)×P(w2 | <s> w1)× P(w3 | w1 w2)...×P(wn | wn-2 wn-1)×P(</s> | wn-1 wn)

Page 5: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Example: input method editor (IME)

• Software to convert keystrokes (Pinyin) to text output

6

mafangnitryyixoazegefanfama马

fang方

ni你

try yi一

xia下

ze则

ge个

fan反

fa发

ma麻

fan

yu与

zhe这

fang方

fa法

yi xia一 下

zhe ge这 个

fang fa方 法

yi xia以 下

fan反

ma妈

fang

xia夏

nitu泥土

nit

ti替

zeng增

ma fan麻 烦

e饿

try

ma fan麻 烦

ni你

zhe ge这 个

yi xia一 下

fang fa方 法

Page 6: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

LM Evaluation

• Perplexity – quality of LM– Geometric average inverse probability– Branching factor of a doc: predicting power of LM– Lower perplexities are better– Character perplexity for Chinese/Japanese

7

Better to use task-specific evaluation, e.g.,

Character error rate (CER) – quality of IME

Test set (A, W*)

CER = edit distance between converted W and W*

Correlation with perplexity?

Page 7: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

MLE for trigram LM

• PML(w3|w1 w2) = Count(w1 w2 w3)/Count(w1 w2)• PML(w2|w1) = Count(w1 w2)/Count(w1)• PML(w) = Count(w)/N• It is easy – let us get real text and start counting

But why is this the MLE solution?

8

Page 8: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

The derivation of MLE for N-gram

• Homework

• Hints

– This is a constrained optimization problem

– Use log likelihood as objective function

– Assume a multinomial distribution of LM

– Introduce Lagrange multiplier for the constraints

• ∑xXP(x) = 1, and P(x) 0

9

Page 9: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Sparse data problems

• Say our vocabulary size is |V|

• There are |V|3 parameters in the trigram LM

– |V| = 20,000 20,0003 = 8 1012 parameters

• Most trigrams have a zero count even in a large text corpus

– Count(w1 w2 w3) = 0

– PML(w3|w1 w2) = Count(w1 w2 w3)/Count(w1 w2) = 0

– P(W) = PML(w1) PML(w2|w1) iPML(wi|wi-2 wi-1) = 0

– W= argmaxW P(A|W)P(W) =… oops

10

Page 10: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Smoothing: backoff

• Backoff trigram to bigram, bigram to unigram

11

D(0,1) is a discount constant – absolute discount

α is calculated so probabilities sum to 1 (homework)

Page 11: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Smoothing: improved backoff

• Allow D to vary – Different D’s for different N-gram– Value of D’s as a function of Count(.)– Modified absolute discount

• Optimizing D’s on dev data using e.g., Powell search

12

Using word type probabilities rather than token probability for backoff models

Kneser-Ney smoothing

Page 12: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

What is the best smoothing?

• It varies from task to task– Chen and Goodman (1999) gives a very thorough

evaluation and descriptions of a number of methods

• My favorite smoothing methods– Modified absolute discount (MAD, Gao et al., 2001)

• Simple to implement and use• Good performance across many tasks, e.g., IME, SMT, ASR,

Speller

– Interpolated Kneser-Ney • Recommended by Chen and Goodman (1999)• Only slightly better than MAD on SMT (more expensive to

train, though)

13

Page 13: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

N-gram LM at Microsoft

• Contextual speller in WORD– 1-5 MB trigram– LM compression (Church et al. 2007)

• Chinese ASR and IME– 20-100 MB trigram– Training data selection and LM compression (Gao et al. 2002a)

• Japanese IME– 30-60 MB trigram– Capture language structure – headword trigram (Gao et al. 2002b)

• MS-SMT (MSRLM)– 1-20 GB tri/four-gram– LM compression, training data selection, runtime (client/server)

• Bing Search, e.g., query speller/segmentation, ranking– Terabyte 5-gram– Model building and runtime via cloud-based platform

14

Page 14: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

LM research

• Research communities (speech/IR/NLP)– Make LM smarter via

• Using better smoothing• Using word class• Capturing linguistic structure, etc.

• Industry: data is smarter!– Make LM simpler and more scalable, e.g.,

• Google’s “stupid smoothing” model• Don’t do research until you run out of data (Eric Brill)

• Bridge the gap btw academic/industry research– Bing-It-On-Ngram service hosted by MS

(http://research.microsoft.com/web-ngram)

15

Page 16: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Outline

• N-gram language model (LM) ABC

• N-gram LM at Microsoft– Bing-It-On-Ngram

• Building Web scale N-gram LM

• Three search query processing tasks– Query spelling correction

– Query bracketing

– Long query segmentation

• Conclusion

17

Page 17: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Building Web scale n-gram LM

• Everyone can count. Why is it so difficult?• Count all bigrams of a given text

– Input: tokenized text– Output: a list of <bigram, freq> pairs

• Counting alg.– Use hash if text is small– Sort and merge (used in most SLM toolkits)

• Could be very slow on large amounts of text data

• Probability/backoff estimation often requires sorting n-grams in different orders– This is why KN-smoothed LM is expensive to train

18

Page 18: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

A cloud-based n-gram platform

19

Page 19: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Cloud infrastructure

• Build programs using a script language– SQL-like language, easy to use

– User-defined func (operators) written in C#

– Map “serial” code to “parallel” execution plan automatically

20

Page 20: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Example: n-gram count in cloud

Recursive

Reducer

Node 1 Node 2 Node N…...

…...

Output

Web Pages

Parsing

Counting

Local

Hash

Tokenize

Web Pages

Parsing

Counting

Local

Hash

Tokenize

Web Pages

Parsing

Counting

Local

Hash

Tokenize

21

Page 21: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Script language: 5-gram counting

22

Page 22: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Web pages• Web page is a multi-field text

– Content fields: URL/Title/Body– Popularity fields: anchor/query-click (Gao et al. 2009)

23

Page 23: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Web scale n-gram models (updated)

24

Page 24: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Perplexity results on a query set (updated)

• Query/anchor/content are different languages• Web corpus is an aligned multi-lingual corpus

25

Page 25: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Outline

• N-gram language model (LM) ABC

• N-gram LM at Microsoft– Bing-It-On-Ngram

• Building Web scale N-gram LM

• Three search query processing tasks– Query spelling correction

– Query bracketing

– Long query segmentation

• Conclusion

26

Page 26: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Query Spelling Correction

• What does speller do?

• How does it work?

• What is the role of LM?

• Results

27

Page 27: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

What does speller do

• Provide suggestion for misspelled queryhttp://www.bing.com/search?q=new+years+dances+srilanka&form=QBRE&qs=n

http://www.bing.com/search?q=yetisport&form=QBRE&qs=n

28

Page 28: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

What does speller do

• Alter the original query to improve relevancehttp://www.bing.com/search?q=bmw+competetion&form=QBRE&qs=n

29

Page 29: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

WORD Speller vs. Bing Speller

• 1990’s: spellers built by hand– Dictionaries + heuristic rules used to identify misspellings

– Typically, suggest only words sanctioned by dictionary

– No suggestions for unknown words (e.g., names, new words)

– Runs client-side

• 2010: spellers learned from user-generated data– Search query speller

• Spelling modeled as a statistical translation problem: translate badqueries to good queries

• Models trained on billions of query-suggestion pairs from search logs

– Correct suggestions learned automatically from data

– Runs on cluster: large models provide better suggestions

30

Page 30: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

How query speller works

Input Query, q

“for eggsample”

Speller A

(Edit distance)

examplw à

example

Speller B

(Phonetic mistake)

eggsample à

example

Speller C

(Word breaker)

eggsample à

egg sample

Candidates, GEN(q)

t1 = “for eggsample”

t2 = “for egg sample”

t3 = “for example”

t4 = “for eggs ample”

…...

Ranking results

“for example” 0.23

“for egg sample” 0.06

“for eggs ample” 0.03

“for eggsample” 0.01

…….

Feature extractor

f0: N-gram prob.

f1: Length

f2: ED_Bin

fi: …….

Ranker

Score(q, t) = λf(q, t)

31

Page 31: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Speller workflow

Candidate Generation

Path Filtering

Ranking

Auto-correction

Generate candidate for each token in the query Typographic generator Phonetic generatorWordbreak generator Concatenation Generator

{Britnay Spears Vidios}

32

Page 32: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Speller workflow

Candidate Generation

Path Filtering

Ranking

Auto-correction

Use a small bigram model to pick the 20 best paths.

Page 33: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Speller workflow

Candidate Generation

Path Filtering

Ranking

Auto-correction

Extract ~200 features.

Return the path with highest score.

Page 34: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Speller workflow

Candidate Generation

Path Filtering

Ranking

Auto-correctionDetermine whether we should alter the original query

Original query = britnay spears vidiosAltered query = word:(britnay britney) spears word:(vidios videos)

Page 35: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Roles of LM in a ranker-based speller

• A light decoder

– Only uses a small bigram model trained on query log

– Run Viterbi/A* to produce top-20 suggestions

– A component of candidate generator

• A heavy ranker

– Feature extraction• Derived from the models that generate candidates

• Additional features (defined by domain experts)

– (Non-)linear model ranker (with complex features)• uses ~200 features, including

• 4 Web scale 4-gram language models (amount to 2 terabyte)

36

Page 36: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Search Query Speller Accuracy

83

86

89

92

95

1 2 3 4 5 6 7

Acc

ura

cy o

n 3

0K

qu

eri

es

(%)

1. Noisy-channel model trained on query logs

2. Ranker-based speller trained on query/session logs

7. 6 + TB language models trained on the Web collection

4. 3 + phrase-based translation model trained on 1-m session logs

6. 3 + phrase-based translation model trained on 3-m session logs

3. 2 + word-based translation model trained on session logs

37

Page 37: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

38

Page 38: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Query Bracketing

• Task: given a three-word NP, determine sub-NP structure either as left or right bracketing

• Methods: compare word associations btw w1w2 and w2w3 (or btw w1w2 and w1w3).

• Word association metrics

– PMI based on raw counts or smoothed prob

39

Page 39: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Results

40

Page 40: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Long Query Segmentation

• Task

• Method: best-first search based on SPMI

41

Page 41: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Results

42

Page 42: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church

Conclusion

• Web page is a multi-field text

– Web corpus is an aligned multi-lingual corpus

• We can build large and smart models

• Performance of a LM depends on

– Language (style), model size, model order and smoothing

• Web as baseline for IR/NLP research

– Bing-It-On-Ngram

43

Page 43: Exploring Web Scale Language Models for Search Query Processing · 2016-01-09 · N-gram LM at Microsoft • Contextual speller in WORD –1-5 MB trigram –LM compression (Church