Lecture 5: Language Models and Smoothingdemo.clab.cs.cmu.edu/NLP/F20/files/slides/05-lm.pdf · LM...

24
Natural Language Processing Lecture 5: Language Models and Smoothing

Transcript of Lecture 5: Language Models and Smoothingdemo.clab.cs.cmu.edu/NLP/F20/files/slides/05-lm.pdf · LM...

  • Natural Language Processing

    Lecture 5: Language Models and Smoothing

  • Language Modeling

    • Is this sentences good?– This is a pen– Pen this is a

    • Help choose between optons, help score optons–他向记者介绍了发言的主要内容– He briefed to reporters on the chief contents of the statement– He briefed reporters on the chief contents of the statement– He briefed to reporters on the main contents of the statement– He briefed reporters on the main contents of the statement

  • One-Slide Review of Probability Terminology

    • Random variables take diferent values, depending on chance.

    • Notaton: p(X = x) is the probability that r.v. X takes value xp(x) is shorthand for the samep(X) is the distributon over values X can take (a functon)• Joint probability: p(X = x, Y = y)– Independence– Chain rule

    • Conditonal probability: p(X = x | Y = y)

  • Unigram Model

    • Every word in Σ is assigned some probability.• Random variables W1, W2, ... (one per word).

  • Part of A Unigram Distributon

    …[rank 1001]p(joint) = 0.00014p(relatvely) = 0.00014p(plot) = 0.00014p(DEL1SUBSEQ) = 0.00014p(rule) = 0.00014p(62.0) = 0.00014p(9.1) = 0.00014p(evaluated) = 0.00014...

    [rank 1]p(the) = 0.038p(of) = 0.023p(and) = 0.021p(to) = 0.017p(is) = 0.013p(a) = 0.012p(in) = 0.012p(for) = 0.009...

  • Unigram Model as a Generatorfirst, frrsm etss tet Thess sftrtnt 2ee4)), sutt wtes e gcs e 19.2 Ms te tetsr It ~(s?1), gcsvtn e.62 tetst (xe; m t e 1 s et utet. x 6e 1998. utn tr by Nsts t wtt sfr st tt CFG 12e bt 1ee es tssn utr y Ifr m s tes nstt 21.8 t e e WP te t tet te t Nsv? k. ts frutn tssn; ts [e, ts sftrtnt v eutts, m s te 65 sts. s s - 24).94) stnttn ts nst te t 2 In ts eutsttrsngc t e K&M 1ee Bse fr t X))] ppest ; In 1e4) S. gcr m m r wt s (St tssn sntr stsvt tetsss, tet m esnts t bet -5.66 trs es: An tet ttxtut e (fr m sey ppes tssns. Wt e vt frsr m s tes 4)e.1 ns 156 txpt tt rt ntsgcebsress

  • Full History Model

    • Every word in Σ is assigned some probability, conditoned on every history.

  • Bsee Cesntsn's utnutsut eey srt t sm m tnt Wt nts y sn tet pssssbet rset sfr r t sn tet tet tssn wt s sn kttpsngc wtste tet Cesntsns' bs ts psrtr y Ob m , wtes ss sm sngc ts bt sm t tet first be k U.S. prtss tnt, s tet et r fr vsrstt, tetrtby etsstnsngc tet psttnts e fr eesutt sfr Hsee ry Cesntsn sts nst wtsn sn Ssutte C rsesn .

  • N-Gram Model

    • Every word in Σ is assigned some probability, conditoned on a fixed-ileengeth history (n – 1).

  • Bigram Model as a Generatort. (A.33) (A.34)) A.5 Ms teS rt ess bttn sm petttey sutrp sst sn ptrfrsrm n t sn r frts sfr snesnt egcsrstem s n estvt fr r m srt ss wteset sutbst nts eey sm prsvt utssngc CE. 4).4).1 MLE s C stsfrCE 71 26.34) 23.1 57.8 K&M 4)2.4) 62.7 4)e.9 4)4) 4)3 9e.7 1ee.e 1ee.e 1ee.e 15.1 3e.9 18.e 21.2 6e.1 utn srt tt tv eut tssns srt tt DEL1 ThRANS1 ntsgcebsress . Thess sntsnutts, wtste sutptrvsst snst., stm ssutptrvsst MLE wtste tet METhU- S b n sThrttb nk 195 ADJA ADJD ADV APPR APPRARTh APPO APZR ARTh CARD FM IThJ KOUI KOUS KON KOKOM NN NN NN IN JJ NNThetsr prsbetm ss y x. Thet tv eut tssn sftrs tet eypstetsszt esnk gcr m m r wtste G utsss n

  • Trigram Model as a Generatortsp(xI ,rsgcet,B). (A.39) vsnte(X, I) r snstste(I 1, I). (A.4)e) vsnt(n). (A.4)1) Thetst tqut tssns wttrt prtstntt sn bste sts; tetst s srts utsnts prsb bsesty sstrsbuttssn ss tvtn sm eetr(r =e.e5). Thess ss tx tey frEM. Dutrsngc DA, ss gcr ut eey rte xt . Thess pprs e sute bt tf stntey utst sn prtvssuts e pttrs) btfrsrt tr snsngc (ttst) K&MZtrsLs er n sm m s tes Fsgcutrt4).12: Dsrt tt utr y sn ee ssx e ngcut gcts. Im psrt ntey, tetst p ptrs estvt st tt- sfr-tet- rt rtsutets sn tetsr t sks n utne btet t n tet vtrbs rt eeswtt (frsr snst n t) ts stet t tet r sn esty sfr ss rttt strut tutrts, eskt m t esngcs sn wttsgcett gcr pes (M Dsn e tt e., 1993) (35 t gc typts, 3.39 bsts). Thet Butegc rs n,

  • What’s in a word

    • Is punctuaton a word?– Does knowing the last “word” is a “,” help?

    • In speech– I do uh main- mainly business processing– Is “uh” a word?

  • For Thought

    • Do N-Gram models “know” English?• Unknown words• N-gram models and fnite-state automata

  • Startng and StoppingUnigram model:

    ...Bigram model:

    ...

    Trigram model:

    ...

  • Evaluatio

  • Which model is beter?

    • Can I get a number about how good my model is for a test set?

    • What is the P(test_set | Model )• We measure this by Perplexity• Perplexity is the probability of test set

    normalized by the number of words

  • Perplexity

  • Perplexity of diferent models

    • Beter models have lower perplexity–WSJ: Unigram 962; Bigram 170; Trigram 109

    • Diferent tasks have diferent perplexity–WSJ (109) vs Bus Informaton Queries (~25)

    • Higher the conditonal probability,lower the perplexity

    • Perplexity is the average branching rate

  • What about open class

    • What is the probability of unseen words?– (Naïve answer is 0.0)

    • But that’s not what you want– Test set will usually include words not in training

    • What is the probability of – P(Nebuchadnezzur | son of )

  • LM smoothing

    • Laplace or add-one smoothing– Add one to all counts– Or add “epsilon” to all counts– You stll need to know all your vocabulary

    • Have an OOV word in your vocabulary– The probability of seeing an unseen word

  • Good-Turing Smoothing• Good (1953) From Turing.– Using the count of things you’ve seen once to estmate count of

    things you’ve never seen.• Calculate the frequency of frequencies of Ngrams– Count of Ngrams that appear 1 tmes– Count of Ngrams that appear 2 tmes– Count of Ngrams that appear 3 tmes– …– Estmate new c = (c+1) (N_c + 1)/N_c)

    • Change the counts a litle so we get a beter estmate for count 0

  • Good-Turing’s Discounted CountsAP Newswire

    BigramsBerkeley Restaurants Bigrams Smith Thesis

    Bigrams

    c Nc c* Nc c* Nc c*

    e 74),671,1ee,eee e.eeee27e 2,e81,4)96 e.ee2553 x 38,e4)8 / x

    1 2,e18,e4)6 e.4)4)6 5,315 e.53396e 38,e4)8 e.2114)7

    2 4)4)9,721 1.26 1,4)19 1.357294) 4),e32 1.e5e71

    3 188,933 2.24) 64)2 2.373832 1,4)e9 2.12633

    4) 1e5,668 3.24) 381 4).e81365 74)9 2.63685

    5 68,379 4).22 311 3.78135e 395 3.91899

    6 4)8,19e 5.19 196 4).5eeeee 258 4).4)224)8

  • Backof

    • If no trigram, use bigram• If no bigram, use unigram• If no unigram … smooth the unigrams

  • Estmatng p(w | esstsry)• Relatve frequencies (count & normalize)• Transform the counts:– Laplace/“add one”/“add λ”– Good-Turing discountng

    • Interpolate or “backof”:–With Good-Turing discountng: Katz backof– “Stupid” backof– Absolute discountng: Kneser-Ney

    Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24