Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing...

39
Language Modeling I Algorithms for NLP Taylor Berg-Kirkpatrick – CMU Slides: Dan Klein – UC Berkeley

Transcript of Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing...

Page 1: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

LanguageModelingI

AlgorithmsforNLP

TaylorBerg-Kirkpatrick– CMU

Slides:DanKlein– UCBerkeley

Page 2: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

TheNoisy-ChannelModel

§ Wewanttopredictasentencegivenacoustics:

§ Thenoisy-channelapproach:

Acousticmodel:HMMsoverwordpositionswithmixturesofGaussiansasemissions

Languagemodel:Distributionsoversequencesofwords

(sentences)

Page 3: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

sourceP(w) w a

decoderobserved

argmaxP(w|a)=argmaxP(a|w)P(w)w w

w abest

channelP(a|w)

LanguageModel AcousticModel

ASRComponents

Page 4: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

AcousticConfusions

thestationsignsareindeepinenglish -14732thestationssignsareindeepinenglish -14735thestationsignsareindeepintoenglish -14739thestation'ssignsareindeepinenglish -14740thestationsignsareindeepintheenglish -14741thestationsignsareindeedinenglish -14757thestation'ssignsareindeedinenglish -14760thestationsignsareindians inenglish -14790thestationsignsareindian inenglish -14799thestationssignsareindians inenglish -14807thestationssignsareindians andenglish -14815

Page 5: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Translation:Codebreaking?

“Also knowing nothing official about, but having guessedand inferred considerable about, the powerful newmechanized methods in cryptography—methods which Ibelieve succeed even when one does not know whatlanguage has been coded—one naturally wonders if theproblem of translation could conceivably be treated as aproblem in cryptography. When I look at an article inRussian, I say: ‘This is really written in English, but it hasbeen coded in some strange symbols. I will now proceed todecode.’ ”

WarrenWeaver(1947)

Page 6: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

sourceP(e) e f

decoderobserved

argmaxP(e|f)=argmaxP(f|e)P(e)e e

e fbest

channelP(f|e)

Language Model Translation Model

MTSystemComponents

Page 7: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

OtherNoisyChannelModels?

§ We’renotdoingthisonlyforASR(andMT)§ Grammar/spellingcorrection§ Handwritingrecognition,OCR§ Documentsummarization§ Dialoggeneration§ Linguisticdecipherment§ …

Page 8: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Language Models§ Alanguagemodelisadistributionoversequencesofwords

(sentences)

§ What’sw?(closedvsopenvocabulary)§ What’sn?(mustsumtooneoveralllengths)§ Canhaverichstructureorbelinguisticallynaive

§ Whylanguagemodels?§ Usuallythepointistoassignhighweightstoplausiblesentences(cf

acousticconfusions)§ Thisisnotthesameasmodelinggrammaticality

𝑃 𝑤 = 𝑃 𝑤$ …𝑤&

Page 9: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

N-GramModels

Page 10: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

N-Gram Models§ Usechainruletogeneratewordsleft-to-right

§ Can’tconditionontheentireleftcontext

§ N-grammodelsmakeaMarkovassumption

P(???|Turntopage134andlookatthepictureofthe)

Page 11: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Empirical N-Grams§ HowdoweknowP(w|history)?

§ Usestatisticsfromdata(examplesusingGoogleN-Grams)§ E.g.whatisP(door|the)?

§ Thisisthemaximumlikelihoodestimate

198015222 the first194623024 the same168504105 the following158562063 the world…14112454 the door-----------------23135851162 the *

Trai

ning

Cou

nts

Page 12: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

IncreasingN-Gram Order§ Higherorderscapturemoredependencies

198015222 the first194623024 the same168504105 the following158562063 the world…14112454 the door-----------------23135851162 the *

197302 close the window 191125 close the door 152500 close the gap 116451 close the thread 87298 close the deal

-----------------3785230 close the *

Bigram Model Trigram Model

P(door | the) = 0.0006 P(door | close the) = 0.05

Page 13: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

IncreasingN-GramOrder

Page 14: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Sparsity

3380 please close the door1601 please close the window1164 please close the new1159 please close the gate…0 please close the first-----------------13951 please close the *

Pleaseclosethefirstdoorontheleft.

Page 15: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Sparsity§ Problemswithn-grammodels:

§ Newwords(openvocabulary)§ Synaptitute§ 132,701.03§ multidisciplinarization

§ Oldwordsinnewcontexts

§ Aside:Zipf’s Law§ Types(words)vs.tokens(wordoccurences)§ Broadly:mostwordtypesarerareones§ Specifically:

§ Rankwordtypesbytokenfrequency§ Frequencyinverselyproportionaltorank

§ Notspecialtolanguage:randomlygeneratedcharacterstringshavethisproperty(tryit!)

§ Thislawqualitatively(butrarelyquantitatively)informsNLP

0

0.2

0.4

0.6

0.8

1

0 500000 1000000

Frac

tion

Seen

Number of Words

Unigrams

Bigrams

Page 16: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

N-GramEstimation

Page 17: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Smoothing§ Weoftenwanttomakeestimatesfromsparsestatistics:

§ Smoothingflattensspikydistributionssotheygeneralizebetter:

§ VeryimportantalloverNLP,buteasytodobadly

P(w | denied the)3 allegations2 reports1 claims1 request7 total

alle

gatio

ns

char

ges

mot

ion

bene

fits

alle

gatio

ns

repo

rts

clai

ms

char

ges

requ

est

mot

ion

bene

fits

alle

gatio

ns

repo

rts

clai

ms

requ

est

P(w | denied the)2.5 allegations1.5 reports0.5 claims0.5 request2 other7 total

Page 18: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

LikelihoodandPerplexity§ HowdowemeasureLM“goodness”?

§ Shannon’sgame:predictthenextword

WhenIeatpizza,Iwipeoffthe_________

§ Formally:definetestset(log)likelihood

§ Perplexity:“averageperwordbranchingfactor”

grease 0.5

sauce 0.4

dust 0.05

….

mice 0.0001

….

the 1e-100

3516 wipe off the excess 1034 wipe off the dust547 wipe off the sweat518 wipe off the mouthpiece…120 wipe off the grease0 wipe off the sauce0 wipe off the mice-----------------28048 wipe off the *

logP (X|✓) =X

w2X

logP (w|✓)

perp(X, ✓) = exp

✓� logP (X|✓)

|X|

Page 19: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

MeasuringModelQuality(Speech)

§ WereallywantbetterASR(orwhatever),notbetterperplexities

§ Forspeech,wecareaboutworderrorrate(WER)

§ Commonissue:intrinsicmeasureslikeperplexityareeasiertouse,butextrinsiconesaremorecredible

Correctanswer: Andysawa part ofthemovie

Recognizeroutput: And he sawapart ofthemovie

insertions +deletions +substitutionstruesentencesizeWER: =4/7=57%

Page 20: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

KeyIdeasforN-GramLMs

Page 21: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Idea1:Interpolation

Pleaseclosethefirstdoorontheleft.

3380 please close the door1601 please close the window1164 please close the new1159 please close the gate…0 please close the first-----------------13951 please close the *

198015222 the first194623024 the same168504105 the following158562063 the world……-----------------23135851162 the *

197302 close the window 191125 close the door 152500 close the gap 116451 close the thread…8662 close the first-----------------3785230 close the *

0.0 0.002 0.009

SpecificbutSparse DensebutGeneral

4-Gram 3-Gram 2-Gram

Page 22: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

(Linear)Interpolation§ Simplestwaytomixdifferentorders:linearinterpolation

§ Howtochooselambdas?§ Shouldlambdadependonthecountsofthehistories?

§ Choosingweights:eithergridsearchorEMusingheld-outdata

§ Bettermethodshaveinterpolationweightsconnectedtocontextcounts,soyousmoothmorewhenyouknowless

Page 23: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Train,Held-Out,Test§ Wanttomaximizelikelihoodontest,nottrainingdata

§ Empiricaln-gramswon’tgeneralizewell§ Modelsderivedfromcounts/sufficientstatisticsrequire

generalizationparameterstobetunedonheld-outdatatosimulatetestgeneralization

§ Sethyperparameters tomaximizethelikelihoodoftheheld-outdata(usuallywithgridsearchorEM)

Training Data Held-OutData

TestData

Counts / parameters from here

Hyperparametersfrom here

Evaluate here

Page 24: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Idea2:Discounting§ Observation:N-gramsoccurmoreintrainingdatathanthey

willlater

Countin22MWords Futurec*(Next22M)1 0.452 1.253 2.244 3.235 4.21

EmpiricalBigramCounts(ChurchandGale,91)

Page 25: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Absolute Discounting

§ Absolutediscounting§ Reducenumeratorcountsbyaconstantd(e.g.0.75)§ Maybehaveaspecialdiscountforsmallcounts§ Redistributethe“shaved”masstoamodelofnewevents

§ Exampleformulation

Page 26: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Idea3:Fertility

§ Shannongame:“Therewasanunexpected_____”§ “delay”?§ “Francisco”?

§ Contextfertility:numberofdistinctcontexttypesthatawordoccursin§ Whatisthefertilityof“delay”?§ Whatisthefertilityof“Francisco”?§ Whichismorelikelyinanarbitrarynewcontext?

Page 27: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Kneser-Ney Smoothing§ Kneser-Neysmoothingcombinestwoideas

§ Discountandreallocatelikeabsolutediscounting§ Inthebackoff model,wordprobabilitiesareproportionaltocontextfertility,notfrequency

§ Theoryandpractice§ Practice:KNsmoothinghasbeenrepeatedlyprovenbotheffectiveandefficient

§ Theory:KNsmoothingasapproximateinferenceinahierarchicalPitman-Yor process[Teh,2006]

P (w) / |{w0 : c(w0, w) > 0}|

Page 28: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Kneser-NeyDetails§ Allordersrecursivelydiscountandback-off:

§ Alphaiscomputedtomaketheprobabilitynormalize(seeifyoucanfigureoutanexpression).

§ Forthehighestorder,c’isthetokencountofthen-gram.Forallothersitisthecontextfertilityofthen-gram:

§ Theunigrambasecasedoesnotneedtodiscount.§ Variantsarepossible(e.g.differentdforlowcounts)

c

0(x) = |{u : c(u, x) > 0}|

Pk(w|prevk�1) =max(c0(prevk�1, w)� d, 0)P

v c0(prevk�1, v)

+ ↵(prev k � 1)Pk�1(w|prevk�2)

Page 29: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

WhatActuallyWorks?§ Trigramsandbeyond:

§ Unigrams,bigramsgenerallyuseless

§ Trigramsmuchbetter§ 4-,5-gramsandmoreare

reallyusefulinMT,butgainsaremorelimitedforspeech

§ Discounting§ Absolutediscounting,Good-

Turing,held-outestimation,Witten-Bell,etc…

§ Contextcounting§ Kneser-Neyconstructionof

lower-ordermodels

§ See[Chen+Goodman]readingfortonsofgraphs…

[Graph fromJoshua Goodman]

Page 30: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Idea4:BigData

There’snodatalikemoredata.

Page 31: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Data>>Method?§ Havingmoredataisbetter…

§ …butsoisusingabetterestimator§ Anotherissue:N>3hashugecostsinspeechrecognizers

5.56

6.57

7.58

8.59

9.510

1 2 3 4 5 6 7 8 9 10 20n-gram order

Ent

ropy

100,000 Katz

100,000 KN

1,000,000 Katz

1,000,000 KN

10,000,000 Katz

10,000,000 KN

all Katz

all KN

Page 32: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

TonsofData?

[Brants etal,2007]

Page 33: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

What about…

Page 34: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

Unknown Words?§ Whatabouttotallyunseenwords?

§ MostLMapplicationsareclosedvocabulary§ ASRsystemswillonlyproposewordsthatareintheirpronunciation

dictionary§ MTsystemswillonlyproposewordsthatareintheirphrasetables

(modulospecialmodelsfornumbers,etc)

§ Inprinciple,onecanbuildopenvocabularyLMs§ E.g.modelsovercharactersequencesratherthanwordsequences§ Back-offneedstogodownintoa“generatenewword”model§ Typicallyifyouneedthis,ahigh-ordercharactermodelwilldo

Page 35: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

What’sinanN-Gram?§ Justabouteverylocalcorrelation!

§ Wordclassrestrictions:“willhavebeen___”§ Morphology:“she___”,“they___”§ Semanticclassrestrictions:“dancedthe___”§ Idioms:“addinsultto___”§ Worldknowledge:“icecapshave___”§ Popculture:“theempirestrikes___”

§ Butnotthelong-distanceones§ “Thecomputer whichIhadjustputintothemachineroomonthefifthfloor___.”

Page 36: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

LinguisticPain?§ TheN-Gramassumptionhurtsone’sinnerlinguist!

§ Manylinguisticargumentsthatlanguageisn’tregular§ Long-distancedependencies§ Recursivestructure

§ Answers§ N-gramsonlymodellocalcorrelations,buttheygetthemall§ AsNincreases,theycatchevenmorecorrelations§ N-grammodelsscalemuchmoreeasilythanstructuredLMs

§ Notconvinced?§ CanbuildLMsoutofourgrammarmodels(laterinthecourse)§ Takeanygenerativemodelwithwordsatthebottomandmarginalize

outtheothervariables

Page 37: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

WhatGetsCaptured?

§ Bigrammodel:§ [texaco,rose,one,in,this,issue,is,pursuing,growth,in,a,boiler,

house,said,mr.,gurria,mexico,'s,motion,control,proposal,without,permission,from,five,hundred,fifty,five,yen]

§ [outside,new,car,parking,lot,of,the,agreement,reached]§ [this,would,be,a,record,november]

§ PCFGmodel:§ [This,quarter,‘s,surprisingly,independent,attack,paid,off,the,risk,

involving,IRS,leaders,and,transportation,prices,.]§ [It,could,be,announced,sometime,.]§ [Mr.,Toseland,believes,the,average,defense,economy,is,drafted,

from,slightly,more,than,12,stocks,.]

Page 38: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

ScalingUp?§ There’salotoftrainingdataoutthere…

…nextclasswe’lltalkabouthowtomakeitfit.

Page 39: Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing N-GramOrder §Higher orders capture more dependencies 198015222 the first 194623024

OtherTechniques?§ Lotsofothertechniques

§ MaximumentropyLMs(soon)

§ NeuralnetworkLMs(soon)

§ Syntactic/grammar-structuredLMs(muchlater)