Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing...
Transcript of Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing...
LanguageModelingI
AlgorithmsforNLP
TaylorBerg-Kirkpatrick– CMU
Slides:DanKlein– UCBerkeley
TheNoisy-ChannelModel
§ Wewanttopredictasentencegivenacoustics:
§ Thenoisy-channelapproach:
Acousticmodel:HMMsoverwordpositionswithmixturesofGaussiansasemissions
Languagemodel:Distributionsoversequencesofwords
(sentences)
sourceP(w) w a
decoderobserved
argmaxP(w|a)=argmaxP(a|w)P(w)w w
w abest
channelP(a|w)
LanguageModel AcousticModel
ASRComponents
AcousticConfusions
thestationsignsareindeepinenglish -14732thestationssignsareindeepinenglish -14735thestationsignsareindeepintoenglish -14739thestation'ssignsareindeepinenglish -14740thestationsignsareindeepintheenglish -14741thestationsignsareindeedinenglish -14757thestation'ssignsareindeedinenglish -14760thestationsignsareindians inenglish -14790thestationsignsareindian inenglish -14799thestationssignsareindians inenglish -14807thestationssignsareindians andenglish -14815
Translation:Codebreaking?
“Also knowing nothing official about, but having guessedand inferred considerable about, the powerful newmechanized methods in cryptography—methods which Ibelieve succeed even when one does not know whatlanguage has been coded—one naturally wonders if theproblem of translation could conceivably be treated as aproblem in cryptography. When I look at an article inRussian, I say: ‘This is really written in English, but it hasbeen coded in some strange symbols. I will now proceed todecode.’ ”
WarrenWeaver(1947)
sourceP(e) e f
decoderobserved
argmaxP(e|f)=argmaxP(f|e)P(e)e e
e fbest
channelP(f|e)
Language Model Translation Model
MTSystemComponents
OtherNoisyChannelModels?
§ We’renotdoingthisonlyforASR(andMT)§ Grammar/spellingcorrection§ Handwritingrecognition,OCR§ Documentsummarization§ Dialoggeneration§ Linguisticdecipherment§ …
Language Models§ Alanguagemodelisadistributionoversequencesofwords
(sentences)
§ What’sw?(closedvsopenvocabulary)§ What’sn?(mustsumtooneoveralllengths)§ Canhaverichstructureorbelinguisticallynaive
§ Whylanguagemodels?§ Usuallythepointistoassignhighweightstoplausiblesentences(cf
acousticconfusions)§ Thisisnotthesameasmodelinggrammaticality
𝑃 𝑤 = 𝑃 𝑤$ …𝑤&
N-GramModels
N-Gram Models§ Usechainruletogeneratewordsleft-to-right
§ Can’tconditionontheentireleftcontext
§ N-grammodelsmakeaMarkovassumption
P(???|Turntopage134andlookatthepictureofthe)
Empirical N-Grams§ HowdoweknowP(w|history)?
§ Usestatisticsfromdata(examplesusingGoogleN-Grams)§ E.g.whatisP(door|the)?
§ Thisisthemaximumlikelihoodestimate
198015222 the first194623024 the same168504105 the following158562063 the world…14112454 the door-----------------23135851162 the *
Trai
ning
Cou
nts
IncreasingN-Gram Order§ Higherorderscapturemoredependencies
198015222 the first194623024 the same168504105 the following158562063 the world…14112454 the door-----------------23135851162 the *
197302 close the window 191125 close the door 152500 close the gap 116451 close the thread 87298 close the deal
-----------------3785230 close the *
Bigram Model Trigram Model
P(door | the) = 0.0006 P(door | close the) = 0.05
IncreasingN-GramOrder
Sparsity
3380 please close the door1601 please close the window1164 please close the new1159 please close the gate…0 please close the first-----------------13951 please close the *
Pleaseclosethefirstdoorontheleft.
Sparsity§ Problemswithn-grammodels:
§ Newwords(openvocabulary)§ Synaptitute§ 132,701.03§ multidisciplinarization
§ Oldwordsinnewcontexts
§ Aside:Zipf’s Law§ Types(words)vs.tokens(wordoccurences)§ Broadly:mostwordtypesarerareones§ Specifically:
§ Rankwordtypesbytokenfrequency§ Frequencyinverselyproportionaltorank
§ Notspecialtolanguage:randomlygeneratedcharacterstringshavethisproperty(tryit!)
§ Thislawqualitatively(butrarelyquantitatively)informsNLP
0
0.2
0.4
0.6
0.8
1
0 500000 1000000
Frac
tion
Seen
Number of Words
Unigrams
Bigrams
N-GramEstimation
Smoothing§ Weoftenwanttomakeestimatesfromsparsestatistics:
§ Smoothingflattensspikydistributionssotheygeneralizebetter:
§ VeryimportantalloverNLP,buteasytodobadly
P(w | denied the)3 allegations2 reports1 claims1 request7 total
alle
gatio
ns
char
ges
mot
ion
bene
fits
…
alle
gatio
ns
repo
rts
clai
ms
char
ges
requ
est
mot
ion
bene
fits
…
alle
gatio
ns
repo
rts
clai
ms
requ
est
P(w | denied the)2.5 allegations1.5 reports0.5 claims0.5 request2 other7 total
LikelihoodandPerplexity§ HowdowemeasureLM“goodness”?
§ Shannon’sgame:predictthenextword
WhenIeatpizza,Iwipeoffthe_________
§ Formally:definetestset(log)likelihood
§ Perplexity:“averageperwordbranchingfactor”
grease 0.5
sauce 0.4
dust 0.05
….
mice 0.0001
….
the 1e-100
3516 wipe off the excess 1034 wipe off the dust547 wipe off the sweat518 wipe off the mouthpiece…120 wipe off the grease0 wipe off the sauce0 wipe off the mice-----------------28048 wipe off the *
logP (X|✓) =X
w2X
logP (w|✓)
perp(X, ✓) = exp
✓� logP (X|✓)
|X|
◆
MeasuringModelQuality(Speech)
§ WereallywantbetterASR(orwhatever),notbetterperplexities
§ Forspeech,wecareaboutworderrorrate(WER)
§ Commonissue:intrinsicmeasureslikeperplexityareeasiertouse,butextrinsiconesaremorecredible
Correctanswer: Andysawa part ofthemovie
Recognizeroutput: And he sawapart ofthemovie
insertions +deletions +substitutionstruesentencesizeWER: =4/7=57%
KeyIdeasforN-GramLMs
Idea1:Interpolation
Pleaseclosethefirstdoorontheleft.
3380 please close the door1601 please close the window1164 please close the new1159 please close the gate…0 please close the first-----------------13951 please close the *
198015222 the first194623024 the same168504105 the following158562063 the world……-----------------23135851162 the *
197302 close the window 191125 close the door 152500 close the gap 116451 close the thread…8662 close the first-----------------3785230 close the *
0.0 0.002 0.009
SpecificbutSparse DensebutGeneral
4-Gram 3-Gram 2-Gram
(Linear)Interpolation§ Simplestwaytomixdifferentorders:linearinterpolation
§ Howtochooselambdas?§ Shouldlambdadependonthecountsofthehistories?
§ Choosingweights:eithergridsearchorEMusingheld-outdata
§ Bettermethodshaveinterpolationweightsconnectedtocontextcounts,soyousmoothmorewhenyouknowless
Train,Held-Out,Test§ Wanttomaximizelikelihoodontest,nottrainingdata
§ Empiricaln-gramswon’tgeneralizewell§ Modelsderivedfromcounts/sufficientstatisticsrequire
generalizationparameterstobetunedonheld-outdatatosimulatetestgeneralization
§ Sethyperparameters tomaximizethelikelihoodoftheheld-outdata(usuallywithgridsearchorEM)
Training Data Held-OutData
TestData
Counts / parameters from here
Hyperparametersfrom here
Evaluate here
Idea2:Discounting§ Observation:N-gramsoccurmoreintrainingdatathanthey
willlater
Countin22MWords Futurec*(Next22M)1 0.452 1.253 2.244 3.235 4.21
EmpiricalBigramCounts(ChurchandGale,91)
Absolute Discounting
§ Absolutediscounting§ Reducenumeratorcountsbyaconstantd(e.g.0.75)§ Maybehaveaspecialdiscountforsmallcounts§ Redistributethe“shaved”masstoamodelofnewevents
§ Exampleformulation
Idea3:Fertility
§ Shannongame:“Therewasanunexpected_____”§ “delay”?§ “Francisco”?
§ Contextfertility:numberofdistinctcontexttypesthatawordoccursin§ Whatisthefertilityof“delay”?§ Whatisthefertilityof“Francisco”?§ Whichismorelikelyinanarbitrarynewcontext?
Kneser-Ney Smoothing§ Kneser-Neysmoothingcombinestwoideas
§ Discountandreallocatelikeabsolutediscounting§ Inthebackoff model,wordprobabilitiesareproportionaltocontextfertility,notfrequency
§ Theoryandpractice§ Practice:KNsmoothinghasbeenrepeatedlyprovenbotheffectiveandefficient
§ Theory:KNsmoothingasapproximateinferenceinahierarchicalPitman-Yor process[Teh,2006]
P (w) / |{w0 : c(w0, w) > 0}|
Kneser-NeyDetails§ Allordersrecursivelydiscountandback-off:
§ Alphaiscomputedtomaketheprobabilitynormalize(seeifyoucanfigureoutanexpression).
§ Forthehighestorder,c’isthetokencountofthen-gram.Forallothersitisthecontextfertilityofthen-gram:
§ Theunigrambasecasedoesnotneedtodiscount.§ Variantsarepossible(e.g.differentdforlowcounts)
c
0(x) = |{u : c(u, x) > 0}|
Pk(w|prevk�1) =max(c0(prevk�1, w)� d, 0)P
v c0(prevk�1, v)
+ ↵(prev k � 1)Pk�1(w|prevk�2)
WhatActuallyWorks?§ Trigramsandbeyond:
§ Unigrams,bigramsgenerallyuseless
§ Trigramsmuchbetter§ 4-,5-gramsandmoreare
reallyusefulinMT,butgainsaremorelimitedforspeech
§ Discounting§ Absolutediscounting,Good-
Turing,held-outestimation,Witten-Bell,etc…
§ Contextcounting§ Kneser-Neyconstructionof
lower-ordermodels
§ See[Chen+Goodman]readingfortonsofgraphs…
[Graph fromJoshua Goodman]
Idea4:BigData
There’snodatalikemoredata.
Data>>Method?§ Havingmoredataisbetter…
§ …butsoisusingabetterestimator§ Anotherissue:N>3hashugecostsinspeechrecognizers
5.56
6.57
7.58
8.59
9.510
1 2 3 4 5 6 7 8 9 10 20n-gram order
Ent
ropy
100,000 Katz
100,000 KN
1,000,000 Katz
1,000,000 KN
10,000,000 Katz
10,000,000 KN
all Katz
all KN
TonsofData?
[Brants etal,2007]
What about…
Unknown Words?§ Whatabouttotallyunseenwords?
§ MostLMapplicationsareclosedvocabulary§ ASRsystemswillonlyproposewordsthatareintheirpronunciation
dictionary§ MTsystemswillonlyproposewordsthatareintheirphrasetables
(modulospecialmodelsfornumbers,etc)
§ Inprinciple,onecanbuildopenvocabularyLMs§ E.g.modelsovercharactersequencesratherthanwordsequences§ Back-offneedstogodownintoa“generatenewword”model§ Typicallyifyouneedthis,ahigh-ordercharactermodelwilldo
What’sinanN-Gram?§ Justabouteverylocalcorrelation!
§ Wordclassrestrictions:“willhavebeen___”§ Morphology:“she___”,“they___”§ Semanticclassrestrictions:“dancedthe___”§ Idioms:“addinsultto___”§ Worldknowledge:“icecapshave___”§ Popculture:“theempirestrikes___”
§ Butnotthelong-distanceones§ “Thecomputer whichIhadjustputintothemachineroomonthefifthfloor___.”
LinguisticPain?§ TheN-Gramassumptionhurtsone’sinnerlinguist!
§ Manylinguisticargumentsthatlanguageisn’tregular§ Long-distancedependencies§ Recursivestructure
§ Answers§ N-gramsonlymodellocalcorrelations,buttheygetthemall§ AsNincreases,theycatchevenmorecorrelations§ N-grammodelsscalemuchmoreeasilythanstructuredLMs
§ Notconvinced?§ CanbuildLMsoutofourgrammarmodels(laterinthecourse)§ Takeanygenerativemodelwithwordsatthebottomandmarginalize
outtheothervariables
WhatGetsCaptured?
§ Bigrammodel:§ [texaco,rose,one,in,this,issue,is,pursuing,growth,in,a,boiler,
house,said,mr.,gurria,mexico,'s,motion,control,proposal,without,permission,from,five,hundred,fifty,five,yen]
§ [outside,new,car,parking,lot,of,the,agreement,reached]§ [this,would,be,a,record,november]
§ PCFGmodel:§ [This,quarter,‘s,surprisingly,independent,attack,paid,off,the,risk,
involving,IRS,leaders,and,transportation,prices,.]§ [It,could,be,announced,sometime,.]§ [Mr.,Toseland,believes,the,average,defense,economy,is,drafted,
from,slightly,more,than,12,stocks,.]
ScalingUp?§ There’salotoftrainingdataoutthere…
…nextclasswe’lltalkabouthowtomakeitfit.
OtherTechniques?§ Lotsofothertechniques
§ MaximumentropyLMs(soon)
§ NeuralnetworkLMs(soon)
§ Syntactic/grammar-structuredLMs(muchlater)