Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing...

LanguageModelingI

AlgorithmsforNLP

TaylorBerg-Kirkpatrick– CMU

Slides:DanKlein– UCBerkeley

TheNoisy-ChannelModel

§ Wewanttopredictasentencegivenacoustics:

§ Thenoisy-channelapproach:

Acousticmodel:HMMsoverwordpositionswithmixturesofGaussiansasemissions

Languagemodel:Distributionsoversequencesofwords

(sentences)

sourceP(w) w a

decoderobserved

argmaxP(w|a)=argmaxP(a|w)P(w)w w

w abest

channelP(a|w)

LanguageModel AcousticModel

ASRComponents

AcousticConfusions

thestationsignsareindeepinenglish -14732thestationssignsareindeepinenglish -14735thestationsignsareindeepintoenglish -14739thestation'ssignsareindeepinenglish -14740thestationsignsareindeepintheenglish -14741thestationsignsareindeedinenglish -14757thestation'ssignsareindeedinenglish -14760thestationsignsareindians inenglish -14790thestationsignsareindian inenglish -14799thestationssignsareindians inenglish -14807thestationssignsareindians andenglish -14815

Translation:Codebreaking?

“Also knowing nothing official about, but having guessedand inferred considerable about, the powerful newmechanized methods in cryptography—methods which Ibelieve succeed even when one does not know whatlanguage has been coded—one naturally wonders if theproblem of translation could conceivably be treated as aproblem in cryptography. When I look at an article inRussian, I say: ‘This is really written in English, but it hasbeen coded in some strange symbols. I will now proceed todecode.’ ”

WarrenWeaver(1947)

sourceP(e) e f

decoderobserved

argmaxP(e|f)=argmaxP(f|e)P(e)e e

e fbest

channelP(f|e)

Language Model Translation Model

MTSystemComponents

OtherNoisyChannelModels?

§ We’renotdoingthisonlyforASR(andMT)§ Grammar/spellingcorrection§ Handwritingrecognition,OCR§ Documentsummarization§ Dialoggeneration§ Linguisticdecipherment§ …

Language Models§ Alanguagemodelisadistributionoversequencesofwords

(sentences)

§ What’sw?(closedvsopenvocabulary)§ What’sn?(mustsumtooneoveralllengths)§ Canhaverichstructureorbelinguisticallynaive

§ Whylanguagemodels?§ Usuallythepointistoassignhighweightstoplausiblesentences(cf

acousticconfusions)§ Thisisnotthesameasmodelinggrammaticality

𝑃 𝑤 = 𝑃 𝑤$ …𝑤&

N-GramModels

N-Gram Models§ Usechainruletogeneratewordsleft-to-right

§ Can’tconditionontheentireleftcontext

§ N-grammodelsmakeaMarkovassumption

P(???|Turntopage134andlookatthepictureofthe)

Empirical N-Grams§ HowdoweknowP(w|history)?

§ Usestatisticsfromdata(examplesusingGoogleN-Grams)§ E.g.whatisP(door|the)?

§ Thisisthemaximumlikelihoodestimate

198015222 the first194623024 the same168504105 the following158562063 the world…14112454 the door-----------------23135851162 the *

Trai

ning

Cou

nts

IncreasingN-Gram Order§ Higherorderscapturemoredependencies

198015222 the first194623024 the same168504105 the following158562063 the world…14112454 the door-----------------23135851162 the *

197302 close the window 191125 close the door 152500 close the gap 116451 close the thread 87298 close the deal

-----------------3785230 close the *

Bigram Model Trigram Model

P(door | the) = 0.0006 P(door | close the) = 0.05

IncreasingN-GramOrder

Sparsity

3380 please close the door1601 please close the window1164 please close the new1159 please close the gate…0 please close the first-----------------13951 please close the *

Pleaseclosethefirstdoorontheleft.

Sparsity§ Problemswithn-grammodels:

§ Newwords(openvocabulary)§ Synaptitute§ 132,701.03§ multidisciplinarization

§ Oldwordsinnewcontexts

§ Aside:Zipf’s Law§ Types(words)vs.tokens(wordoccurences)§ Broadly:mostwordtypesarerareones§ Specifically:

§ Rankwordtypesbytokenfrequency§ Frequencyinverselyproportionaltorank

§ Notspecialtolanguage:randomlygeneratedcharacterstringshavethisproperty(tryit!)

§ Thislawqualitatively(butrarelyquantitatively)informsNLP

0

0.2

0.4

0.6

0.8

1

0 500000 1000000

Frac

tion

Seen

Number of Words

Unigrams

Bigrams

N-GramEstimation

Smoothing§ Weoftenwanttomakeestimatesfromsparsestatistics:

§ Smoothingflattensspikydistributionssotheygeneralizebetter:

§ VeryimportantalloverNLP,buteasytodobadly

P(w | denied the)3 allegations2 reports1 claims1 request7 total

alle

gatio

ns

char

ges

mot

ion

bene

fits

…

alle

gatio

ns

repo

rts

clai

ms

char

ges

requ

est

mot

ion

bene

fits

…

alle

gatio

ns

repo

rts

clai

ms

requ

est

P(w | denied the)2.5 allegations1.5 reports0.5 claims0.5 request2 other7 total

LikelihoodandPerplexity§ HowdowemeasureLM“goodness”?

§ Shannon’sgame:predictthenextword

WhenIeatpizza,Iwipeoffthe_________

§ Formally:definetestset(log)likelihood

§ Perplexity:“averageperwordbranchingfactor”

grease 0.5

sauce 0.4

dust 0.05

….

mice 0.0001

….

the 1e-100

3516 wipe off the excess 1034 wipe off the dust547 wipe off the sweat518 wipe off the mouthpiece…120 wipe off the grease0 wipe off the sauce0 wipe off the mice-----------------28048 wipe off the *

logP (X|✓) =X

w2X

logP (w|✓)

perp(X, ✓) = exp

✓� logP (X|✓)

|X|

◆

MeasuringModelQuality(Speech)

§ WereallywantbetterASR(orwhatever),notbetterperplexities

§ Forspeech,wecareaboutworderrorrate(WER)

§ Commonissue:intrinsicmeasureslikeperplexityareeasiertouse,butextrinsiconesaremorecredible

Correctanswer: Andysawa part ofthemovie

Recognizeroutput: And he sawapart ofthemovie

insertions +deletions +substitutionstruesentencesizeWER: =4/7=57%

KeyIdeasforN-GramLMs

Idea1:Interpolation

Pleaseclosethefirstdoorontheleft.

3380 please close the door1601 please close the window1164 please close the new1159 please close the gate…0 please close the first-----------------13951 please close the *

198015222 the first194623024 the same168504105 the following158562063 the world……-----------------23135851162 the *

197302 close the window 191125 close the door 152500 close the gap 116451 close the thread…8662 close the first-----------------3785230 close the *

0.0 0.002 0.009

SpecificbutSparse DensebutGeneral

4-Gram 3-Gram 2-Gram

(Linear)Interpolation§ Simplestwaytomixdifferentorders:linearinterpolation

§ Howtochooselambdas?§ Shouldlambdadependonthecountsofthehistories?

§ Choosingweights:eithergridsearchorEMusingheld-outdata

§ Bettermethodshaveinterpolationweightsconnectedtocontextcounts,soyousmoothmorewhenyouknowless

Train,Held-Out,Test§ Wanttomaximizelikelihoodontest,nottrainingdata

§ Empiricaln-gramswon’tgeneralizewell§ Modelsderivedfromcounts/sufficientstatisticsrequire

generalizationparameterstobetunedonheld-outdatatosimulatetestgeneralization

§ Sethyperparameters tomaximizethelikelihoodoftheheld-outdata(usuallywithgridsearchorEM)

Training Data Held-OutData

TestData

Counts / parameters from here

Hyperparametersfrom here

Evaluate here

Idea2:Discounting§ Observation:N-gramsoccurmoreintrainingdatathanthey

willlater

Countin22MWords Futurec*(Next22M)1 0.452 1.253 2.244 3.235 4.21

EmpiricalBigramCounts(ChurchandGale,91)

Absolute Discounting

§ Absolutediscounting§ Reducenumeratorcountsbyaconstantd(e.g.0.75)§ Maybehaveaspecialdiscountforsmallcounts§ Redistributethe“shaved”masstoamodelofnewevents

§ Exampleformulation

Idea3:Fertility

§ Shannongame:“Therewasanunexpected_____”§ “delay”?§ “Francisco”?

§ Contextfertility:numberofdistinctcontexttypesthatawordoccursin§ Whatisthefertilityof“delay”?§ Whatisthefertilityof“Francisco”?§ Whichismorelikelyinanarbitrarynewcontext?

Kneser-Ney Smoothing§ Kneser-Neysmoothingcombinestwoideas

§ Discountandreallocatelikeabsolutediscounting§ Inthebackoff model,wordprobabilitiesareproportionaltocontextfertility,notfrequency

§ Theoryandpractice§ Practice:KNsmoothinghasbeenrepeatedlyprovenbotheffectiveandefficient

§ Theory:KNsmoothingasapproximateinferenceinahierarchicalPitman-Yor process[Teh,2006]

P (w) / |{w0 : c(w0, w) > 0}|

Kneser-NeyDetails§ Allordersrecursivelydiscountandback-off:

§ Alphaiscomputedtomaketheprobabilitynormalize(seeifyoucanfigureoutanexpression).

§ Forthehighestorder,c’isthetokencountofthen-gram.Forallothersitisthecontextfertilityofthen-gram:

§ Theunigrambasecasedoesnotneedtodiscount.§ Variantsarepossible(e.g.differentdforlowcounts)

c

0(x) = |{u : c(u, x) > 0}|

Pk(w|prevk�1) =max(c0(prevk�1, w)� d, 0)P

v c0(prevk�1, v)

+ ↵(prev k � 1)Pk�1(w|prevk�2)

WhatActuallyWorks?§ Trigramsandbeyond:

§ Unigrams,bigramsgenerallyuseless

§ Trigramsmuchbetter§ 4-,5-gramsandmoreare

reallyusefulinMT,butgainsaremorelimitedforspeech

§ Discounting§ Absolutediscounting,Good-

Turing,held-outestimation,Witten-Bell,etc…

§ Contextcounting§ Kneser-Neyconstructionof

lower-ordermodels

§ See[Chen+Goodman]readingfortonsofgraphs…

[Graph fromJoshua Goodman]

Idea4:BigData

There’snodatalikemoredata.

Data>>Method?§ Havingmoredataisbetter…

§ …butsoisusingabetterestimator§ Anotherissue:N>3hashugecostsinspeechrecognizers

5.56

6.57

7.58

8.59

9.510

1 2 3 4 5 6 7 8 9 10 20n-gram order

Ent

ropy

100,000 Katz

100,000 KN

1,000,000 Katz

1,000,000 KN

10,000,000 Katz

10,000,000 KN

all Katz

all KN

TonsofData?

[Brants etal,2007]

What about…

Unknown Words?§ Whatabouttotallyunseenwords?

§ MostLMapplicationsareclosedvocabulary§ ASRsystemswillonlyproposewordsthatareintheirpronunciation

dictionary§ MTsystemswillonlyproposewordsthatareintheirphrasetables

(modulospecialmodelsfornumbers,etc)

§ Inprinciple,onecanbuildopenvocabularyLMs§ E.g.modelsovercharactersequencesratherthanwordsequences§ Back-offneedstogodownintoa“generatenewword”model§ Typicallyifyouneedthis,ahigh-ordercharactermodelwilldo

What’sinanN-Gram?§ Justabouteverylocalcorrelation!

§ Wordclassrestrictions:“willhavebeen___”§ Morphology:“she___”,“they___”§ Semanticclassrestrictions:“dancedthe___”§ Idioms:“addinsultto___”§ Worldknowledge:“icecapshave___”§ Popculture:“theempirestrikes___”

§ Butnotthelong-distanceones§ “Thecomputer whichIhadjustputintothemachineroomonthefifthfloor___.”

LinguisticPain?§ TheN-Gramassumptionhurtsone’sinnerlinguist!

§ Manylinguisticargumentsthatlanguageisn’tregular§ Long-distancedependencies§ Recursivestructure

§ Answers§ N-gramsonlymodellocalcorrelations,buttheygetthemall§ AsNincreases,theycatchevenmorecorrelations§ N-grammodelsscalemuchmoreeasilythanstructuredLMs

§ Notconvinced?§ CanbuildLMsoutofourgrammarmodels(laterinthecourse)§ Takeanygenerativemodelwithwordsatthebottomandmarginalize

outtheothervariables

WhatGetsCaptured?

§ Bigrammodel:§ [texaco,rose,one,in,this,issue,is,pursuing,growth,in,a,boiler,

house,said,mr.,gurria,mexico,'s,motion,control,proposal,without,permission,from,five,hundred,fifty,five,yen]

§ [outside,new,car,parking,lot,of,the,agreement,reached]§ [this,would,be,a,record,november]

§ PCFGmodel:§ [This,quarter,‘s,surprisingly,independent,attack,paid,off,the,risk,

involving,IRS,leaders,and,transportation,prices,.]§ [It,could,be,announced,sometime,.]§ [Mr.,Toseland,believes,the,average,defense,economy,is,drafted,

from,slightly,more,than,12,stocks,.]

ScalingUp?§ There’salotoftrainingdataoutthere…

…nextclasswe’lltalkabouthowtomakeitfit.

OtherTechniques?§ Lotsofothertechniques

§ MaximumentropyLMs(soon)

§ NeuralnetworkLMs(soon)

§ Syntactic/grammar-structuredLMs(muchlater)

Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing...

Documents

Transcript of Algorithms for NLPdemo.clab.cs.cmu.edu/11711fa18/slides/lecture_2_language_models_1.pdfIncreasing...