1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average)...

29
1. How do we represent the meaning of a word? Definition: meaning (Webster dictionary) the idea that is represented by a word, phrase, etc. the idea that a person wants to express by using words, signs, etc. the idea that is expressed in a work of writing, art, etc. Commonest linguistic way of thinking of meaning: signifier (symbol) signified (idea or thing) = denotation 1/11/18 4

Transcript of 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average)...

Page 1: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

1.Howdowerepresentthemeaningofaword?

Definition:meaning (Websterdictionary)

• theideathatisrepresentedbyaword,phrase,etc.

• theideathatapersonwantstoexpressbyusingwords,signs,etc.

• theideathatisexpressedinaworkofwriting,art,etc.

Commonestlinguisticwayofthinkingofmeaning:

signifier(symbol)⟺ signified(ideaorthing)

=denotation 1/11/184

Page 2: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Howdowehaveusablemeaninginacomputer?Commonsolution:Usee.g.WordNet,aresourcecontaininglistsofsynonymsets andhypernyms (“isa”relationships).

[Synset('procyonid.n.01'), Synset('carnivore.n.01'), Synset('placental.n.01'), Synset('mammal.n.01'), Synset('vertebrate.n.01'), Synset('chordate.n.01'), Synset('animal.n.01'), Synset('organism.n.01'), Synset('living_thing.n.01'), Synset('whole.n.02'), Synset('object.n.01'), Synset('physical_entity.n.01'), Synset('entity.n.01')]

(adj) full, good (adj) estimable, good, honorable, respectable (adj) beneficial, good (adj) good, just, upright (adj) adept, expert, good, practiced, proficient, skillful(adj) dear, good, near (adj) good, right, ripe…(adv) well, good (adv) thoroughly, soundly, good (n) good, goodness (n) commodity, trade good, good

e.g.synonymsetscontaining“good”: e.g.hypernymsof“panda”:

1/11/185

Page 3: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

ProblemswithresourceslikeWordNet

• Greatasaresourcebutmissingnuance

• e.g.“proficient”islistedasasynonymfor“good”.Thisisonlycorrectinsomecontexts.

• Missingnewmeaningsofwords

• e.g.wicked,badass,nifty,wizard,genius,ninja,bombest

• Impossibletokeepup-to-date!

• Subjective

• Requireshumanlabortocreateandadapt

• Hardtocomputeaccuratewordsimilarityà1/11/186

Page 4: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Representingwordsasdiscretesymbols

IntraditionalNLP,weregardwordsasdiscretesymbols:hotel, conference, motel

Wordscanberepresentedbyone-hot vectors:

motel=[000000000010000]hotel=[000000010000000]

Vectordimension=numberofwordsinvocabulary(e.g.500,000)

Meansone1,therest0s

1/11/187

Page 5: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Problemwithwordsasdiscretesymbols

Example: inwebsearch,ifusersearchesfor“Seattlemotel”,wewouldliketomatchdocumentscontaining“Seattlehotel”.

But:motel=[000000000010000]hotel=[000000010000000]

Thesetwovectorsareorthogonal.Thereisnonaturalnotionofsimilarityforone-hotvectors!

Solution:• CouldrelyonWordNet’slistofsynonymstogetsimilarity?• Instead:learntoencodesimilarityinthevectorsthemselves

Sec. 9.2.2

1/11/188

Page 6: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Representingwordsbytheircontext

• Coreidea:Aword’smeaningisgivenbythewordsthatfrequentlyappearclose-by

• “Youshallknowawordbythecompanyitkeeps” (J.R.Firth1957:11)

• OneofthemostsuccessfulideasofmodernstatisticalNLP!

• Whenawordw appearsinatext,itscontext isthesetofwordsthatappearnearby(withinafixed-sizewindow).

• Usethemanycontextsofw tobuilduparepresentationofw

…government debt problems turning into banking crises as happened in 2009……saying that Europe needs unified banking regulation to replace the hodgepodge…

…India has just given its banking system a shot in the arm…

Thesecontextwordswillrepresentbanking 1/11/189

Page 7: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Wordvectors

Wewillbuildadensevectorforeachword,chosensothatitissimilartovectorsofwordsthatappearinsimilarcontexts.

Note:wordvectorsaresometimescalledwordembeddings orwordrepresentations.

linguistics=

0.2860.792−0.177−0.1070.109−0.5420.3490.271

1/11/1810

Page 8: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

2.Word2vec:Overview

Word2vec(Mikolov etal.2013)isaframeworkforlearningwordvectors.Idea:

• Wehavealargecorpusoftext• Everywordinafixedvocabularyisrepresentedbyavector• Gothrougheachpositiont inthetext,whichhasacenterword

c andcontext(“outside”)wordso• Usethesimilarityofthewordvectorsforcando tocalculate

theprobabilityofo givenc(orviceversa)• Keepadjustingthewordvectorstomaximizethisprobability

1/11/1811

Page 9: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Word2VecOverview

• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7

…crisesbankingintoturningproblems… as

centerwordatpositiont

outsidecontextwordsinwindowofsize2

outsidecontextwordsinwindowofsize2

𝑃 𝑤78;|𝑤7

𝑃 𝑤78<|𝑤7

𝑃 𝑤7=;|𝑤7

𝑃 𝑤7=<|𝑤7

1/11/1812

Page 10: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Word2VecOverview

• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7

…crisesbankingintoturningproblems… as

centerwordatpositiont

outsidecontextwordsinwindowofsize2

outsidecontextwordsinwindowofsize2

𝑃 𝑤78;|𝑤7

𝑃 𝑤78<|𝑤7

𝑃 𝑤7=;|𝑤7

𝑃 𝑤7=<|𝑤7

1/11/1813

Page 11: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Word2vec:objectivefunctionForeachposition𝑡 = 1,… , 𝑇,predictcontextwordswithinawindowoffixedsizem,givencenterword𝑤9.

𝐿 𝜃 =F F 𝑃 𝑤789|𝑤7; 𝜃�

=IJ9JI9KL

M

7N;

Theobjectivefunction 𝐽 𝜃 isthe(average)negativeloglikelihood:

𝐽 𝜃 = −1𝑇log 𝐿(𝜃) = −

1𝑇S S log𝑃 𝑤789|𝑤7; 𝜃

=IJ9JI9KL

M

7N;

Minimizingobjectivefunction⟺Maximizingpredictiveaccuracy

Likelihood=

𝜃 isallvariablestobeoptimized

sometimescalledcost orlossfunction

1/11/1814

Page 12: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Word2vec:objectivefunction

• Wewanttominimizetheobjectivefunction:

𝐽 𝜃 = −1𝑇S S log𝑃 𝑤789|𝑤7; 𝜃

=IJ9JI9KL

M

7N;

• Question: Howtocalculate𝑃 𝑤789|𝑤7; 𝜃 ?

• Answer: Wewillusetwovectorsperwordw:

• 𝑣U whenw isacenterword

• 𝑢U whenw isacontextword

• Thenforacenterwordc andacontextwordo:

𝑃 𝑜 𝑐 = exp(𝑢YM𝑣Z)

∑ exp(𝑢UM 𝑣Z)�U∈] 1/11/1815

Page 13: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Word2VecOverviewwithVectors

• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7• 𝑃 𝑢^_Y`abIc|𝑣de7Y shortforP 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠|𝑖𝑛𝑡𝑜; 𝑢^_Y`abIc, 𝑣de7Y, 𝜃

…crisesbankingintoturningproblems… as

centerwordatpositiont

outsidecontextwordsinwindowofsize2

outsidecontextwordsinwindowofsize2

𝑃 𝑢`peqder|𝑣de7Y

𝑃 𝑢Z_dcdc|𝑣de7Y

𝑃 𝑢7seder|𝑣de7Y

𝑃 𝑢^_Y`abIc|𝑣de7Y

1/11/1816

Page 14: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Word2VecOverviewwithVectors

• Examplewindowsandprocessforcomputing𝑃 𝑤789|𝑤7

…crisesbankingintoturningproblems… as

centerwordatpositiont

outsidecontextwordsinwindowofsize2

outsidecontextwordsinwindowofsize2

𝑃 𝑢Z_dcbc|𝑣`peqder

𝑃 𝑢pc|𝑣`peqder

𝑃 𝑢de7Y|𝑣`peqder

𝑃 𝑢7s_eder|𝑣`peqder

1/11/1817

Page 15: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Word2vec:predictionfunction

𝑃 𝑜 𝑐 = exp(𝑢YM𝑣Z)

∑ exp(𝑢UM 𝑣Z)�U∈]

• Thisisanexampleofthesoftmax functionℝe → ℝe

softmax 𝑥d = exp(𝑥d)

∑ exp(𝑥9)e9N;

= 𝑝d

• Thesoftmax functionmapsarbitraryvalues𝑥d toaprobabilitydistribution𝑝d• “max” becauseamplifiesprobabilityoflargest𝑥d• “soft” becausestillassignssomeprobabilitytosmaller𝑥d• FrequentlyusedinDeepLearning

Dotproductcomparessimilarityofo andc.Largerdotproduct=largerprobability

Aftertakingexponent,normalizeoverentirevocabulary

1/11/1818

Page 16: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Totrainthemodel:Computeall vectorgradients!

• Recall:𝜃representsallmodelparameters,inonelongvector• Inourcasewithd-dimensionalvectorsand V-manywords:

• Remember:everywordhastwovectors• Wethenoptimizetheseparameters

1/11/1819

Page 17: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

3. Derivationsofgradient

• ThebasicLegopiece

• Usefulbasics:

• Ifindoubt:writeoutwithindices

• Chainrule!Ify =f(u)andu =g(x),i.e.y=f(g(x)),then:

1/11/1820

Page 18: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

ChainRule

• Chainrule!Ify =f(u)andu =g(x),i.e.y=f(g(x)),then:

• Simpleexample:

1/11/1821

Page 19: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Note

Let’sderivegradientforcenterwordtogetherForoneexamplewindowandoneexampleoutsideword:

Youthenalsoneedthegradientforcontextwords(it’ssimilar;leftforhomework).That’salloftheparametersθ here.

1/11/1822

Page 20: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Calculatingallgradients!

• Wewentthroughgradientforeachcentervectorv inawindow• Wealsoneedgradientsforoutsidevectorsu• Deriveathome!• Generallyineachwindowwewillcomputeupdatesforall

parametersthatarebeingusedinthatwindow.Forexample:

…crisesbankingintoturningproblems… as

centerwordatpositiont

outsidecontextwordsinwindowofsize2

outsidecontextwordsinwindowofsize2

𝑃 𝑢Z_dcbc|𝑣`peqder

𝑃 𝑢pc|𝑣`peqder

𝑃 𝑢de7Y|𝑣`peqder

𝑃 𝑢7s_eder|𝑣`peqder

1/11/1823

Page 21: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Word2vec:Moredetails

Whytwovectors?à Easieroptimization.Averagebothattheend.

Twomodelvariants:1. Skip-grams(SG)

Predictcontext(”outside”)words(positionindependent)givencenterword

2. ContinuousBagofWords(CBOW)Predictcenterwordfrom(bagof)contextwords

Thislecturesofar:Skip-grammodel

Additionalefficiencyintraining:1. Negativesampling

Sofar:Focusonnaïvesoftmax (simplertrainingmethod)

1/11/1824

Page 22: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

GradientDescent

• Wehaveacostfunction𝐽 𝜃 wewanttominimize• GradientDescentisanalgorithmtominimize𝐽 𝜃• Idea:forcurrentvalueof𝜃,calculategradientof𝐽 𝜃 ,thentake

smallstepindirectionofnegativegradient.Repeat.

Note:Ourobjectivesarenotconvexlikethis:(

1/11/1825

Page 23: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

Intuition

Forasimpleconvexfunctionovertwoparameters.

Contourlinesshowlevelsofobjectivefunction

1/11/1826

Page 24: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

• Updateequation(inmatrixnotation):

• Updateequation(forsingleparameter):

• Algorithm:

GradientDescent

𝛼 =stepsizeorlearningrate

1/11/1827

Page 25: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

StochasticGradientDescent

• Problem:𝐽 𝜃 isafunctionofall windowsinthecorpus(potentiallybillions!)• Soisveryexpensivetocompute

• Youwouldwaitaverylongtimebeforemakingasingleupdate!

• Very badideaforprettymuchallneuralnets!• Solution: Stochasticgradientdescent(SGD)• Repeatedlysamplewindows,andupdateaftereachone.

• Algorithm:

1/11/1828

Page 26: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

1/11/1829

Page 27: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

PSet1:Theskip-grammodelandnegativesampling

• Frompaper:“DistributedRepresentationsofWordsandPhrasesandtheirCompositionality”(Mikolov etal.2013)

• Overallobjectivefunction(theymaximize):

• Thesigmoidfunction!(we’llbecomegoodfriendssoon)

• Sowemaximizetheprobabilityoftwowordsco-occurringinfirstlogà

1/11/1830

Page 28: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

PSet1:Theskip-grammodelandnegativesampling

• Simplernotation,moresimilartoclassandPSet:

• Wetakeknegativesamples.• Maximizeprobabilitythatrealoutsidewordappears,

minimizeprob.thatrandomwordsappeararoundcenterword

• P(w)=U(w)3/4/Z,theunigramdistributionU(w)raisedtothe3/4power(Weprovidethisfunctioninthestartercode).

• Thepowermakeslessfrequentwordsbesampledmoreoften

1/11/1831

Page 29: 1. How do we represent the meaning of a word?9KL M 7N; The objective functionOE is the (average) negative log likelihood: OE =− 1 C logD(E)=− 1 C S S log56 789 | 6 7;E =IJ9JI 9KL

PSet1:Thecontinuousbagofwordsmodel

• Mainideaforcontinuousbagofwords(CBOW):Predictcenterwordfromsumofsurroundingwordvectorsinsteadofpredictingsurroundingsinglewordsfromcenterwordasinskip-grammodel

• Tomakeassignmentslightlyeasier:

ImplementationoftheCBOWmodelisnotrequired(youcandoitforacoupleofbonuspoints!),butyoudohavetodothetheoryproblemonCBOW.

1/11/1832