Computaonal Linguiscs - The Stanford Natural Language ...

39
Computa(onal Linguis(cs (aka Natural Language Processing) Bill MacCartney SymSys 100 Stanford University 26 May 2011 (some slides adapted from Chris Manning)

Transcript of Computaonal Linguiscs - The Stanford Natural Language ...

Page 1: Computaonal Linguiscs - The Stanford Natural Language ...

Computa(onalLinguis(cs(akaNaturalLanguageProcessing)

BillMacCartneySymSys100

StanfordUniversity26May2011

(someslidesadaptedfromChrisManning)

Page 2: Computaonal Linguiscs - The Stanford Natural Language ...

xkcdsnarkiness

OK,Randall,it’sfunny…butwrong!

cartoon from xkcd.com

Page 3: Computaonal Linguiscs - The Stanford Natural Language ...

Awordonterminology

Ifyoucallit…

•  Computa(onalLinguis(cs(CL)•  …you’realinguist!•  …youusecomputerstostudylanguage

•  NaturalLanguageProcessing(NLP)•  …you’reacomputerscien(st!

•  …youworkonapplica(onsinvolvinglanguage

Butreally,they’repreQymuchsynonymous

Page 4: Computaonal Linguiscs - The Stanford Natural Language ...

Let’sgetsituated!

Today,weareinthisinters(ce

cartoon from xkcd.com

Page 5: Computaonal Linguiscs - The Stanford Natural Language ...

NLP:TheVision

I’msorry,Dave.Ican’tdothat.

Oh,dear! Thatiscorrect,captain.

Page 6: Computaonal Linguiscs - The Stanford Natural Language ...

Language:theul(mateUI

WhereisABug’sLifeplayinginMountainView?

ABug’sLifeisplayingattheCentury16Theater.

Whenisitplayingthere?

It’splayingat2pm,5pm,and8pm.

OK.I’dlike1adultand2childrenforthefirstshow.Howmuchwouldthatcost?

Butweneeddomainknowledge,discourseknowledge,worldknowledge(Nottomen(onlinguis(cknowledge!)

Page 7: Computaonal Linguiscs - The Stanford Natural Language ...

NLP:Goalsofthefield

•  Fromthelo_y…•  full‐onnaturallanguageunderstanding•  par(cipa(oninspokendialogues•  open‐domainques(onanswering•  real‐(mebi‐direc(onaltransla(on

•  …tothemundane•  iden(fyingspam•  categorizingnewsstories(&otherdocs)•  finding&comparingproductinforma(onontheweb•  assessingsen(menttowardproducts,brands,stocks,…

Predominantinrecentyears

Page 8: Computaonal Linguiscs - The Stanford Natural Language ...

NLPinthecommercialworld

Powerset

Page 9: Computaonal Linguiscs - The Stanford Natural Language ...

Currentmo(va(onsforNLP

•  Theexplosionofmachine‐readablenaturallanguagetext•  Exabytes(1018bytes)oftext,doublingeveryyearortwo•  Webpages,emails,IMs,SMSs,tweets,docs,PDFs,…•  Opportunity—andincreasingnecessity—toextractmeaning

•  Media(onofhumaninterac(onsbycomputers•  Opportunityforthecomputerinthelooptodomuchmore

•  Growingroleoflanguageinhuman‐computerinterac(on

What’sdrivingNLP?Threetrends:

Page 10: Computaonal Linguiscs - The Stanford Natural Language ...

Furthermo(va(onforCL

Onereasonforstudyinglanguage—andformepersonallythemostcompellingreason—isthatitistemp6ngtoregardlanguage,inthetradi6onalphrase,asa“mirrorofmind”.

Chomsky,1975

Forthesamereason,computa(onallinguis(csisacompellingwaytostudypsycholinguis(csandlanguageacquisi(on.

Some(mes,thebestwaytounderstandsomethingistobuildamodelofit.

WhatIcannotcreate,Idonotunderstand.Feynman,1988

Page 11: Computaonal Linguiscs - The Stanford Natural Language ...

Earlyhistory:50sand60s

•  Founda(onalworkonautomata,formallanguages,probabili(s(cmodeling,andinforma(ontheory

•  Firstspeechsystems(Davisetal.,BellLabs)

•  MTheavilyfundedbymilitary—hugeoverconfidence

•  Butusingmachinesdumberthanapocketcalculator

•  LiQleunderstandingofsyntax,seman(cs,pragma(cs

•  ALPACreport(1966):crap,thisisreallyhard!

Page 12: Computaonal Linguiscs - The Stanford Natural Language ...

Refocusing:70sand80s

•  Founda(onalworkonspeechrecogni(on:stochas(cmodeling,hiddenMarkovmodels,the“noisychannel”

•  Ideasfromthisworkwouldlaterrevolu(onizeNLP!

•  Logicprogramming,rules‐drivenAI,determinis(calgorithmsforsyntac(cparsing(e.g.,LFG)

•  Increasinginterestinnaturallanguageunderstanding:SHRDLU,LUNAR,CHAT‐80

•  ButsymbolicAIhitthewall:“AIwinter”

Page 13: Computaonal Linguiscs - The Stanford Natural Language ...

Thesta(s(calrevolu(on:90s

•  InfluxofnewideasfromEE&ASR:probabilis(cmodeling,corpussta(s(cs,supervisedlearning,empiricalevalua(on

•  Newsourcesofdata:explosionofmachine‐readabletext;human‐annotatedtrainingdata(e.g.,thePennTreebank)

•  Availabilityofmuchmorepowerfulmachines

•  Loweredexpecta(ons:forgetfullseman(cunderstanding,let’sdotextcat,part‐of‐speechtagging,NER,andparsing!

Page 14: Computaonal Linguiscs - The Stanford Natural Language ...

Theriseofthemachines:00s

•  Consolida(onofthegainsofthesta(s(calrevolu(on

•  Moresophis(catedsta(s(calmodelingandmachinelearningalgorithms:MaxEnt,SVMs,BayesNets,LDA,etc.

•  Bigbigdata:100xgrowthofweb,massiveserverfarms

•  Focusshi_ingfromsupervised to unsupervisedlearning

•  Revivedinterestinhigher‐levelseman(capplica(ons

Page 15: Computaonal Linguiscs - The Stanford Natural Language ...

Subfieldsandtasks

Textcategoriza(on Coreferenceresolu(on Ques(onanswering(QA)

Part‐of‐speech(POS)tagging Wordsensedisambigua(on(WSD)

Textualinference&paraphrase

Nameden(tyrecogni(on(NER) Syntac(cparsing Summariza(on

Informa(onextrac(on(IE) Machinetransla(on(MT) Discourse&dialog

Sen(mentanalysis

mostlysolved makinggoodprogress s(llreallyhard

Spamdetec(onOK,let’smeetbythebig…

D1cktoosmall?BuyV1AGRA…

✓ ✗

PhilliesshutdownRangers2‐0

Joblessratehitstwo‐yearlow

SPORTS

BUSINESS

Colorlessgreenideassleepfuriously.

ADJADJNOUNVERBADV

ObamametwithUAWleadersinDetroit…

PERSONORGLOC

You’reinvitedtoourbungabungaparty,FridayMay27at8:30pminCorduraHall

PartyMay27add

Thephowasauthen(candyummy.

Waiterignoredusfor20minutes.

ObamatoldMubarakheshouldn’trunagain.

IneednewbaQeriesformymouse.

Ourspecialtyispandafriedrice.

我们的专长是熊猫炒饭

Sheencon(nuesrantagainst…Sheencon(nuesrantagainst…Sheencon(nuesrantagainst…

Sheenisnuts

Q.WhatcurrencyisusedinChina?

A.Theyuan

IcanseeRussiafrommyhouse!

T.Thirteensoldierslosttheirlives…

H.Severaltroopswerekilledinthe… YES

WhereisThorplayinginSF?

Metreonat4:30and7:30

Seman(csearchpeopleprotes(ngglobaliza(on Search

…demonstratorsstormedIMFoffices…

Page 16: Computaonal Linguiscs - The Stanford Natural Language ...

WhyisNLPhard?

Naturallanguageis:•  highlyambiguousatalllevels

•  complex,withrecursivestructuresandcoreference

•  subtle,exploi(ngcontexttoconveymeaning

•  fuzzyandvague•  involvesreasoningabouttheworld•  partofasocialsystem:persuading,insul(ng,amusing,…

(Nevertheless,simplefeatureso_endohalfthejob!)

Page 17: Computaonal Linguiscs - The Stanford Natural Language ...

Meaningsandexpressions

soda

so_drink

pop

beverage

Coke

Page 18: Computaonal Linguiscs - The Stanford Natural Language ...

Onemeaning,manyexpressions

ImageCaptureDevice:1.68millionpixel1/2‐inchCCDsensor

ImageCaptureDeviceTotalPixelsApprox.3.34millionEffec(vePixelsApprox.3.24million

ImagesensorTotalPixels:Approx.2.11million‐pixel

ImagingsensorTotalPixels:Approx.2.11million1,688(H)x1,248(V)

CCDTotalPixels:Approx.3,340,000(2,140[H]x1,560[V])Effec(vePixels:Approx.3,240,000(2,088[H]x1,550[V])RecordingPixels:Approx.3,145,000(2,048[H]x1,536[V])

Theseallcamefromthesamevendor’swebsite!

Tobuildashoppingsearchengine,youneedtoextractproductinforma(onfromvendors’websites:

Page 19: Computaonal Linguiscs - The Stanford Natural Language ...

Onemeaning,manyexpressions

Gazpromconfirmstwo‐foldincreaseingaspriceforGeorgia

GazpromdoublesGeorgia'sgasbill

Russiagasmonopolytodoublepriceofgas

RussiahitsGeorgiawithhugeriseinitsgasbill

RussiaplanstodoubleGeorgiangasprice

RussiaincreasingpriceofgasforGeorgia Search

Russiadoublesgasbillto“punish”neighbourGeorgia

Orconsideraseman(csearchapplica(on:

Page 20: Computaonal Linguiscs - The Stanford Natural Language ...

Oneexpression,manymeanings

cartoon from qwantz.com

Page 21: Computaonal Linguiscs - The Stanford Natural Language ...

Syntac(c&seman(cambiguity

NPNP

VP

S

NPNP

PP

VP

S

photosfromworth1000.com

seman(cambiguity

syntac(cambiguity

FruitflieslikeabananaFruitflieslikeabanana

Page 22: Computaonal Linguiscs - The Stanford Natural Language ...

Ambiguousheadlines

Teacher Strikes Idle Kids China to Orbit Human on Oct. 15 Red Tape Holds Up New Bridges Hospitals Are Sued by 7 Foot Doctors Juvenile Court to Try Shooting Defendant Local High School Dropouts Cut in Half Police: Crack Found in Man's Buttocks

Page 23: Computaonal Linguiscs - The Stanford Natural Language ...

OK,whyelseisNLPhard?Ohsomanyreasons!

non‐standardEnglish

Greatjob@jus(nbieber!WereSOOPROUDofwhatyouveaccomplished!Utaughtus2#neversaynever&youyourselfshouldnevergiveupeither♥

segmenta1onissues idiomsdarkhorsegetcoldfeetloseface

throwinthetowel

neologisms

unfriendretweetbromanceteabagger

gardenpathsentences

Themanwhohuntsducksoutonweekends.

ThecoQonshirtsaremadefromgrowshere.

trickyen1tynames

…amuta(onontheforgene…

WhereisABug’sLifeplaying…

MostofLetItBewasrecorded…

worldknowledge

MaryandSuearesisters.

MaryandSuearemothers.

prosody

Ineversaidshestolemymoney.

Ineversaidshestolemymoney.

Ineversaidshestolemymoney.

lexicalspecificity

Butthat’swhatmakesitfun!

theNewYork‐NewHavenRailroad

theNewYork‐NewHavenRailroad

Page 24: Computaonal Linguiscs - The Stanford Natural Language ...

So,howtomakeprogress?

•  Thetaskisdifficult!Whattoolsdoweneed?•  Knowledgeaboutlanguage•  Knowledgeabouttheworld•  Awaytocombineknowledgesources

•  Theanswerthat’sbeengezngtrac(on:•  probabilis(cmodelsbuiltfromlanguagedata

•  P(“maison”→“house”)high

•  P(“L’avocatgénéral”→“thegeneralavocado”)low

•  Somethinkthisisafancynew“A.I.”idea•  Butreallyit’sanoldideastolenfromtheelectricalengineers…

Page 25: Computaonal Linguiscs - The Stanford Natural Language ...

Machinetransla(on(MT)

美国关岛国际机场及其办公室均接获一名自称沙地阿拉伯富商拉登等发出的电子邮件,威胁将会向机场等公众地方发动生化袭击後,关岛经保持高度戒备。

TheU.S.islandofGuamismaintainingahighstateofalerta_ertheGuamairportanditsofficesbothreceivedane‐mailfromsomeonecallinghimselftheSaudiArabianOsamabinLadenandthreateningabiological/chemicalaQackagainstpublicplacessuchastheairport.

•  Theclassicacidtestfornaturallanguageprocessing.

•  Requirescapabili(esinbothinterpreta(onandgenera(on.

•  About$10billionspentannuallyonhumantransla(on.

Page 26: Computaonal Linguiscs - The Stanford Natural Language ...

Empiricalsolu(on

Hieroglyphs

ParallelTexts:TheRoseQaStone

Demo(c

Greek

Page 27: Computaonal Linguiscs - The Stanford Natural Language ...

Empiricalsolu(on

Hmm,every(meonesees“banco”,transla(onis“bank”or“bench”…Ifit’s“bancode…”,italwaysbecomes“bank”,never“bench”…

slide from Kevin Knight

ParallelTexts:–  HongKongLegisla(on–  MacaoLegisla(on

–  CanadianParliamentHansards

–  UnitedNa(onsReports–  EuropeanParliament

–  Instruc(onManuals–  Mul(na(onalcompany

websites

Page 28: Computaonal Linguiscs - The Stanford Natural Language ...

Sindarin‐English

Iamarprestaraen.Theworldischanged.

Hanmathonnenen.Ifeelitinthewaters.

Hanmathonnechae.Ifeelitintheearth.

Ahannostonned'wilith.Ismellitintheair.

FellowshipoftheRingsmoviescript

slide from Lori Levin

Page 29: Computaonal Linguiscs - The Stanford Natural Language ...

Sta(s(calMT

Supposewehadaprobabilis(cmodeloftransla(onP(e|f)

Example:supposefisderienP(you’rewelcome|derien) =0.45P(nothing|derien) =0.13P(piddling|derien) =0.01P(underpants|derien) =0.000000001

Thenthebesttransla(onforfisargmaxeP(e|f)

Page 30: Computaonal Linguiscs - The Stanford Natural Language ...

ABayesianapproach

ê=argmaxeP(e|f)

=argmaxeP(f)

P(f|e)P(e)

=argmaxeP(f|e)P(e)

languagemodeltransla(onmodel languagemodel(fluency)

transla(onmodel(fidelity)

Page 31: Computaonal Linguiscs - The Stanford Natural Language ...

The“noisychannel”model

illustration from Jurafsky & Martin

Page 32: Computaonal Linguiscs - The Stanford Natural Language ...

Languagemodels(LMs)

•  NoisychannelmodelrequireslanguagemodelP(e)

•  LMtellsuswhichsentencesseemlikelyor“good”

•  Givensomecandidatetransla(ons,LMhelpswith:•  wordchoice(“shrankfrom”or“shrankof”?)

•  wordordering(“toughdecisions”or“decisionstough”?)

sentence P(e)

Heshrankfromtoughdecisions. 1.89e‐11

Heshrankfromimportantdecisions. 9.46e‐12

Heshrankoftoughdecisions. 7.11e‐16

Heshrankfromdecisionstough. 3.21e‐17

Page 33: Computaonal Linguiscs - The Stanford Natural Language ...

Sta(s(callanguagemodels

•  Wherewillthelanguagemodelcomefrom?

•  We’llbuilditbycoun(ngthingsincorpusdata!

•  Sta(s(cales(ma(onofmodelparameters

•  Butwecan’tjustcountwholesentences

sentence count P(e)

Heshrankfromtoughdecisions. 1/49208 2.03e‐05

Heshrankfromimportantdecisions. 0/49208 0

Heshrankoftoughdecisions. 0/49208 0

Heshrankfromdecisionstough. 0/49208 0

toohigh!

toolow!

Page 34: Computaonal Linguiscs - The Stanford Natural Language ...

N‐gramlanguagemodels

•  Instead,we’llbreakthingsintopieces

•  Thisiscalledabigramlanguagemodel

•  Wecanes(matebigramprobabili(esfromcorpus

P(Heshrankfromtoughdecisions)=P(He|•)×P(shrank|He)×P(from|shrank)×…×P(decisions|tough)

w1 w2 C(w1) C(w1w2) P(w2|w1)

• He 49208 978 0.0199

He shrank 53142 21 0.0004

shrank from 122 17 0.1393

from tough 18777 184 0.0098

Page 35: Computaonal Linguiscs - The Stanford Natural Language ...

Sta(s(caltransla(onmodels

•  Noisychannelalsoneedstransla(onmodelP(f|e)

•  Similarstrategy:breaksentencepairsintophrases

•  Countco‐occurringpairsinalargeparallelcorpus

•  (ButI’llskipthegorydetails…)

e f C(e) C(e,f) P(f|e)

heshrank illuirépugnait 17 6 0.3529

from de 27111 17855 0.6586

from des 27111 6434 0.2373

toughdecisions décisionsdifficiles 98 81 0.8265

Page 36: Computaonal Linguiscs - The Stanford Natural Language ...

Sta(s(calMTSystems

French BrokenEnglish

English

Sta(s(calAnalysis Sta(s(calAnalysis

J’aitrèsfaim Iamsohungry

WhathungerhaveI,HungryIamso,Iamsohungry,HaveIthathunger…

LanguageModelP(e)

Transla1onModelP(f|e)

DecodingalgorithmargmaxeP(f|e)P(e)

French/EnglishParallelTexts

Michellemabellesontlesmotsquivonttrèsbienensemble

Michellemabellesontlesmotsquivonttrèsbienensemble

Michelle,mabelle,sontlesmotsquivonttrèsbienensemble

Michellemabellesontlesmotsquivonttrèsbienensemble

Michellemabellesontlesmotsquivonttrèsbienensemble

Michelle,mybeau(ful,arewordsthatgotogetherwell

EnglishTexts

Michellemabellesontlesmotsquivonttrèsbienensemble

Michellemabellesontlesmotsquivonttrèsbienensemble

Manygreattradi(onsinartoriginatedintheartofoneofthefive…

Page 37: Computaonal Linguiscs - The Stanford Natural Language ...

Applica(onsofthenoisychannel

Thismodelcanbeappliedtomanydifferentproblems!

Channelmodelspeechproduc(onOCRtypingwithspellingerrorstransla(ngtoEnglish

LanguagemodelEnglishwordsEnglishwordsEnglishwordsEnglishwords

ê=argmaxeP(x|e)P(e)

(WidelyusedatGoogle,forexample)

Page 38: Computaonal Linguiscs - The Stanford Natural Language ...

IfyoulikeNLP/CompLing…

•  learnJavaorPython(andplaywithJavaNLPorNLTK)•  studylogic,probability,sta(s(cs,linearalgebra•  getsomeexposuretolinguis(cs(LING1,…)•  studyAIandmachinelearning(CS121,CS221,CS229)

•  readJurafsky&Mar(norManning&Schütze

•  CS124:FromLanguagetoInforma(on(Jurafsky)

•  CS224N:NaturalLanguageProcessing(Manning)

•  CS224S:SpeechRecogni(on&Synthesis(Jurafsky)•  CS224U:NaturalLanguageProcessing(MacCartney)

Page 39: Computaonal Linguiscs - The Stanford Natural Language ...

Onemorefortheroad

cartoon from qwantz.com