600.465 Connecting the dots - II (NLP in Practice)

download 600.465 Connecting the  dots - II (NLP in Practice)

If you can't read please download the document

description

600.465 Connecting the dots - II (NLP in Practice). Delip Rao [email protected]. Last class …. Understood how to solve and ace in NLP tasks general methodology or approaches. End-to-End development using an example task Named Entity Recognition. Shared Tasks: NLP in practice. - PowerPoint PPT Presentation

Transcript of 600.465 Connecting the dots - II (NLP in Practice)

600.465 Connecting the dots (aka NLP in the Real World)

600.465 Connecting the dots - II(NLP in Practice)Delip [email protected] class Understood how to solve and ace in NLP tasksgeneral methodology or approaches

End-to-End development using an example taskNamed Entity Recognition2Shared Tasks: NLP in practiceShared Task (aka Evaluations)Everybody works on a (mostly) common datasetEvaluation measures are definedParticipants get ranked on the evaluation measuresAdvance the state of the artSet benchmarksTasks involve common hard problems or new interesting problems

Person Name Disambiguation

PhotographerComputational Linguist

Physicist

Psychologist

Sculptor

Biologist

Musician

CEO

Tennis Player

Theologist

Pastor Rao, Garera & Yarowsky, 20074

Clustering using web snippetsTest Doc 1

Test Doc 2

Test Doc 3

Test Doc 4

Test Doc 5

Test Doc 6

Goal: To cluster 100 given test documents for name David Smith

Step 1: Extract top 1000 snippets from GoogleStep 2: Cluster all the 1100 documents together

Step 3: Extract the clustering of the test documentsRao, Garera & Yarowsky, 20076Web Snippets for Disambiguation

SnippetSnippets contain high quality, low noise featuresEasy to extractDerived from sources other than the document(e.g., link text)Rao, Garera & Yarowsky, 20077 Term bridging via SnippetsDocument 1

Contains term780 492-9920

Snippet contains both the terms 780 492-9920 T6G2H1 and that can serve as a bridge for clustering Document 1 and Document 2 togetherDocument 2

Contains term T6G2H1

Rao, Garera & Yarowsky, 20078Evaluating Clustering output

Dispersion: Inter-cluster

Silhouette: Intra-clusterOther metrics:PurityEntropyV-measureEntity LinkingJohn WilliamsRichard Kaufman goes a long way back with John Williams. Trained as a classical violinist, Californian Kaufman started doing session work in the Hollywood studios in the 1970s. One of his movies was Jaws, with Williams conducting his score in recording sessions in 1975...

John Williamsauthor1922-1994J. Lloyd Williamsbotanist1854-1945John Williamspolitician1955-John J. WilliamsUS Senator1904-1988John WilliamsArchbishop1582-1650John Williamscomposer1932-Jonathan Williamspoet1929-Michael PhelpsDebbie Phelps, the mother of swimming star Michael Phelps, who won a record eight gold medals in Beijing, is the author of a new memoir, ...Michael Phelpsswimmer1985-Michael Phelpsbiophysicist1939-Michael Phelps is the scientist most often identified as the inventor of PET, a technique that permits the imaging of biological processes in the organ systems of living individuals. Phelps has ...Identify matching entry, or determine that entity is missing from KB10Challenges in Entity LinkingName VariationAbbreviations: BSO vs. Boston Symphony OrchestraShortened forms: Osama Bin Laden vs. Bin LadenAlternate spellings: Osama vs. Ussamah vs. OussamaEntity Ambiguity: Polysemous mentionsE.g., Springfield, WashingtonAbsence: Open domain linkingNot all observed mentions have a corresponding entry in KB (NIL mentions)Ability to predict NIL mentions determines KBP accuracyLargely overlooked in current literatureEntity Linking: FeaturesName-matchingacronyms, aliases, string-similarity, probabilistic FSTDocument FeaturesTF/IDF comparisons, occurrence of names or KB facts in the query text, WikitologyKB NodeType (e.g., is this a person), Features of Wikipedia page, Google rank of corresponding Wikipedia pageAbsence (NIL Indications)Does any candidate look like a good string match?CombinationsLow-string-match AND Acronym AND Type-is-ORGGo very quickly on this slide do not read it all.DO mention combinations12Entity Linking: Name MatchingAcronymsAlias ListsWikipedia redirects, stock symbols, misc. aliasesExact MatchWith and without normalized punctuation, case, accents, appositive removalFuzzier MatchingDice score (character uni/bi/tri-grams), Hamming, Recursive LCSubstring, SubsequencesWord removal (e.g., Inc., US) and abbrev. expansionWeighted FST for Name EquivalenceTrained models score name-1 as a re-writing of name-2

13Entity Linking: Document FeaturesBoW ComparisonsTF/IDF & Dice scores for news article and KB textExamined entire articles and passages around query mentionsNamed-EntitiesRan BBNs SERIF analyzer on articlesChecked for coverage of (1) query co-references and (2) all names/nominals in KB textNoted type, subtype of query entity (e.g., ORG/Media)KB FactsLooked to see if candidate nodes attributes are present in article text (e.g., spouse, employer, nationality)WikitologyUMBC system predicts relevant Wikipedia pages (or KB nodes) for textQuestion Answering

Question Answering: Ambiguity

More complication: Opinion Question AnsweringQ: What is the international reaction to the reelection of Robert Mugabe as President of Zimbabwe?

A: African observers generally approved of his victory while Western Governments strongly denounced it.

Stoyanov, Cardie, Wiebe 2005 Somasundaran, Wilson, Wiebe, Stoyanov 2007

19Subjectivity and Sentiment AnalysisThe linguistic expression of somebodys opinions, sentiments, emotions, evaluations, beliefs, speculations (private states)

Private state: state that is not open to objective observation or verificationQuirk, Greenbaum, Leech, Svartvik (1985). A Comprehensive Grammar of the English Language.

Subjectivity analysis classifies content in objective or subjective

Thanks: Jan WiebeSentiment AnalysisSubjectivity analysisPositiveSubjectiveNegative NeutralObjective

Rao & Ravichandran, 2009Subjectivity & Sentiment: Applications

Sentiment classificationDocument levelSentence levelProduct feature levelFor a heavy pot, the handle is not well designed.Find opinion holders and their opinions

Subjectivity & Sentiment: More Applications

Product review mining: Best Android phone in the market?24Sentiment tracking

Source: Research.lyTracking sentimentstoward topics overtime:

Is anger ratchetingup or cooling down?Sentiment Analysis Resources : Lexicons

Rao & Ravichandran, 2009Sentiment Analysis Resources: Lexicons...amazing +banal -bewilder -divine +doldrums -...amazing +banal -bewilder -divine +doldrums -...aburrido -

inocente +

mejor +

sabroso +

odiar -....aburrido -

inocente +

mejor +

sabroso +

odiar -....magnifique +

cleste +

irrgulier -

haine -...magnifique +

cleste +

irrgulier -

haine -... -

+

+

+

-... -

+

+

+

-...+ + - + - ...+ + - + - ...Rao & Ravichandran, 2009Sentiment Analysis Resources : CorporaPang and Lee, Amazon review corpusBlitzer, multi-domain review corpus

Dependency ParsingConsider product-feature opinion extractionFor a heavy pot, the handle is not well designed.

thehandleisnotwelldesignedadvmodnegnsubjpassdetSyntactic structure consists of binary, asymmetrical relations between the words of a sentence.29Dependency RepresentationsDirected graphs:V is a set of nodes (tokens)E is a set of arcs (dependency relations)L is a labeling function on E (dependency types)Example:PPP (In)NN60-talet (the-60s)VBmlade (painted)PNhan (he)JJdjrva (bold)NNtavlor (pictures)ADVPROBJSUBATTthanks: NivreDependency Parsing: ConstraintsCommonly imposed constraints:Single-head (at most one head per node)Connectedness (no dangling nodes)Acyclicity (no cycles in the graph)Projectivity:An arc i j is projective iff, for every k occurring between i and j in the input string, i j.A graph is projective iff every arc in A is projective.thanks: NivreDependency Parsing: ApproachesLink grammar (Sleator and Temperley)Bilexical grammar (Eisner):Lexicalized parsing in O(n3) timeMaximum Spanning Tree (McDonald)CONLL 2006/2007Syntactic Variations versus Semantic RolesYesterday, Kristina hit Scott with a baseballScott was hit by Kristina yesterday with a baseballYesterday, Scott was hit with a baseball by KristinaWith a baseball, Kristina hit Scott yesterdayYesterday Scott was hit by Kristina with a baseballThe baseball with which Kristina hit Scott yesterday was hard Kristina hit Scott with a baseball yesterdayAgent, hitterInstrumentPatient, Thing hitTemporal adjunctthanks: Jurafsky33But, there are actually many variations of the same sentence.

Despite the fact that there are so many syntactic variations of the sentence, what we really need is probably just a simple semantic analysis to know that Kristina is the person who hit something or someone; Scott is the victim who got hit; the baseball is used to hit, and the whole event happened yesterday.Semantic Role LabelingFor each clause, determine the semantic role played by each noun phrase that is an argument to the verb.agent patient source destination instrumentJohn drove Mary from Austin to Dallas in his Toyota Prius.The hammer broke the window.Also referred to a case role analysis, thematic analysis, and shallow semantic parsingthanks: MooneySRL DatasetsFrameNet: Developed at UCBBased on notion of FramesPropBank:Developed at UPennBased on elaborating the TreebankSalsa:Developed at Universitt des SaarlandesGerman version of FrameNet35SRL as Sequence LabelingSRL can be treated as an sequence labeling problem.For each verb, try to extract a value for each of the possible semantic roles for that verb.Employ any of the standard sequence labeling methodsToken classificationHMMsCRFsthanks: Mooney36SRL with Parse TreesParse trees help identify semantic roles through exploiting syntactic clues like the agent is usually the subject of the verb.Parse tree is needed to identify the true subject.SNPsg VPsgDet N PPPrep NPplThe manby the store near the dogate the apple.The man by the store near the dog ate an apple.The man is the agent of ate not the dog.thanks: Mooney37SRL with Parse Trees Assume that a syntactic parse is available.For each predicate (verb), label each node in the parse tree as either not-a-role or one of the possible semantic roles.SNP VPNP PPThePrep NPwiththeV NPbitabigdoggirlboyDet A NDet A NAdj ADet A NColor Code:not-a-roleagent patient source destination instrumentbeneficiarythanks: Mooney38Selectional RestrictionsSelectional restrictions are constraints that certain verbs place on the filler of certain semantic roles.Agents should be animateBeneficiaries should be animateInstruments should be toolsPatients of eat should be edibleSources and Destinations of go should be places.Sources and Destinations of give should be animate.Taxanomic abstraction hierarchies or ontologies (e.g. hypernym links in WordNet) can be used to determine if such constraints are met.John is a Human which is a Mammal which is a Vertebrate which is an Animate

thanks: Mooney39Word SensesSense 1Trees of the olive family with pinnate leaves, thin furrowed bark and gray branches.Sense 2The solid residue left when combustible material is thoroughly burned or oxidized.Sense 3To convert into ashSense 1A piece of glowing carbon or burnt wood.Sense 2charcoal.Sense 3A black solid combustible substance formed by the partial decomposition of vegetable matter without free access to air and under the influence of moisture and often increased pressure and temperature that is widely used as a fuel for burningAshCoalBeware of the burning coal underneath the ash.Self-training via Yarowskys AlgorithmRecognizing Textual EntailmentOvertures acquisition by YahooYahoo bought Overture

Question Expected answer formWho bought Overture? >> X bought Overturetexthypothesized answerentails Similar for IE: X acquire Y Similar for semantic IR Summarization (multi-document) MT evaluationthanks: Dagan41(Statistical) Machine Translation

42Where will we get P(F|E)?Books inEnglishSame books,in FrenchP(F|E) modelWe call collections stored in two languages parallel corpora or parallel textsWant to update your system? Just add more text!thanks: Nigam43Machine TranslationSystemsEarly rule based systemsWord based models (IBM models)Phrase based models (log-linear!)Tree based models (syntax driven)Adding semantics (WSD, SRL)Ensemble modelsEvaluationMetrics (BLEU, BLACK, ROUGE )Corpora (statmt.org)EGYPTGIZA++MOSESJOSHUAAllied Areas and TasksInformation RetrievalTREC (Large scale experiments)CLEF (Cross Lingual Evaluation Forum)NTCIRFIRE (South Asian Languages)

Allied Areas and Tasks(Computational) MusicologyMIREX

Where Next?