Answer Extraction as Sequence Tagging with Tree Edit Distance
Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf · ·...
Transcript of Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf · ·...
UnsupervisedApproachestoSequenceTagging,MorphologyInduction,andLexicalResource
AcquisitionRezaBosaghzadeh&NathanSchneider
LS2~1December2008
UnsupervisedMethods– SequenceLabeling(Part‐of‐SpeechTagging)
– MorphologyInduction
– LexicalResourceAcquisition
.
She ran to the station quickly
pronoun verb preposition det noun adverb
un‐supervise‐dlearn‐ing
ContrastiveEstimationSmith&Eisner(2005)
• Alreadydiscussedinclass• Keyidea:exploitsimplicitnegativeevidence
– Mutatingtrainingexamplesoftengivesungrammatical(negative)sentences
– Duringtraining,shiftprobabilitymassfromgeneratednegativeexamplestogivenpositiveexamples
• BUT:Requiresataggingdictionary,i.e.alistofpossibletagsforeachwordtype
Prototype‐driventaggingHaghighi&Klein(2006)
+
PrototypesTargetLabel
UnlabeledData
PrototypeList
AnnotatedData
slidecourtesyHaghighi&Klein
Prototype‐driventaggingHaghighi&Klein(2006)
Newlyremodeled2Bdrms/1Bath,spaciousupperunit,locatedinHilltopMallarea.Walkingdistancetoshopping,publictransportation,schoolsandpark.Paidwaterandgarbage.Nodogsallowed.
Newlyremodeled2Bdrms/1Bath,spaciousupperunit,locatedinHilltopMallarea.Walkingdistancetoshopping,publictransportation,schoolsandpark.Paidwaterandgarbage.Nodogsallowed.
PrototypeList
NN VBN CC JJ CD PUNC
IN NNS IN NNP RB DET
NN president IN of
VBD said NNS shares
CC and TO to
NNP Mr. PUNC .
JJ new CD million
DET the VBP are
EnglishPOS
slidecourtesyHaghighi&Klein
Prototypes
Newlyremodeled2Bdrms/1Bath,spaciousupperunit,locatedinHilltopMallarea.Walkingdistancetoshopping,publictransportation,schoolsandpark.Paidwaterandgarbage.Nodogsallowed.
FEATURE kitchen, laundry
LOCATION near, close TERMS paid, utilities SIZE large, feet RESTRICT cat, smoking
InformationExtraction:ClassifiedAds
FeaturesLocationTermsRestrictSize
PrototypeList
slidecourtesyHaghighi&Klein
Prototype‐driventaggingHaghighi&Klein(2006)
• Trigramtagger,samefeaturesas(Smith&Eisner2005)– Wordtype,suffixesuptolength3,contains‐hyphen,contains‐digit,initialcapitalization
• Tieeachwordtoitsmostsimilarprototype,usingcontext‐basedsimilaritytechnique(Schütze1993)– SVDdimensionalityreduction– Cosinesimilaritybetweencontextvectors
slideadaptedfromHaghighi&Klein
Prototype‐driventaggingHaghighi&Klein(2006)
Pros• Doesn’trequiretaggingdictionaryCons• Stillneedatagset• Maybehardtochoosegoodprototypes
UnsupervisedPOStaggingTheStateoftheArt
Bestsupervisedresult(CRF):99.5%!
UnsupervisedMethods– SequenceLabeling(Part‐of‐SpeechTagging)
– MorphologyInduction
– LexicalResourceAcquisition
.
She ran to the station quickly
pronoun verb preposition det noun adverb
un‐supervise‐dlearn‐ing
UnsupervisedApproachestoMorphology
• Morphologyreferstotheinternalstructureofwords– Amorphemeisaminimalmeaningfullinguisticunit
– Morphemesegmentationistheprocessofdividingwordsintotheircomponentmorphemes
un‐supervise‐dlearn‐ing– Wordsegmentationistheprocessoffindingwordboundariesinastreamofspeechortextunsupervised_learning_of_natural_language
ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)
• Learnsinflectionalparadigmsfromrawtext– Requiresonlyalistofwordtypesfromacorpus– Looksatwordcountsofsubstrings,andproposes(stem,suffix)pairingsbasedontypefrequency
• 3‐stagealgorithm– Stage1:Candidateparadigmsbasedonfrequencies
– Stages2‐3:Refinementofparadigmsetviamergingandfiltering
• Paradigmscanbeusedformorphemesegmentationorstemming
ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)
speak dance buyhablar bailar comprarhablo bailo comprohablamos bailamos compramoshablan bailan compran… … …
• AsamplingofSpanishverbconjugations(inflections)
ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)
speak dance buyhablar bailar comprarhablo bailo comprohablamos bailamos compramoshablan bailan compran… … …
• Aproposedparadigm(correct):stems{habl,bail,compr}andsuffixes{‐ar,‐o,‐amos,‐an}
ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)
• Twosubsequentstages:– Filteringoutspuriousparadigms(e.g.withincorrectsegmentations)
– Mergingpartialparadigmstoovercomesparsity:smoothing
ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)
speak dancehablar bailarhablo bailohablamos bailamoshablan bailan… …
• Forcertainsub‐setsofverbs,thealgorithmmayproposeparadigmswithspuriousseg‐mentations,liketheoneatleft
• Thefilteringstageofthealgorithmweedsouttheseincorrectparadigms
ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)
• Whatifnotallconjugationswereinthecorpus?
speak dance buyhablar bailar comprar
bailo comprohablamos bailamos compramoshablan… … …
ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)
• Anotherstageofthealgorithmmergestheseoverlappingpartialparadigmsviaclustering
speak dance buyhablar bailar comprar
bailo comprohablamos bailamos compramoshablan… … …
ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)
speak dance buyhablar bailar comprarhablo bailo comprohablamos bailamos compramoshablan bailan compran… … …
• Thisamountstosmoothing,or“hallucinating”out‐of‐vocabularyitems
ParaMor:MorphologicalparadigmsMonsonetal.(2007,2008)
• Heuristic‐based,deterministicalgorithmcanlearninflectionalparadigmsfromrawtext
• Currently,ParaMorassumessuffix‐basedmorphology
• Paradigmscanbeusedstraightforwardlytopredictsegmentations– CombiningtheoutputsofParaMorandMorfessor(anothersystem)wonthesegmentationtaskatMorphoChallenge2008foreverylanguage:English,Arabic,Turkish,German,andFinnish
• Wordsegmentationresults–comparison
• SeeNarges&Andreas’spresentationformoreonthismodel
Goldwateretal.UnigramDP
Goldwateretal.BigramHDP
BayesianwordsegmentationGoldwateretal.(2006;insubmission)
tablefromGoldwateretal.(insubmission)
MultilingualmorphemesegmentationSnyder&Barzilay(2008)
speakrs speaktuhablar parlerhablo parlehablamos parlonshablan parlent… …
• Abstractmorphemescrosslanguages:(ar,er),(o,e),(amos,ons),(an,ent),(habl,parl)
• Considersparallelphrasesandtriestofindmorphemecorrespondences
• Straymorphemesdon’tcorrespondacrosslanguages
MorphologyPapers:Inputs&Outputs
• Whatdoes“unsupervised”meanforeachapproach?
UnsupervisedMethods– SequenceLabeling(Part‐of‐SpeechTagging)
– MorphologyInduction
– LexicalResourceAcquisition
.
She ran to the station quickly
pronoun verb preposition det noun adverb
un‐supervise‐dlearn‐ing
BilinguallexiconsfrommonolingualcorporaHaghighietal.(2008)
SourceText
TargetText
Matching
m state
world
name
SourceWords
s
nation
estado
política
TargetWords
t
mundo
nombre
diagramcourtesyHaghighietal.UsedavariantofCCA(CanonicalCorrelationAnalysis)
state
Orthographic Features 1.0
1.0
1.0
#st
tat te#
5.0
20.0
10.0
Context Features
world politics society
SourceText
estado
Orthographic Features 1.0
1.0
1.0
#es
sta do#
10.0
17.0
6.0
Context Features
mundo politica sociedad
TargetText
slidecourtesyHaghighietal.
BilingualLexiconsfromMonolingualCorporaHaghighietal.(2008)
DataRepresentation
FeatureExperiments
61.1
80.1 80.289.0
0
25
50
75
100
EditDist Ortho Context MCCA
Precision
• MCCA:Orthographicandcontextfeatures
4kEN‐ESWikipediaArticlesslidecourtesyHaghighietal.
NarrativeeventsChambers&Jurafsky(2008)
• Givenacorpus,identifiesrelatedeventsthatconstitutea“narrative”and(whenpossible)predicttheirtypicaltemporalordering– E.g.:NOPQPRSTUOVWXNYZPVRnarrative,withverbs:arrest,accuse,plead,testify,acquit/convict
• Keyinsight:relatedeventstendtoshareaparticipantinadocument– Thecommonparticipantmayfilldifferentsyntactic/semanticroleswithrespecttoverbs:arrest.V\]XNZ,accuse.V\]XNZ,plead.WY\]XNZ
NarrativeeventsChambers&Jurafsky(2008)
• Atemporalclassifiercanreconstructpairwisecanonicaleventorderings,producingadirectedgraphforeachnarrative
StatisticalverblexiconGrenager&Manning(2006)
• Fromdependencyparses,agenerativemodelpredictsforeachverb:– PropBank‐stylesemanticroles:wux0,wux1,etc.(donotnecessarilycorrespondacrossverbs)
– Theroles’syntacticrealizations,e.g.:
• Usedforsemanticrolelabeling
He gave me a cookie
subj ARG0
verb give
np#1 ARG2
np#2 ARG1
He gave a cookie to me
subj ARG0
verb give
np#2 ARG1
pp_to ARG2
“Semanticity”:Ourproposedscaleofsemanticrichness
• text<POS<syntax/morphology/alignments<coreference/semanticroles/temporalordering<translations/narrativeeventsequences
• Wescoreeachmodel’sinputsandoutputsonthisscale,andcalltheinput‐to‐outputincrease“semanticgain”– Haghighietal.’sbilinguallexiconinductionwinsinthisrespect,goingfromrawtexttolexicaltranslations
SemanticGain:ComparisonofMethods
Robustnesstolanguagevariation• AbouthalfofthepapersweexaminedhadEnglish‐onlyevaluations
• Weconsideredwhichtechniquesweremostadaptabletoother(esp.resource‐poor)languages.Twomainfactors:– Relianceonexistingtools/resourcesforpreprocessing(parsers,coreferenceresolvers,…)
– Anylinguisticspecificityinthemodel(e.g.suffix‐basedmorphology)
SummaryWeexaminedthreeareasofunsupervisedNLP:
1. Sequencetagging:HowcanwepredictPOS(ortopic)tagsforwordsinsequence?
2. Morphology:Howarewordsputtogetherfrommorphemes(andhowcanwebreakthemapart)?
3. Lexicalresources:Howcanweidentifylexicaltranslations,semanticrolesandargumentframes,ornarrativeeventsequencesfromtext?
Ineightrecentpaperswefoundavarietyofapproaches,includingheuristicalgorithms,Bayesianmethods,andEM‐styletechniques.
ThankstoNoahandKevinfortheirfeedbackonthepaper;AndreasandNargesfortheir
collaborationonthepresentations;andallofyouforgivingusyourattention!
Questions?
un‐supervise‐dlearn‐ing
hablar bailar
hablo bailo
hablamos bailamos
hablan bailan
subj=give.wux0verb=givenp#1=give.wux2np#2=give.wux1
PrototypesTargetLabel
ImprovementIdeas
• POSTagging:Learnthetagset• Morphology:Non‐agglomerativeMorphology,Alsoparses
• LexicalResources:Trywordclasses
• All:Languagevariability