F rom U nstructured I nformation t o L inked D ata
description
Transcript of F rom U nstructured I nformation t o L inked D ata
FROM UNSTRUCTURED INFORMATION TO LINKED DATA
Axel NgongaHead of SIMBA@AKSWUniversity of LeipzigIASLOD, August 15/16th 2012
Motivation
Motivation• Where does the LOD Cloud come from?
• Structured data• Triplify, D2R
• Semi-structured data• DBpedia
• Unstructured data• ???
• Unstructured data make up 80% of the Web• How do we extract Linked Data from unstructured data
sources?
Overview1. Problem Definition2. Named Entity Recognition
• Algorithms• Ensemble Learning
3. Relation Extraction• General approaches• OpenIE approaches
4. Entity Disambiguation• URI Lookup• Disambiguation
5. Conclusion
NB: Will be mainly concerned with the newest developments.
Overview1. Problem Definition2. Named Entity Recognition
• Algorithms• Ensemble Learning
3. Relation Extraction• General approaches• OpenIE approaches
4. Entity Disambiguation• URI Lookup• Disambiguation
5. Conclusion
Problem Definition• Simple(?) problem: given a text fragment, retrieve
• All entities and• relations between these entities automatically plus• „ground them“ in an ontology
• Also coined Knowledge Extraction
John Petrucci was born in New York.
:John_Petrucci:New_York
dbo:birthPlace
:John_Petrucci dbo:birthPlace :New_York .
Problems1. Finding entities
Named Entity Recognition2. Finding relation instances
Relation Extraction3. Finding URIs
URI Disambiguation
Overview1. Problem Definition2. Named Entity Recognition
• Algorithms• Ensemble Learning
3. Relation Extraction• General approaches• OpenIE approaches
4. Entity Disambiguation• URI Lookup• Disambiguation
5. Conclusion
Named Entity Recognition• Problem definition: Given a set of classes, find all
strings that are labels of instances of these classes within a text fragment
John Petrucci was born in New York.
[John Petrucci, PER] was born in [New York, LOC].
Named Entity Recognition• Problem definition: Given a set of classes, find all
strings that are labels of instances of these classes within a text fragment
• Common sets of classes• CoNLL03: Person, Location, Organization, Miscelleaneous• ACE05: Facility, Geo-Political Entity, Location, Organisation,
Person, Vehicle, Weapon• BioNLP2004: Protein, DNA, RNA, cell line, cell type
• Several approaches• Direct solutions (single algorithms)• Ensemble Learning
NER: Overview of approaches• Dictionary-based• Hand-crafted Rules• Machine Learning
• Hidden Markov Model (HMMs)• Conditional Random Fields (CRFs)• Neural Networks• k Nearest Neighbors (kNN)• Graph Clustering
• Ensemble Learning• Veto-Based (Bagging, Boosting)• Neural Networks
NER: Dictionary-based• Simple Idea1. Define mappings between words and classes, e. g.,
Paris Location2. Try to match each token from each sentence3. Return the mapping entities
Time-Efficient at runtime× Manuel creation of gazeteers× Low Precision (Paris = Person, Location)× Low Recall (esp. on Persons and Organizations as the
number of instances grows)
NER: Rule-based• Simple Idea1. Define a set of rule to find entities, e.g.,
[PERSON] was born in [LOCATION].2. Try to match each sentence to one or several rules3. Return the mapping entities
High precision × Manuel creation of rules is very tedious × Low recall (finite number of patterns)
NER: Markov Models• Stochastic process such that (Markov Property)
) = )
• Equivalent to finite-state machine• Formally consists of
• Set S of states S1, … , Sn
• Matrix M such that mij = P(Xt+1=Sj|Xt=Si)
NER: Hidden Markov Models• Extension of Markov Models
• States are hidden and assigned an output function• Only output is seen• Transitions are learned from training data
• How do they work?• Input: Discrete sequence of features
(e.g., POS Tags, word stems, etc.)• Goal: Find the best sequence of states
that represent the input• Output: hopefully right classification
of each token
S0
S1
…
Sn
PER
_
LOC
NER: k Nearest Neighbors• Idea
• Describe each token q from a labelled training data set with a set of features (e.g., left and right neigbors)
• Each new token t is described with the same features• Assign t the class of its k nearest neighbors
NER: So far …• „Simple approaches“
• Apply one algorithm to the NER problem• Bound to be limited by assumptions of model
• Implemented by a large number of tools• Alchemy• Stanford NER• Illinois Tagger• Ontos NER Tagger• LingPipe• …
NER: Ensemble Learning• Intuition: Each algorithm has its strengths and
weaknesses• Idea: Use ensemble learning to merge results of different
algorithms so as to create a meta-classifier of higher accuracy
Dictionary-based
approaches Pattern-based
approaches
Condition Random FieldsSupport Vector
Machines
NER: Ensemble Learning• Idea: Merge the results of several approaches for
improving results• Simplest approaches:
• Voting• Weighted voting
Input
System 1 System 2 System n
Merger
Output
NER: Ensemble Learning• When does it work?• Accuracy
• Need for exisiting solutions to be „good“• Merging random results lead to random results• Given, current approaches reach 80% F-Score
• Diversity• Need for smallest possible amount of correlation
between approaches• E.g., merging two HMM-based taggers won‘t help• Given, large number of approaches for NER
NER:FOX• Federated Knowledge Extraction Framework• Idea: Apply ensemble learning to NER• Classical approach: Voting
• Does not make use of systematic error• Partly difficult to train
• Use neural networks instead• Can make use of systematic
errory• Easy to train• Converge fast• http://fox.aksw.org
NER: FOX
NER: FOX on MUC7
NER: FOX on MUC7
NER: FOX on Website Data
NER: FOX on Website Data
NER: FOX on Companies and Countries
No runtime issues (parallel implementation) NN overhead is small× Overfitting
NER: Summary• Large number of approaches
• Dictionaries• Hand-Crafted rules• Machine Learning• Hybrid• …
Combining approaches leads to better results than single algorithms
Overview1. Problem Definition2. Named Entity Recognition
• Algorithms• Ensemble Learning
3. Relation Extraction• General approaches• OpenIE approaches
4. Entity Disambiguation• URI Lookup• Disambiguation
5. Conclusion
RE: Problem Definition• Find the relations between NEs if such relations exist.• NEs not always given a-priori (open vs. closed RE)
bornIn ([John Petrucci, PER], [New York, LOC]).
John Petrucci was born in New York.
[John Petrucci, PER] was born in [New York, LOC].
RE: Approaches• Hand-crafted rules• Pattern Learning• Coupled Learning
RE: Pattern-based• Hearst patterns [Hearst: COLING‘92]
• POS-enhanced regular expression matching in natural-language text
NP0 {,} such as {NP1, NP2, … (and|or) }{,} NPn
NP0 {,}{NP1, NP2, … NPn-1}{,} or other NPn
“The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string.” isA(“Bambara ndang”, “bow lute”)
Time-Efficient at runtime× Very low recall× Not adaptable to other relations
RE: DIPRE• DIPRE = Dual Iterative Pattern Relation Extraction• Semi-supervised, iterative gathering of facts and patterns• Positive & negative examples as seeds for a given target
relation• e.g. +(Hillary, Bill) ; +(Carla, Nicolas); –(Larry, Google)
• Various tuning parameters for pruning low-confidence patterns and facts
• Extended to SnowBall / QXtract
(Hillary, Bill)
(Carla, Nicolas)X and her husband Y
X and Y on their honeymoon
X and Y and their childrenX has been dating with YX loves Y
(Angelina, Brad)
(Hillary, Bill)(Victoria, David)
(Carla, Nicolas)
(Larry, Google)…
RE: NELL• Never-Ending Language Learner (http://rtw.ml.cmu.edu/)• Open IE with ontological backbone
• Closed set of categories & typed relations
• Seeds/counter seeds (5-10)• Open set of predicate arguments
(instances)• Coupled iterative learners • Constantly running over a large Web corpus
since January 2010 (200 Mio pages)• Periodic human supervision
athletePlaysForTeam(Athlete, SportsTeam)
athletePlaysForTeam(Alex Rodriguez, Yankees)
athletePlaysForTeam(Alexander_Ovechkin, Penguins)
RE: NELL
Conservative strategy Avoid Semantic Drift
RE: BOA• Bootstrapping Linked Data (http://boa.aksw.org)• Core idea: Use instance data in Data Web to discover NL
patterns and new instances
RE: BOA• Follows conservative strategy
• Only top pattern• Frequency threshold• Score Threshold
• Evaluation results
RE: Summary• Several approaches
• Hand-crafted rules• Machine Learning• Hybrid
Large number of instances available for many relations Runtime problem Parallel implementations Many new facts can be found× Semantic Drift× Long tail× Entity Disambiguation
Overview1. Problem Definition2. Named Entity Recognition
• Algorithms• Ensemble Learning
3. Relation Extraction• General approaches• OpenIE approaches
4. Entity Disambiguation• URI Lookup• Disambiguation
5. Conclusion
ED: Problem Definition• Given (a) refence knowledge base(s), a text fragment, a
list of NEs (incl. position), and a list a relations, find URIs for each of the NEs and relations
• Very difficult problem• Ambiguity, e.g., Paris = Paris Hilton? Paris (France)?• Difficult even for humans, e.g.,• Paris‘ mayor died yesterday
• Several solutions• Indexing• Surface Form• Graph-based
ED: Problem Definition
bornIn ([John Petrucci, PER], [New York, LOC]).
John Petrucci was born in New York.
[John Petrucci, PER] was born in [New York, LOC].:John_Petrucci dbo:birthPlace :New_York .
ED: Indexing• More retrieval than disambiguation• Similar to dictionary-based approaches• Idea
• Index all labels in reference knowledge base• Given an input label, retrieve all entities with a similar
label× Poor recall (unknown surface form, e.g., „Mme Curie“ für
„Marie Curie“)× Low precision (Paris = Paris Hilton, Paris (France), …)
ED: Type Disambiguation• Extension of indexing
• Index all labels• Infer type information• Retrieve labels from entities of the given type
• Same recall as previous approach• Higher precision
• Paris[LOC] != Paris[PER]• Still, Paris (France) vs. Paris (Ontario)
• Need for context
ED: Spotlight• Known surface forms (http://dbpedia.org/spotlight)
• Based on DBpedia + Wikipedia• Uses supplementary knowledge including disambiguation
pages, redirects, wikilinks• Three main steps
• Spotting: Finding possible mentions of DBpedia resources, e.g.,
John Petrucci was born in New York.• Candidate Selection: Find possible URIs, e.g.,
John Petrucci :JohnPetrucciNew York :New_York, :New_York_County, …
• Disambiguation: Map context to vector for each resource New York :New_York
ED: YAGO2• Joint Disambiguation
Mississippi, one of Bob’s later songs, was first recorded by Sheryl on her album.
♬
ED: YAGO2
Mississippi (State)
Bob Dylan Songs
Sheryl Cruz
Sheryl Lee
Mississippi (Song)
Sheryl Crow
Objective: Maximize objective function (e.g., total weight)
Constraint: Keep at least one entity per mention
Mentions of Entities Entity Candidatessim(cxt(ml ),cxt(ei ))
prior(ml ,ei )
coh(ei ,ej )
ED: FOX• Generic Approach
• A-priori score (a): Popularity of URIs• Similarity score (s): Similarity of resource labels and text• Coherence score (z): Correlation between URIs
49
a|s
a|sz
ED:FOX• Allows the use of several algorithms
• HITS• Pagerank• Apriori• Propagation Algorithms• …
50
ED: Summary• Difficult problem even for humans• Several approaches
• Simple search• Search with restrictions• Known surface forms• Graph-based
Improved F-Score for DBpedia (70-80%)× Low F-Score for generic knowledge bases× Intrinsically difficult× Still a lot to do
Overview1. Problem Definition2. Named Entity Recognition
• Algorithms• Ensemble Learning
3. Relation Extraction• General approaches• OpenIE approaches
4. Entity Disambiguation• URI Lookup• Disambiguation
5. Conclusion
Conclusion• Discussed basics of …
• Knowledge Extraction problem• Named Entity Recognition• Relation Extraction• Entity Disambiguation
• Still a lot of research necessary• Ensemble and active Learning• Entity Disambiguation• Question Answering …
Thank You!Questions?