SOFIE: A Self-Organizing Framework for Information Extraction
description
Transcript of SOFIE: A Self-Organizing Framework for Information Extraction
SOFIE: A Self-Organizing Framework for Information Extraction 1Fabian M. Suchanek
SOFIE:A Self-Organizing
Frameworkfor Information Extraction
Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum
(Max-Planck-Institute for Informatics, Saarbrücken, Germany)
SOFIE: A Self-Organizing Framework for Information Extraction 2Fabian M. Suchanek
Ontologies
SingerCountry
USA
Entity
bornInPlace
typetype
subclassOfsubclassOf
Wikipedia
DBpedia,
YAGO,
KYLIN,
...
Internet
?"Elvis died in England"
birth-place: USA
SOFIE: A Self-Organizing Framework for Information Extraction 3Fabian M. Suchanek
Information Extraction
EnglanddiedInPlace
"Elvis died in England"
Previous approaches:
Espresso, DIPRE, LEILA, Snowball, TextRunner, Alice, and many more
Goal:
Extract ontological information from natural language documents
died in, perished in, was killed in,...
May deliver non-canonic relations ر
England, UK, Great Britain, ...
May deliver non-canonic entities ر
diedInPlace(Elvis,England)
diedInPlace(Elvis,Germany)
May deliver inconsistent facts ر
SOFIE: A Self-Organizing Framework for Information Extraction 4Fabian M. Suchanek
Pitfalls of Information Extraction
Elvis died in England.
OntologyWeb page
Louis XIV died in France.
FrancediedInPlace
If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.
"died in" = diedInPlace
SOFIE: A Self-Organizing Framework for Information Extraction 5Fabian M. Suchanek
Pitfalls of Information Extraction
Elvis died in England.
OntologyWeb page
Louis XIV died in France.
If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.
"died in" = diedInPlace
If a meaningful pattern occurs with two entities, then the entities stand in the relation.
"Elvis"
"England"diedInPlace
SOFIE: A Self-Organizing Framework for Information Extraction 6Fabian M. Suchanek
Pitfalls of Information Extraction
Elvis died in England.
OntologyWeb page
Louis XIV died in France.
If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.
"died in" = diedInPlace
If a meaningful pattern occurs with two entities, then the entities stand in the relation.
"Elvis"
"England"diedInPlace
Taxidophobist?
SOFIE: A Self-Organizing Framework for Information Extraction 7Fabian M. Suchanek
Pitfalls of Information Extraction
Elvis died in England.
Web page
Louis XIV died in France.
If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.
"died in" = diedInPlace
If a meaningful pattern occurs with two entities, then the entities stand in the relation.
"Elvis"
"England"diedInPlace
Taxidophobist
Reasoning Problem
SOFIE: A Self-Organizing Framework for Information Extraction 8Fabian M. Suchanek
Pitfalls of Information Extraction
Elvis died in England.
Web page
Louis XIV died in France.
If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.
"died in" = diedInPlace
If a meaningful pattern occurs with two entities, then the entities stand in the relation.
Taxidophobist
Reasoning Problem
Disambiguation Problem
SOFIE: A Self-Organizing Framework for Information Extraction 9Fabian M. Suchanek
Pitfalls of Information Extraction
Elvis died in England.Louis XIV died in France.
Taxidophobist
Reasoning Problem
Disambiguation Problem
Pattern Matching Problem
"died in" = diedInPlace ?
SOFIE: A Self-Organizing Framework for Information Extraction 10Fabian M. Suchanek
Information Extraction as Formulas
type(Elvis,Taxidophobist).
type(X,Taxidophobist)
& bornInPlace(X,Y)
=> diedInPlace(X,Z) [0.8]
Taxidophobist
Reasoning Problem
SOFIE: A Self-Organizing Framework for Information Extraction 11Fabian M. Suchanek
Information Extraction as Formulas
Elvis died in England.Louis XIV died in France.
Reasoning Problem
Disambiguation Problem
Pattern Matching Problem
"died in" = diedInPlace ?
type(X,Taxidophobist)
& bornInPlace(X,Y)
=> diedInPlace(X,Z)
type(Elvis,Taxidophobist).
SOFIE: A Self-Organizing Framework for Information Extraction 12Fabian M. Suchanek
Assumptions:
In one document, the same word has always the same meaning ر
The ontology already knows all important meanings of proper رnames
possibleMeaning(Elvis@D15, ElvisPresley). [0.7]
Information Extraction as Formulas
Disambiguation Problem
SOFIE: A Self-Organizing Framework for Information Extraction 13Fabian M. Suchanek
Assumptions:
In one document, the same word has always the same meaning ر
The ontology already knows all important meanings of proper رnames
possibleMeaning(Elvis@D15, ElvisPresley). [0.7]
A word in context (wic).Here: The word "Elvis"
in document D15
One possible meaning of "Elvis" as given by the ontology
Prior estimation for the likelihood of this meaning.
Information Extraction as Formulas
| words(D15) ∩ rel(ElvisPresley)|
| words(D15) |
SOFIE: A Self-Organizing Framework for Information Extraction 14Fabian M. Suchanek
Assumptions:
In one document, the same word has always the same meaning ر
The ontology already knows all important meanings of proper رnames
possibleMeaning(Elvis@D15, ElvisPresley). [0.7]
Information Extraction as Formulas
possibleMeaning(X,Y) => means(X,Y)
means(X,Y) & YZ => means(X,Z)
SOFIE: A Self-Organizing Framework for Information Extraction 15Fabian M. Suchanek
Information Extraction as Formulas
Elvis died in England.Louis XIV died in France.
Reasoning Problem
Disambiguation Problem
Pattern Matching Problem
"died in" = diedInPlace ?
type(X,Taxidophobist)
& bornInPlace(X,Y)
=> diedInPlace(X,Z)
type(Elvis,Taxidophobist).
meaning(Elvis@D15,
ElvisPresley). [0.7]
SOFIE: A Self-Organizing Framework for Information Extraction 16Fabian M. Suchanek
Information Extraction as Formulas
Elvis died in England.Louis XIV died in France.
Pattern Matching Problem
"died in" = diedInPlace ?
occurs("died in",
Elvis@D15,
England@D15). [14]
occurs(P,Wic1,Wic2) & means(Wic1,X) & means(Wic2,Y) & mapsTo(P,R)
=> R(X,Y)
occurs(P,Wic1,Wic2) & means(Wic1,X) & means(Wic2,Y) & R(X,Y)
=> mapsTo(P,R)
SOFIE: A Self-Organizing Framework for Information Extraction 17Fabian M. Suchanek
Information Extraction as Formulas
Reasoning Problem
Disambiguation Problem
Pattern Matching Problem
type(X,Taxidophobist)
& bornInPlace(X,Y)
=> diedInPlace(X,Z)
type(Elvis,Taxidophobist).
meaning(Elvis@D15,
ElvisPresley). [0.7]
occurs("died in",
Elvis@D15,
England@D15). [14]
Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized
means(Elvis@D15, ElvisPresley) ?
mapsTo("died In", diedInPlace) ?
diedIn(ElvisPresley, England) ?
SOFIE: A Self-Organizing Framework for Information Extraction 18Fabian M. Suchanek
Weighted MAX SAT Problem
Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized
Problems:
The Weighted MAX SAT Problem is NP-hard ر
Our instance of the problem is huge ر
The most popular linear approximation algorithm ر(Johnson's) does not work well with our type of formulas
Weighted MAX SAT Problem
Johnson's cannot approximate better than 2/3
bornInPlace(X,Y) => bornInPlace(X,Z)
A v B A v C B v C
SOFIE: A Self-Organizing Framework for Information Extraction 19Fabian M. Suchanek
A v B [w1]
A v B [w2]
B v C [w3]
C [w4]
Formulas
A
B
C
Hypotheses
The Functional MAX SAT Algorithm considers only unit clauses.
= true
= false
= false
FMS Algorithm
The Functional MAX SAT Algorithm propagates Dominating Unit Clauses
A v B [10]
A [10]
A [30]
A = true30 > 10+10
SOFIE: A Self-Organizing Framework for Information Extraction 20Fabian M. Suchanek
FMS Algorithm
Experiments show better performance in practice than Johnson's algorithm
in our setting .
FMS Algorithm
FOR i=1 TO 42...NEXT i
Approximation
Guarantee
Polynomial
time
SOFIE: A Self-Organizing Framework for Information Extraction 21Fabian M. Suchanek
FMS Algorithm
FOR i=1 TO 42...NEXT i
FMS Algorithm
Elvis died in England r(X,Y) & s(Y) => t(X,Y)
SOFIE: A Self-Organizing Framework for Information Extraction 22Fabian M. Suchanek
England
FMS Algorithm
diedIn
St. Elvis
FMS Algorithm
FOR i=1 TO 42...NEXT i
Elvis died in England
type(Elvis,Taxidophobist)=1diedIn(Elvis,England)=0means(Elvis@D15,Elvis)=0means(Elvis@D15,...)=1
r(X,Y) & s(Y) => t(X,Y)
SOFIE: A Self-Organizing Framework for Information Extraction 23Fabian M. Suchanek
England
FMS Algorithm
diedIn
St. Elvis
FMS Algorithm
FOR i=1 TO 42...NEXT i
r(X,Y) & s(Y) => t(X,Y)
SOFIE: A Self-Organizing Framework for Information Extraction 24Fabian M. Suchanek
Corpus Type # Docs Relations Time Precision
Wikipedia toy corpus structured 100 3 2min 100%
Wikipedia subcorpus
semi-structured
2000 15 15h 94%
News article toy corpus
unstructured 150 1 24min 91%
Biographies from Web
unstructured 3440 5 15h 90%
Other Experiments
SOFIE: A Self-Organizing Framework for Information Extraction 25Fabian M. Suchanek
SOFIE unifies the tasks of
entity disambiguation ر
pattern extraction ر
semantic constraint reasoning ر
in a single framework, delivering
canonicalized facts ر
of high precision (experiments show 90% precision) ر
Conclusion
died in England... but is alive!
SOFIE: A Self-Organizing Framework for Information Extraction 26Fabian M. Suchanek
occurs(P,WX,WY)
/\ refersTo(WX.X)
/\ refersTo(WY,Y)
/\ R(X,Y)
=> expresses(P,R)
occurs(P,WX,WY)
/\ expressed(P,R)
/\ refersTo(WX.X)
/\ refersTo(WY,Y)
/\ range(R,D1)
/\ domain(R,D2)
/\ type(X,D1)
/\ type(Y,D2)
=> R(X,Y) R(X,Y)
R(X,Y)
/\ R(X,Z)
/\ type(R,function)
=> Y = Z
disambiguationPrior(W,X) => refersTo(W,X)
bornInYear(X,B) /\ diedInYear(X,D) => B<D
SOFIE rules!
SOFIE: A Self-Organizing Framework for Information Extraction 27Fabian M. Suchanek
SOFIE: Experiments
Corpus Type # Docs Relations Time Precision Recall
Wikipedia toy corpus
structured 100 3 8min 100% 80%
Wikipedia toy corpus
semi-structured 50% infoboxes removed
100 3 8min 100% 57%
Wikipedia subcorpus
semi-structured 2000 15 15h 94% ?
News article toy corpus
unstructured 150 1 24min 91% 24%, 31%
Snowball 56% 31%
Biographies from Web
unstructured 3440 5 15h 90% ?
SOFIE: A Self-Organizing Framework for Information Extraction 28Fabian M. Suchanek
SOFIE: Large-Scale Experiment
Goal:
Extract bornIn, bornOnDate, diedIn, diedOnDate, politicianOf
Corpus:
3700 biography documents downloaded from the Web
Runtime: (summed over 5 batches)
Parsing 7:05h
Hypothesis Generation 6:15h
Solving 2:30h
Total 15:50h
Results: (precision in %)
bornIn bornOnD diedIn diedOnD polOf
87 87 13 98 95 90
SOFIE: A Self-Organizing Framework for Information Extraction 29Fabian M. Suchanek
SOFIE: Relation to Markov Logic
P
bornIn(Nicholas, Patras)
false true
P(X) ~ e sat(i,X) wi
Number of satisfied instances of the ith formula
Weight of the ith formula
r(x,y) /\ s(x,z) => t(x,z) [w]
...
max X e sat(i,X) wi
max X log( e sat(i,X) wi )
max X sat(i,X) wi
~~~~> Weighted MAX SAT problem
SOFIE: A Self-Organizing Framework for Information Extraction 30Fabian M. Suchanek
Grounding
r(X,Y) & s(Y) => t(X,Y)
{ r(X,Y), s(Y), t(X,Y) }
{ r(a,a), s(a), t(a,a) }
{ r(a,b), s(b), t(a,b) }
{ r(b,a), s(a), t(b,a) }
{ r(b,b), s(b), t(b,b) }
r(a,a)
r(a,b)
r(b,a)
r(b,b)
Immutable, complete facts (e.g. pattern occurrences)
Entities={a,b}
SOFIE: A Self-Organizing Framework for Information Extraction 31Fabian M. Suchanek
Grounding
r(X,Y) & s(Y) => t(X,Y)
{ r(X,Y), s(Y), t(X,Y) }
{ s(a), t(a,a) } [w]
r(a,a) [w]
r(a,b)
r(b,a)
r(b,b)
Immutable, complete facts (e.g. pattern occurrences)
SOFIE: A Self-Organizing Framework for Information Extraction 32Fabian M. Suchanek
Grounding
{ s(a), t(a,a) } [w1]
{p(c,d), q(e), } [w2]
Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized
means(Elvis@D15, ElvisPresley) = true ?
mapsTo("died In", diedInPlace) = true ?
diedIn(ElvisPresley, England) = true ?