Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge...
-
Upload
lillian-harrell -
Category
Documents
-
view
218 -
download
1
Transcript of Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge...
Outline
1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge
5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up
• Goal• Extraction from text• Consistency reasoning• Extraction from Tables• Open IE
Source-centric IE vs. Yield-centric IE
many sources
one source
Surajit obtained hisPhD in CS from Stanford ...
Document 1:instanceOf (Surajit, scientist)inField (Surajit, c.science)almaMater (Surajit, Stanford U)…
Yield-centric IE
Student UniversitySurajit Chaudhuri Stanford UJim Gray UC Berkeley … …
Student AdvisorSurajit Chaudhuri Jeffrey UllmanJim Gray Mike Harrison … …
1) recall !2) precision
1) precision !2) recall
Source-centric IE
worksAt
hasAdvisor
+ (optional)targetedrelations
2
We focus on yield-centric IE
many sources
Yield-centric IE
Student UniversitySurajit Chaudhuri Stanford UJim Gray UC Berkeley … …
Student AdvisorSurajit Chaudhuri Jeffrey UllmanJim Gray Mike Harrison … …
1) precision !2) recall
worksAt
hasAdvisor
+ (optional)targetedrelations
3
Goal: Find facts of given binary relations
...find instances of these relationshasAdvisor (JimGray, MikeHarrison)hasAdvisor (Susan Davidson, Hector Garcia-Molina)graduatedAt (JimGray, Berkeley)graduatedAt (HectorGarcia-Molina, Stanford)bornOn (JohnLennon, 9-Oct-1940)
Given binary relations with type signaturehasAdvisor: Person PersongraduatedAt: Person UniversitybornOn: Person Date
4
Facts Patterns(JimGray, MikeHarrison)
(BarbaraLiskov, JohnMcCarthy)
& Fact CandidatesX and his advisor Y
X under the guidance of Y
X and Y in their paper
X co-authored with Y
X rarely met his advisor Y
… • good for recall• noisy, drifting• not robust enough for high precision
(Surajit, Jeff)
(Sunita, Mike)(Alon, Jeff)
(Renee, Yannis)
(Surajit, Microsoft)
(Sunita, Soumen)
(Surajit, Moshe)(Alon, Larry)
(Soumen, Sunita)
Facts yield patterns – and vice versa
5[Brin@WebDB1998 "DIPRE"; Agichtein@SIGMOD2001 "Snowball"]
Facts Patterns(JimGray, MikeHarrison)
(BarbaraLiskov, JohnMcCarthy)
& Fact CandidatesX and his advisor Y
X under the guidance of Y
X and Y in their paper
X co-authored with Y
X rarely met his advisor Y
… • good for recall• noisy, drifting• not robust enough for high precision
(Surajit, Jeff)
(Sunita, Mike)(Alon, Jeff)
(Renee, Yannis)
(Surajit, Microsoft)
(Sunita, Soumen)
(Surajit, Moshe)(Alon, Larry)
(Soumen, Sunita)
Facts yield patterns – and vice versa
6
Extensions:1. use statistics to estimate the trustworthiness of patterns2. use counter examples to "punish" bad patterns
[Ravichandran 2002; Suchanek 2006; ...]
3. use deep parsing to generalize patterns[Bunescu 2005 , Suchanek 2006, …]
Outline
1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge
5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up
• Goal √• Extraction from text √• Consistency reasoning• Extraction from Tables• Open IE
Reasoning
[Suchanek@WWW2009] 8
occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…
Einstein died in 1955
Reasoning
[Suchanek@WWW2009] 9
occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…
Solving a weightedMAX SAT problemat scale
Reasoning
[Suchanek@WWW2009] 10
occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…
Reasoning
[Suchanek@WWW2009] 11
occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…
Extensions:1. parallelize the reasoning by performing a min cut on the dependency graph [Nakashole@WSDM2011 "Prospera"]
2. use Markov logic networks to represent the entire joint probability distribution [M. Richardson / P. Domingos 2006]
MLN>
Using Markov Logic Networks
12
We can model/compute• the marginal probabilities• the joint distribution• the MAP (=maximum a posteriori), i.e. the most likely world
World 1: World 2:
…
Probability:
Application: Extracting facts at large scale[Zhu@WWW2009 "StatSnowball", "EntityCube"]
528528
Outline
1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge
5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up
• Goal √• Extraction from text √• Consistency reasoning √• Extraction from Tables• Open IE
tables>
Web Tables provide relational information[Cafarella et al: PVLDB 08; Sarawagi et al: PVLDB 09]
14
Web Tables can be annotated with YAGO[Limaye, Sarawagi, Chakrabarti: PVLDB 10]
Goal: enable semantic search over Web tables
Idea:• Map column headers to Yago classes,• Map cell values to Yago entities• Using joint inference for factor-graph learning model
15
Title Author
A short history of time S Hawkins
D AdamsHitchhiker's guide
Book Person
Entity
hasAuthorwebtables>
Statistics yield semantics of Web tables
[Venetis,Halevy et al: PVLDB 11]
Idea: Infer classes from co-occurrences, headers are class names
𝑃 (𝑣𝑎𝑙1 ,…,𝑣𝑎𝑙𝑛|𝑐𝑙𝑎𝑠𝑠 )∝∏ 𝑃 (𝑐𝑙𝑎𝑠𝑠∨𝑣𝑎𝑙𝑖)𝑃 (𝑐𝑙𝑎𝑠𝑠)
Result from 12 Mio. Web tables:• 1.5 Mio. labeled columns (=classes)• 155 Mio. instances (=values) 16
but: classes&entities not canonicalized. Instances may include: Google Inc., Google, NASDAQ GOOG, Google search engine, … Jet Li, Li Lianjie, Ley Lin Git, Li Yangzhong, Nameless hero, …
ID-Based Extraction
887128476661
• Unique identifiers exist for books (ISBN), products (GTIN), companies (VAT), people (emails*), etc.
• Unique identifiers can be found by regular expression + check digit verification
id Name URL
123 Puma PowerTech url1123 Please choose url1123 Puma PowerTech url2123 Puma Power Shoe url2124 Puma Slow Cat url3779 Please choose url3779 Canon PowerShot url3…
ID-Based Extraction
ID-Based Extraction
[Talaika@WebDB2015 "IBEX"]
Outline
1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge
5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up
• Goal √• Extraction from text √• Consistency reasoning √• Extraction from Tables √• Open IE
Open Information Extraction
So far we assumed given relations with type signatures <entity1, relation, entity2>
< CarlaBruni marriedTo NicolasSarkozy> Person R Person < NataliePortman wonAward AcademyAward > Person R Prize
Open IE aims to discover new entities and new relation types <name1, phrase, name2>
Madame Bruni in her happy marriage with Sarko…
21<Madame Bruni, her happy marriage with, Sarko>
details>
Open IE with ReVerb [A. Fader et al. 2011, T. Lin 2012, Mausam 2012]
Idea: Consider all subject-verb-object triples as facts.
Problem 1: uninformative extractions “Gold has an atomic weight of 196” <Gold,has,atomicweight> “Faust made a deal with the devil” <Faust, made, a deal>
Solutions: 1. enforce regular expressions over POS tags, such as VB (N | ADJ | ADV | PRN | DET)* PREP2. require relation phrase appear with many distinct arg pairs3. intersect with Freebase
Problem 2: over-specific extractions “Elvis is the first and greatest rock and roll star of America” <..., is the first and greatest rock and roll star of, …>
22
Syntactic-Lexical-Ontological (SOL) patterns combine1. ontological types2. lexical surface form3. syntactic properties
Amy Winehouse’s cosy voice in her song ‘Rehab’Jim Morrison’s haunting voice and charisma in ‘The End’Joan Baez’s angel-like voice in ‘Farewell Angelina’
SOL pattern: <singer> ’s ADJECTIVE voice * in <song>
[Nakashole@EMNLP2012 "PATTY"]
24
Enhanced Patterns
Patterns can subsume each other: "wife of" => "spouse of"… which means that we can create synsets of patternsand arrange them in a taxonomy.
350 000 SOL patterns with 4 Mio. instancesaccessible at: www.mpi-inf.mpg.de/yago-naga/patty 25
[Nakashole@EMNLP2012 "PATTY"]
Enhanced Patterns
Open Problems and Grand Challenges
Real-time & incremental fact extractionfor continuous KB growth & maintenance(life-cycle management over years and decades)
Extensions to ternary & higher-arity relationsevents in context: who did what to/with whom when where why …?
Robust fact extraction with both high precision & recallas highly automated (self-tuning) as possible
Extend the approaches to other languages
26