Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge...

26
Outline 1. Introduction 2. Harvesting Classes 3. Harvesting Facts 4. Common Sense Knowledge 5. Knowledge Consolidation 6. Web Content Analytics 7. Wrap-Up Goal Extraction from text Consistency reasoning Extraction from Tables Open IE

Transcript of Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge...

Page 1: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Outline

1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge

5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up

• Goal• Extraction from text• Consistency reasoning• Extraction from Tables• Open IE

Page 2: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Source-centric IE vs. Yield-centric IE

many sources

one source

Surajit obtained hisPhD in CS from Stanford ...

Document 1:instanceOf (Surajit, scientist)inField (Surajit, c.science)almaMater (Surajit, Stanford U)…

Yield-centric IE

Student UniversitySurajit Chaudhuri Stanford UJim Gray UC Berkeley … …

Student AdvisorSurajit Chaudhuri Jeffrey UllmanJim Gray Mike Harrison … …

1) recall !2) precision

1) precision !2) recall

Source-centric IE

worksAt

hasAdvisor

+ (optional)targetedrelations

2

Page 3: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

We focus on yield-centric IE

many sources

Yield-centric IE

Student UniversitySurajit Chaudhuri Stanford UJim Gray UC Berkeley … …

Student AdvisorSurajit Chaudhuri Jeffrey UllmanJim Gray Mike Harrison … …

1) precision !2) recall

worksAt

hasAdvisor

+ (optional)targetedrelations

3

Page 4: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Goal: Find facts of given binary relations

...find instances of these relationshasAdvisor (JimGray, MikeHarrison)hasAdvisor (Susan Davidson, Hector Garcia-Molina)graduatedAt (JimGray, Berkeley)graduatedAt (HectorGarcia-Molina, Stanford)bornOn (JohnLennon, 9-Oct-1940)

Given binary relations with type signaturehasAdvisor: Person PersongraduatedAt: Person UniversitybornOn: Person Date

4

Page 5: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Facts Patterns(JimGray, MikeHarrison)

(BarbaraLiskov, JohnMcCarthy)

& Fact CandidatesX and his advisor Y

X under the guidance of Y

X and Y in their paper

X co-authored with Y

X rarely met his advisor Y

… • good for recall• noisy, drifting• not robust enough for high precision

(Surajit, Jeff)

(Sunita, Mike)(Alon, Jeff)

(Renee, Yannis)

(Surajit, Microsoft)

(Sunita, Soumen)

(Surajit, Moshe)(Alon, Larry)

(Soumen, Sunita)

Facts yield patterns – and vice versa

5[Brin@WebDB1998 "DIPRE"; Agichtein@SIGMOD2001 "Snowball"]

Page 6: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Facts Patterns(JimGray, MikeHarrison)

(BarbaraLiskov, JohnMcCarthy)

& Fact CandidatesX and his advisor Y

X under the guidance of Y

X and Y in their paper

X co-authored with Y

X rarely met his advisor Y

… • good for recall• noisy, drifting• not robust enough for high precision

(Surajit, Jeff)

(Sunita, Mike)(Alon, Jeff)

(Renee, Yannis)

(Surajit, Microsoft)

(Sunita, Soumen)

(Surajit, Moshe)(Alon, Larry)

(Soumen, Sunita)

Facts yield patterns – and vice versa

6

Extensions:1. use statistics to estimate the trustworthiness of patterns2. use counter examples to "punish" bad patterns

[Ravichandran 2002; Suchanek 2006; ...]

3. use deep parsing to generalize patterns[Bunescu 2005 , Suchanek 2006, …]

Page 7: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Outline

1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge

5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up

• Goal √• Extraction from text √• Consistency reasoning• Extraction from Tables• Open IE

Page 8: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Reasoning

[Suchanek@WWW2009] 8

occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…

Einstein died in 1955

Page 9: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Reasoning

[Suchanek@WWW2009] 9

occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…

Solving a weightedMAX SAT problemat scale

Page 10: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Reasoning

[Suchanek@WWW2009] 10

occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…

Page 11: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Reasoning

[Suchanek@WWW2009] 11

occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…

Extensions:1. parallelize the reasoning by performing a min cut on the dependency graph [Nakashole@WSDM2011 "Prospera"]

2. use Markov logic networks to represent the entire joint probability distribution [M. Richardson / P. Domingos 2006]

MLN>

Page 12: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Using Markov Logic Networks

12

We can model/compute• the marginal probabilities• the joint distribution• the MAP (=maximum a posteriori), i.e. the most likely world

World 1: World 2:

Probability:

Application: Extracting facts at large scale[Zhu@WWW2009 "StatSnowball", "EntityCube"]

528528

Page 13: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Outline

1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge

5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up

• Goal √• Extraction from text √• Consistency reasoning √• Extraction from Tables• Open IE

tables>

Page 14: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Web Tables provide relational information[Cafarella et al: PVLDB 08; Sarawagi et al: PVLDB 09]

14

Page 15: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Web Tables can be annotated with YAGO[Limaye, Sarawagi, Chakrabarti: PVLDB 10]

Goal: enable semantic search over Web tables

Idea:• Map column headers to Yago classes,• Map cell values to Yago entities• Using joint inference for factor-graph learning model

15

Title Author

A short history of time S Hawkins

D AdamsHitchhiker's guide

Book Person

Entity

hasAuthorwebtables>

Page 16: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Statistics yield semantics of Web tables

[Venetis,Halevy et al: PVLDB 11]

Idea: Infer classes from co-occurrences, headers are class names

𝑃 (𝑣𝑎𝑙1 ,…,𝑣𝑎𝑙𝑛|𝑐𝑙𝑎𝑠𝑠 )∝∏ 𝑃 (𝑐𝑙𝑎𝑠𝑠∨𝑣𝑎𝑙𝑖)𝑃 (𝑐𝑙𝑎𝑠𝑠)

Result from 12 Mio. Web tables:• 1.5 Mio. labeled columns (=classes)• 155 Mio. instances (=values) 16

but: classes&entities not canonicalized. Instances may include: Google Inc., Google, NASDAQ GOOG, Google search engine, … Jet Li, Li Lianjie,  Ley Lin Git, Li Yangzhong, Nameless hero, …

Page 17: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

ID-Based Extraction

887128476661

• Unique identifiers exist for books (ISBN), products (GTIN), companies (VAT), people (emails*), etc.

• Unique identifiers can be found by regular expression + check digit verification

Page 18: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

id Name URL

123 Puma PowerTech url1123 Please choose url1123 Puma PowerTech url2123 Puma Power Shoe url2124 Puma Slow Cat url3779 Please choose url3779 Canon PowerShot url3…

ID-Based Extraction

Page 19: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

ID-Based Extraction

[Talaika@WebDB2015 "IBEX"]

Page 20: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Outline

1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge

5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up

• Goal √• Extraction from text √• Consistency reasoning √• Extraction from Tables √• Open IE

Page 21: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Open Information Extraction

So far we assumed given relations with type signatures <entity1, relation, entity2>

< CarlaBruni marriedTo NicolasSarkozy> Person R Person < NataliePortman wonAward AcademyAward > Person R Prize

Open IE aims to discover new entities and new relation types <name1, phrase, name2>

Madame Bruni in her happy marriage with Sarko…

21<Madame Bruni, her happy marriage with, Sarko>

details>

Page 22: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Open IE with ReVerb [A. Fader et al. 2011, T. Lin 2012, Mausam 2012]

Idea: Consider all subject-verb-object triples as facts.

Problem 1: uninformative extractions “Gold has an atomic weight of 196” <Gold,has,atomicweight> “Faust made a deal with the devil” <Faust, made, a deal>

Solutions: 1. enforce regular expressions over POS tags, such as VB (N | ADJ | ADV | PRN | DET)* PREP2. require relation phrase appear with many distinct arg pairs3. intersect with Freebase

Problem 2: over-specific extractions “Elvis is the first and greatest rock and roll star of America” <..., is the first and greatest rock and roll star of, …>

22

Page 23: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

23http://openie.cs.washington.edu/

PATTY>

Page 24: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Syntactic-Lexical-Ontological (SOL) patterns combine1. ontological types2. lexical surface form3. syntactic properties

Amy Winehouse’s cosy voice in her song ‘Rehab’Jim Morrison’s haunting voice and charisma in ‘The End’Joan Baez’s angel-like voice in ‘Farewell Angelina’

SOL pattern: <singer> ’s ADJECTIVE voice * in <song>

[Nakashole@EMNLP2012 "PATTY"]

24

Enhanced Patterns

Patterns can subsume each other: "wife of" => "spouse of"… which means that we can create synsets of patternsand arrange them in a taxonomy.

Page 25: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

350 000 SOL patterns with 4 Mio. instancesaccessible at: www.mpi-inf.mpg.de/yago-naga/patty 25

[Nakashole@EMNLP2012 "PATTY"]

Enhanced Patterns

Page 26: Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge Consolidation 6.Web Content Analytics 7.Wrap-Up Goal.

Open Problems and Grand Challenges

Real-time & incremental fact extractionfor continuous KB growth & maintenance(life-cycle management over years and decades)

Extensions to ternary & higher-arity relationsevents in context: who did what to/with whom when where why …?

Robust fact extraction with both high precision & recallas highly automated (self-tuning) as possible

Extend the approaches to other languages

26