Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge...

Post on 26-Dec-2015

218 views 1 download

Tags:

Transcript of Outline 1.Introduction 2.Harvesting Classes 3.Harvesting Facts 4.Common Sense Knowledge 5.Knowledge...

Outline

1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge

5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up

• Goal• Extraction from text• Consistency reasoning• Extraction from Tables• Open IE

Source-centric IE vs. Yield-centric IE

many sources

one source

Surajit obtained hisPhD in CS from Stanford ...

Document 1:instanceOf (Surajit, scientist)inField (Surajit, c.science)almaMater (Surajit, Stanford U)…

Yield-centric IE

Student UniversitySurajit Chaudhuri Stanford UJim Gray UC Berkeley … …

Student AdvisorSurajit Chaudhuri Jeffrey UllmanJim Gray Mike Harrison … …

1) recall !2) precision

1) precision !2) recall

Source-centric IE

worksAt

hasAdvisor

+ (optional)targetedrelations

2

We focus on yield-centric IE

many sources

Yield-centric IE

Student UniversitySurajit Chaudhuri Stanford UJim Gray UC Berkeley … …

Student AdvisorSurajit Chaudhuri Jeffrey UllmanJim Gray Mike Harrison … …

1) precision !2) recall

worksAt

hasAdvisor

+ (optional)targetedrelations

3

Goal: Find facts of given binary relations

...find instances of these relationshasAdvisor (JimGray, MikeHarrison)hasAdvisor (Susan Davidson, Hector Garcia-Molina)graduatedAt (JimGray, Berkeley)graduatedAt (HectorGarcia-Molina, Stanford)bornOn (JohnLennon, 9-Oct-1940)

Given binary relations with type signaturehasAdvisor: Person PersongraduatedAt: Person UniversitybornOn: Person Date

4

Facts Patterns(JimGray, MikeHarrison)

(BarbaraLiskov, JohnMcCarthy)

& Fact CandidatesX and his advisor Y

X under the guidance of Y

X and Y in their paper

X co-authored with Y

X rarely met his advisor Y

… • good for recall• noisy, drifting• not robust enough for high precision

(Surajit, Jeff)

(Sunita, Mike)(Alon, Jeff)

(Renee, Yannis)

(Surajit, Microsoft)

(Sunita, Soumen)

(Surajit, Moshe)(Alon, Larry)

(Soumen, Sunita)

Facts yield patterns – and vice versa

5[Brin@WebDB1998 "DIPRE"; Agichtein@SIGMOD2001 "Snowball"]

Facts Patterns(JimGray, MikeHarrison)

(BarbaraLiskov, JohnMcCarthy)

& Fact CandidatesX and his advisor Y

X under the guidance of Y

X and Y in their paper

X co-authored with Y

X rarely met his advisor Y

… • good for recall• noisy, drifting• not robust enough for high precision

(Surajit, Jeff)

(Sunita, Mike)(Alon, Jeff)

(Renee, Yannis)

(Surajit, Microsoft)

(Sunita, Soumen)

(Surajit, Moshe)(Alon, Larry)

(Soumen, Sunita)

Facts yield patterns – and vice versa

6

Extensions:1. use statistics to estimate the trustworthiness of patterns2. use counter examples to "punish" bad patterns

[Ravichandran 2002; Suchanek 2006; ...]

3. use deep parsing to generalize patterns[Bunescu 2005 , Suchanek 2006, …]

Outline

1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge

5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up

• Goal √• Extraction from text √• Consistency reasoning• Extraction from Tables• Open IE

Reasoning

[Suchanek@WWW2009] 8

occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…

Einstein died in 1955

Reasoning

[Suchanek@WWW2009] 9

occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…

Solving a weightedMAX SAT problemat scale

Reasoning

[Suchanek@WWW2009] 10

occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…

Reasoning

[Suchanek@WWW2009] 11

occurs("Elvis","died in",528)occurs("Einstein","died in",1955)died(Einstein,1955), born(Elvis, 1935)occurs(X',P,Y) & means(X',X) & R(X,Y) => pattern(P,R)occurs(X',P,Y) & means(X',X) & pattern(P,R) => R(X,Y)born(X,Y) & died(X,Z) => Z>Y…

Extensions:1. parallelize the reasoning by performing a min cut on the dependency graph [Nakashole@WSDM2011 "Prospera"]

2. use Markov logic networks to represent the entire joint probability distribution [M. Richardson / P. Domingos 2006]

MLN>

Using Markov Logic Networks

12

We can model/compute• the marginal probabilities• the joint distribution• the MAP (=maximum a posteriori), i.e. the most likely world

World 1: World 2:

Probability:

Application: Extracting facts at large scale[Zhu@WWW2009 "StatSnowball", "EntityCube"]

528528

Outline

1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge

5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up

• Goal √• Extraction from text √• Consistency reasoning √• Extraction from Tables• Open IE

tables>

Web Tables provide relational information[Cafarella et al: PVLDB 08; Sarawagi et al: PVLDB 09]

14

Web Tables can be annotated with YAGO[Limaye, Sarawagi, Chakrabarti: PVLDB 10]

Goal: enable semantic search over Web tables

Idea:• Map column headers to Yago classes,• Map cell values to Yago entities• Using joint inference for factor-graph learning model

15

Title Author

A short history of time S Hawkins

D AdamsHitchhiker's guide

Book Person

Entity

hasAuthorwebtables>

Statistics yield semantics of Web tables

[Venetis,Halevy et al: PVLDB 11]

Idea: Infer classes from co-occurrences, headers are class names

𝑃 (𝑣𝑎𝑙1 ,…,𝑣𝑎𝑙𝑛|𝑐𝑙𝑎𝑠𝑠 )∝∏ 𝑃 (𝑐𝑙𝑎𝑠𝑠∨𝑣𝑎𝑙𝑖)𝑃 (𝑐𝑙𝑎𝑠𝑠)

Result from 12 Mio. Web tables:• 1.5 Mio. labeled columns (=classes)• 155 Mio. instances (=values) 16

but: classes&entities not canonicalized. Instances may include: Google Inc., Google, NASDAQ GOOG, Google search engine, … Jet Li, Li Lianjie,  Ley Lin Git, Li Yangzhong, Nameless hero, …

ID-Based Extraction

887128476661

• Unique identifiers exist for books (ISBN), products (GTIN), companies (VAT), people (emails*), etc.

• Unique identifiers can be found by regular expression + check digit verification

id Name URL

123 Puma PowerTech url1123 Please choose url1123 Puma PowerTech url2123 Puma Power Shoe url2124 Puma Slow Cat url3779 Please choose url3779 Canon PowerShot url3…

ID-Based Extraction

ID-Based Extraction

[Talaika@WebDB2015 "IBEX"]

Outline

1. Introduction2. Harvesting Classes 3. Harvesting Facts4. Common Sense Knowledge

5. Knowledge Consolidation6. Web Content Analytics7. Wrap-Up

• Goal √• Extraction from text √• Consistency reasoning √• Extraction from Tables √• Open IE

Open Information Extraction

So far we assumed given relations with type signatures <entity1, relation, entity2>

< CarlaBruni marriedTo NicolasSarkozy> Person R Person < NataliePortman wonAward AcademyAward > Person R Prize

Open IE aims to discover new entities and new relation types <name1, phrase, name2>

Madame Bruni in her happy marriage with Sarko…

21<Madame Bruni, her happy marriage with, Sarko>

details>

Open IE with ReVerb [A. Fader et al. 2011, T. Lin 2012, Mausam 2012]

Idea: Consider all subject-verb-object triples as facts.

Problem 1: uninformative extractions “Gold has an atomic weight of 196” <Gold,has,atomicweight> “Faust made a deal with the devil” <Faust, made, a deal>

Solutions: 1. enforce regular expressions over POS tags, such as VB (N | ADJ | ADV | PRN | DET)* PREP2. require relation phrase appear with many distinct arg pairs3. intersect with Freebase

Problem 2: over-specific extractions “Elvis is the first and greatest rock and roll star of America” <..., is the first and greatest rock and roll star of, …>

22

23http://openie.cs.washington.edu/

PATTY>

Syntactic-Lexical-Ontological (SOL) patterns combine1. ontological types2. lexical surface form3. syntactic properties

Amy Winehouse’s cosy voice in her song ‘Rehab’Jim Morrison’s haunting voice and charisma in ‘The End’Joan Baez’s angel-like voice in ‘Farewell Angelina’

SOL pattern: <singer> ’s ADJECTIVE voice * in <song>

[Nakashole@EMNLP2012 "PATTY"]

24

Enhanced Patterns

Patterns can subsume each other: "wife of" => "spouse of"… which means that we can create synsets of patternsand arrange them in a taxonomy.

350 000 SOL patterns with 4 Mio. instancesaccessible at: www.mpi-inf.mpg.de/yago-naga/patty 25

[Nakashole@EMNLP2012 "PATTY"]

Enhanced Patterns

Open Problems and Grand Challenges

Real-time & incremental fact extractionfor continuous KB growth & maintenance(life-cycle management over years and decades)

Extensions to ternary & higher-arity relationsevents in context: who did what to/with whom when where why …?

Robust fact extraction with both high precision & recallas highly automated (self-tuning) as possible

Extend the approaches to other languages

26