Coverage of Information Extraction from Sentences...

13
Coverage of Information Extraction from Sentences and Paragraphs Simon Razniewski 1 , Nitisha Jain 2 , Paramita Mirza 1 , Gerhard Weikum 1 1 Max Planck Institute for Informatics 2 Hasso Plattner Institute

Transcript of Coverage of Information Extraction from Sentences...

Page 1: Coverage of Information Extraction from Sentences …simonrazniewski.com/wp-content/uploads/2019_EMNLP_slides.pdfCoverage of Information Extraction from Sentences and Paragraphs Simon

Coverage of Information Extraction from Sentences and Paragraphs

Simon Razniewski1, Nitisha Jain2, Paramita Mirza1, Gerhard Weikum1

1Max Planck Institute for Informatics2Hasso Plattner Institute

Page 2: Coverage of Information Extraction from Sentences …simonrazniewski.com/wp-content/uploads/2019_EMNLP_slides.pdfCoverage of Information Extraction from Sentences and Paragraphs Simon

The classical IE paradigm

Obama has two children,

Malia and Sasha.

<Obama, child, Malia> - Confidence 0.9<Obama, child, Sasha> - Confidence 0.9

Information extraction

tool

But are these all?

2

Page 3: Coverage of Information Extraction from Sentences …simonrazniewski.com/wp-content/uploads/2019_EMNLP_slides.pdfCoverage of Information Extraction from Sentences and Paragraphs Simon

Coverage differences

• Obama has two children, Malia and Sasha.

• Jolie brought her children Shiloh and Knox to school.

• New York consists of the districts Manhattan, Bronx, Queens, Brooklyn, and Staten Island.

• Important districts of Hong Kong are Wan Chai, Kowloon City, and Yau Tsim Mong.

3

Page 4: Coverage of Information Extraction from Sentences …simonrazniewski.com/wp-content/uploads/2019_EMNLP_slides.pdfCoverage of Information Extraction from Sentences and Paragraphs Simon

Relevance (1/3): IE resource efficiency

Districts(Hong Kong) = Wan Chai, Kowloon City, Yau Tsim Mong

Coverage = Low Explore more resources

4

IE

Coverage = High Stop further extraction

Districts(NY) = Manhattan, Bronx, Queens, Brooklyn, Staten Island

Page 5: Coverage of Information Extraction from Sentences …simonrazniewski.com/wp-content/uploads/2019_EMNLP_slides.pdfCoverage of Information Extraction from Sentences and Paragraphs Simon

Relevance (2/3): Adjust IE thresholds

• District(HK, Wan Chai) - confidence 0.93

• District(HK, Kowloon City) - confidence 0.86

• District(HK, Yau Tsim) - confidence 0.74

• District(HK, Macao) - confidence 0.67

• …

IE

HK consists of the districts Wan Chai, …, …, …, … and ….

Coverage 0.98

Accept

Reject

5

Page 6: Coverage of Information Extraction from Sentences …simonrazniewski.com/wp-content/uploads/2019_EMNLP_slides.pdfCoverage of Information Extraction from Sentences and Paragraphs Simon

Relevance (3/3): QA negation and completeness

• Which US presidents were married only once?

• Which countries participated in no UN mission?

• For which cities do we know all districts?

Without coverage awareness, QA systems cannot answer these

Focus of our research [SIGMOD’15, WSDM’17, ACL’17, ISWC’18, …]

QA

6

Page 7: Coverage of Information Extraction from Sentences …simonrazniewski.com/wp-content/uploads/2019_EMNLP_slides.pdfCoverage of Information Extraction from Sentences and Paragraphs Simon

Coverage estimation how?

Maxim of quantity:

• Make your contribution as informative as required

Maxim of relevance:

• Be relevant

Can we automatically determine where full coverage is relevant?

Grice’s maxims of cooperative communication [Logic and conversation, 1975]

7

Obama has two children, Malia and Sasha. Jolie brought her children Shilohand Knox to school.

Page 8: Coverage of Information Extraction from Sentences …simonrazniewski.com/wp-content/uploads/2019_EMNLP_slides.pdfCoverage of Information Extraction from Sentences and Paragraphs Simon

Formal problem: Full coverage prediction

Subject s

Predicate p

Real-world set of p-objects for s: RW{o | sp}

Ground truth extraction (perfect IE): GTE{o | sp, t}

Given a text segment t:

GTE{o | sp, t} = RW{o | sp}?

If yes t has full coverage

of o for s, p8

Page 9: Coverage of Information Extraction from Sentences …simonrazniewski.com/wp-content/uploads/2019_EMNLP_slides.pdfCoverage of Information Extraction from Sentences and Paragraphs Simon

Setup

• p ∈ {child, spouse, bandmember, educatedAt, employer}

• s ∈ popular entities from Wikidata

• t: Wikipedia sentences/paragraphs

• GTE{o | sp, t}: OpenIE + predicate dictionary + surface name matching

• RW{o | sp}: Distant supervision

• Wikidata objects – assumed to be complete for popular s

• Classifier/features:

• SVM on text n-grams, LSTM on word embeddings

• Baselines:

• Random

• Longest text segments are complete

• Text segments containing most proper names are complete

9

Page 10: Coverage of Information Extraction from Sentences …simonrazniewski.com/wp-content/uploads/2019_EMNLP_slides.pdfCoverage of Information Extraction from Sentences and Paragraphs Simon

Results

10

Text unit Model child bandMember educatedAt

Sentence

Random .06 .06 .14

Length .05 .13 .24

#pnames .05 .17 .28

LSTM .45 .60 .64

F1-score on predicting text units w/ full coverage.

Paragraph

Random .12 .15 .12

Length .17 .26 .21

#pnames .19 .31 .29

LSTM .55 .54 .58

1. Baselines capture some signal

2. LSTM on text contains much stronger signals

3. Task is not easy

Page 11: Coverage of Information Extraction from Sentences …simonrazniewski.com/wp-content/uploads/2019_EMNLP_slides.pdfCoverage of Information Extraction from Sentences and Paragraphs Simon

Paragraph features

11

child spouse bandMember educatedAt

<num> grandsons

married twice consists of briefly attended

<pname> sons secondmarriage

vocals <pname>

left graduating

daugthers: <pname>

later married lineup<pname>

<pname> left

Page 12: Coverage of Information Extraction from Sentences …simonrazniewski.com/wp-content/uploads/2019_EMNLP_slides.pdfCoverage of Information Extraction from Sentences and Paragraphs Simon

Example predictions

12

Sentence LSTM score

He was the father of actor Pierre Renoir (1885-1952),filmmaker Jean Renoir (1894-1979) and ceramicartist Claude Renoir (1901-1969).

0.54

His daughter Julie Gavras and his son RomainGavras are also filmmakers.

0.46

Genghis Khan was aware of the friction between hissons (particularly between Chagatai and Jochi) andworried of possible conflict between them if he died.

0.42

“From this moment I am no longer the king; theking is Victor my son.”

0.17

Page 13: Coverage of Information Extraction from Sentences …simonrazniewski.com/wp-content/uploads/2019_EMNLP_slides.pdfCoverage of Information Extraction from Sentences and Paragraphs Simon

Take-home

• IE so far only focused on confidence (precision)

• Coverage (recall) has importance for resource efficiency, thresholding, QA w/ negation

• Linguistic theories give handles towards coverage estimation

• Experiments:• Coverage estimation is feasible

• N-grams provide informative features

13