Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

53
Machine Reading of Web Text Oren Etzioni Turing Center University of Washington http://turing.cs.washington.edu
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    222
  • download

    0

Transcript of Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

Page 1: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

Machine Reading of Web

Text Oren Etzioni

Turing CenterUniversity of Washington

http://turing.cs.washington.edu

Page 2: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

2

Rorschach Test

Page 3: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

3

Rorschach Test for CS

Page 4: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

4

Moore’s Law?

Page 5: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

5

Storage Capacity?

Page 6: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

6

Number of Web Pages?

Page 7: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

7

Number of Facebook Users?

Page 8: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

8

Page 9: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

9

Turing Center Foci

Scale MT to 49,000,000 language pairs 2,500,000 word translation graph P(V F C)? PanImages

Accumulate knowledge from the Web

A new paradigm for Web Search

Page 10: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

10

Outline

1. A New Paradigm for Search2. Open Information Extraction3. Tractable Inference4. Conclusions

Page 11: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

11

Web Search in 2020?

Type key words into a search box? Social or “human powered” Search? The Semantic Web? What about our technology

exponentials?

“The best way to predict the future is to invent it!”

Page 12: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

12

Intelligent Search

Instead of merely retrieving Web pages, read ‘em!

Machine Reading = Information Extraction (IE) + tractable inference

IE(sentence) = who did what? speaker(Alon Halevy, UW)

Inference = uncover implicit information Will Alon visit Seattle?

Page 13: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

13

Application: Information Fusion What kills bacteria? What west coast, nano-technology

companies are hiring? Compare Obama’s “buzz” versus

Hillary’s? What is a quiet, inexpensive, 4-star

hotel in Vancouver?

Page 14: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

14

Opine (Popescu & Etzioni, EMNLP ’05)

IE(product reviews) Informative Abundant, but varied Textual

Summarize reviews without any prior knowledge of product category

Opinion Mining

Page 15: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

15

Page 16: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

16

Page 17: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

17

But “Reading” the Web is Tough Traditional IE is narrow IE has been applied to small,

homogenous corpora No parser achieves high accuracy No named-entity taggers No supervised learning

How about semi-supervised learning?

Page 18: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

18

Semi-Supervised Learning

Few hand-labeled examples Limit on the number of concepts Concepts are pre-specified Problematic for the Web

Alternative: self-supervised learning Learner discovers concepts on the fly Learner automatically labels examples

per concept!

Page 19: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

19

2. Open IE = Self-supervised IE (Banko, Cafarella, Soderland, et. al, IJCAI ’07)

Traditional IE Open IE

Input: Corpus + Hand-labeled Data

Corpus

Relations: Specified in Advance

Discovered Automatically

Complexity:

Text analysis:

O(D * R) R relations

Parser + Named-entity tagger

O(D) D documents

NP Chunker

Page 20: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

20

Extractor Overview (Banko & Etzioni, ’08)

1. Use a simple model of relationships in English to label extractions

2. Bootstrap a general model of relationships in English sentences, encoded as a CRF

3. Decompose each sentence into one or more (NP1, VP, NP2) “chunks”

4. Use CRF model to retain relevant parts of each NP and VP.

The extractor is relation-independent!

Page 21: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

21

TextRunner Extraction

Extract Triple representing binary relation (Arg1, Relation, Arg2) from sentence.

Internet powerhouse, EBay, was originally founded by Pierre Omidyar.

Internet powerhouse, EBay, was originally founded by Pierre Omidyar.

(Ebay, Founded by, Pierre Omidyar)

Page 22: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

22

Numerous Extraction Challenges Drop non-essential info: “was originally founded by” founded by Retain key distinctionsEbay founded by Pierr ≠ Ebay founded

Pierre Non-verb relationships“George Bush, president of the U.S…” Synonymy & aliasingAlbert Einstein = Einstein ≠ Einstein Bros.

Page 23: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

23

TextRunner (Web’s 1st Open IE

system) 1. Self-Supervised Learner: automatically

labels example extractions & learns an extractor

2. Single-Pass Extractor: single pass over corpus, identifying extractions in each sentence

3. Query Processor: indexes extractions enables queries at interactive speeds

Page 24: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

TextRunner Demo

Page 25: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

25

Page 26: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

26

Page 27: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

27

Triples11.3 million

With Well-Formed Relation9.3 million

With Well-Formed Entities7.8 million

Abstract6.8 million

79.2% correct

Concrete1.0 million

88.1%correct

Sample of 9 million Web Pages

Concrete facts: (Oppenheimer, taught at, Berkeley)

Abstract facts: (fruit, contain, vitamins)

Page 28: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

28

3. Tractable Inference

Much of textual information is implicit

I. Entity and predicate resolutionII. Probability of correctnessIII. Composing facts to draw conclusions

Page 29: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

29

I. Entity Resolution

Resolver (Yates & Etzioni, HLT ’07): determines synonymy based on relations found by TextRunner (cf. Pantel & Lin ‘01)

(X, born in, 1941) (M, born in, 1941) (X, citizen of, US) (M, citizen of, US) (X, friend of, Joe) (M, friend of, Mary)

P(X = M) ~ shared relations

Page 30: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

30

Relation Synonymy

(1, R, 2) (2, R 4) (4, R, 8) Etc.

(1, R’ 2) (2, R’, 4) (4, R’ 8) Etc.

P(R = R’) ~ shared argument pairs

•Unsupervised probabilistic model•O(N log N) algorithm run on millions of docs

Page 31: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

31

II. Probability of CorrectnessHow likely is an extraction to be correct?

Factors to consider include: Authoritativeness of source Confidence in extraction method Number of independent extractions

Page 32: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

32

Counting Extractions

Lexico-syntactic patterns: (Hearst ’92)“…cities such as Seattle, Boston, and…”

Turney’s PMI-IR, ACL ’02: PMI ~ co-occur frequency # results # results confidence in class

membership.

Page 33: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

33

Formal Problem StatementIf an extraction x appears k times in a

set of n distinct sentences each suggesting that x belongs to C, what is the probability that x C ?

C is a class (“cities”) or a relation (“mayor of”)

Note: we only count distinct sentences!

Page 34: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

34

Combinatorial Model (“Urns”)

Odds increase exponentially with k, but decrease exponentially with n

See Downey et al.’s IJCAI ’05 paper for formal details.

Page 35: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

35

0

1

2

3

4

5

City Film Country MayorOf

De

via

tio

n f

rom

ide

al l

og

lik

elih

oo

d

urns

noisy-or

pmi

Performance (15x Improvement)

Self supervised, domain independent method

Page 36: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

36

0

250

500

0 50000 100000

Frequency rank of extraction

Nu

mb

er

of

tim

es

ex

tra

cti

on

a

pp

ea

rs i

n p

att

ern

URNS limited on “sparse” facts

A mixture of correct and incorrect

e.g., (Dave Shaver, Pickerington)(Ronald McDonald, McDonaldland)

con

text Tend to be correct

e.g., (Michael Bloomberg, New York City)

Page 37: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

37

Language Models to the Rescue (Downey, Schoenmackers, Etzioni, ACL ’07)Instead of only lexico-syntactic patterns, leverage

all contexts of a particular entity

Statistical ‘type check’: does Pickerington “behave” like a city?

does Shaver “behave” like a mayor?

Language model = HMM (built once per corpus) Project string to point in 20-dimensional space Measure proximity of Pickerington to Seattle,

Boston, etc.

Page 38: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

38

III Compositional Inference (work in progress, Schoenmackers, Etzioni, Weld)Implicit information, (2+2=4) TextRunner: (Turing, born in, London) WordNet: (London, part of, England) Rule: ‘born in’ is transitive thru ‘part of’ Conclusion: (Turing, born in, England) Mechanism: MLN instantiated on the fly Rules: learned from corpus (future work) Inference Demo

Page 39: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

39

Mulder ‘01 WebKB ‘99 PMI-IR ‘01

KnowItAll, ‘04

UrnsBE ‘05

KnowItNow ‘05

TextRunner ‘07

KnowItAll Family Tree

Opine ‘05

Woodward ‘06

Resolver ‘07

REALM ‘07 Inference ‘08

Page 40: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

40

KnowItAll Team

Michele Banko Michael Cafarella Doug Downey Alan Ritter Dr. Stephen Soderland Stefan Schoenmackers Prof. Dan Weld Mausam

Alumni: Dr. Ana-Maria Popescu, Dr. Alex Yates, and others.

Page 41: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

41

Related Work

Sekine’s “pre-empty IE” Powerset Textual Entailment AAAI ‘07 Symposium on “Machine

Reading” Growing body of work on IE from the

Web

Page 42: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

42

4. Conclusions

Imagine search systems that operate over a (more) semantic space

Key words, documents extractions TF-IDF, pagerank relational models Web pages, hyper links entities, relns

Reading the Web new Search Paradigm

Page 43: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

43

Page 44: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

44

Machine Reading = Unsupervised understanding

of text

Much is implicit tractable inference is

key!

Page 45: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

45

HMM in more detail

Training: seek to maximize probability of corpus w given latent states t using EM:

ti ti+1 ti+2 ti+3 ti+4

wi wi+1 wi+2 wi+3 wi+4

cities such as Los Angeles

wordsw

kNt

i

i

1,,...,1

Page 46: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

46

Using the HMM at Query Time Given a set of extractions (Arg1, Rln, Arg2) Seeds = most frequent Args for Rln

arg|,|

||

1)(arg, tPseedtP

seedsKLseedsf

ii

1. Distribution over t is read from the HMM

2. Compute KL divergence via f(arg, seeds)

3. For each extraction, average f over Arg1 & Arg2

4. Sort “sparse” extractions in ascending order

Page 47: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

47

Language Modeling & Open IE Self supervised Illuminating phrases full context

Handles sparse extractions

Page 48: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

48

Focus: Open IE on Web Text

Advantages Challenges

“Semantically tractable”sentences

Redundancy

Search engines

Difficult, ungrammatical sentences

Unreliable information

Heterogeneous corpus

Page 49: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

49

II. Probability of CorrectnessHow likely is an extraction to be correct?Distributional Hypothesis: “words that

occur in the same contexts tend to have similar meanings ”

KnowItAll Hypothesis: extractions that occur in the same informative contexts more frequently are more likely to be correct.

Page 50: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

50

Relation’s arguments are “typed”:(Person, Mayor Of, City)

Training: Model distribution of Person & City contexts in corpus (Distributional Hypothesis)

Query time: Rank sparse triples by how well each argument’s context distribution matches that of its type

Argument “Type checking” via HMM

Page 51: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

51

Silly Example

(Shaver, Mayor of, Pickerington) over (Spice Girls, Mayor of, Microsoft)

Because: Shaver’s contexts are more like “other

mayors” than Spice Girls’, and Pickerington's contexts are more like

“other cities” than Microsoft’s

Page 52: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

52

Utilizing HMMs to Check TypesChallenges: Argument types are not known Can’t build model for each argument

type “Textual types” are fuzzy

Solution: Train an HMM for the corpus using EM & bootstrap

REALM improves precision by 90%

Page 53: Machine Reading of Web Text Oren Etzioni Turing Center University of Washington .

53

MLNMLN

Knowledge BasesQuery Formula

Find BestQuery

Run Query

Find ImpliedNodes & Cliques

Results

Best KB + Query

Query Results

New Nodes+ Cliques

TextRunner, WordNetBornIn(Turing, England)? Inference RulesBornIn(X, city) ->

BornIn(X, country)

WordNet: X is in England

London is in England

In(London, England)

TextRunner: Turing born in X

Turing was born in London

BornIn(Turing, London)BornIn(Turing, England)

Query: Was Turing born in England?

In(London, England)BornIn(Turing, London)BornIn(Turing, England)

Yes! Turing wasborn in England!