Download - Information Extraction MAS.S60 Catherine Havasi Rob Speer.

Information Extraction

MAS.S60Catherine Havasi

Rob Speer

Wikipedia as a corpus

• 3.9 million English articles, 284 languages

• 2 billion words– Brown has 1 million

• DBpedia and Freebase

Text reveals relations

• “Various explanations of the overabundance of carbon, oxygen, nitrogen, and other elements have been proposed.”

• “These were performed in town halls and other large buildings...”

• “The splendid artistic legacy of Angkor Wat and other Khmer monuments...”

NACLO puzzle

Would it be plausible to describe something as “danty but sloshful”?

Possible patterns

• both X and Y• X but not Y• use NP to VP• [Un]fortunately, VP

Constraints using named entities

Constraints using named entities and parts of speech

TextRunner

• Starts out with some seed patterns• Label: Uses those to label possible extractions

in a sentence• Learn: Using a graphical model• Extract: Using the learned pattern, extract the

sentence• Problem: 200,000 – 300,000 labeled training

points needed

ReVerb

• Syntactic Constraint– Requires extraction

to match syntactic patterns

• Lexical Constraint– Phrases must have

many different arguments in the corpus

Accuracy of IE

• Incoherent extractions make up 15-30% of extracted knowledge bits

• Uninformative extractions 3-7%

Tom Mitchell (NELL)

• Unsupervised learning machine

Categories on Wikipedia (Dan Weld)

How Kylin Works

Word senses on Wikipedia

Named entities on Wikipedia?

[[Pigeon photography]] is an [[aerial photography]] technique invented in 1907 by the German apothecary [[Julius Neubronner]]...

Downloading Wikipedia and other Wikimedia projects

• A 2200-article sample is available on the class web site

Lab

• Find an information pattern besides the ones we’ve listed

• Run it over the Wikipedia front page corpus• Does it need a tagger? A named entity

extractor?

Assignment

• Choose and refine an information extractor• Hand-tag some examples• Add a classifier for good vs. bad matches

• You are allowed to work in groups• Sharing code is fine, but one writeup per

person