Wikipedia as a corpus
• 3.9 million English articles, 284 languages
• 2 billion words– Brown has 1 million
• DBpedia and Freebase
Text reveals relations
• “Various explanations of the overabundance of carbon, oxygen, nitrogen, and other elements have been proposed.”
• “These were performed in town halls and other large buildings...”
• “The splendid artistic legacy of Angkor Wat and other Khmer monuments...”
TextRunner
• Starts out with some seed patterns• Label: Uses those to label possible extractions
in a sentence• Learn: Using a graphical model• Extract: Using the learned pattern, extract the
sentence• Problem: 200,000 – 300,000 labeled training
points needed
ReVerb
• Syntactic Constraint– Requires extraction
to match syntactic patterns
• Lexical Constraint– Phrases must have
many different arguments in the corpus
Accuracy of IE
• Incoherent extractions make up 15-30% of extracted knowledge bits
• Uninformative extractions 3-7%
Named entities on Wikipedia?
[[Pigeon photography]] is an [[aerial photography]] technique invented in 1907 by the German apothecary [[Julius Neubronner]]...
Downloading Wikipedia and other Wikimedia projects
• A 2200-article sample is available on the class web site
Lab
• Find an information pattern besides the ones we’ve listed
• Run it over the Wikipedia front page corpus• Does it need a tagger? A named entity
extractor?
Top Related