Relation Extraction from the Web using Distant Supervision
Isabelle Augenstein, Diana Maynard, Fabio Ciravegna Department of Computer Science, University of Sheffield, UK
{i.augenstein,d.maynard,f.ciravegna}@dcs.shef.ac.uk
November 28, 2014
EKAW 2014
2
• Large knowledge bases are useful for search, question answering etc. but far from complete
• Approach: automatic knowledge base population (KBP) methods using Web information extraction (IE) 1) Extracting entities and relations between them from text on Web pages 2) Combining information from several sources to populate KBs
Problem
3
• Why can’t we just use existing tool X?
Motivation
4
• Why can’t we just use existing tool X?
• IE methods requiring manual effort • Manually crafted extraction patters, e.g. “X is a professor at Y” • Supervised learning: statistical models, manually annotated training
data as input Ø Biased towards a domain, e.g. Biology, newswire, Wikipedia
• IE methods requiring no manual effort • Unsupervised learning: discovering patterns, clustering Ø Difficult to map to schema • Bootstrapping: learning patterns iteratively starting with prior knowledge,
e.g. list of names Ø “Semantic drift”
Existing Approaches
5
• Requirements • Works for Web text • Extract with respect to knowledge base • No manual effort required
• What can we do? • Use knowledge base to train statistical model • Distant supervision: automatically label text with relations from
knowledge base, train machine learning classifier Ø Extract relations with respect to KB, no manual effort
Proposed Approach
6
Creating positive & negative training
examples
Feature Extraction
Classifier Training
Prediction of New
Relations
Distant Supervision
7
Distant Supervision
Creating positive & negative training
examples
Feature Extraction
Classifier Training
Prediction of New
Relations
Supervised learning
Automatically generated training data
+
Distant Supervision
8
“If two entities participate in a relation, any sentence that contains those two entities might express that relation.” (Mintz, 2009)
Amy Jade Winehouse was a singer and songwriter known for her eclectic mix of musical genres including R&B, soul and jazz.� Blur helped to popularise the Britpop genre.� Beckham rose to fame with the all-female pop group Spice Girls.�
Name Genre … Amy Winehouse Amy Jade Winehouse Wino …
R&B soul jazz …
…
Blur …
Britpop …
…
Spice Girls …
pop …
…
different lexicalisations
Distant Supervision
9
• Collect corpus • From Web, using search patterns containing relation
• Relation identification • Recognise all entities in sentences • Check if sentences contain subject, object of relations
• Seed selection • Discover, then discard potentially noisy training data
• Extract features • Standard features: context words, part of speech tags (noun, verb) etc.
• Train classifier • Apply to hold-out part of corpus
• Same relation identification procedure as for training data • Extracting relations across sentence boundaries
• Integrate / combine results
Distant Supervision System
10
• Collect corpus • From Web, using search patterns containing relation
• Relation identification • Recognise all entities in sentences • Check if sentences contain subject, object of relations
• Seed selection • Discover, then discard potentially noisy training data
• Extract features • Standard features
• Train classifier • Apply to hold-out part of corpus
• Same as for training data • Extracting relations across sentences
• Integrate / combine results
Distant Supervision System
Research described in paper
11
• Web crawl corpus, created using entity-specific search queries, consisting of 1 million Web pages
Class Property / Relation
Book author, characters, publication date, genre, original language
Musical Artist
album, active (start), active (end), genre, record label, origin, track
Film release date, director, producer, language, genre, actor, character
Politician birthdate, birthplace, educational institution, nationality, party, religion, spouses
Evaluation: Corpus
Class Property / Relation
Organisation industry, employees, city, country, date founded, founders
Educational Institution
school type, mascot, colours, city, country, date founded
River origin, mouth, length, basin countries, contained by
12
Generating training data: is it that easy?
Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be.
Name Album Track The Beatles …
Let It Be …
Let It Be …
Seed Selection
13
Generating training data: is it that easy?
Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be.
Name Album Track The Beatles …
Let It Be …
Let It Be …
• Use ‘Let It Be’ mentions as positive training examples for album or for track?
• Problem: if both mentions of ‘Let It Be’ are used to extract features for both album and track, wrong weights are learnt
• How can such ambiguous examples be detected? • Develop methods to detect, then automatically discard
potentially ambiguous training data
Seed Selection
14
Ambiguity within an entity • Example: Let It Be is the twelfth album by The Beatles
which contains their hit single Let It Be. • Let It Be can be both an album and a track of the musical artist
The Beatles • For every relation, consisting of a subject, a property and an
object (s, p, o), is the subject related to (at least) two different objects with the same lexicalisation which express two different relations?
• Unam: • Retrieve the number of such senses using the Freebase API • Discard the lexicalisation of the object as positive training data if it has at
least two different senses within an entity
Seed Selection
15
Ambiguity across classes • Example: common names of book authors or common genres,
e.g. “Jack mentioned that he read On the Road”, in which Jack is falsely recognised as the author Jack Kerouac.
• Stop: remove common words that are stopwords • Stat: Estimate how ambiguous a lexicalisation of an object is
compared to other lexicalisations of objects of the same relation • For every lexicalisation of an object of a relation, retrieve the number of
senses using the Freebase API (example: for Jack n=1066) • Compute frequency distribution per relation with min, max, median (50th
percentile), lower (25th percentile) and upper quartile (75th percentile) (example: for author: min=0, max=3059, median=10, lower=4, upper=32)
• For every lexicalisation of an object of a relation, if the number of senses > upper quartile (or the lower quartile, or median, depending on the model), discard it (example: 1066 > 32 -> Jack will be discarded)
Seed Selection
16
• Seed selection • Statistical methods for discarding noisy training data improve precision
e.g. Musical Artist: 0.62 -> 0.74; Politician: 0.85 -> 0.86
• Relation candidate recognition • Using additional methods to recognise named entities which do not rely
on existing tools increases number of extractions
• Information integration • Statistical methods for information integration improve results over
simple combination Overall precision: Simple: 0.74, Strategic combination: 0.86
• Extracting across sentence boundaries • Improves precision as well as recall, up to 5 times the number of single
extractions, on average twice as many extractions combined Overall precision: 0.8 -> 0.86
Results / Key Findings
17
• Distant supervision allows to automatically populate knowledge bases without manual effort
• Distant supervision can be applied to any domain (focus of this work: Web data)
• Seed selection, improved named entity recognition, strategies for information integration and extracting sentences across boundaries improve performance
• Additional heuristics for named entity recognition work, but the approach still relies on existing tools for that Ø More work on unsupervised named entity recognition needed
• Web pages do not only contain text, but also lists, tables etc. Ø more data that can be integrated
Conclusions / Future Work
18
Thank you for your attention!
Questions?
Top Related