LCC’s Approaches to Knowledge Base Population John Lehmann, Sean Monahan, Luke Nezda, Arnold Jung,...

LCC’s Approaches to Knowledge Base

PopulationJohn Lehmann, Sean Monahan, Luke Nezda,

Arnold Jung, Ying Shi

Outline

• Entity Linking• Slot Filling• Surprise Slot Filling

Entity Linking Task Definition

• Entity linking involves resolving an entity mention to its corresponding entry in a Knowledge Base (KB)– Input: Query consisting of document id and entity mention string– Output: Corresponding Knowledge Base (KB) node id, or else NIL– Scoring: Accuracy across all queries

• KB consists of 800k entries from Wikipedia Oct. 2008– Fields: title, entity type, infobox, text

• Source document collection– ~1.8M articles: ~70 % newswire / ~30% web– Extended in 2010 to include web data

Entity Linking Approach and Challenges

• Three step approach1. Candidate Generation

• Determine every potential/candidate entity target

2. Candidate Ranking• Extract features and pick most likely candidate

3. NIL Detection• Decide between the best candidate and NIL

• Key Challenges– How to retrieve non-trivial senses

• Spelling variations, dropped/added/reordered tokens, acronyms, synonym/informal

– How to model different types of document context• Immediate/local, topical, other references, facts

– How to represent NIL in the model

Candidate Generation (Re)sourcesNormalized Articles and Redirects (NAR)

Normalized article names + normalized redirects original page name

Surface Text to Entity Map (STEM) Hyperlink anchor texts their target pagesRelaxed Disambig. Page (DPR) Disambiguation page names all anchor targets on those pagesDisambig. Page (DP) Like the DPR except requires page’s name to overlap the anchor

Search Engine (SE) Mention string top 3 Google results, filteredSoft Mentions (SM) Strings with high dice coefficient to NAR or STEM keys associated keysLonger Mentions (LM) Superstrings of the entity mention in document their STEM keys

Expanded Acronym Bootstrap (EAB)

Expanded acronym string in document candidates resulting from a bootstrap of generation process

• Generate candidate Wikipedia page targets, map back later• 5 context-free and 2 context-dependent candidate generators

– Stored in memory, in custom map data structures for fast lookups• Cumulative actual entity recall on 2009 data: ~97%

Candidate Generation

docid = eng-NG-31-142147-10015518

In a related story, during an interview on Thursday morning with Black Panther leader Malik Zulu Shabazz, Fox News Channel viewers learned that Shabazz' group endorsed and supported Senator Barack Obama for President of the United States. […]

The New Black Panther Party leader proudly announced on Fox News that his organization endorsed and Obama for President.

Black panther

Black Panthers (comics)

Black Panther Party

New Black Panther Party

NAR

LM

DP

DPSTEM

The New Black Panther Party (NBPP) leader proudly announced on Fox News that his organization endorsed and Obama for President.

NAR

EAB

Entity Link Candidate FeaturesName Size Type DescriptionSurfaceLINK PROB 1 D percent of mention string links in STEM which target the candidate senseDICE TEST 2 B true if a Dice coefficient score passes the thresholdACRO TEST 2 B true if passes an acronym testSUBSTR TEST 1 B true if candidate or mention is a substring of the otherWEAK ALIAS 1 B true if all three surface tests failContextualCTX SIM 1 D candidate's average LLS score to context termsCTX WT 1 B sum of all context term scoresCTX CT 1 I number of context termsALIAS HIT 1 B true if high precision alias of this candidate is foundFACT HIT PTS 1 D points awarded if a candidate's fact phrase is found

SemanticQUERY TYPE 1 E semantic type of the query string according to NER systemCAND TYPE 1 E semantic type of the candidate according to KB, DBpedia and WRATSSEM CON 1 B true if query and candidate type are not inconsistent

GenerationSOURCE 7 B true for each source which generates the candidateSOURCE CT 1 I total number of sources that generates the candidateOtherLINK COMBO 1 D weighted average between CTX SIM and LINK PROBWEAK ALIAS SRC 2 B joint feature between WEAK ALIAS and SE or DPR sourcesLOG LINK CT 1 I log of total link count to the candidate sense pageSENSE CT 1 I number of candidate senses generatedIS BLOG 1 B true if source document is detected to be a blog

Contextual Representation

• How to compare source document context and candidate backing text?• Common approach is comparing weighted term vectors

– Matching literal or stemmed terms is suffers both recall & precision• “North Korea”, “DPRK” (same concept)• “Al Qaeda”, “Hamas” (topically related concept)

• We address these challenges by modeling context terms as Wikipedia pages (Milne and Witten, 2008)1. Low ambiguity context terms are identified and disambiguated2. Term similarity is determined by common in-bound links

• We call this model the Low Ambiguity [context term] Link Similarity (LLS)• Originally used for cross-linking documents with Wikipedia articles• Our focus is on correctly linking specific spans of text / individual entities

LLS Context Method Example

docid = eng-WL-11-174596-12951106 Intergovernmental Panel on Climate Change

(IPCC) made the claim which it said was based on detailed research into the impact of global warming...Dr Syed Hasnain, an Indian scientist then based at Jawaharlal Nehru University in Delhi, said that the claim was "speculation" and was not supported by any formal research... Despite this it rapidly became a key source for the IPCC when Prof Lal and his colleagues came to write the section on the Himalayas…When finally published, the IPCC report did give its source as the WWF study but went further, suggesting the likelihood of the glaciers melting was "very high".

World Wrestling Entertainment

World Wide Fund for Nature

0.11

0.04

0.00

0.00

0.51

0.31

0.00

0.41

0.73

0.25

Intergovernmental Panel on Climate Change

Jawaharlal Nehru University

Himalayas

Context Term Selection

• Context terms are selected by considering– Ambiguity: link probability and link counts must exceed thresholds– Relatedness: score to other context terms (e.g. coherence)– Linkability: likelihood of being linked in Wikipedia– Proximity: prefer terms nearest the entity mention

• Iteration: insufficient context terms result in relaxed criteria– Reduce minimum link count– Allow non-sentence zones

• Context Similarity (CTX_SIM) feature is the candidate’s average score to all context terms

Alias and Fact Context

docid = eng-NG-31-142147-10015518

In a related story, during an interview on Thursday morning with Black Panther leader Malik Zulu Shabazz, Fox News Channel viewers learned that Shabazz' group endorsed and supported Senator Barack Obama for President of the United States. […]

The New Black Panther Party leader proudly announced on Fox News that his organization endorsed and Obama for President.

Black panther

Black Panthers (comics)

Black Panther Party

New Black Panther Party

NAR

STEM

DP

DPSTEM

ALIAS

FACT

Semantic Features

• Indicate semantic type of entity for query mention and candidate sense– Important to eliminate spurious matches based on context alone– “AZ” which is a PER, should never match Arizona (U.S. state)– Types: PER, ORG, GPE, UKN

• Query Type (QET): CiceroLite NER classifications across all document mentions, or UKN– Observed 65% recall and 96% precision

• Candidate Type (CET): using 3 sources: KB, DBpedia, LCC’s WRATS– LCC’s WRATS: 1.9M high-confidence 12-type ontology of Wikipedia– Observed 97% precision with combined approach

• Semantic Consistency: true if (QET == CET) or (QET or CET == UKN)– Fired “inconsistent” 10% of time, with 96% precision

Candidate Ranking and NIL Detection

1. Heuristic– LINK_COMBO (weighted average of LINK_PROB and CTX_SIM)– Boost score for ALIAS or FACT HIT– Set to 0 if not semantically consistent– Emit NIL if score does not exceed a threshold

2. Binary logistic classifier– Train on all previously available 2009 and 2010 data– Generated samples from top 3 heuristic results, when key is not NIL– Emit NIL if classifier returns false, otherwise select max confidence

candidate• Additional NIL detection occurs if candidate not in KB (“known NIL”)

– Benefit: Provides system concrete concepts for many NILs– Drawback: Link to equivalent concept, which is not in KB, and emit

NIL

Evaluation Results

• LCC3: ML system with reduced bias to emit NIL– Motivation: 2009 data was 56% NIL vs 2010’s 28%– Predicted exact ratio of NILs (45% vs 50%), but did worse than LCC2

• Bolded scores are best across all 46 submissions• Scores comparable to post-hoc eval on 2009 (87%) and 2010 training (90%)• GPEs were lowest performing: usually insufficient or misleading document context

– “AZ” (U.S. state) linked to Arizona (Rapper) because of entertainment context– “BUFFALO, N.Y.” dateline in document with “Buffalo Bills” or “Buffalo Sabres”

• Better use of local context and semantic features should greatly help

RUN ALL KNOWN NIL PER ORG GPE

LCC1 H-NOWEB 85.78 79.22 91.22 96.01 82.40 78.91

LCC2 ML 86.80 80.59 91.95 95.61 85.20 79.57

LCC3 ML-R 86.44 82.35 89.84 95.34 84.53 79.44

Median 68.36 84.49 67.67 59.75

Contextual Feature Contribution

• Baseline (BL) = all features except C1 & C2• C1 = ALIAS_HIT, FACT_HIT, EAB• C2 = CTX_SIM (LLS)• * comparable to a submitted run, but after refactoring• BF = Bug fix for Wikipedia statistics (related to “Linkability”)

Ranking Features H Score ML Score

Baseline (BL) 80.89 79.87

BL+C1 82.58 82.18

BL+C2 84.27 84.79

BL+C1+C2 85.24* 86.49*

BL+C1+C2+BF 87.38

Candidate Generator Ablation

• STEM and DP(R) provide most unique gains• No generator contributed more than 1 point (including web access)

Generator Set Score

ALL 87.38

ALL-NAR -0.18

ALL-STEM (Links) -0.98

ALL-DP -0.89

ALL-DPR -0.89

ALL-SOFT -0.05

ALL-LM -0.27

ALL-SE (Google) -0.58

Slot Filling Task and Approach

• Slot Filling Task: populate specific fields for an entity in a knowledge base

• Four step process, with linking system loosely integrated1. Query Processing

– Slot filling queries provide entity ID when applicable– If unspecified, attempt to link entity to Wikipedia (for “known NILs”)– Generate aliases from Entity Linking resources

2. Passage Retrieval– Aliases used to search corpus for potential mentions– Any sentence with an alias or coreferring names or pronouns is retrieved

Linking Application to Slot Filling

3. Slot Fill Extraction– LCC’s CiceroCustom relation/event extractors (machine learning classifiers)– Rule-based extractors over POS, shallow parses and SRL (mostly pre-

existing)– Attempted to projected known answers from DBpedia

4. Slot Fill Processing– Merge duplicate answers using string heuristics, dice coefficient, WordNet

and Linker-generated aliases– Heuristically select best answer using frequency, length, source– Also use confidence with which we linked that mention of the target

• Prefer answers confidently associated with the target entity• Filter answers that are very unconfident

Results/Comments

Submission P R F

LCC1 45.3 18.8 26.6

LCC2 44.9 19.4 27.1

LCC3 43.6 19.2 26.7

No aliases 48.1 15.8 23.8

Manual Aliases 48.9 16.8 25.0

• Initial recall on 2010 training data had been close to 40%• Perhaps there were many missing keys in training data

• Generated aliases were noisy but still beneficial• 5/6 linking errors were due to “semi-famous” entities

• “Sean Preston” → “Brittany Spears”• Manually gathered “gold” aliases performed worse

• Link-based filtering appears to have produced little benefit• Non-confusable namestrings were used

Surprise Slot Filling

• Same framework used as main slot filling task, but here we had no pre-existing slot fill extractors– PER:Disease– PER:Charity-Supported– PER:Award-Won– ORG:Product

• Rapidly customized entity/event extractors in a few hours– WELDER lexicon generation tool

• 17k diseases, 12k awards, 7k charities

– CiceroCustom event extractors• Day 1: First pass on all 4 (11 hours)• Day 2: Charity, Product (34 hours)

– Rule for person + charity terms + charity in context (Charity)– High recall grammar rules for patterns (Product)

• Product is challenging, being a top level type and often having quite ambiguous names

Results/Comments

• Better overall score than the main task• 40% increase in score during second day, mostly due to product

– Dev set showed improvements for charity, but not eval set

Submission P R F Time

LCC1 50.3 15.4 23.7 11 hours

LCC2 52.4 24.2 33.1 34 hours

SlotType

LCC1 LCC2

P R F P R F

Award 56.5 19.7 29.2 55.6 22.7 32.3

Charity 48.1 19.1 27.4 48.1 19.1 27.4

Disease 42.8 22.2 29.3 46.7 25.9 33.3

Product 50.5 13.3 21.1 53.0 25.3 34.3

Total 50.3 15.4 23.6 52.3 24.2 33.1

Conclusions

• Slot Filling– Relatively good results but apparent over fitting limited performance– Integrated entity linking did not prove helpful given low polysemy targets

• Surprise Slot Filling– Customized system to 4 new slots in 1 and 2 days time– Scores better than main slot filling task in much less time– Product extractor challenging

• Entity Linking– Context modeling approaches proved successful– Encouraging opportunities for error reduction– Future work of extending this model to add entries to the KB– Demo session: come see the linking system process entire documents

LCC’s Approaches to Knowledge Base Population John Lehmann, Sean Monahan, Luke Nezda, Arnold Jung,...

Documents

Transcript of LCC’s Approaches to Knowledge Base Population John Lehmann, Sean Monahan, Luke Nezda, Arnold Jung,...