Term Informativeness for Named Entity Detection
description
Transcript of Term Informativeness for Named Entity Detection
Term Informativeness for Named Entity Detection
Jason D. M. RennieMIT
Tommi JaakkolaMIT
Information Extraction
President Bush signed the Central America Free Trade Agreement into law Tuesday…
Who What When
Named Entity Detection
President Bush signed the Central America Free Trade Agreement into law Tuesday, hailing the seven-nation pact as an open-door policy that will benefit U.S. exporters
and seed prosperity and democracy in Central America and the Dominican
Republic.
Informal Communication
• Other Sources of Information– E-mail– Web Bulletin Boards– Mailing Lists
• More specialized, up-to-date information
• But, harder to extract
IE for Informal Comm.
SUBJECT: Two New Ipswich Seafood Joints to Open Soon.
ALL HOUNDS ON DECK! #1 Across from the new HS, at the old White Cap Seafood is a renovated new joint and the sign says "Salt Box". I suspect they are opening soon; they look ready. Lets hope its great as there is too much 'just average' around here. #2: In the…
NED for Informal Comm.
Subject: finale harvard square
has anyone been to the recently openedfinale in harvard square?
Restaurant Bulletin Board
• Gathered from a Restaurant BBoard– 6 sets of ~100 posts– 132 threads– Applied Ratnaparki’s POS tagger– Hand-labeled each token In/Out of restaurant
name
Detecting Named Entities
Named Entity
Informative
Bursty
Named Entity
Informative
Document 1 Document 2 Document 3
Quantifying Informativeness
the clandestineBrazil
A Little History…
Z-measure [Brookes,1968]
Inverse Doc. Freq. [Jones,1973]
xI [Bookstein & Swanson, 1974]
Residual IDF [Church & Gale, 1995]
Gain [Papenini, 2001]
Main Idea
• Informative words are:– Rare (IDF)– Modal (Mixture Score)
• Rarity and Modality are independent qualities
• We quantify informativeness using a product of IDF and Mixture Score
Binomial Distribution
Term Frequency Distributions
7
0
4
0
8
0
5
5
6
0
“the”
“Brazil”
Mixture Models
0.1% 5%
10%
0 5
90%
Modality
• Modal words fit a mixture much better than a single binomial
• We separately fit the binomial and mixture models to each term frequency distribution
• We quantify modality by comparing the fitness of the two models
Learning Mixture Parameters
Use Gradient Descent to learn , 1, 2
Comparing Fitness
• Use log-odds to compare fitness of the two models
Top Mixture Score Words
Token Score Rest. Occur.
sichaun 99.62 31/52
fish 50.59 7/73
was 48.79 0/483
speed 44.69 16/19
tacos 43.77 4/19
Independence
Rareness(IDF)
Modality(Mixture Score)
?
Correlation Coefficient
Score Pair Corr. Coefficient
IDF/Mixture -.0139IDF/RIDF .4113
Mixture/RIDF .7380
Top Words Overlap Plot
• Two sorted lists– Sorted by IDF– Sorted by Mixture Score
• Look at % overlap among top N in both lists
• Plot % overlap as we vary N
• Independent scores would produce line along diagonal
Overlap Plot
# Top Words
Per
cent
Ove
rlap
IDF/Mixture
IDF/RIDF
Top IDF*Mixture Words
Token Score Rest. Occur.
sichaun 379.97 31/52
villa 197.08 10/11
tokyo 191.72 7/11
ribs 181.57 0/13
speed 156.23 16/19
Intro to NED Experiments
• Task: Identify Restaurant Names
• Use standard NED features (capitalization, punctuation, POS) as “Baseline”
• Add informativeness score as an additional feature
• Use F1 Breakeven as performance metric
NED Experiments
Feature Set F1 Breakeven
Baseline 55.0%
IDF 56.0%
Mixture 56.0%
IDF,Mixture 56.9%
Residual IDF 57.4%
IDF*RIDF 58.5%
IDF*Mixture 59.3%
Better
Summary
• Traditional syntax-based features are not enough for IE in e-mail & bulletin boards
• We used term occurrence statistics to construct an informativeness score (IDF*Mixture)
• We found IDF*Mixture to be useful for identifying topic-centric words and named entites
Discussion
• Phrases
• Foreign languages, Speech
• Co-reference resolution, context tracking
• Collaborative filtering