Post on 26-Mar-2015
Named-Entity Recognition with Character-Level Models
Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning
Stanford University
CoNLL-2003: Seventh Conference on Natural Language Learning
klein@cs.stanford.edu jsmarr@stanford.edu htnguyen@stanford.edu manning@cs.stanford.edu
2
Unknown Words are a Central Challenge for NER
Recognizing known named-entities (NEs) is relatively simple and accurate
Recognizing novel NEs requires recognizing context and/or word-internal features
External context and frequent internal words (e.g. “Inc.”) are most commonly used features
Internal composition of NEs alone provide surprisingly strong evidence for classification (Smarr & Manning, 2002) Staffordshire Abdul-Karim al-Kabariti CentrInvest
3
Are Names Self-Describing?
NO: names can be opaque/ambiguousWord-Level: “Washington” occurs as LOC, PER, and
ORGChar-Level: “–ville” suggests LOC, but exceptions
like “Neville”
YES: names can be highly distinctive/descriptiveWord-Level: “National Bank” is a bank (i.e. ORG)Char-Level: “Cotramoxazole” is clearly a drug
name
Question: Overall, how informative are names alone?
4
How Internally Descriptive are Isolated Named Entities?
Classification accuracy of pre-segmented CoNLL NEs without context is ~90%
Using character n-grams as features instead of words yields 25% error reduction
On single-word unknown NEs, word model is at chance; char n-gram model fixes 38% of errors
89.1
91.8
80
90
100
Words Char N-Grams
All NEs
37.5
60.7
30
40
50
60
70
Words Char N-Grams
Single-word UNKs
NE Classification Accuracy (%)[not CoNLL task]
5
Exploiting Word-Internal Features
Many existing systems use some word-internal features (suffix, capitalization, punctuation, etc.)
e.g. Mikheev 97, Wacholder et al 97, Bikel et al 97 Features usually language-dependent (e.g. morphology)
Our approach: use char n-grams as primary representation
Use all substrings as classification features:
Char n-grams subsume word features Features are language-independent (assuming its
alphabetic) Similar in spirit to Cucerzan and Yarowsky (99), but uses
ALL char n-grams vs. just prefix/suffix
#Tom##Tom#, #Tom, Tom#, #To,
Tom, om#, #T, To, om, m#, T, o, m
6
Character-Feature Based Classifier
Model I: Independent classification at each word maxent classifiers, trained using conjugate gradient equal-scale gaussian priors for smoothing trained models with >800K features in ~2 hrs
POS tags and contextual features complement n-grams
Description Added Features Overall F1 (English Dev.)
Words w0
Official Baseline
-
Char N-Grams n(w0)
POS Tags t0
Simple Context
w-1, w0, t-1, t1
More Context ‹w-1, w0›, ‹w0, w1›, ‹t-1, t0›, ‹t0, w1›
52.29
73.10
74.17
82.39
83.09
71.18
7
Character-Based CMM
Model II: Joint classifications along the sequence
Previous classification decisions are clearly relevant: “Grace Road” is a single location, not a
person + location Include neighboring classification
decisions as features Perform joint inference across chain of
classifiers Conditional Markov Model (CMM, aka. maxent
Markov model) Borthwick 1999, McCallum et al 2000
8
Character-Based CMM
Final extra features: Letter-type patterns for each word
United Xx, 12-month d-x, etc. Conjunction features
E.g., previous state and current signature Repeated last words of multi-word names
E.g., Jones after having seen Doug Jones … and a few more
Description Added Features Overall F1 (English Dev)
More Context ‹w-1, w0›, ‹w0, w1›, ‹t-1, t0›, ‹t0, w1›
Simple Sequence
s-1, ‹s-1, t-1, t0›
More Sequence ‹s-2, s-1›, ‹s-2, s-1, t-1, t0›
Final misc. extra features
83.09
87.21
92.27
85.44
9
Final Results
Drop from English dev to test largely due to inconsistent labeling
Lack of capitalization cues in German hurts recall more because maxent classifier is precision-biased when faced with weak evidence
92.27
86.31
67.03
71.90
50
60
70
80
90
100
Eng Dev Eng Test Ger Dev Ger Test
Precision Recall F1
10
Conclusions
Character substrings are valuable and underexploited model features Named entities are internally quite
descriptive 25-30% error reduction vs. word-level models
Discriminative maxent models allow productive feature engineering 30% error reduction vs. basic model
What distinguishes our approach? More and better features Regularization is crucial for preventing
overfitting