Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS...
Transcript of Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS...
Digital Access of Handwritten Documents
Venu GovindarajuAnurag Bhardwaj
Huaigu Cao
Outline
RecognitionPostal ApplicationParadigms
SearchOCR accuracy
FusionLexicon ReductionStatistical Topic Models
Document SearchWord Spotting
Motivation
Vast, irreplaceable, culturally vital legacy collections of historical documents are competing ineffectively for attention with billions of digital documents
Thus historical archives are threatened with
neglect, perceived irrelevance, …. & eventually,
oblivion?
Threat: ‘If it’s not in Google, it doesn’t exist!’
[Baird 2003]
Postal Context (138 mil records) ZIP Code30% of ZIP Codes contain a single street name5% of ZIP Codes contain a single primary number2% of ZIP Codes contain a single add-on
<ZIP Code, primary number>Maximum number of records returned is 3,071
<ZIP Code, add-on>Maximum number of records returned is 3,070
Lex Top 1 Top 2
10 96.5 98.7
100 89.2 94.1
1000 75.3 86.3
LDR
Paradigms
Context Ranked Lexicon
Lexicon Driven OCR
LDR
Lexicon Free OCR
LFR
Segmentation Recognition Post-processing
Lexicon Free (LFR)
i[.8], l[.8] u[.5], v[.2]
w[.6], m[.3]
w[.7]
i[.7]u[.3]
m[.2]m[.1]
r[.4]
d[.8]o[.5]
-Image from 1 to 3 is a in with 0.5 confidence-Image from segment 1 to 4 is a ‘w’ with 0.7 confidence-Image from segment 1 to 5 is a ‘w’ with 0.6 confidence and an ‘m’ with 0.3 confidence
Find the best path in graph from segment 1 to 8
Lexicon Driven (LDR)
w[7.6]
w[7.2]r[3.8]
w[5.0]
w[8.6]
o[7.6]r[6.3]
d[4.9]
w[5.0]
o[6.6]
o[6.0]
o[7.2]o[10.6] d[6.5]
d[4.4]
r[7.5]r[6.4]
o[7.8]r[8.6]
r[7.6]
o[8.3]
o[7.7]r[5.8]
1 2 3 4 5 6 7 8 9
o[6.1]
Find the best way of accounting for characters‘w’, ‘o’, ‘r’, ‘d’ buy consuming all segments 1 to 8
Distance between lexicon entry ‘word’ first character ‘w’ and the image between:- segments 1 and 4 is 5.0- segments 1 and 3 is 7.2- segments 1 and 2 is 7.6
[Kim & Govindaraju, TPAMI 1997]
a) Amherst b) Buffalo c) Boston
Interactive Models (LDR)2-way interaction
a) San Jose b) Buffalo c) Washingtond) None of the above
Search for Handwritten Documents
LexiconGood Quality10K 1K
Historical10K 1K
Medical4K
Top 1 (%) 57 67 12 28 20
Top 3 (%) 69 72 22 44 27
Top 10 (%) 74 75 32 72 42
• Lexicons are typically large: >5K• Need around 70% accuracy
Strategy• Reduce lexicon size using topic categorization (DAS 06;08)• Use Top-N choices returned by OCR (ICDAR 07)
[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007]
Outline
RecognitionPostal ApplicationParadigms
SearchOCR accuracy
FusionLexicon ReductionStatistical Topic Models
Document SearchWord Spotting
?1ffN ≡
Fusion of RecognizersType III
),( 21
11 ssfN
LDR
5.6
7.4
…
LFR
.52
.81
…
Identification task:
Amherst
Buffalo
…
Verification task:
5.6 .52Amherst
),( 22
12 ssfN
),( 211 ssf
1S
2SNi ,...,1
maxarg=
θ>SAccept
Reject
• Sum rule
• Weighted sum rule
• Product rule
• Max rule
• Rank-based methods
Traditional Fusion Rules2121
1 ),( ssssf +=2
21
121
1 ),( swswssf +=
21211 ),( ssssf ×=
),max(),( 21211 ssssf =
}),,{,( 111
111Niii sssrankrs K=→
21211 ),( iiii rrssf +=
)|,(),( 21211 genrrPssf iiii =
Likelihood RatioVerification Tasks
Impostor
Genuine
Rec
ogni
zer s
core
2
Recognizer score 1
• 2 classes: imposter and genuine• Pattern classification task
),(),(
),( 21
2121
sspssp
ssfimp
genlr =
Minimum risk criteria: optimal decision boundaries coincide with the contours of likelihood ratio function:
Metaclassification with NN, SVM, etc. also possible
lrV ff =
Vf
[Prabhakar, Jain 02] [Nandkumar, Jain, Das 08]
Optimal Combination functions
LFR is correct 54.8%LDR is correct 77.2%Both are correct 48.9%
Either is correct 83.0%
Likelihood Ratio 69.8%Weighted Sum 81.6%
• LR combination is worse than single matcher Vf
LRV ff =
Identification Task Results
Top choice correct rate
Verification Task Results
ROC
[Tulyakov & Govindaraju IJPRAI 2009]
)},,,{,,,,( 2121ik
Mkkk
Miiii ssssssfS ≠= KK
Independence of ScoresIn a single trial
),( 21
11 ssf
Amherst
5.6
7.4
…
Buffalo
.52
.81
…
LDR
LFR
…
),( 22
12 ssf
…. ….
[Tulyakov & Govindaraju IJPRAI 2009]
Dependencies
OCR
A B C …
.95 .89 .76 …
A B C …
.80 .54 .43 …
∏=j
kkj
kj
kkj
kj
k CtspCtsp
C)|,()|,(
maxarg
[Tulyakov & Govindaraju IJPRAI 2009]
• Initialize a combination function
• Get scores from the same identification trial (for all trials)• Update function so Genuine score better than any impostor score
),,,(),,,(
() 21
21
Miiiimp
Miiigen
ssspsssp
fK
K=
),,,( 21 Msssf K
0,1
1())( 1
22
11
≥+
=+++++− jsss M
MMe
f ααααα K
Best Impostor Function
Sum of Logistic Functions
Iterative Methods
Likelihood Ratio
Weighted sum
Best Impostor Likelihood Ratio
Logistic Sum
Neural Network
LFR & LDR 69.84 81.58 80.07 81.43 81.67
[Tulyakov & Govindaraju IJPRAI 2009]
•Pre Hospital Care ReportWNY: 250,000 filed a yearNYC: 50,000 filed in a dayPDAs not popular
•OHR issuesLoosely constrained writing styleLarge lexiconsHeterogeneous data
6,700 carbon forms stored at 300 DPI1000 PCR forms ground truthed
Search EngineHandwritten Forms
Search Engine for Medical Forms
•Find all people who reported asthma problems in NY•How many people with high blood pressure are on medication X?•Is there an epidemic breaking?
Lexicon Reduction
Large Lexicon> 5K
HandwrittenMedical
Documents
Lex Driven
Improve Performance
Lex Free ICR Features Topic Category
Reduce Lexicon~2.5K
[Milewski, Bharadwaj, & Govindaraju , ,IJDAR 2009]
cohesion(wa ,wb ) = z • f (wa ,wb )f (wa )* f (wb ))
DIGESTIVE-SYSTEM FQ CHSN PHRASE30 0.72 PAIN INCIDENT5 0.31 PAIN TRANSPORTED42 0.54 PAIN CHEST52 0.81 STOMACH PAIN9 0.25 HOME PAIN6 0.43 VOMITING ILLNESS
Topic Features
[Milewski, Bharadwaj, & Govindaraju , ,IJDAR 2009]
(Chu-Carroll, et al., 1999)
Bt, c =At, c
At, e2
e=1
n
∑
IDF( t) = log 2n
c( t)
Xt, c = IDF(t) ⋅Bt, c
Topic Categorization
24
Cosine similarity between trained topic vectors and test document
Results
CLT to RLT CL to RL CLT to ALT CLT to SLT
HR ↑7.48% ↑7.42% ↑17.58% ↑7.42%
Error Rate ↓10.78% ↓10.88% ↓24.53% ↓10.21%
C: complete lexiconR: reduced lexiconA: category givenS: features syntheticT: truth present
[Milewski, Bharadwaj, & Govindaraju , ,IJDAR 2009]
•Train topic categorization maximum entropy model
•Generate topic distribution of test document
•Use topic distribution to score each topic as new prior
•Compute posterior probability of word recognition
•Improves from 32% to 40% on IAM dataset
Statistical Topic Modeling
Input Word Image
Toggle – 0.92
Google – 0.90
Noodle – 0.70
.
.
Google – 0.96
Toggle – 0.72
Noodle – 0.58
.
.Noisy Output Corrected Output
Correction Model
[Bharadwaj, Farooq, Cao, & Govindaraju , ,AND 2008]
Statistical Topic Modeling
Input Word Image
Toggle – 0.92
Google – 0.90
Noodle – 0.70
.
.
Google – 0.96
Toggle – 0.72
Noodle – 0.58
.
.Noisy Output
p(word-image | term)
Corrected Output
P(term | word-image)
Correction Model
P( term | word-image )
= P( word-image | term ) x P ( term )= P( word-image | term ) x ∑ { P ( term | LMi ) x P (LMi) }
[Bharadwaj, Farooq, Cao, & Govindaraju , ,AND 2008]
Language Model = P( t | LMi )
Category c1 Document
Category c2 Document
P(eye|c1) – 0.92
P(brain|c1) – 0.90
.
.
P(china|c1) – 0.09
P(trade|c2) – 0.82
P(bank|c2) – 0.78
.
.
P(eye|c2) – 0.1
Category c1 Language Model
LM1
Category c2 Language Model
LM2
[Bharadwaj, Farooq, Cao, & Govindaraju , ,AND 2008]
Topic Distribution = P( LMi)
∑∑
∑=
c
cdf
cdf
iii
iii
e
edcP),(
),(
)|(λ
λ• Train the Max-Entropy model - fix λi
fi is feature (e.g., normalized word counts)
I – 0.80
T - 0.65
H – 0.35
.
JULY – 0.90
FULLY - 0.75
DULY – 0.65
.
CAVE – 0.70
HAVE - 0.55
HAS – 0.15
.
DECEIVED – 0.95
RECEIVED - 0.55
PERCEIVED – 0.30
.
FAVOR – 0.70
YOUR - 0.55
COLOR – 0.15
.
YOUR – 0.95
HAVE - 0.15
HAS – 0.10
.
count(YOUR) = 0.95 + 0.55 = 1.50
[Bharadwaj, Farooq, Cao, & Govindaraju , ,AND 2008]
Experiments
Corpus : IAM DatabaseWord Recognizer : WMRTopic Categorization : MalletLM : CMU-Cambridge LM toolkit# of training docs : 380# of train categories : 13# of test docs : 70# of test word images : 4033
Method Word Recognition
Raw 32.33 %
Raw + 3-LM 35.95 %
Raw + Topic-LM 40.63 %
[Bharadwaj, Farooq, Cao, & Govindaraju , ,AND 2008]
Outline
RecognitionPostal ApplicationParadigms
SearchOCR accuracy
FusionLexicon ReductionStatistical Topic Models
Document SearchWord Spotting
Vector IR Model (TF-IDF)
Set of terms {ti}; Set of documents {dj} of length {Lj}
Term Frequency (TF)
Inverted Document Frequency-IDF
Query TF
Similarity
j
jiji L
freqtf ,
,
=
}0 |{#}{#
log, >
=jij
ji freqd
didf
⎩⎨⎧
=otherwise ,0
query in is if ,1,
qttf i
qi
qii
ijij tfidftfqd ,,),(sim ⋅⋅= ∑
jitf ,terms
back 0.024
.
.
.
0.008pain
}pain"" ,back"{"=q
.
.
.
.
.
.
.
.
.
iidf
4.1
2.4
.
.
.
.
.
.
.
.
.
×
×
qitf ,
1
1
×
×
0...0
0...0
0...0
Σ),sim( qd j
[Baeza-Yates 1999]
OCR- Based IR [Prior Work]
[Mittendorf, SDAIR96; Ohta, ICDAR97; Jing, CL02]Model common OCR errorsMachine print
[Rath, SIGIR04]Learn word pdf (over image features)Requires large annotated training setASSUME
Perfect segmentationSingle writer
[Howe, SIGIR2005]ASSUMERanks obey Zipfian distributionSegmentation assumed perfect
Required (Assumed) Inputs
Word segmentation result
Word recognition likelihoods
Estimation
: word images]...[ 21 Lwwww =r
∑=
=L
kkiji wtfreqE
1, )|Pr(}{
)|pain""Pr( kw 0.02 0.01 0.2 0.01 0.01
∑}{ ,pain"" jfreqE
…Doc dj
[Rath 04, Howe 05]
Estimating Term Frequency
( ) ε+=∑wI
wiwji ItIfreq )|Pr(Pr}{E ,
wI)Pr( wI
)head"Pr(" w|I)arm"Pr(" w|I)pelvis"Pr(" w|I
...
1 1 5.0 1 ..................
2.0
05.001.0
7.007.0
01.0... ... ... ...
8.001.0
002.0 01.007.003.0
,...}pelvis"",arm"",head""{:}{ 210 === tttti
( )
...07.0101.05.07.0105.01
)|arm""Pr(Pr
}{E ,1
+×+×+×+×=
=∑wI
ww
j
II
freqdj
[Cao & Govindaraju, ICDAR 2007]
Estimating Segmentation
Word Segmentation Gap between adjacent connected components above a threshold D
Generate multiple hypotheses with multiple D
If hypothesis Iw overlaps m other hypotheses, then
( )wIPr
( )1
1Pr+
=m
Iw
d > D
3 hypotheses
( )wIPr21
31
21
m 1 2 1
[Cao & Govindaraju, ICDAR 2007]
Top-Rank (Top-S candidates involved)
Weighted Top-Rank
Empirical
rate OCR )1(R- toprate OCR R- top)|Pr( −−=wi It
⎪⎩
⎪⎨⎧ ≤≤
= otherwise ,0
)rank(1 if ,1)|Pr(
StSIt i
wi
))rank((R it=
Word RecognitionPrior Work )|Pr( wi It
∑−
−
⋅
⋅=
i
d
i
d
iwi
i
i
et
etIt2
2
2
2
2
2
)Pr(
)Pr()|Pr(σ
σ
[Cao & Govindaraju, ICDAR 2007]
Results •16303 word images (342 forms)• Automatic segmentation: 63% correct; 32% under; 5% over•Lexicon 4405 words•Query: 22 (1-3 words)
n = 1 2 5 10
Top n word recognition rate 20% 27% 35% 42%
[Cao & Govindaraju, ICDAR 2007]
: observation series
:word segmentation hypothesis
:a decoded term sequence
:word sequence segmentation probability
:word sequence recognition probability
:number of ti in
]...[ 21 Toooo =r
]...[ 21 Lwwww =r
]...[ 21 Lττττ =r
)|Pr( ow rr
)|Pr( wrrτ
)(# τrit τr
MMSE Estimation
1)..." coin scene ambulatoryPt ("# scene"" =
)(#)|Pr()|Pr(}{ , τττ
rrrrrrr it
wji wowfreqE ⋅⋅= ∑∑
Pt ambulatory scene in co …
…
Word Gap
fv1: Euclidian dist between bounding boxes
fv2: Shortest white run between two CC’s
fv3: Distance between convex hulls
)|()Pr()|()Pr()|()Pr()|Pr(
validNonfvpvalidNonValidfvpValidValidfvpValidfvValid
−−+=
Likelihoods estimated using Parzen window and Gaussian kernel
[Cao, Bharadwaj, & Govindaraju, ICFHR 2008]
Word Recognition Likelihood
),( itws
)Imposter|Pr()ImposterPr()Genuine|Pr()GenuinePr()Genuine|Pr()GenuinePr(
)|GenuinePr()(UBM
sss
ss
+
==
)),((UBM)|( ii twstwp ∝
[Kim and Govindaraju, T-PAMI 1997; Cao et al ICFHR 2008]
S: number of top-candidates retained in the OCR’ed text
VM: Vector Model
PM: a Probabilistic IR Model
Naive MMSE Estimation:
MMSE Estimation:
MAP and R-Precision Values of IR Tests
0.1614 0.16090.11450.12810.12690.1171
0.1577 0.15140.2042
0.16740.1491
0.1675 0.16350.1897
00.050.1
0.150.2
0.25
OCR'ed Text(S=1)
OCR'ed Text(S=3)
OCR'ed Text(S=7)
OCR'ed Text(S=15)
VM + HREstimation
PM + NaiveMMSE
Estimation
VM + MMSEEstimation
MAP
R-Prec
[Cao, Bharadwaj, & Govindaraju, ICFHR 2008]
Error‐prone segmentation
Manual labeling
Poor performance in multiple writer scenarios
Image Based Methods
(Rath et al 07, IJDAR)
CorpusWashington’s manuscripts
MAP Performance40.98% (2372 good quality images16.5% (3262 poor quality images)
Query: Both Image and Text
Script specific Upper/ lower profile structural features
Observation density
),Pr( fvwrdPosterior word recognition probability
∑=
wrd
fvwrdfvwrdfvwrd
),Pr(),Pr()|Pr(
[Rath et al, CVPR 2003]
Keyword SpottingPrevious Work
Keyword SpottingPrevious Work
Matching in feature spaceMatching GSC features of two word images: 512 bitsSensitive to noise and character segmentation
Corpus: 9312 word images (3104 for queries and 6208 for tests) from 776 individuals, 4 words
R Precision: GSC: 45.5%, 56.59%, 54.11%, 62.04%DTW: 35.53%, 38.65%, 44.39%, 55.23%
1024-bit GSC feature
[Srihari, et al, SPIE 2004]
Template Free Word SpottingMatching Gabor features of two word images. Posterior probability estimate from SVM OCR
Corpus: 12 medical forms with 5295 character images.101 samples of 6 keywords
MAP Performance67.1% compared to 12.6% by DTW
V1 V2 V3 V4Vw = [V1T V2T V3T V4T]T
))|(ln(1),(1∑=
=
−=ni
iiP vcP
nVwwC i
Probabilistic Similarity
[Cao & Govindaraju, ICAPR 2007]
Keyword SpottingPrevious Work
Probabilistic Indexing
)|Pr( x )1( x...x )1( x),sim :similarityquery -Word
components and between )(y probabilit gap Word
,...,,
11
1
1
wqq(w
cc
cccw
jjii
kkk
jii
σσσσ
σ
⋅−−=
→
−−
+
+
w
c1 c2 c3 c4
σ0 σ1 σ2 σ3 σ4
[Cao, Bharadwaj & Govindaraju, PR 2009]
Summary•Handwriting Recognition remains a challenging task despite success in postal applications
•Need for improved search technologies to access handwritten documents on the web
•Statistical topic models can help in document categorization and lexicon reduction
•Document indexing can be performed by MMSE modeling that integrates segmentation, language models, and recognition
•Word spotting can be performed by indexing image level features and on OCR results