Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS...

Digital Access of Handwritten Documents

Venu GovindarajuAnurag Bhardwaj

Huaigu Cao

[email protected]

mailto:[email protected]

Outline

RecognitionPostal ApplicationParadigms

SearchOCR accuracy

FusionLexicon ReductionStatistical Topic Models

Document SearchWord Spotting

Challenge of Handwriting

Motivation

Vast, irreplaceable, culturally vital legacy collections of historical documents are competing ineffectively for attention with billions of digital documents

Thus historical archives are threatened with

neglect, perceived irrelevance, …. & eventually,

oblivion?

Threat: ‘If it’s not in Google, it doesn’t exist!’

[Baird 2003]

Postal Context (138 mil records) ZIP Code30% of ZIP Codes contain a single street name5% of ZIP Codes contain a single primary number2% of ZIP Codes contain a single add-on

<ZIP Code, primary number>Maximum number of records returned is 3,071

<ZIP Code, add-on>Maximum number of records returned is 3,070

Lex Top 1 Top 2

10 96.5 98.7

100 89.2 94.1

1000 75.3 86.3

LDR

Paradigms

Context Ranked Lexicon

Lexicon Driven OCR

LDR

Lexicon Free OCR

LFR

Segmentation Recognition Post-processing

Lexicon Free (LFR)

i[.8], l[.8] u[.5], v[.2]

w[.6], m[.3]

w[.7]

i[.7]u[.3]

m[.2]m[.1]

r[.4]

d[.8]o[.5]

-Image from 1 to 3 is a in with 0.5 confidence-Image from segment 1 to 4 is a ‘w’ with 0.7 confidence-Image from segment 1 to 5 is a ‘w’ with 0.6 confidence and an ‘m’ with 0.3 confidence

Find the best path in graph from segment 1 to 8

Lexicon Driven (LDR)

w[7.6]

w[7.2]r[3.8]

w[5.0]

w[8.6]

o[7.6]r[6.3]

d[4.9]

w[5.0]

o[6.6]

o[6.0]

o[7.2]o[10.6] d[6.5]

d[4.4]

r[7.5]r[6.4]

o[7.8]r[8.6]

r[7.6]

o[8.3]

o[7.7]r[5.8]

1 2 3 4 5 6 7 8 9

o[6.1]

Find the best way of accounting for characters‘w’, ‘o’, ‘r’, ‘d’ buy consuming all segments 1 to 8

Distance between lexicon entry ‘word’ first character ‘w’ and the image between:- segments 1 and 4 is 5.0- segments 1 and 3 is 7.2- segments 1 and 2 is 7.6

[Kim & Govindaraju, TPAMI 1997]

a) Amherst b) Buffalo c) Boston

Interactive Models (LDR)2-way interaction

a) San Jose b) Buffalo c) Washingtond) None of the above

Search for Handwritten Documents

LexiconGood Quality10K 1K

Historical10K 1K

Medical4K

Top 1 (%) 57 67 12 28 20

Top 3 (%) 69 72 22 44 27

Top 10 (%) 74 75 32 72 42

• Lexicons are typically large: >5K• Need around 70% accuracy

Strategy• Reduce lexicon size using topic categorization (DAS 06;08)• Use Top-N choices returned by OCR (ICDAR 07)

[Milewski & Govindaraju, DAS 2006] Farooq et al, DAS 2008] [Cao & Govindaraju, ICDAR 2007]

Outline


SearchOCR accuracy



?1ffN ≡

Fusion of RecognizersType III

),( 21

11 ssfN

LDR

5.6

7.4

…

LFR

.52

.81

…

Identification task:

Amherst

Buffalo

…

Verification task:

5.6 .52Amherst

),( 22

12 ssfN

),( 211 ssf

1S

2SNi ,...,1

maxarg=

θ>SAccept

Reject

• Sum rule

• Weighted sum rule

• Product rule

• Max rule

• Rank-based methods

Traditional Fusion Rules2121

1 ),( ssssf +=2

21

121

1 ),( swswssf +=

21211 ),( ssssf ×=

),max(),( 21211 ssssf =

}),,{,( 111

111Niii sssrankrs K=→

21211 ),( iiii rrssf +=

)|,(),( 21211 genrrPssf iiii =

Likelihood RatioVerification Tasks

Impostor

Genuine

Rec

ogni

zer s

core

2

Recognizer score 1

• 2 classes: imposter and genuine• Pattern classification task

),(),(

),( 21

2121

sspssp

ssfimp

genlr =

Minimum risk criteria: optimal decision boundaries coincide with the contours of likelihood ratio function:

Metaclassification with NN, SVM, etc. also possible

lrV ff =

Vf

[Prabhakar, Jain 02] [Nandkumar, Jain, Das 08]

Optimal Combination functions

LFR is correct 54.8%LDR is correct 77.2%Both are correct 48.9%

Either is correct 83.0%

Likelihood Ratio 69.8%Weighted Sum 81.6%

• LR combination is worse than single matcher Vf

LRV ff =

Identification Task Results

Top choice correct rate

Verification Task Results

ROC

[Tulyakov & Govindaraju IJPRAI 2009]

)},,,{,,,,( 2121ik

Mkkk

Miiii ssssssfS ≠= KK

Independence of ScoresIn a single trial

),( 21

11 ssf

Amherst

5.6

7.4

…

Buffalo

.52

.81

…

LDR

LFR

…

),( 22

12 ssf

…. ….


Dependencies

OCR

A B C …

.95 .89 .76 …

A B C …

.80 .54 .43 …

∏=j

kkj

kj

kkj

kj

k CtspCtsp

C)|,()|,(

maxarg


• Initialize a combination function

• Get scores from the same identification trial (for all trials)• Update function so Genuine score better than any impostor score

),,,(),,,(

() 21

21

Miiiimp

Miiigen

ssspsssp

fK

K=

),,,( 21 Msssf K

0,1

1())( 1

22

11

≥+

=+++++− jsss M

MMe

f ααααα K

Best Impostor Function

Sum of Logistic Functions

Iterative Methods

Likelihood Ratio

Weighted sum

Best Impostor Likelihood Ratio

Logistic Sum

Neural Network

LFR & LDR 69.84 81.58 80.07 81.43 81.67


•Pre Hospital Care ReportWNY: 250,000 filed a yearNYC: 50,000 filed in a dayPDAs not popular

•OHR issuesLoosely constrained writing styleLarge lexiconsHeterogeneous data

6,700 carbon forms stored at 300 DPI1000 PCR forms ground truthed

Search EngineHandwritten Forms

Search Engine for Medical Forms

•Find all people who reported asthma problems in NY•How many people with high blood pressure are on medication X?•Is there an epidemic breaking?

Lexicon Reduction

Large Lexicon> 5K

HandwrittenMedical

Documents

Lex Driven

Improve Performance

Lex Free ICR Features Topic Category

Reduce Lexicon~2.5K

[Milewski, Bharadwaj, & Govindaraju , ,IJDAR 2009]

ICR Features Index


cohesion(wa ,wb ) = z • f (wa ,wb )f (wa )* f (wb ))

DIGESTIVE-SYSTEM FQ CHSN PHRASE30 0.72 PAIN INCIDENT5 0.31 PAIN TRANSPORTED42 0.54 PAIN CHEST52 0.81 STOMACH PAIN9 0.25 HOME PAIN6 0.43 VOMITING ILLNESS

Topic Features


(Chu-Carroll, et al., 1999)

Bt, c =At, c

At, e2

e=1

n

∑

IDF( t) = log 2n

c( t)

Xt, c = IDF(t) ⋅Bt, c

Topic Categorization

24

Cosine similarity between trained topic vectors and test document

Results

CLT to RLT CL to RL CLT to ALT CLT to SLT

HR ↑7.48% ↑7.42% ↑17.58% ↑7.42%

Error Rate ↓10.78% ↓10.88% ↓24.53% ↓10.21%

C: complete lexiconR: reduced lexiconA: category givenS: features syntheticT: truth present


•Train topic categorization maximum entropy model

•Generate topic distribution of test document

•Use topic distribution to score each topic as new prior

•Compute posterior probability of word recognition

•Improves from 32% to 40% on IAM dataset

Statistical Topic Modeling

Input Word Image

Toggle – 0.92

Google – 0.90

Noodle – 0.70

.

.

Google – 0.96

Toggle – 0.72

Noodle – 0.58

.

.Noisy Output Corrected Output

Correction Model

[Bharadwaj, Farooq, Cao, & Govindaraju , ,AND 2008]

Statistical Topic Modeling

Input Word Image

Toggle – 0.92

Google – 0.90

Noodle – 0.70

.

.

Google – 0.96

Toggle – 0.72

Noodle – 0.58

.

.Noisy Output

p(word-image | term)

Corrected Output

P(term | word-image)

Correction Model

P( term | word-image )

= P( word-image | term ) x P ( term )= P( word-image | term ) x ∑ { P ( term | LMi ) x P (LMi) }


Topic Distribution = P( LMi)

∑∑

∑=

c

cdf

cdf

iii

iii

e

edcP),(

),(

)|(λ

λ• Train the Max-Entropy model - fix λi

fi is feature (e.g., normalized word counts)

I – 0.80

T - 0.65

H – 0.35

.

JULY – 0.90

FULLY - 0.75

DULY – 0.65

.

CAVE – 0.70

HAVE - 0.55

HAS – 0.15

.

DECEIVED – 0.95

RECEIVED - 0.55

PERCEIVED – 0.30

.

FAVOR – 0.70

YOUR - 0.55

COLOR – 0.15

.

YOUR – 0.95

HAVE - 0.15

HAS – 0.10

.

count(YOUR) = 0.95 + 0.55 = 1.50


Experiments

Corpus : IAM DatabaseWord Recognizer : WMRTopic Categorization : MalletLM : CMU-Cambridge LM toolkit# of training docs : 380# of train categories : 13# of test docs : 70# of test word images : 4033

Method Word Recognition

Raw 32.33 %

Raw + 3-LM 35.95 %

Raw + Topic-LM 40.63 %


Outline


SearchOCR accuracy



Indexing Retrieval

Handwriting Recognition

Vector IR Model (TF-IDF)

Set of terms {ti}; Set of documents {dj} of length {Lj}

Term Frequency (TF)

Inverted Document Frequency-IDF

Query TF

Similarity

j

jiji L

freqtf ,

,

=

}0 |{#}{#

log, >

=jij

ji freqd

didf

⎩⎨⎧

=otherwise ,0

query in is if ,1,

qttf i

qi

qii

ijij tfidftfqd ,,),(sim ⋅⋅= ∑

jitf ,terms

back 0.024

.

.

.

0.008pain

}pain"" ,back"{"=q

.

.

.

.

.

.

.

.

.

iidf

4.1

2.4

.

.

.

.

.

.

.

.

.

×

×

qitf ,

1

1

×

×

0...0

0...0

0...0

Σ),sim( qd j

[Baeza-Yates 1999]

OCR- Based IR [Prior Work]

[Mittendorf, SDAIR96; Ohta, ICDAR97; Jing, CL02]Model common OCR errorsMachine print

[Rath, SIGIR04]Learn word pdf (over image features)Requires large annotated training setASSUME

Perfect segmentationSingle writer

[Howe, SIGIR2005]ASSUMERanks obey Zipfian distributionSegmentation assumed perfect

Required (Assumed) Inputs

Word segmentation result

Word recognition likelihoods

Estimation

: word images]...[ 21 Lwwww =r

∑=

=L

kkiji wtfreqE

1, )|Pr(}{

)|pain""Pr( kw 0.02 0.01 0.2 0.01 0.01

∑}{ ,pain"" jfreqE

…Doc dj

[Rath 04, Howe 05]

Estimating Term Frequency

( ) ε+=∑wI

wiwji ItIfreq )|Pr(Pr}{E ,

wI)Pr( wI

)head"Pr(" w|I)arm"Pr(" w|I)pelvis"Pr(" w|I

...

1 1 5.0 1 ..................

2.0

05.001.0

7.007.0

01.0... ... ... ...

8.001.0

002.0 01.007.003.0

,...}pelvis"",arm"",head""{:}{ 210 === tttti

( )

...07.0101.05.07.0105.01

)|arm""Pr(Pr

}{E ,1

+×+×+×+×=

=∑wI

ww

j

II

freqdj

[Cao & Govindaraju, ICDAR 2007]

Estimating Segmentation

Word Segmentation Gap between adjacent connected components above a threshold D

Generate multiple hypotheses with multiple D

If hypothesis Iw overlaps m other hypotheses, then

( )wIPr

( )1

1Pr+

=m

Iw

d > D

3 hypotheses

( )wIPr21

31

21

m 1 2 1


Top-Rank (Top-S candidates involved)

Weighted Top-Rank

Empirical

rate OCR )1(R- toprate OCR R- top)|Pr( −−=wi It

⎪⎩

⎪⎨⎧ ≤≤

= otherwise ,0

)rank(1 if ,1)|Pr(

StSIt i

wi

))rank((R it=

Word RecognitionPrior Work )|Pr( wi It

∑−

−

⋅

⋅=

i

d

i

d

iwi

i

i

et

etIt2

2

2

2

2

2

)Pr(

)Pr()|Pr(σ

σ


Results •16303 word images (342 forms)• Automatic segmentation: 63% correct; 32% under; 5% over•Lexicon 4405 words•Query: 22 (1-3 words)

n = 1 2 5 10

Top n word recognition rate 20% 27% 35% 42%


: observation series

:word segmentation hypothesis

:a decoded term sequence

:word sequence segmentation probability

:word sequence recognition probability

:number of ti in

]...[ 21 Toooo =r

]...[ 21 Lwwww =r

]...[ 21 Lττττ =r

)|Pr( ow rr

)|Pr( wrrτ

)(# τrit τr

MMSE Estimation

1)..." coin scene ambulatoryPt ("# scene"" =

)(#)|Pr()|Pr(}{ , τττ

rrrrrrr it

wji wowfreqE ⋅⋅= ∑∑

Pt ambulatory scene in co …

…

Word Gap

fv1: Euclidian dist between bounding boxes

fv2: Shortest white run between two CC’s

fv3: Distance between convex hulls

)|()Pr()|()Pr()|()Pr()|Pr(

validNonfvpvalidNonValidfvpValidValidfvpValidfvValid

−−+=

Likelihoods estimated using Parzen window and Gaussian kernel

[Cao, Bharadwaj, & Govindaraju, ICFHR 2008]

S: number of top-candidates retained in the OCR’ed text

VM: Vector Model

PM: a Probabilistic IR Model

Naive MMSE Estimation:

MMSE Estimation:

MAP and R-Precision Values of IR Tests

0.1614 0.16090.11450.12810.12690.1171

0.1577 0.15140.2042

0.16740.1491

0.1675 0.16350.1897

00.050.1

0.150.2

0.25

OCR'ed Text(S=1)

OCR'ed Text(S=3)

OCR'ed Text(S=7)

OCR'ed Text(S=15)

VM + HREstimation

PM + NaiveMMSE

Estimation

VM + MMSEEstimation

MAP

R-Prec


Word Spotting

Error‐prone segmentation

Manual labeling

Poor performance in multiple writer scenarios

Image Based Methods

(Rath et al 07, IJDAR)

CorpusWashington’s manuscripts

MAP Performance40.98% (2372 good quality images16.5% (3262 poor quality images)

Query: Both Image and Text

Script specific Upper/ lower profile structural features

Observation density

),Pr( fvwrdPosterior word recognition probability

∑=

wrd

fvwrdfvwrdfvwrd

),Pr(),Pr()|Pr(

[Rath et al, CVPR 2003]

Keyword SpottingPrevious Work


Matching in feature spaceMatching GSC features of two word images: 512 bitsSensitive to noise and character segmentation

Corpus: 9312 word images (3104 for queries and 6208 for tests) from 776 individuals, 4 words

R Precision: GSC: 45.5%, 56.59%, 54.11%, 62.04%DTW: 35.53%, 38.65%, 44.39%, 55.23%

1024-bit GSC feature

[Srihari, et al, SPIE 2004]

Template Free Word SpottingMatching Gabor features of two word images. Posterior probability estimate from SVM OCR

Corpus: 12 medical forms with 5295 character images.101 samples of 6 keywords

MAP Performance67.1% compared to 12.6% by DTW

V1 V2 V3 V4Vw = [V1T V2T V3T V4T]T

))|(ln(1),(1∑=

=

−=ni

iiP vcP

nVwwC i

Probabilistic Similarity

[Cao & Govindaraju, ICAPR 2007]


Probabilistic Indexing

)|Pr( x )1( x...x )1( x),sim :similarityquery -Word

components and between )(y probabilit gap Word

,...,,

11

1

1

wqq(w

cc

cccw

jjii

kkk

jii

σσσσ

σ

⋅−−=

→

−−

+

+

w

c1 c2 c3 c4

σ0 σ1 σ2 σ3 σ4

[Cao, Bharadwaj & Govindaraju, PR 2009]

Summary•Handwriting Recognition remains a challenging task despite success in postal applications

•Need for improved search technologies to access handwritten documents on the web

•Statistical topic models can help in document categorization and lexicon reduction

•Document indexing can be performed by MMSE modeling that integrates segmentation, language models, and recognition

•Word spotting can be performed by indexing image level features and on OCR results

Thank You

Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS...

Documents

Transcript of Digital Access of Handwritten Documentsgovind/pdfs/and09_govindaraju.pdf[Milewski & Govindaraju, DAS...