Autonomous Cleaning of Corrupted Scanned Documents A Generative
Modeling Approach Zhenwen Dai Jrg Lcke Frankfurt Institute for
Advanced Studies, Dept. of Physics, Goethe-University
Frankfurt
Slide 2
A document cleaning problem 2
Slide 3
What method can save us? Optical Character Recognition (OCR)
3
Slide 4
OCR Software 4 input OCR Character Segmentation Character
Classification ? ? vs. (FineReader 11)
Slide 5
What method can save us? Optical Character Recognition (OCR)
Automatic Image Inpainting 5
Slide 6
6
Slide 7
Unable to identify the defects because corruption and
characters consist of same features solution requires knowledge of
explicit character representations 7
Slide 8
What else? Optical Character Recognition (OCR) Automatic Image
Inpainting Image Denoising? Problem requires a new solution! 8
Slide 9
Our Approach training data is only the page of corrupted
document no label information a limited alphabet (currently) 9
inputour approach
Slide 10
How does it work without supervision? Characters are salient
self-repeating patterns. Corruptions are more irregular. Related to
Sparse Coding 10 inputour approach
Slide 11
The Flow of Our Approach 11 Cut into Image Patches Character
Detection & Recognition b a y s e A Character Model on Image
Patches Learning
Slide 12
A Probabilistic Generative Model Show a character generation
process. A character representation (parameters) mask param.
Feature Vectors (RGB color) 12
Slide 13
A Tour of Generation 1.Select a character. 2.Translate to the
position. 3.Generate a background. 4.Overlap character with
background according to mask. 13 Translation by [12,10] T
Pixel-wise Background Distribution Prior Prob. 0.2 Learning masks
features
Slide 14
Maximum Likelihood Iterative Parameter Update Rules from EM: 14
prior prob. std parameter set posterior t1t1 t2t2 t0t0 tntn A
posterior distribution is needed for every image patch in the
update rules.
Slide 15
Posterior Computation Problem A posterior distribution is
needed for every image patch in the update rules. Similar to
template matching A pre-selection approximation 15 inference Which
character? ABCDE Where? ????? ??? hidden space pre-selection (Lcke
& Eggert, JMLR 2010) (Yuille & Kersten, TiCS 2006)
(truncated variational EM)
Slide 16
An Intuitive Illustration of Pre-selection Select some local
features according to parameters. Very few features A number of
good guesses ABCDE 16 (Lcke & Eggert, JMLR 2010) (Yuille &
Kersten, TiCS 2006) BCAED BCAED Features in image patches B BD
Slide 17
Learn the Character Representations Input: image patches (Gabor
wavelets) A learning course: (about 25 mins) 17 maskfeaturestdchars
1 2 3 maskfeaturestdchars 4 5 6 (heat map) featurestd
Slide 18
Learn the Character Representations Input: image patches (Gabor
wavelets) A learning course: (about 25 mins) 18 maskfeaturestdchars
1 2 3 maskfeaturestdchars 4 5 6 (heat map) featurestd
Slide 19
Document Cleaning How to recognize characters against noise?
Character segmentation fails. Our model one char per patch It is a
non-trivial task. Try to explore from the model as much as
possible. 19
Slide 20
Document Cleaning Procedure Inference of every patch with the
learned model 1.Paint a clean character at the detected position.
2.Erase the character from the original document. Accept original
reconstructed Fully visible=1 20 reconstructed Clean Characters
from the Corrupted Document
Slide 21
Document Cleaning Procedure Inference of every patch with the
learned model Iterate until no more reconstruction. iteration 1
reconstructed Accept Reject original reconstructed Accept Fully
visible=1 Fully visible=0 Fully visible=1 iteration 2 Reject Accept
reconstructed Fully visible=0 Fully visible=1 Fully visible=0 Fully
visible=1 21 more than one character per patch (about 1 min per
iteration)
Slide 22
Before Cleaning 22
Slide 23
After Iteration 1 23
Slide 24
After Iteration 2 24
Slide 25
After Iteration 3 25
Slide 26
More Experiments More characters (9 chars) Unusual character
set (Klingon) Irregular placement (randomly placed, rotated)
Occluded by spilled ink 26 9 charsKlingon Rotated, random placed
Occluded original reconstructed
Slide 27
Recognition Rates 27
Slide 28
False Positives 28
Slide 29
Not only a Character Model Detect and count cells on
microscopic image data 29 in collaboration with Thilo Figge and
Carl Svensson
Slide 30
Summary Addressed the corrupted document cleaning problem.
Followed a probabilistic generative approach. Autonomous cleaning
of a document is possible. Demonstrated efficiency and robustness.
The dataset will be available online soon. Future directions:
Extended to large alphabet by incorporating prior knowledge of
documents. Extended to various different applications. 30
Document Cleaning Procedure Character vs. Noise ? MAP inference
can only choose among learned characters. 3.Define a novel quality
measure. Threshold: 0.5 y a MAP mask param.mask posteriordifference
35