Post on 18-Apr-2022
Gianmaria Silvello
Information Management Research Group (IMS)Department of Information Engineering
University of Padua, Italy
Boosted Decision Trees for Word Recognition in Handwritten Document
RetrievalHowe, N.R., Rath, T.M. and Manmatha, R.
Department of Computer Science, University of MassachusettsSIGIR 2005 published by ACM, New York
Applied Functional Analysis5 February 2009Padova, Italy
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
• Introduction to recognition and retrieval of handwritten documents
• Classification Algorithm: AdaBoost and Decision trees
• Classification Experiments
• Language Models for Retrieval
• Conclusions
2
Outline
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
• Recognition and retrieval of off-line hand-written documents based upon word classification
• Decision tree with normalized pixels as feature form the basis for AdaBoost
• Problem of skewed distribution of class frequencies
• Experiments done on the GW20 and GW100 corpus
• Retrieval is done using a language model over recognized words
3
Introduction
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Introduction
• The main goal is to offer access to world historical handwritten documents
• Often HW works on limited vocabularies (postal address)
• Historical documents add complexity due to ink bleeding or dirt on the paper
• Use pixels in normalized word image at multiple scales (image pyramids) as features
• Propose an innovative procedure to create additional training data
4
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
The Boosting Approach
• Boosting is a classification technique that determines its prediction via the weighted vote of a diverse set of base classifiers each of which has been trained on a different weighting of the trained data
• AdaBoost trains successive version of its base classifier focusing on hard-to-classify examples
• It can use a simple classifier but stronger classifiers get better results
5
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
AdaBoost in brief
• Introduced in 1995 by Freund and Schapire in “A decision-theoretic generalization of the on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139.
6
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
AdaBoost in brief
• Introduced in 1995 by Freund and Schapire in “A decision-theoretic generalization of the on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139.
6
Reference: Freund, Y. and Schapire, R. E. “A Short Introduction to Boosting”, Journal of Japanese Society for Artificial Intelligence, 14(5):
771-780, 1999.
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
AdaBoost in brief
• Introduced in 1995 by Freund and Schapire in “A decision-theoretic generalization of the on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139.
6
Binary case
Reference: Freund, Y. and Schapire, R. E. “A Short Introduction to Boosting”, Journal of Japanese Society for Artificial Intelligence, 14(5):
771-780, 1999.
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
AdaBoost in brief
• Introduced in 1995 by Freund and Schapire in “A decision-theoretic generalization of the on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139.
6
All weights are set equally
Reference: Freund, Y. and Schapire, R. E. “A Short Introduction to Boosting”, Journal of Japanese Society for Artificial Intelligence, 14(5):
771-780, 1999.
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
AdaBoost in brief
• Introduced in 1995 by Freund and Schapire in “A decision-theoretic generalization of the on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139.
6
Reference: Freund, Y. and Schapire, R. E. “A Short Introduction to Boosting”, Journal of Japanese Society for Artificial Intelligence, 14(5):
771-780, 1999.
Find a weak hypothesis
appropriate for the distribution Dt
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
AdaBoost in brief
• Introduced in 1995 by Freund and Schapire in “A decision-theoretic generalization of the on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139.
6
Reference: Freund, Y. and Schapire, R. E. “A Short Introduction to Boosting”, Journal of Japanese Society for Artificial Intelligence, 14(5):
771-780, 1999.
The error measures the goodness of the
hypothesis
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
AdaBoost in brief
• Introduced in 1995 by Freund and Schapire in “A decision-theoretic generalization of the on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139.
6
Reference: Freund, Y. and Schapire, R. E. “A Short Introduction to Boosting”, Journal of Japanese Society for Artificial Intelligence, 14(5):
771-780, 1999.
AdaBoost chooses the parameter αt that measures the importance assigned to ht
αt ≥0 if εt ≤1/2
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
AdaBoost in brief
• Introduced in 1995 by Freund and Schapire in “A decision-theoretic generalization of the on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139.
6
Reference: Freund, Y. and Schapire, R. E. “A Short Introduction to Boosting”, Journal of Japanese Society for Artificial Intelligence, 14(5):
771-780, 1999.
Dt is updated → increased the weight of misclassified examples → concentrate on
hard examples
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
AdaBoost in brief
• Introduced in 1995 by Freund and Schapire in “A decision-theoretic generalization of the on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139.
6
Reference: Freund, Y. and Schapire, R. E. “A Short Introduction to Boosting”, Journal of Japanese Society for Artificial Intelligence, 14(5):
771-780, 1999.
H is a weighted majority vote for the T weak hypothesis where αt is the weight
assigned to ht
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
AdaBoost in brief
• In Schapire, R. E. and Singer, Y. “Improved boosting algorithms using confidence-rated predictions”, Machine Learning 37(3) 297-336, 1999 is shown how AdaBoost can handle weak hypothesis which output real values.
• Consider x, ht outputs ht(x)∈R whose sign is the predicted label (-1 or +1) and whose magnitude ⎮ht(x)⎮gives the measure of confidence in the prediction.
• AdaBoost.M1 is the extension to the multi-class case → it is adequate when the weak learner is strong to achieve an accuracy of at least 50%.
• Extensions: AdaBoost.MH and AdaBoost.MR → reducing multi-class to a larger binary problem
7
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Choices and Problems
• The recognition process uses values sampled directly from the word image at varying resolutions
• The choice is to divide word image and not letters
• recognizing letters become a limiting step
• segmentation of individual word images is easier (image classification problem)
• Skewed distribution of class frequencies (Zipfian distribution)and paucity of training data for most word classes
8
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Classification Algorithm
• HW words belonging to a single class have similar ink distribution (bur not identical)
• The position of individual features within the word will shift from example to example
• The pixel representation contains information about word identity that can be amplified by boosting
• clearer areas will contain more reliable features
• blurring indicates areas of inconsistency
9
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Classification Algorithm
• HW words belonging to a single class have similar ink distribution (bur not identical)
• The position of individual features within the word will shift from example to example
• The pixel representation contains information about word identity that can be amplified by boosting
• clearer areas will contain more reliable features
• blurring indicates areas of inconsistency
9
Composite image of 21 examples of the word “Instructions”.
Straightforward use of the pixel is ineffective.
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Common framework
• Pixels used as features for word image classification
• Word image mapped into a common pixel grid
• Images scaled and translated → horizontal line span: (0,0) to (1,0) → resampling each image to a common grid will produce common pixel representation
• Long and short words (horizontal and vertical dimensions) → astronomic data sizes
• Pyramid approach
10
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Pyramid Approach
• Define a family of standard grids → base grid Φo covering ([0,1], [-0.5, 0.5]) broken into 32x32 px
• Refine grids cover the same square region with double resolution 64x64 px
• Like a tree in which each Φk has 4 children in Φk+1
✓ The standard image usually don’t cover the full vertical extent of the grid ➙
portions above and below the edges of standard image may be represented
using a single default value
✓ Data need only be stored for Φk with resolution up to that of the reference
image.
11
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Pyramid Approach
• Define a family of standard grids → base grid Φo covering ([0,1], [-0.5, 0.5]) broken into 32x32 px
• Refine grids cover the same square region with double resolution 64x64 px
• Like a tree in which each Φk has 4 children in Φk+1
✓ The standard image usually don’t cover the full vertical extent of the grid ➙
portions above and below the edges of standard image may be represented
using a single default value
✓ Data need only be stored for Φk with resolution up to that of the reference
image.
11
This square area captures all the detail of interest for most words
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Pyramid Approach
• Define a family of standard grids → base grid Φo covering ([0,1], [-0.5, 0.5]) broken into 32x32 px
• Refine grids cover the same square region with double resolution 64x64 px
• Like a tree in which each Φk has 4 children in Φk+1
✓ The standard image usually don’t cover the full vertical extent of the grid ➙
portions above and below the edges of standard image may be represented
using a single default value
✓ Data need only be stored for Φk with resolution up to that of the reference
image.
11
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Boosting and Decision Trees
• Word image recognition has many potential class → to use AdaBoost, a classifier with at least 50% accuracy is needed
• Decision Trees are the foremost option
• well understood
• achieve arbitrary accuracy on the training data in practice
• At each node the training examples are split into 2 sub-groups by comparing the value of a chosen pixel in each to a chosen threshold
• A tree branch growth is stopped when the contained subset is dominated by a majority class
12
Se si va avanti fino ad avere un solo training example per foglia si raggiunge accuratezza 100%. -> overfit the tree che deve essere pruned rimuovendo rami statisticamente poco bell
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
C4.5
• C4.5 provides the algorithm for building the decision tree, with some modification designed to support the grid pyramid data structure
• C4.5 builds decision trees from a set of training data using the concept of information entropy
• It uses the fact that each attribute of the data can be used to make a decision that splits the data into smaller subsets
13
Training data is a setS = s1, s2, ..., sn
of already classify samples, where
si = x1, x2, ..., xm
xj = feature
Training data is augmented with a vector
C = c1, c2, ..., cv where
ci represents the class that each sample belongs to.
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
C4.5
• C4.5 provides the algorithm for building the decision tree, with some modification designed to support the grid pyramid data structure
• C4.5 builds decision trees from a set of training data using the concept of information entropy
• It uses the fact that each attribute of the data can be used to make a decision that splits the data into smaller subsets
13
Reference: Quinlan, J. R. “C4.5: Programs for Machine Learning”. Morgan Kaufmann, 1993.
Training data is a setS = s1, s2, ..., sn
of already classify samples, where
si = x1, x2, ..., xm
xj = feature
Training data is augmented with a vector
C = c1, c2, ..., cv where
ci represents the class that each sample belongs to.
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
C4.5 for images pyramid
• At each node a feature (i.e. pixel location) and a threshold value must be chosen as split value → exhaustiveness is not possible
• Only Φo is exhaustively examined → location and threshold offering the greatest information gain is retained
• The search proceeds selectively to its children in Φ1, from there to the children of the best of those locations, and so on until the maximum resolution available is reached
• The grid level, location and threshold with the highest information gain becomes the decision criterion for the node
14
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Boosting
• Single trees do not generalize well for hand-written word images
(1) The base classifier is normally generated from the training data
(2)AdaBoost raises the weights of misclassify elements → forcing base classifier to work harder
(3)After many rounds of boosting, a weighted vote classifies the training set perfectly and shows good generalization to unseen examples
(4) In practice after a certain # of rounds (here: 200) the results don’t improve significantly
15
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Supplementary Training Examples
• Problem: paucity of training examples for many classes makes generalization difficult
• Zipf law ➙ few examples for many words
• 57% of the words appear only one time in the test collection
• Solution: generate new training examples for low frequencies classes via stochastical distortion of the available example
• Improve overall word classification accuracy
16
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Supplementary Training Examples
• Sample from the original using a grid of points whose portions have been perturbed from a uniform lattice
• Nearby points should be perturbed by similar amounts
• New image is the distortion of the old one
17
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Classification Experiments
• Test collection: GW20 ➞ previously used and GW100 non overlapping with GW20
• written by multiple hands
• manually segmented to extract images of individual words (4856 in GW20 and 21324 in GW100)
• all images labeled with their ASCII equivalent
18
GW20 experiments. 19 pages for training and 1 for tests.
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Classification Experiments
• Test collection: GW20 ➞ previously used and GW100 non overlapping with GW20
• written by multiple hands
• manually segmented to extract images of individual words (4856 in GW20 and 21324 in GW100)
• all images labeled with their ASCII equivalent
18
GW20 experiments. 19 pages for training and 1 for tests.
①
Single decision tree → standard C4.5 grown to completion, then pruned
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Classification Experiments
• Test collection: GW20 ➞ previously used and GW100 non overlapping with GW20
• written by multiple hands
• manually segmented to extract images of individual words (4856 in GW20 and 21324 in GW100)
• all images labeled with their ASCII equivalent
18
GW20 experiments. 19 pages for training and 1 for tests.
②
AdaBoost + Decision Tree as base learner
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Classification Experiments
• Test collection: GW20 ➞ previously used and GW100 non overlapping with GW20
• written by multiple hands
• manually segmented to extract images of individual words (4856 in GW20 and 21324 in GW100)
• all images labeled with their ASCII equivalent
18
GW20 experiments. 19 pages for training and 1 for tests.
③
AdaBoost + Decision Tree + Synthetic Data
No experiments with AdaBoost and simple classifier because 50% accuracy cannot be achieved
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Classification Experiments
• Test collection: GW20 ➞ previously used and GW100 non overlapping with GW20
• written by multiple hands
• manually segmented to extract images of individual words (4856 in GW20 and 21324 in GW100)
• all images labeled with their ASCII equivalent
18
GW20 experiments. 19 pages for training and 1 for tests.
GW100: ⇓ performances = + OOV words and ↓image quality
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Retrieval
• Language Modeling approach to retrieval• Ref: Ponte, J. and Croft, W.B. “A language modeling approach to Information
Retrieval”, SIGIR 1998 275-281
• Use query likelihood formulation where documents are ranked according to P(Q|D)
• AdaBoost provides classification rather than probabilities → only the most likely label for each word image is preserved
• An approach can be that the probabilities are equal to their frequencies in each recognized document → but many words can be misclassified
19
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Retrieval: Regularization Schema
• Regularization schema based upon classification rank information
• Hypothesis: Rank info may be more important than actual probabilities ➔ top terms→very imp. | some moderate imp. etc.
• Infer probabilities from the rank ordered output of AdaBoost classification algorithm → rank the top n classes according to scores
• Associate a probability to classes fitting the Zipfian distribution to rank classes
20
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Retrieval: Regularization Schema
• Instead a document has one possible word for each position, now it contains a probability distribution at each position
• Test on Lemur with the query-likelihood ranking method
• Because of limited size of GW20 line retrieval is performed
• relevant = line containing all query terms
• stop-words removed
• GW100 allows for full page retrieval with GW20 as training examples
21
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Retrieval: Regularization Schema
• Instead a document has one possible word for each position, now it contains a probability distribution at each position
• Test on Lemur with the query-likelihood ranking method
• Because of limited size of GW20 line retrieval is performed
• relevant = line containing all query terms
• stop-words removed
• GW100 allows for full page retrieval with GW20 as training examples
21
Applied Functional Analysis05 February 2009, Padova
Gianmaria Silvello
Conclusions
• Learning algorithms are not designed to deal with training data that exhibits highly skewed distribution of class frequencies
• The methodology described does not always work fine because the synthetic training data are not truly independent of the originals
• Performances are good for GW20
• The problem is challenging for GW100 →larger data-set, noise ⇒ using soft classification decisions can
improve the results for shorter queries
22