[IEEE 9th International Conference on Information Technology (ICIT'06) - Bhubaneswar, India...

6
An HMM Based Recognition Scheme for Handwritten Oriya Numerals Tapan K Bhowmik Swapan K Parui Ujjwal Bhattacharya Bikash Shaw IBM India Pvt Ltd, Millennium City, Salt Lake, Kolkata, India [email protected] CVPR Unit, Indian Statistical Institute, Kolkata, India [email protected] CVPR Unit, Indian Statistical Institute, Kolkata, India [email protected] CVPR Unit,Indian Statistical Institute, Kolkata, India [email protected] Abstract A novel hidden Markov model (HMM) for recognition of handwritten Oriya numerals is proposed. The novelty lies in the fact that the HMM states are not determined a priori, but are determined automatically based on a database of handwritten numeral images. A handwritten numeral is assumed to be a string of several shape primitives. These are in fact the states of the proposed HMM and are found using certain mixture distributions. One HMM is constructed for each numeral. To classify an unknown numeral image, its class conditional probability for each HMM is computed. The classification scheme has been tested on a large handwritten Oriya numeral database developed recently. The classification accuracy is 95.89% and 90.50% for training and test sets respectively. 1. Introduction The development of a handwritten character recognition system for any script is always a challenging problem mainly because of the enormous variability in handwriting styles. Although there have been significant developments on Roman scripts, not much work has been reported on scripts of Indian origin. On the other hand, several peculiar features of Indian scripts [1] make such research work for handwritten characters of Indian scripts more difficult. India is a multilingual country with multiple scripts in use. However, in the literature, not enough research works are found on recognition of handwriting in all of these scripts. Relatively more research attempts are found towards recognition of handwritten characters of Devanagari and Bangla. The earliest available work on recognition of hand printed Devanagari characters is found in [2]. For recognition of handwritten Devanagari numerals, Ramakrishnan et al. [3] used independent component analysis technique for feature extraction while Bajaj et al [4] considered a strategy combining decisions of multiple classifiers. In an attempt to develop a bilingual handwritten numeral recognition system, Lehal and Bhatt [5] used a set of global and local features derived from the right and left projection profiles of the numeral images for recognition of handwritten numerals of Devanagari and Roman scripts. A few works on handwritten Bangla numerals/characters include [6-10]. In the present study, we consider automatic recognition of handwritten Oriya numerals. Oriya, an Indo-Aryan language, is spoken by about 31 million people mainly in the Indian state of Orissa, and also in a few other Indian states such as West Bengal, Jharkhand and Gujarat. Oriya language is closely related to Bangla and Assamese languages. Its script developed from the Kalinga script, is one of the many descendants of the Brahmi script of ancient India. The earliest known inscription of the Oriya language, in the Kalinga script, dates back to 1051. Like other scripts, Oriya also has 10 numerals. The earliest available work on recognition of hand printed Oriya characters is found in [11]. However, in this study a small set of samples was considered. Recently, Roy et al. [12] considered chain code histogram features computed from the contour of the numerals divided into several blocks for recognition of handwritten Oriya numerals. However, to the best of our knowledge, no HMM based recognition scheme has so far been proposed for handwritten Oriya numerals. On the other hand, it is an established fact that the development of a handwriting recognition system requires a large set of training samples. Generation of such a data set is always difficult since it is time consuming and labor intensive [13]. Such standard training data sets for any Indian script did not exist till recently. However, a few large handwritten databases have recently been developed and these include an image database of handwritten Oriya numerals [14]. In 9th International Conference on Information Technology (ICIT'06) 0-7695-2635-7/06 $20.00 © 2006

Transcript of [IEEE 9th International Conference on Information Technology (ICIT'06) - Bhubaneswar, India...

Page 1: [IEEE 9th International Conference on Information Technology (ICIT'06) - Bhubaneswar, India (2006.12.18-2006.12.21)] 9th International Conference on Information Technology (ICIT'06)

An HMM Based Recognition Scheme for Handwritten Oriya Numerals

Tapan K Bhowmik Swapan K Parui Ujjwal Bhattacharya Bikash Shaw IBM India Pvt Ltd,

Millennium City, Salt Lake, Kolkata, India

[email protected]

CVPR Unit, Indian Statistical Institute,

Kolkata, India [email protected]

CVPR Unit, Indian Statistical Institute,

Kolkata, India [email protected]

CVPR Unit,Indian Statistical Institute,

Kolkata, India [email protected]

Abstract

A novel hidden Markov model (HMM) for recognition of handwritten Oriya numerals is proposed. The novelty lies in the fact that the HMM states are not determined a priori, but are determined automatically based on a database of handwritten numeral images. A handwritten numeral is assumed to be a string of several shape primitives. These are in fact the states of the proposed HMM and are found using certain mixture distributions. One HMM is constructed for each numeral. To classify an unknown numeral image, its class conditional probability for each HMM is computed. The classification scheme has been tested on a large handwritten Oriya numeral database developed recently. The classification accuracy is 95.89% and 90.50% for training and test sets respectively. 1. Introduction

The development of a handwritten character recognition system for any script is always a challenging problem mainly because of the enormous variability in handwriting styles. Although there have been significant developments on Roman scripts, not much work has been reported on scripts of Indian origin. On the other hand, several peculiar features of Indian scripts [1] make such research work for handwritten characters of Indian scripts more difficult.

India is a multilingual country with multiple scripts in use. However, in the literature, not enough research works are found on recognition of handwriting in all of these scripts. Relatively more research attempts are found towards recognition of handwritten characters of Devanagari and Bangla. The earliest available work on recognition of hand printed Devanagari characters is found in [2]. For recognition of handwritten Devanagari numerals, Ramakrishnan et al. [3] used independent component analysis technique for feature

extraction while Bajaj et al [4] considered a strategy combining decisions of multiple classifiers. In an attempt to develop a bilingual handwritten numeral recognition system, Lehal and Bhatt [5] used a set of global and local features derived from the right and left projection profiles of the numeral images for recognition of handwritten numerals of Devanagari and Roman scripts. A few works on handwritten Bangla numerals/characters include [6-10].

In the present study, we consider automatic recognition of handwritten Oriya numerals. Oriya, an Indo-Aryan language, is spoken by about 31 million people mainly in the Indian state of Orissa, and also in a few other Indian states such as West Bengal, Jharkhand and Gujarat. Oriya language is closely related to Bangla and Assamese languages. Its script developed from the Kalinga script, is one of the many descendants of the Brahmi script of ancient India. The earliest known inscription of the Oriya language, in the Kalinga script, dates back to 1051. Like other scripts, Oriya also has 10 numerals.

The earliest available work on recognition of hand printed Oriya characters is found in [11]. However, in this study a small set of samples was considered. Recently, Roy et al. [12] considered chain code histogram features computed from the contour of the numerals divided into several blocks for recognition of handwritten Oriya numerals. However, to the best of our knowledge, no HMM based recognition scheme has so far been proposed for handwritten Oriya numerals. On the other hand, it is an established fact that the development of a handwriting recognition system requires a large set of training samples. Generation of such a data set is always difficult since it is time consuming and labor intensive [13]. Such standard training data sets for any Indian script did not exist till recently. However, a few large handwritten databases have recently been developed and these include an image database of handwritten Oriya numerals [14]. In

9th International Conference on Information Technology (ICIT'06)0-7695-2635-7/06 $20.00 © 2006

Page 2: [IEEE 9th International Conference on Information Technology (ICIT'06) - Bhubaneswar, India (2006.12.18-2006.12.21)] 9th International Conference on Information Technology (ICIT'06)

the present report, training and test results of the proposed approach are presented on the basis of this database.

We have used a hidden Markov model (HMM) in the proposed scheme for recognition of handwritten Oriya numerals. Since there are many uncertainties in handwriting recognition, stochastic modeling is a suitable approach to this problem. An HMM is capable of making use of both the statistical and structural information present in handwritten images. This is why HMMs have been used in several handwritten character recognition tasks in recent years [15]. In such HMMs, the states are usually defined as pre-determined entities. However, a novelty in the present HMM is that a data-driven or adaptive approach is taken to define the states. The shapes of the strokes present in the database of handwritten Oriya numeral images are studied and their statistical distribution is modeled as a mixture distribution. Each component is a state of the HMM. The proposed method is robust in the sense that it is independent of several aspects of input such as thickness, size etc. 2. Handwritten Oriya Numeral Database

In the present work, we have used a recently developed database of isolated handwritten Oriya numerals [14]. This consists of 5970 samples collected from real-life documents like postal mail pieces, job application forms etc. These documents were scanned at 300 dpi using a HP flatbed scanner and stored as gray-level images with 1 byte per pixel. A few samples from this database are shown in Fig.1

Figure 1. A few samples from the database of handwritten Oriya numerals.

Table 1. Distribution of samples in the database

The above database is exclusively divided into training and test sets. The distribution of samples in these training and test sets over 10 digit classes are given in Table 1. Following the usual practice, training set consists of most of the handwritten samples.

(a) (b) (c)

Figure 2. (a) An input numeral image, (b) image obtained after binarization, (c) image obtained after median filtering of binarized image in (b).

3. Feature Extraction 3.1. Preprocessing

The input numeral image is first smoothed by median filtering and then binarized by Otsu’s thresolding method. The binarized image is again median filtered for noise cleaning. No size normalization is done at the image level since it is taken care of at the time of feature extraction. A sample image from the present database and the same after binarization and smoothing are shown respectively in Figs. 2(a), 2(b) and 2(c).

3.2. Extraction of strokes

Let A be the binarized image. We now describe the process of extraction of vertical and horizontal strokes that are present in A. Let E be a binary image consisting of object pixels in A whose right or east

9th International Conference on Information Technology (ICIT'06)0-7695-2635-7/06 $20.00 © 2006

Page 3: [IEEE 9th International Conference on Information Technology (ICIT'06) - Bhubaneswar, India (2006.12.18-2006.12.21)] 9th International Conference on Information Technology (ICIT'06)

neighbour is in the background. In other words, the object pixels of A that are visible from the east (Fig. 3(a)) form E. Similarly, S is defined as the binary image consisting of object pixels in A whose bottom or south neighbour is in the background (Fig. 3(b)).

The connected components in E represent strokes that are vertical while the connected components in S represent strokes that are horizontal. Each horizontal or vertical stroke is a digital curve. Shapes of these strokes are analyzed for extraction of features. Very short curves are ignored.

(a) (b)

3.3. Extraction of features

From each stroke in E and S, 8 scalar features are extracted. These features indicate the shape, size and position of a digital curve with respect to the numeral image. A curve C in E is traced from bottom upward. Suppose the bottom most and the top most pixel positions in C are P0 and P5 respectively. The four points P1, …, P4 on C are found such that the curve distances between Pi-1 and Pi (i=1, …, 5 ) are equal, using the algorithm in [16]. Let iα , i =1,…, 5 be the

angles that the lines ii PP 1− make with the x-axis. Since the stroke here is vertical, 450 ≤ iα ≤ 1350.

iα ’s are features that are invariant under scaling and represent only the shape. The position features of C are given by X , Y which are the x- and y-coordinates of the centre of gravity of the pixel positions in C. X is also useful in arranging the strokes present in an image from left to right. Let L be the length of the stroke C. The 3 features X , Y and L are normalized with respect to the image height. Thus, the feature vector becomes ),,,,,,,( 54321 LYXααααα .

The features extracted from a horizontal stroke C in S are similar. Here C is traced from west to east. Suppose Q0, …, Q5 are equidistant points on C such that Q0 is the west most and Q5 is east most pixel on C. These points are found in the same way as Pi. Let

iβ be the angles that the lines ii QQ 1− make with the x-axis. Since the stroke is horizontal, -450 ≤ iβ ≤ 450. iβ , like iα , are invariant under scaling. The feature vector of a horizontal stroke C is defined as ),,,,,,,( 54321 LYXβββββ where X , Y and L are defined in the same way as before.

4. Proposed HMM Classifier An HMM with the state space },,{ 1 NssS = and

observation sequence TqqQ ,,1= is characterized by γ =(π,A,B) where the initial state distribution is given by }{ iππ = , =iπ Prob )( 1 isq = , the state transition probability distribution by )}({ taA ij=

where )(taij = Prob )/( 1 itjt sqsq ==+ and the observation symbol probability distributions by B = {bi} where bi(Ot ) is the distribution for state i and Ot is the observation at instant t. The HMM here is non-homogeneous.

Here the problem is: given an observation sequence TOO ,,1=O and a model ),,( BAπ=γ , how to efficiently compute )/( γOP , the probability of the observation sequence. For a classifier of m classes of patterns, we denote m different HMMs by

mjj .,..,2,1, =γ . Let an input pattern X of an unknown class have a sequence O . The probability

)/( jP γO is computed for each model jγ and X is assigned to class c whose model shows the highest probability. That is, })/({maxarg

1j

mjPc γO

≤≤= . For a

given γ , )/( γOP is computed using the well known forward and backward algorithms. In the former,

)/,,.,( 1 γitt SqOOP = and in the latter, )/,,.,( 1 γjtTt SqOOP =+ are computed.

Note that the observation sequence =O

TOO ,,1 in our problem is the sequence of feature vectors of the strokes (arranged from left to right) that are present in a handwritten numeral image. T is the number of strokes in the image. The states here are certain feature primitives (or more specifically, individual 8-dimensional Gaussian distributions in the feature space) that are found below using EM algorithm.

Figure 3. A sample image of numeral “5”. Dark and gray pixels indicate (a) S and A images respectively and (b) E and A images respectively.

9th International Conference on Information Technology (ICIT'06)0-7695-2635-7/06 $20.00 © 2006

Page 4: [IEEE 9th International Conference on Information Technology (ICIT'06) - Bhubaneswar, India (2006.12.18-2006.12.21)] 9th International Conference on Information Technology (ICIT'06)

4.1. HMM parameters We will now consider the feature vectors of a stroke, that is, θ = (θ 1,θ 2,θ 3,θ 4,θ 5, X , Y , L). A feature vector can come either from a vertical or a horizontal stroke. It is assumed that the features follow a multivariate Gaussian mixture distribution. In other words, θ = (θ 1,θ 2,θ 3,θ 4,θ 5, X , Y , L) has a distribution )(θf which is a mixture of K 8-dimensional Gaussian distributions, namely,

∑=

=K

kkk fPf

1)()( θθ , where =)(θkf

})2({/})()(5.0{exp 2/12/81kkk

Tk ΣµθΣµθ π−−− − and

kP is the prior probability of the k-th component. The unknown parameters of the mixture distribution, namely, kkkP Σµ ,, ( Kk .,...,1= ) are estimated using the EM (Expectation Maximization) algorithm [17] that maximizes the log likelihood of the observed samples (feature vectors) },,2,1,{ nii =θ coming from the distribution given by )(θf . These samples are all the strokes that are present in the training set of images of a handwritten numeral class. The state space of the proposed HMM consists of K states which are characterized by the probability density functions Kkfk ...,,2,1),( =θ . In a sense, the vertical and horizontal strokes in the numeral image database are distributed around K different prototype strokes. We call them stroke primitives corresponding to the mean shape vectors 1µ , 2µ , …, Kµ . These K

stroke primitives constitute the state space. Thus, the states here are not determined a priori but are constructed adaptively on the basis of the numeral image database. To determine the optimum values of K, we use the Bayesian information criterion (BIC) which is defined as )log(2)( nmLLKBIC +−≡ , for a Gaussian mixture model with K components, LL is the log likelihood value, m is the number of independent parameters to be estimated, n is the number of observations. For several K values, the BIC(K) values are computed. The first local minimum indicates the optimum K value. 4.2. Estimation of HMM parameters In our implementation, N = K and observation symbol probability distribution )( ti Ob is, in fact, the Gaussian distribution ),()( iii Nf Σ= µθ . Thus

2/12/8

1

)2(

})()(21{exp

)(i

itiT

itti

OOOb

Σ

−Σ−−=

π

µµ

The parameters produced by EM algorithm are ,,1P

NNNP ΣΣµµ ,,,,, 11, . Let, in a character image,

the strokes be arranged from left to right on the basis of X to generate the observation sequence

TOOO ......,, 21=O . For each tO , compute

∑=

= N

jtjj

tiiti

Obp

ObpOh

1)(

)()( and tO is assigned to state k

where })({maxarg1

tiNi

Ohk≤≤

= . This assignment to

respective states is done for all L observation sequences (L is the number of training images). From these L state sequences, the estimates of the initial probabilities are computed as ( Ni ≤≤1 )

1

1 }{qofsoccurrenceofnumberTotalsqofsoccurrenceofNumber i

i∈

=π .

The transition probability estimates )(ta ij are computed as ( Ni ≤≤1 , Nj ≤≤1 , 11 −≤≤ Tt )

}{.}&{. 1

it

jtit

sqofsoccurrenceofNosqsqofsoccurrenceofNo

∈∈∈ +

The above HMM parameter estimates are fine-tuned using re-estimation by Baum-Welch forward-backward algorithm. 5. Experimental Results

The proposed scheme has been tested on the recently developed database [14] of handwritten Oriya numeral images. The results of our study are reported below. To the best of our knowledge, there does not exist any other standard database of handwritten Oriya numerals. The few past studies that had been carried out for recognition of handwritten Indian scripts are based on samples collected in laboratory environments. The training and test databases here consist of respectively 4970 and 1000 images of handwritten numerals of all the 10 classes. From these numeral images, 15231 horizontal and 11014 vertical strokes have been extracted from the training set whereas 3078 horizontal and 2204 vertical stokes have been extracted from the test set. Distributions of horizontal and vertical strokes in the training set are shown in Table 2.

9th International Conference on Information Technology (ICIT'06)0-7695-2635-7/06 $20.00 © 2006

Page 5: [IEEE 9th International Conference on Information Technology (ICIT'06) - Bhubaneswar, India (2006.12.18-2006.12.21)] 9th International Conference on Information Technology (ICIT'06)

Table 2: Distribution of Horizontal & Vertical Strokes in the Training Set

The parameters of an HMM for each of the 10 numeral classes are determined using the method described in Section 4.

For example, for numeral class “5”, the K value is

found to be 14. The curves corresponding to the 14 mean vectors kµ are shown in Fig 4. These represent the 14 HMM states for “5”. For the image shown in Fig. 3, the strokes are shown in Fig. 5. The strokes arranged in terms of X from left to right are e1, e2, e3, e4, e5, e6, e7, e8, e9, e10. The most likely states of these 10 strokes individually are s11, s4, s12, s13, s4 s8, s2, s13, s6 and s9 respectively.

(a) (b)

(c)

The probability )/( jOP γ is computed for

j=1,…,10 and the image is classified as class c where })/({maxarg

101j

jPc γO

≤≤= . We have achieved 90.50%

correct recognition rate on the test set and 95.89% on the training set. Table 3 gives the confusion matrix of the proposed recognizer for the training set while Table 4 provides the same for the test set.

Table 3: Confusion matrix for the training set of handwritten Oriya numeral database

Figure 5. For image in Fig. 3, (a) horizontal strokes (b) vertical strokes and (c) e1 to e10 represent the strokes arranged from left to right.

Figure 4. Stroke primitives for vertical and horizontal strokes for numeral “5”.

9th International Conference on Information Technology (ICIT'06)0-7695-2635-7/06 $20.00 © 2006

Page 6: [IEEE 9th International Conference on Information Technology (ICIT'06) - Bhubaneswar, India (2006.12.18-2006.12.21)] 9th International Conference on Information Technology (ICIT'06)

Table 4: Confusion matrix for the test set of handwritten Oriya numeral database

6. Conclusions

The proposed HMM based approach to recognition of handwritten Oriya numerals improves the existing recognition performance described in [12] where the recognition results were presented on the basis of training set only. In the present article, we have provided the same on the basis of both training and test sets. In future, we shall consider combination of classifiers towards improvement of the recognition accuracies.

References [1] V. Bansal and R. M. K. Sinha. Designing a front end OCR system for Indian scripts for Machine Translation - A case study for Devanagari. Symp. Machine Aids for Trans. & Comm. (SMATAC-96), New Delhi, India, 1996.

[2] I. K. Sethi and B. Chatterjee. Machine recognition of constrained handprinted Devanagari. Pattern Recognition, 9: 69 - 75, 1977.

[3] K. R. Ramakrishnan, S. H. Srinivasan and S. Bhagavathy. The independent components of characters are ‘Strokes’. Proc. of the 5th ICDAR: 414 - 417, 1999.

[4] R. Bajaj, L. Dey and S. Chaudhuri. Devanagari numeral recognition by combining decision of multiple connectionist classifiers. Sadhana, 27(1): 59 – 72, 2002.

[5] G. S. Lehal and N. Bhatt. A recognition system for Devnagri and English handwritten numerals. Advances in Multimodal Interfaces – ICMI 2001, T. Tan, Y. Shi and W. Gao (Eds.), LNCS-1948: 442-449, 2000.

[6] A. Dutta, S. Chaudhuri. Bengali alpha-numeric character recognition using curvature features, Pattern Recognition, 26: 1757-1770, 1993.

[7] U. Pal and B. B. Chaudhuri. Automatic recognition of unconstrained off-line Bangla hand-written numerals. Advances in Multimodal Interfaces, LNCS-1948, Eds. T. Tan, Y. Shi and W. Gao: 371-378, 2000.

[8] A. F. R. Rahman, R. Rahman and M. C. Fairhurst. Recognition of handwritten Bengali characters: a novel multistage approach. Pattern Recognition, 35: 997-1006, 2002.

[9] U. Bhattacharya, T. K. Das, A. Datta, S. K. Parui and B. B. Chaudhuri. A hybrid scheme for handprinted numeral recognition based on a self-organizing network and MLP classifiers. Int. J. Patt. Recog. & Art. Intell., 16(7) : 845-864, 2002.

[10] U. Bhattacharya, B. B. Chaudhuri. A majority voting scheme for multiresolution recognition of handprinted numerals. Proc. of the 7th ICDAR, Edinburgh, Scotland, I: 16-20, 2003.

[11] S. Mohanti. Pattern recognition in alphabets of Oriya language using Kohonen neural network. Int. J. Patt. Recog. & Art. Intell., 12: 1007-1015, 1998.

[12] K. Roy, T. Pal, U. Pal and F. Kimura. Oriya handwritten numeral recognition system. Proceedings of the 8th ICDAR’05: 770-774, 2005.

[13] H. Ma, D. S. Doermann. Adaptive Hindi OCR using generalized Hausdorff image comparison. ACM Trans. Asian Lang. Inf. Process, 2: 193-218, 2003.

[14] U. Bhattacharya and B. B. Chaudhuri. Databases for research on recognition of handwritten characters of Indian scripts. Proc. of the 8th Int. Conf. on Document Analysis and Recognition, Seoul, II: 789-793, 2005.

[15] H. Park and S. Lee. Off-line recognition of large-set handwritten characters with multiple hidden Markov models, Pattern Recognition, 29: 231-244, 1996.

[16] S. K. Parui and D. Dutta Majumder. Shape similarity measures for open curves, Pattern Recognition Letters, 1(3): 129-134.3, 1983.

[17] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, San Diego, 2nd Edition, 1990.

9th International Conference on Information Technology (ICIT'06)0-7695-2635-7/06 $20.00 © 2006