Machine Printed Handwritten Text Discrimination

5
Machine Printed Handwritten Text Discrimination Using Radon Transform and SVM Classifier ET-Tahir Zemouri 1 and Youcef Chibani 2 Signal Processing Laboratory, Faculty of Electronic and Computer Sciences University of Sciences and Technology Houari Boumediene USTHB, EL-Alia, B.P. 32, 16111, Algiers, Algeria 1 tzemouri @usthb.dz, 2 [email protected] AbstractDiscrimination of machine printed and handwritten text is deemed as major problem in the recognition of the mixed texts. In this paper, we address the problem of identifying each type by using the Radon transform and Support Vector Machines, which is conducted at three steps: preprocessing, feature generation and classification. New set of features is generated from each word using the Radon transform. Classification is used to distinguish printed text from handwritten. The proposed system is tested on IAM databases. The recognition rate of the proposed method is calculated to be over 98%. Keywords-document analysis; machine printed and handwritten text discrimination; Radon transform; Support Vector Machines (SVM). I. INTRODUCTION Machine printed and handwritten text are often met in application forms, question papers, mail as well as notes, corrections and instructions in printed documents. In all mentioned cases it is crucial to detect, distinguish and process differently the areas of handwritten and printed text (OCR for machine printed text and ICR for handwritten annotations) for obvious reasons such as: (a) retrieval of important information (identification of handwriting in application forms), (b) removal of unnecessary information (removal of handwritten notes from official documents), and (c) application of different recognition algorithms in each case. The main difference between machine printed and handwritten text is their shape structure. Characters in machine printed text have a uniform shape. Whereas handwritten text are of arbitrary curly allograph styles. This difference can be exploited for generating features by exploring the regularity of the machine printed words comparatively of the handwritten words. There exist a few papers on the discrimination of machine printed and handwritten text. Kuhnke et al. [1] proposed a neural network-based approach with straightness and symmetry as features. Pal and Chaudhuri [2] have used horizontal projection profiles for separating the printed and handwritten lines in Bangla script. Guo and Ma [3] proposed an approach based on the vertical projection profile of the segmented words, which used a Hidden Markov Model (HMM) as the classifier. Zheng et al. [4] reported on printed and handwritten text segmentation using k-NN, Support Vector Machines (SVM) and Fisher classifier with features like pixel density, aspect ratio and Gabor features. Kandan et al. [5] used invariant moments, which are insensitive to translation, scale, mirroring and rotation as the feature for distinguishing the printed and handwritten elements and the SVM classifier. We propose in this paper a new method for text discrimination by using the Radon transform and Support Vector Machines. The Radon transform is adapted for detecting linear features. Hence, printed words generate Radon coefficients more regular comparatively to handwritten words. This property can be used for distinguishing between printed and handwritten words. While, the SVM is well adapted for a robust separation of two classes. The paper is organized as follows. In section 2, we describe the proposed system. Experiments and conclusions are discussed in Sections 3 and 4, respectively. II. THE PROPOSED SYSTEM The system for the discrimination between machine printed and handwritten text can be decomposed into three stages [1], as shown in Fig. 1. The first stage is the preprocessing stage, in which the document is cleaned of all the noise components present such as spurious dots and lines. In the second stage, features are generated based on Radon transform, for which the elements are classified into printed or handwritten using SVM classifiers. A. Preprocessing stage Due to large variations in image data, preprocessing, which is used to reduce variations and produce a more consistent set of data, is essential for accurate character recognition. In our system, preprocessing includes the filtering, binarization, skew angle correction, smoothing, and word segmentation.

description

 

Transcript of Machine Printed Handwritten Text Discrimination

Page 1: Machine Printed Handwritten Text Discrimination

Machine Printed Handwritten Text Discrimination

Using Radon Transform and SVM Classifier

ET-Tahir Zemouri1 and Youcef Chibani

2

Signal Processing Laboratory, Faculty of Electronic and Computer Sciences

University of Sciences and Technology Houari Boumediene

USTHB, EL-Alia, B.P. 32, 16111, Algiers, Algeria 1 tzemouri @usthb.dz, 2 [email protected]

Abstract—Discrimination of machine printed and

handwritten text is deemed as major problem in the

recognition of the mixed texts. In this paper, we address the

problem of identifying each type by using the Radon transform

and Support Vector Machines, which is conducted at three

steps: preprocessing, feature generation and classification. New

set of features is generated from each word using the Radon

transform. Classification is used to distinguish printed text

from handwritten. The proposed system is tested on IAM

databases. The recognition rate of the proposed method is

calculated to be over 98%.

Keywords-document analysis; machine printed and

handwritten text discrimination; Radon transform; Support

Vector Machines (SVM).

I. INTRODUCTION

Machine printed and handwritten text are often met in application forms, question papers, mail as well as notes, corrections and instructions in printed documents.

In all mentioned cases it is crucial to detect, distinguish and process differently the areas of handwritten and printed text (OCR for machine printed text and ICR for handwritten annotations) for obvious reasons such as: (a) retrieval of important information (identification of handwriting in application forms), (b) removal of unnecessary information (removal of handwritten notes from official documents), and (c) application of different recognition algorithms in each case.

The main difference between machine printed and handwritten text is their shape structure. Characters in machine printed text have a uniform shape. Whereas handwritten text are of arbitrary curly allograph styles. This difference can be exploited for generating features by exploring the regularity of the machine printed words comparatively of the handwritten words.

There exist a few papers on the discrimination of machine printed and handwritten text. Kuhnke et al. [1] proposed a neural network-based approach with straightness and symmetry as features. Pal and Chaudhuri [2] have used horizontal projection profiles for separating the printed and

handwritten lines in Bangla script. Guo and Ma [3] proposed an approach based on the vertical projection profile of the segmented words, which used a Hidden Markov Model (HMM) as the classifier. Zheng et al. [4] reported on printed and handwritten text segmentation using k-NN, Support Vector Machines (SVM) and Fisher classifier with features like pixel density, aspect ratio and Gabor features. Kandan et al. [5] used invariant moments, which are insensitive to translation, scale, mirroring and rotation as the feature for distinguishing the printed and handwritten elements and the SVM classifier.

We propose in this paper a new method for text discrimination by using the Radon transform and Support Vector Machines.

The Radon transform is adapted for detecting linear features. Hence, printed words generate Radon coefficients more regular comparatively to handwritten words. This property can be used for distinguishing between printed and handwritten words. While, the SVM is well adapted for a robust separation of two classes.

The paper is organized as follows. In section 2, we describe the proposed system. Experiments and conclusions are discussed in Sections 3 and 4, respectively.

II. THE PROPOSED SYSTEM

The system for the discrimination between machine printed and handwritten text can be decomposed into three stages [1], as shown in Fig. 1. The first stage is the preprocessing stage, in which the document is cleaned of all the noise components present such as spurious dots and lines. In the second stage, features are generated based on Radon transform, for which the elements are classified into printed or handwritten using SVM classifiers.

A. Preprocessing stage

Due to large variations in image data, preprocessing, which is used to reduce variations and produce a more consistent set of data, is essential for accurate character recognition. In our system, preprocessing includes the filtering, binarization, skew angle correction, smoothing, and word segmentation.

Page 2: Machine Printed Handwritten Text Discrimination

Figure 1. Block-diagram of the classification system.

1) Image filtering: Generally, the image acquired from a scanner contains the noise, which can be reduced using a

3x3 Wiener filter [6].

2) Binarization: the text is separated from background by automatic thresholding. The Wolf approach [7] is used to

the binary image.

3) Skew angle correction: The skew estimation and

correction is an important step in any document analysis and recognition system. Hence, we use the projection profile for

estimating the skew angle [8], which can be performed for

different angles and the largest magnitude variations

correspond to the skew angle.

4) Smoothing: For smoothing binary document images, four filters [9] can be used to smooth the edges and removing

the small pieces of noise.

5) Segmentation: Segmentation aims to extract the words from the document. Segmentation is performed in two

consecutive steps: line segmentation and word segmentation.

Both steps make use of the projection profiles [10].

B. Feature Generation

Many kinds of features can be generated for distinguish the printed from handwritten text, Kuhnke et al. [1] proposed a straightness of vertically/horizontally oriented lines and symmetry relative to different points as features. Pal and Chaudhuri [2] used the distinctive structural and statistical features. Guo and Ma [3] evaluated their scheme using the vertical projection profile. Zheng et al. [4] used features like Gabor filter, Run length histogram features etc. Kandan et al. [5] used the invariant moments that are invariant under translation, scaling, rotation and reflection.

The main idea of our approach is to take advantage of the structural properties that help to discriminate printed from handwritten text. More precisely, the shape of the printed

characters is more or less stable within a text word. On the other hand, the distribution of the shape of handwritten characters is quite diverse.

The Radon transform has been used in many pattern recognition applications as shape recognition [11]. In our approach, the Radon transform is used as a tool for generating a feature vector. Hence, we briefly review its main properties.

1) Radon Transform

The Radon transform computes projections of an image

along specified directions. A projection of a two-dimensional

function ),( yxI is a set of line integrals. The Radon

transform computes the line integrals from multiple sources

along parallel paths in a certain direction. To represent an

image, the Radon transform takes multiple and parallel

projections of the image from different angles by rotating the

source around the center of the image. Formally, the Radon

transform of an image is defined as [12]:

∫ ∫ −+=x y

I

R dxdyyxyxIT )sincos(),(),( ρθθδθρ (1)

where δ is the Dirac function, ]801]0, °∈θ and

],-] +∞∞∈ρ . In other words, I

RT is the integral of ),( yxI

over the line defined by θθρ sincos yx += .

The Radon transform has several useful properties, as periodicity, symmetry, translation invariance, rotation invariance and scaling invariance.

In our approach, we only are interested on periodicity and symmetry. Fig. 2 shows an example of the Radon transform computed on the printed and handwritten words.

(a) (b)

Figure 2. A shape (a) and its Radon transform (b).

We can easily see that the Radon transform generates

more coefficients of the handwritten word comparatively to

the printed word.

Preprocessing

Document image

Filtering

Binarization

Skew correction

Smoothing

Segmentation

Classification

Handwritten Machine printed

Feature generation

Page 3: Machine Printed Handwritten Text Discrimination

2) Feature vector generation

To generate features of printed and handwritten words,

we fix the angular direction number denoted by θN

( ]360]0, °∈θ ). Since, the Radon transform generates

redundant coefficients (Fig 2.b), hence, in our approach, we select the positive radial projections and taking all directions from 0 to 360°. The feature vector is then generated by computing for a given column in positive space of the Radon transform, the sum of the square coefficient by setting the

number of angular direction θN . The feature values )(θIE

are defined as:

∑=ρ

ρ

θρθ NI

RI TN

E2),(

1)( (2)

Fig. 3 illustrates an example of feature generation values

which include the Radon transform energy for each angle θ .

(a)

(b)

(c)

Figure 3. Feature vector generation, (a) Printed word and its Radon

transform, (b) handwritten word and its Radon transform, (b) Radon

transform, (c) Radon energy versus angle.

We can see that the energy based-Radon transform

generates more energy of the handwritten word

comparatively to the printed word.

3) Feature vector normalization In many practical situations, a designer is confronted

with features whose values lie within different dynamic

ranges. Thus, features with large values may have a larger

influence in the cost function than features with small values,

although this does not necessarily reflect their respective

significance in the design of the classifier. The problem is overcome by normalizing the features so that their values lie

within similar ranges. This is achieved by using nonlinear

transformation [13].

C. Classification

SVM are supervised learning methods, which have been

widely and successfully used for pattern recognition in different applications as digit recognition [14]. The main

concept of SVM lies to find a hyperplane that allows

separating two classes, leaving the largest margin between

the vectors of the two classes [14]. However, in real life,

problems can be linearly non separable. To deal with this

problem, a nonlinear decision surface is obtained by lifting

the feature space into a higher dimensional space. A linear

separating hyperplane is found in the higher dimensional

space that gives a nonlinear decision surface in the original

feature space. The decision function of the SVM can be

expressed as follows:

∑ +=i

iii bxxKyxf ),()( α (3)

Where { }1),( X ±ℜ∈ d

ii yx are the feature vectors and

labels, respectively. In our case, the feature vectors and

labels correspond to the Radon energy { }ix , printed words

{+1} and handwritten words {-1}, respectively. Parameters

iα and b are found by maximizing a quadratic function

subject to some constraints [14]. ),( ixxK is the kernel

function, which allows mapping the feature vectors into a higher dimension inner product space. In our case, we use the RBF kernel (Radial Function Basis) since it offers better discrimination than other kernels. The RBF kernel is defined as:

)2

),(exp(),(

2σi

i

xxdxxK −= (4)

2),( ii xxxxd −= (5)

σ is user defined.

The optimization algorithm adopted for training SVMs is

the Sequential Minimal Optimization (SMO) which provides

practical advantages [15].

Page 4: Machine Printed Handwritten Text Discrimination

III. EXPERIMENTAL RESULTS

A. Data set

For evaluating the performances of the proposed method, we use the IAM database (Institut für Informatik und angewandte Mathematik) [16]. They are scanned with resolution of 300 dpi, 8 bits/pixel, gray-scale and converted into binary images using the Wolf binarization method. This database is formed for more than 1500 documents containing printed and handwritten text. An example of a document can be seen in Fig. 4. Regions of printed and handwritten words are easily separable. They present no auxiliary lines to fill or to supply with written texts. This characteristic facilitates the identification and classification of each type of words.

For testing the performances of our system, 21 images are chosen and preprocessed. The set of words are divided into three subsets for training (1/3), validating (1/3) and testing (1/3), respectively. Table 1 summaries the data set.

For each word, a vector with the energy based-Radon Transform is calculated. We use the recognition rate (RR) as a metric to evaluate the performances of our system, which is defined as:

wordsof total #

classifiedcorrectly wordsof#RR = (%) (7)

Figure 4. IAM Database form.

TABLE I. DATA SET

Data set Training Validation Testing

Machine printed 447 447 438

Handwritten 525 525 484 Total 972 972 922

B. System validation

In order to validate our system various experiments are

conducted for finding the SVM regularization parameter

(fixed at 10), kernel parameter (σ ) and the best angular

direction number ( θN ). Fig. 5 shows the recognition rate

obtained on the validation set for each angular direction

number. We can note that the RR is not very sensitive to the

number of the angular direction. However, the best

performances (RR=77.06%) are obtained for θN =20 and

σ =2.1.

Figure 5. Recognition rate using Radon transform

for the system validation.

In order to improve the recognition rate, we add by

concatenation statistical features to the energy based-Radon

transform, which are mean, variance, variance of projection profile (vertical and horizontal) and entropy. Fig. 6 shows

the recognition rate versus the number of the angular

direction.

Figure 6. Recognition rate using Radon transform and statistical features.

We can see that statistical feature sets are very suitable

information for the discrimination between machine printed

and handwritten text since the RR has been improved to

92.8% for θN =10 and σ =2 using validation set. This

constitutes an additional advantage when adding the

statistical features.

C. System testing

After the validation of the system, the testing set is used for evaluating its performances. Hence, the optimal values of

Page 5: Machine Printed Handwritten Text Discrimination

the system validation are used for computing the recognition rate. The obtained results are 98.32%, which constitutes encouraging performances compared to other works [1-5].

D. Comparaison with other similar works

We compare our results with some other published

research works in terms of RR. Hence, Kuhnke et al. [1]

proposed a neural network-based approach with straightness

of vertically/horizontally oriented lines and symmetry

relative to different points as features. The system reached a

RR of 78.5%. Pal and Chaudhuri [2] approach based on the distinctive structural and statistical features of machine

printed and handwritten text lines in Bangla script. The

classification scheme has a RR of 98.3%. Guo and Ma [3]

evaluated their scheme using the vertical projection profile of

the segmented word and obtained a 92.86% from their

scheme using HMM. Zheng et al. [4] got a RR of 96% using

SVM classifier and features like Gabor filter, Run length

histogram features etc. Kandan et al. [5] obtained a RR of

93.22% using the invariant moments that are invariant under

translation, scaling, rotation and reflection as features and

SVM classifier. Our proposed method obtains a RR of 98.32% by using

Radon transform and statistical features and SVM classifier,

which constitutes encouraging performances compared to

other works.

IV. CONCLUSION

In this paper, we proposed a new method for

discriminating printed and handwritten text in document

images using the Radon transform and SVM classifiers. The

system was implemented and tested in IAM databases.

Our approach presents encouraging results by combining

Radon energy and statistical features using SVM classifiers

with the RBF kernel. In the future, we plane to implement our methodology to

distinguish machine printed/handwritten with Arabic and

Latin texts.

REFERENCES

[1] K. Kuhnke, L. Simoncini, and Z.M. Kovacs-V, “A System for

Machine-Written and Hand-Written Character Distinction,” Proc. 3rd International Conference on Document Analysis and Recognition,

vol. 2, pp 811-814, 1995.

[2] U. Pal, and B. B. Chaudhuri, “Machine-printed and Hand-written

Text Line Identification,” Pattern Recognition Letters, vol. 22, n. 3-4, pp. 431-441, 2001.

[3] J. K. Guo, and M. Y. Ma, “Separating Handwritten Material from

Machine Printed Text Using Hidden Markov Models,” Proc. 6th International Conference on Document Analysis and Recognition, pp.

439-443, 2001.

[4] Y. Zheng, H. Li, and D. Doermann, “Machine Printed Text and Handwriting Identification in Noisy Document Images,” IEEE Trans

on Pattern Analysis and Machine Intelligence, vol. 26, n. 3, pp. 337-353, 2004.

[5] R. Kandan, N. K. Reddy, K. R. Arvind, and A. G. Ramakrishnan, “A

Robust Two Level Classification Algorithm for Text Localization in Documents,” Advances in Visual Computing, 3rd Int Symp, (ISVC

07), Part II, LNCS 4842, pp. 96–105, 2007.

[6] B. Gatos, I. Pratikakis and S. J. Perantonis, “Adaptive degraded

document image binarization,” Pattern Recognition, vol. 39, pp. 317-327, 2006.

[7] C. Wolf, and J.M. Jolion, “Extraction and recognition of artificial text in multimedia documents,” Pattern Analysis and Applications, vol. 6,

n. 4, pp. 309-326, 2003.

[8] T. Akiyama, and N. Hagita, “Automatic entry system for printed documents,” Pattern Recognition, vol. 23, n. 11, pp. 1141-1154, 1990.

[9] M. Cheriet, N. Kharma, C. L. Liu, and C. Suen, “Character

Recognition Systems: A Guide for Students and Practitioners,” Wiley-Interscience editor, p 321, 2007.

[10] E. Ataer, and P. Duygulu, “Retrieval of Ottoman Documents,” Proc

8th ACM international workshop on Multimedia information retrieval, pp. 155-162, 2006.

[11] S. Tabbone ,L. Wendling, and J. P. Salmon, “A new shape descriptor

defined on the Radon transform,” Computer Vision and Image Understanding, vol.102, n. 1, pp. 42–51, 2006.

[12] S. R. Deans, “The Radon Transform and Some of Its Applications.

New York: Wiley, 1983.

[13] S. Theodoridis, and K. Koutroumbas, “Pattern Recognition,” 4th Ed,

Elsevier Inc, 2009.

[14] H. Nemmour, Y. Chibani, “Handwritten digit recognition based on a neural-SVM combination”, Int journal of computers and applications

(Acta Press Editor), vol. 32, n.1, pp. 104-109, 2010.

[15] H. Nemmour, Y. Chibani, “Integrating class-dependant tangent vectors into SVMs for handwritten digit recognition,” Int Conf on

Signals, Circuits and Systems (ICSCS), pp. 1-4, 2009.

[16] U.V. Marti, and H. Bunke, “The IAM-Database: an english sentence database for offline handwriting recognition,” International Journal

on Document Analysis and Recognition, vol. 5, n. 1, pp. 39-46, 2002.