Italic Detection and Rectification - AMiner · JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 23,...

17
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 23, 403-419 (2007) 403 Italic Detection and Rectification * KUO-CHIN FAN AND CHIEN-HSIANG HUANG Institute of Computer Science and Information Engineering National Central University Chungli, 320 Taiwan E-mail: [email protected] In this paper, a novel italic detection and rectification method without the prerequi- site of character recognition is proposed. An italic style character can be obtained by performing shear transformation on its corresponding non-italic style character. Tradi- tional italic detection methods have to be operated at least on the word, sentence or even the whole paragraph. The merit of the proposed method is that it can be operated directly on a single character so that more accurate statistical information can be obtained. The rationale of our proposed method is that the difference of certain features derived from italic style characters after shear transformation will be canceled, whereas the difference will be more obvious for non-italic style (normal style) characters. In our proposed ap- proach, the virtual strokes embedded in the considered character image are extracted first. Then, reverse transformation is operated on the considered character image. The 26 up- per and 26 lower alphabets are classified into three classes based on the structural infor- mation of the extracted virtual strokes. The italic and non-italic style characters can then be distinguished based on the classification rule devised for each class of characters. Last, the exact shear angle of the identified italic character is calculated to perform more ac- curate reverse shear transformation to rectify the italic style character into normal (non-italic) style character to facilitate the later OCR task. Experiments were conducted on 50 document images with mixed italic and normal style characters. Satisfactory ac- curacy rate 99.59% for italic style characters and 99.85% for normal style characters are achieved. Experimental results verify the validity of our proposed method in distin- guishing italic and non-italic style characters. Keywords: italic detection, virtual stroke, character classification, italic rectification, shear transformation 1. INTRODUCTION Optical Character Recognition (OCR) is a technique to convert scanned text images into machine-readable format. Document Analysis (DA) is a preprocess to facilitate the OCR task. Font and style detection is one of the important topics in DA. For beautifica- tion or highlighting purpose, italic style characters frequently appear in document layout. In this paper, a novel technique for detecting italic style characters is presented. The pur- poses of italic detection are twofold. First, it can preserve the flavor (outlook) of the original document, i.e., the original document can be losslessly reconstructed without losing any information. Second, the recognition rate of later OCR can be greatly im- proved. As we know, italic style characters and normal style characters possess distinct outlooks and hence possess different features. Using normal style character recognition Received January 10, 2005; revised March 30, 2005; accepted May 2, 2005. Communicated by Pau-Choo Chung. * This work was supported in grant by MOE Program for Promoting Academic Excellent of Universities under grant No. 91-H-FA08-1-4.

Transcript of Italic Detection and Rectification - AMiner · JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 23,...

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 23, 403-419 (2007)

403

Italic Detection and Rectification*

KUO-CHIN FAN AND CHIEN-HSIANG HUANG

Institute of Computer Science and Information Engineering National Central University

Chungli, 320 Taiwan E-mail: [email protected]

In this paper, a novel italic detection and rectification method without the prerequi-

site of character recognition is proposed. An italic style character can be obtained by performing shear transformation on its corresponding non-italic style character. Tradi-tional italic detection methods have to be operated at least on the word, sentence or even the whole paragraph. The merit of the proposed method is that it can be operated directly on a single character so that more accurate statistical information can be obtained. The rationale of our proposed method is that the difference of certain features derived from italic style characters after shear transformation will be canceled, whereas the difference will be more obvious for non-italic style (normal style) characters. In our proposed ap-proach, the virtual strokes embedded in the considered character image are extracted first. Then, reverse transformation is operated on the considered character image. The 26 up-per and 26 lower alphabets are classified into three classes based on the structural infor-mation of the extracted virtual strokes. The italic and non-italic style characters can then be distinguished based on the classification rule devised for each class of characters. Last, the exact shear angle of the identified italic character is calculated to perform more ac-curate reverse shear transformation to rectify the italic style character into normal (non-italic) style character to facilitate the later OCR task. Experiments were conducted on 50 document images with mixed italic and normal style characters. Satisfactory ac-curacy rate 99.59% for italic style characters and 99.85% for normal style characters are achieved. Experimental results verify the validity of our proposed method in distin-guishing italic and non-italic style characters. Keywords: italic detection, virtual stroke, character classification, italic rectification, shear transformation

1. INTRODUCTION

Optical Character Recognition (OCR) is a technique to convert scanned text images into machine-readable format. Document Analysis (DA) is a preprocess to facilitate the OCR task. Font and style detection is one of the important topics in DA. For beautifica-tion or highlighting purpose, italic style characters frequently appear in document layout. In this paper, a novel technique for detecting italic style characters is presented. The pur-poses of italic detection are twofold. First, it can preserve the flavor (outlook) of the original document, i.e., the original document can be losslessly reconstructed without losing any information. Second, the recognition rate of later OCR can be greatly im-proved. As we know, italic style characters and normal style characters possess distinct outlooks and hence possess different features. Using normal style character recognition Received January 10, 2005; revised March 30, 2005; accepted May 2, 2005. Communicated by Pau-Choo Chung. * This work was supported in grant by MOE Program for Promoting Academic Excellent of Universities under

grant No. 91-H-FA08-1-4.

KUO-CHIN FAN AND CHIEN-HSIANG HUANG

404

technique to recognize italic style characters will result in very low recognition rate. The performance of OCR will be greatly improved if we can distinguish italic style and nor-mal style characters and rectify the detected italic style characters into normal style char-acters.

An italic style character can be obtained from its corresponding normal style char-acter through shear transformation. Hence, the features of the strokes embedded in an italic style character are distinctly different from those in its corresponding normal style character. The rationale of our proposed method is that the difference of certain features derived from italic style characters after shear transformation will be canceled, whereas the difference will be more obvious for non-italic style (normal style) characters. The problem of italic detection can then be treated as the detection of existence or inexistence of shear transformation operated on the character image. On the other hand, the problem of italic rectification is to find out the exact shear angle to perform the inverse shear transformation. Through the reverse shear transformation, an italic style character can be rectified back to a normal style character.

Traditional italic detection methods have to be operated at least on a word, a sen-tence or even the whole paragraph, whereas our proposed method can be successfully operated even on a single character. In this way, more accurate statistical information can be obtained. Knoubyari and Hull [1] introduced a method that identifies the predominant font (style) of a document image by matching clusters of word images against a pre-gen-erated font (style) database. The performance of italic or nonitalic style detection will be the best if the font is known in advance. Cooperman [2] used a set of local detectors to estimate the style attributes, such as serifness and boldness, and then utilized these fea-tures to perform the tasks of style detection and OCR simultaneously. Zhu et al. [3] pro-posed a global texture-analysis-based style recognition method operating on normalized text blocks. These content-independent approaches [1-3] avoid the procedure of segmen-tation, such as commonly-used connected component analysis and X-Y cut method for detailed local feature extraction. Shi and Pavlidis [4, 5] proposed a method to discrimi-nate italic and non-italic styles by analyzing the histogram of stroke slopes of the whole text block. The common characteristic of all these methods [1-5] is that they have to be operated by extracting features on large text blocks. However, these methods will obtain very poor performance under the circumstance that only a few italic words are sparsely presented in a document image. Nevertheless, Chaudhuri and Garain [6, 7] presented a OCR-free italic detection technique operating on character level by testing the angle of the straight stroke or the vertical midline presented in the character image and then gen-eralize it to word level. Sun and Si [8] utilized gradient direction to detect slanted char-acters presented in a document image with only a few italic words. Although the two methods are operated based on feature analysis conducting on each individual character, they cannot obtain good performance when characters are inter-connected in some dis-torted document images. Moreover, shape properties and gradient information will usu-ally be influenced subjecting to font variations, such as size, serifness, boldness, etc. Zramdini and Ingold [9] introduced an italic detection method by calculating the sum of the first derivation of the horizontal profile of the whole text block. Zhang et al. [10] used stroke pattern analysis operating on wavelet decomposed word images to detect the presence of italic style. It first extracts the normalized total height H of the vertical straight line segments (VSLS) and the normalized total length L of the long continuous

ITALIC DETECTION AND RECTIFICATION

405

diagonal strokes (CDS). Then, define some thresholds by experimental result to decide the range of H and L of the italic and normal words. The proposed method can still work under a few italic words and can resist the phenomenon of inter-connected characters. However, it needs to train the database to select the suitable threshold and can not detect some single characters with slant strokes, such as ‘W’, ‘V’, ‘X’, and so on, because it can not differentiate between slant straight strokes which belong to normal style characters and vertical strokes which belong to italic style characters. To avoid the detecting error of these characters, they further extend the method to the word level.

The main goal of our work is to recognize character images in italic style of various fonts. In our proposed approach, the virtual strokes embedded in the considered character image are extracted first. Then, shear transformation is operated on the considered char-acter image. The 26 upper and 26 lower alphabets are classified into three classes based on the structural information of the extracted virtual strokes. The italic and non-italic style characters can then be distinguished based on the classification rule devised for each class of characters. Last, the exact shear angle of the identified italic character is calculated to perform more accurate reverse shear transformation to rectify the italic style character into normal (non-italic) style character to facilitate the later OCR task. Shown in Fig. 1 is the flowchart of our proposed method.

Fig. 1. The flowchart of italic detection and rectification.

The advantages of our proposed method are threefold. First, it can still function well

under the circumstance where just a few italic words are scattered in the documents due to its character-based property. Second, it does not need any training to select the suitable threshold in judging the italic style. Thirdly, it is suitable to be employed on various kinds of font types.

KUO-CHIN FAN AND CHIEN-HSIANG HUANG

406

The rests of the paper are organized as follows. Section 2 introduces the preproc-esses of character segmentation and shear transformation. In section 3, the process for extracting virtual strokes of a character is presented. Section 4 presents the categorization process for categorizing each virtual stroke into one of the four categories according to the gradient direction and stroke direction. The classification process for classifying each character into one of the three classes based on its composing virtual strokes is also ad-dressed in the section. In section 5, the decision rules for detecting the presence of italic style in each character class are devised. Section 6 describes the process of italic rectifi-cation by finding the exact shear angle in performing reverse shear transformation. In section 7, experimental results are illustrated to demonstrate the feasibility and validity of the proposed approach in detecting italic style characters of various fonts. Finally, con-clusions are given in section 8.

2. CHARACTER SEGMENTATION AND SHEAR TRANSFORMATION

As we know, the existence of italic style will affect the result of character segmen-tation because the representation of each character block is assumed to be a rectangle in document analysis. The extending size of each character block will overlap with its neighboring character blocks under the presence of italic style characters and hence hin-der the finding of accurate segmentation paths.

If the shear angle of italic style characters is known in advance, it will definitely be helpful in finding accurate segmentation paths. Li et al. [11] proposed a segmentation method for touching italic characters by estimating the slant (shear) angle in finding the suitable cut path.

The process for estimating the slant angle can be stated as follows. First, the slant projection is performed and the blank lines in each direction in the text line are counted to choose the maximum as the possible slant angle θ. Then, segment the character along the slant angle θ and shear the character block approximately the angle θ as the reference for later italic detection task.

[ ] [ ]1 0tan 1

x y x yθ

⎡ ⎤ ′ ′⋅ =⎢ ⎥⎣ ⎦

(1)

The shear transformation is executed by horizontally shifting each row in an image a certain distance according to its height. Actually, we do not need to perform all the calculations in Eq. (1). We only need to calculate the shifting distance on each row and shift each row the calculated shifting distance accordingly.

The possible shear angle θ ranges from 5o to 20o that depends heavily on the font type. Although we cannot figure out the exact shear angle at this time, the estimating shear angle can still help us in the italic detection process. The exact shear angle will be obtained afterwards in later italic rectification process. The features (e.g. vertical profile) of italic style characters produced by shear transformation will be almost counteracted by the approximate shear angle, whereas the features of normal style characters produced by shear transformation will be enhanced by the approximate shear angle. The rationale of

ITALIC DETECTION AND RECTIFICATION

407

our proposed italic detection method lies on the detection of the appearance or disap-pearance of these features.

3. VIRTUAL STROKE EXTRACTION

Each character is composed by some basic units called strokes. The inputted char-acter can then be classified into different character classes based on the structural infor-mation of its constituting strokes. We all admit that stroke extraction is a tedious and time-consuming process in image processing. In some applications, it is sufficient to merely extract the similar outlook of the character. In response to this need, virtual stroke developed in our previous work [12] is adopted to analyze the outlook structure of a character for later classification purpose. As verified in the experimentation of [12], the virtual strokes extracted from a character can indeed preserve the structural information of the character. Moreover, the extraction of virtual strokes is much faster than that of extracting real strokes.

The process of virtual stroke extraction is stated as follows. First, the original char-acter image Fo(x, y) is shifted right one pixel to get the image Fr(x, y) and is shifted left one pixel to get the image Fl(x, y). Next, AND operation is operated on Fo(x, y) and Fr(x, y) to generate FAND(x, y) and then XOR operation is manipulated on Fo(x, y) and FAND(x, y) to produce the left virtual stroke Sl(x, y) of the character. Replace Fr(x, y) by Fl(x, y) and repeat the above procedure, we can produce the right virtual stroke Sr(x, y) of the character. Similarly, the up virtual stroke Su(x, y) and bottom virtual stroke Sb(x, y) of the character can be generated by replacing the above Fr(x, y) with Fb(x, y) and Fu(x, y), re-spectively. Here, Fb(x, y) and Fu(x, y) are obtained by shifting Fo(x, y) down and up one pixel, respectively. Shown in Fig. 2 are the left virtual strokes and right virtual strokes extracted from the characters A to K.

(a)

(b)

(c)

Fig. 2. (a) Original character images from A to K; (b) Left virtual strokes of the characters; (c) Right virtual strokes of the characters.

4. STROKE CATEGORIZATION AND CHARACTER CLASSIFICATION

The features, such as the slant angle of stroke, of italic style characters are not sta-tionary, i.e. the features will vary for different characters. If we can know the character in advance through the OCR process, the feature difference between the italic style and normal style of the same character will be prominent and hence the feature can be util-ized in discriminating the italic and non-italic style characters. However, it needs to per-form the time-consuming OCR process. On the other hand, the presence of italic style

KUO-CHIN FAN AND CHIEN-HSIANG HUANG

408

will deteriorate the performance of OCR consequently. To resolve this problem, the con-sidered character is first classified into three classes according to the structural informa-tion of the virtual strokes embedded in the character. The features deriving from stroke structure in the same character class will thus be the same and thereby without the need-ing of performing OCR. 4.1 Virtual Stroke Categorization

We can categorize each virtual stroke extracted from a character into four categories according to the gradient direction and stroke direction. They are vertical strokes, hori-zontal strokes, slant strokes and cursive strokes. The features of the strokes in different categories will be different after shear transformation, whereas the features of the strokes in the same category will remain the same after shear transformation. The task of charac-ter classification is performed according to the composition of the identified strokes ex-tracted from the character.

In performing the virtual stroke categorization, virtual strokes are firstly separated into cursive and non-cursive virtual strokes. The non-cursive strokes are then further separated into horizontal, slant, and vertical strokes. From Eq. (1), we know that shear transformation can be regarded as the manipulation by shifting each pixel in the horizon-tal direction according to its relative height in the character block. Since only unobvious horizontal shifting phenomenon occurs for all pixels in horizontal strokes after shear transformation, horizontal strokes thus contain no discriminating information and can thereby be ignored. Only the other three stroke categories (vertical, slant, and cursive) are considered. By carefully observing the constitution of English characters, we find that all strokes in the three stroke categories (vertical, slant, and cursive) are mainly composed of left virtual strokes and right virtual strokes. Hence, only left and right vir-tual strokes are considered both in virtual stroke categorization and character classifica-tion processes.

The procedure for discriminating cursive and non-cursive virtual strokes can be stated as follows. First, scan the right virtual stroke Sr(x, y) from right to left and mark the first black pixel encountered and record the distance from the starting scanning point to the marked pixel to generate the depth function Dr(y). Similarly, scan the left virtual stroke Sl(x, y) from left to right and mark the first black pixel encountered and then re-cord the distance from the starting scanning point to the marked pixel to generate the depth function Dl(y). Next, obtain the first-order derivatives Dr′(y) and Dl′(y) of Dr(y) and Dl(y), respectively. Dr′(y) and Dl′(y) can be treated as the direction function of the right virtual stroke Sr(x, y) and left virtual stroke Sl(x, y), respectively. If the first-order deriva-tives of a virtual stroke changes sign, the considered stroke is a cursive stroke. Otherwise, it is a non-cursive stroke (either vertical or slant stroke). An example illustrating the vir-tual stroke categorization of character “b” is shown in Fig. 3. Shown in the Figs. 3 (d) and (f) are the depth function and direction function of the lefty virtual stroke of charac-ter “b”. Similarly, the depth function and direction function of the right virtual stroke are depicted in Figs. 3 (e) and (g), respectively. The direction function here is the first-order derivative of the corresponding depth function. From the result, we know that the left virtual stroke of character “b” is a non-cursive stroke, whereas the right virtual stroke of the character is a cursive stroke.

ITALIC DETECTION AND RECTIFICATION

409

x x

x x

y

y y

y

Depth Function Depth Function

Fig. 3. (a) Original image of character “b”; (b) Left virtual stroke of character “b” which is a non-

cursive stroke; (c) Right virtual stroke of character “b” which is a cursive stroke; (d) Depth function of (b); (e) Depth function of (c); (f) Direction function of (b); (g) Direction f func-tion of (c).

4.2 Character Classification

Since the prominent features embedded in italic style characters are presented mainly in the vertical direction, the horizontal virtual strokes can thus be ignored during the character classification process. With the composing virtual strokes of the considered character being categorized, certain rules are devised to classify the character into the three character classes. Shown in Fig. 4 is the flow diagram for virtual stroke categoriza-tion and character classification.

Fig. 4. Flow-diagram illustrating virtual stroke categorization and character classification processes.

(a) (b) (c)

(d) (e)

(f) (g)

KUO-CHIN FAN AND CHIEN-HSIANG HUANG

410

If there is only one non-cursive stroke in a character, then the character is classified as class 1 character. The class 1 characters are easy to be detected as italic or non-italic style by most italic detection methods because the non-cursive stroke is slant in italic style and vertical in non-italic style.

The characters that are composed by cursive strokes in both left virtual stroke and right virtual stroke are classified as class 2 characters. The class 2 characters lack the prominent features by merely testing the stroke angle or its horizontal profile in italic detection. It will usually misclassify this kind of characters by using traditional italic detection methods. In our work, certain effective rules are devised in the italic detection of class 2 characters.

As to the class 3 characters, they are composed by non-cursive strokes in both left virtual stroke and right virtual stroke. Through detail analysis, we find that the non- cursive strokes in class 3 characters usually appear in pair symmetrically, i.e., appear in the form of a pair of symmetrical vertical strokes or a pair of symmetrical slant strokes.

The 26 upper-case characters and the 26 lower-case characters can be classified into the three character classes according to above classification rules. The classification re-sult is listed in Table 1.

Table 1. Result of character classification.

Content Class 1 B, D, E, F, J, K, L, P, R, T, b, d, f, h, j, k, p, q, r, t Class 2 C, G, O, Q, S, U, Z, a, c, e, g, o, s, u, z Class 3 A, H, I, V, M, N, W, X, Y, i, l, m, n, v, w, x, y

5. ITALIC DETECTION

In our work, the features used in detecting the presence of italic style in different character classes are different which are devised by observing the impact of shear trans-formation on the original shape of the considered character.

By careful observation, we notice that there are two main effects appearing on the outlook of a character image before and after shear transformation on which the proposed method is based. Shown in Fig. 5 are the examples illustrating the different changes in the vertical projection of a vertical stroke and a slant stroke before and after shear trans-formation. Firstly, the vertical stroke is changed to a slant stroke and a slant stroke is changed to a vertical stroke though shear transformation, i.e., the shape of a vertical stroke will be changed from rectangle to parallelogram and the shape of a vertical stroke will be changed from parallelogram to rectangle. Secondly, the shape of the vertical pro-jection of a vertical stroke will be changed from rectangle to trapezoid (or triangle in the extreme case), whereas the shape of the vertical projection of a slant stroke will be changed from trapezoid or triangle to rectangle. The rationale of the designed italic de-tection method relies mainly based on these two observing phenomena.

ITALIC DETECTION AND RECTIFICATION

411

Original stroke (a):

Vertical projection of (a): Stroke after shear transformation (b): Vertical projection of (b):

Fig. 5. Example illustrating the two effects resulting from shear transformation.

5.1 Italic Detection of Class 1 Characters

Since there exists only one non-cursive stroke in class 1 characters, the first observ-ing phenomenon will be useful in italic detection of class 1 characters. Previously, some researchers utilized the features, such as the angle of the stroke, the number of peak in vertical projection, the sum of gradient in vertical projection, to perform the detection task. In this paper, the feature of the sum of gradient in vertical projection in adopted to detect the presence of italic style in class 1 characters.

The procedure is stated as follows. Firstly, project the character image in vertical direction and get the height f(x) in position x. Then, calculate the gradient f ′(x) of the height function f(x). Finally, perform the detection task by comparing the sum of the gra-dient of the height function.

1

1 1_ _ ( ) ( 1) ( )

n n

x xsum of gradient f x f x f x

= =

′= = + −∑ ∑ (2)

It is obvious that the sum of gradient in vertical stroke will be larger than that in slant stroke. As mentioned previously, the slant stroke in italic style characters will be-come vertical stroke and the vertical stroke in non-italic style characters will become slant stroke. Hence, if the sum of gradient of a character increases after shear transforma-tion, then it is an italic character. Otherwise, it is a non-italic character. Shown in Fig. 6 is an example illustrating the italic detection of an italic character “D”. We can find that the sum of gradient increases in the character after shear transformation.

Original image: Image after shear transfor-mations:

Fig. 6. Example illustrating the italic detection of a class 1 character “D”.

KUO-CHIN FAN AND CHIEN-HSIANG HUANG

412

5.2 Italic Detection of Class 2 Characters

It is difficult to detect italic style of class 2 characters which are composed of cur-sive strokes by merely using the feature of curvature. The reason is that the curvatures of the same stroke in different fonts of class 2 characters will be different. Hence, the cur-vatures of the cursive strokes after shear transformation are unpredictable which rely heavily on the font type. After careful analysis, we find that the widths of the character blocks in class 2 characters will always vary after shear transformation. It satisfies the observing phenomenon 2 which states that the widths of non-italic characters will in-crease after shear transformation and the widths of italic characters will decrease after shear transformation. This is the criteria on which the italic detection of class 2 charac-ters is based. Shown in Fig. 7 depicts an example illustrating the italic detection of an italic character “G”. Note that the width of the character block decreases after shear transformation.

Original image:

Image after shear transfor-mations:

Fig. 7. Example illustrating the italic detection of a class 2 character “G”.

5.3 Italic Detection of Class 3 Characters

A class 3 character is composed of a pair of non-cursive strokes. It may be a pair of parallel straight strokes (such as H, M, N) or a pair of non-parallel straight strokes (such as A, V, W). If the character is not known in advance, we can not know whether the pair of straight strokes is parallel or non-parallel. However, we need to further divide class 3 characters into these two groups, group 1 which is formed by a pair of parallel straight strokes and group 2 which is formed by a pair of non-parallel straight strokes, because the features adopted in the italic detection of these two groups of characters are different. Fortunately, the parallelism or non-parallelism of the two straight strokes can be deter-mined by measuring the slopes of the two strokes. If the slopes of the two straight strokes are the same, then they constitute a pair of parallel straight strokes. Otherwise, they are a pair of non-parallel straight strokes. If the pair of parallel straight strokes in a character are both vertical strokes (with slope equaling 90o), the character must be a non-italic character. Otherwise, it is an italic character. As to the characters that possess non-paral-lel straight strokes, the task of italic detection can be accomplished by measuring the symmetry of the two non-parallel strokes. If the two non-parallel straight strokes of a character are symmetric, then the character is a non-italic character. If the two non-par-allel strokes of a character are non-symmetric, the character must be an italic character.

ITALIC DETECTION AND RECTIFICATION

413

The symmetry or non-symmetry of a pair of non-parallel strokes can be determined by simply measuring the two angles θ1 and θ2 spanned between the two straight strokes and the horizontal line, respectively. The detail procedure is stated as follows.

Let us formally define θ1 as the angle spanned between the left virtual stroke and the horizontal line and θ2 be the angle spanned between the right virtual stroke and the hori-zontal line. The decision rules for italic detection of class 3 characters are: If θ1 = θ2, then the character belongs to group 1 (a pair of parallel straight strokes)

If θ1 = θ2 ≠ 90o, then the character is italic style Otherwise; the character is non-italic style

If θ1 = π − θ2, then the character is a group 2 italic style character Otherwise; the character is a group 1 non-italic style character

Fig. 8. Example illustrating the italic detection

of a class 3 group 2 character “A”. Fig. 9. Example illustrating the italic detection

of a class 3 group 1 character “H”.

Shown in Fig. 8 is an example illustrating the italic detection of a class 3 group 2

character “A”. Note that the pair of non-parallel straight strokes in the non-italic charac-ter “A” is symmetric, whereas the pair of non-parallel straight strokes in the italic char-acter “A” is non-symmetric.

Shown in Fig. 9 is an example illustrating the italic detection of a class 3 group 1 character “H”. It also worth noting that the two spanning angles θ1 and θ2 are not equal to 90o in the italic style character “H”, whereas the two spanning angles θ1 and θ2 will be equal to 90o in the non-italic style character “H”.

6. ITALIC RECTIFICATION

Since the appearance frequency of italic style characters is far less than that of non-italic style characters, most commonly encountered OCR systems are designed for non-italic style characters. In order to recognize italic style characters, two approaches are commonly adopted. The first approach is the adopting of structural-based recognition algorithms by extracting the structural features instead. The alternative is to consider all the other 52 upper-case and lower-case italic characters in the training phase. However, it will increase the size of training database and deteriorate the overall recognition per-formance. In addition to the two approaches, italic rectification approach is proposed in

KUO-CHIN FAN AND CHIEN-HSIANG HUANG

414

this paper to resolve the italic character recognition problem. The concept is quite simple. That is if a character is identified as an italic style character, it is rectified back to its cor-responding non-italic style character by performing the reverse shear transformation. As we know, shear transformation is reversible, i.e., the effect resulting from shear trans-formation can be recovered or cancelled by performing the shear transformation in the opposite direction. Usually, the shear transformation is performed in the clockwise direc-tion with a certain angle θ. If the shear transformation is performed in the counterclock-wise direction with the same angle θ, then it is called the reverse shear transformation. An italic style character can be corrected to a non-italic style character via reverse shear transformation if we can know the rotating angle θ in advance. However, the difficulty is the finding of the exact rotating angle because the rotating angles of different fonts in forming the italic characters will be different. If we utilize a fixed angle to perform the reverse shear transformation, severe noises or distortion may be present for certain font type after the rectification process. To remedy this problem, certain intermediate results generated in the italic detection process can be utilized to obtain the more accurate shear (rotating) angle. The detail process is stated in the following subsections. 6.1 Italic Rectification of Class 1 Characters

Since there is only one straight stroke and the stroke is a vertical stroke in each non-italic class 1 character, the shear angle θ of an italic style character will always be produced for italic detection purpose. The shear angle θ of a class 1 italic character is the slope of the right virtual stroke or left virtual stroke depending on the appearing side of the straight stroke.

θ = tan-1 slope(Sr) or θ = tan-1 slope(Sl) (3) Shown in Fig. 10 is an example illustrating the finding of the shear angle of an italic

character “P”.

Fig. 10. Example illustrating the rectification

of a class 1 italic character “P”. Fig. 11. Example illustrating the rectification

of a class 2 character “G”.

6.2 Italic Rectification of Class 2 Characters

According to the devised classification rule, we know that the widths of a class 2 character are almost the same in all positions. For class 2 italic characters, the shear transformation operated will increase the width of the character. The rationale in rectify-

ITALIC DETECTION AND RECTIFICATION

415

ing class 2 italic characters is to minimize the width increase by utilizing reverse shear transformation. Let the width increase of the character be denoted by ΔW and the height of the character be H. The shear angle in minimizing the width increase can be calculated as follows.

1tan WH

θ − Δ= (4)

Shown in Fig. 11 is an example illustrating the rectification of a class 2 italic char-acter “G”. 6.3 Italic Rectification of Class 3 Characters

As stated previously, there are two groups in class 3 characters and the finding of shear angles for these two groups of characters is different. Let us first state the rectifica-tion process of class 3 group 1 characters which is very similar to that of class 1 charac-ters due to the structural similarity in these two classes of characters. Since there are two straight strokes in class 3 group 1 characters, the shear angle can be found from the slope of either right virtual stroke or left virtual stroke. Shown in Fig. 12 is an example illus-trating the rectification of a class 3 group 1 character “N” by finding the shear angle from the straight stroke.

Fig. 12. Example illustrating the italic rectification of a class 3 group 1 character “N”.

As to the rectification of group 2 characters, the symmetry in normal style charac-

ters will be destroyed after shear transformation, i.e., the right virtual stroke and the left virtual stroke in group 2 italic characters are no longer symmetric. The rationale in recti-fying this kind of characters is to recover the symmetry of the characters after reverse shear transformation by rotating the shear angle onto the italic style character. Let α and β denote the spanning angles between the right virtual stroke and left virtual stroke with respect to the horizontal line in italic style characters, whereas α ′ and β ′ denote the span-ning angles between the right virtual stroke and left virtual stroke with respect to the horizontal line in non-italic style characters. Here, α ′ and β ′ will be equal due to the symmetry in non-italic style characters. The two angles α and β in italic style characters will be increased and decreased by an angle θ after shear transformation. According to the mathematical derivation, the shear angle θ in performing the reverse shear transfor-mation is thus the mean of the absolute value of the slopes of the right virtual stroke (Sr) and left virtual stroke (Sl).

1 1[tan ( ) tan ( )]2

r lslope S slope Sπθ

− −− += (5)

KUO-CHIN FAN AND CHIEN-HSIANG HUANG

416

Shown in Fig. 13 is an example illustrating the rectification of a class 3 group 2 italic character “V”. Note that the symmetry between the left virtual strokes and the right virtual stroke can be recovered by performing the reverse shear transformation.

Fig. 13. Example illustrating the rectification of a class 3 group 2 character “V”.

7. EXPERIMENTAL RESULTS

In this section, experimental results are illustrated to demonstrate the validity of the proposed method in detecting and rectifying italic characters. Shown in Fig. 14 are the feature distribution of the sum of gradient square of the 10 numerals, 26 upper-case and 26 lower-case characters in ‘Arial’ and ‘Times’ fonts and italic and non-italic styles, re-spectively. Traditional italic detection methods have to be operated on at least a few characters. As perceived from the figure, a fixed threshold in the sum of gradient square cannot successfully separate the clusters of italic and non-italic style characters. More-over, the feature clusters will be more obviously variant in different font types. Although the variance might be balanced or neutralized by summing the feature values of a few characters, it might still not work if the number of characters is very few. As to our pro-posed algorithm, it can function very well even in a single character due to the adopting stroke-based features.

Fig. 14. Feature distribution of each character in Arial and Times font types.

ITALIC DETECTION AND RECTIFICATION

417

Table 2. Experimental results of italic detection. Detecting Rate

Font Type FAR(%) FRR(%)

Arial 0 0 Courier 1.7 0 Impact 1.2 0.7 Comic 0 0 Times 0.4 0.5

Tahoma 0 0 Trebuchet 0 0 Verdana 0 0 Average 0.41 0.15

Experiments were conducted on various font types. 50 samples for each character in

different font types were selected as the testing images and satisfactory accuracy rate is obtained. The average false acceptance rate (FAR) in detecting italic style characters is 0.41% and the average false rejection rate (FRR) is 0.15%. The detail experimental data is tabulated in Table 2.

Most detection errors are resulted from the errors occurring in character classifica-tion. For example, some extra strokes for beautification purpose may be produced which might result in the mistake in character classification.

8. CONCLUSIONS

The difficulty in italic detection is that there is no common regular feature inherent in all italic style characters. The features of different italic style characters will be differ-ent, which depend heavily on the shape of the character and the effect resulting from shear transformation. To solve this problem, a novel method is proposed which can be applied to all font types of italic characters. In our proposed approach, the characters are first classified into three classes according to the structure of the composing virtual strokes. Then, extract the common structural features embedded in each class of charac-ters. Finally, the italic detection of each class of characters is performed by comparing the common structural features in the character images before and after shear transforma-tion without any threshold.

Although virtual stroke is a simple and effective feature for analyzing the structural information of a character, some extra strokes for beautification purpose may result in the errors in character classification. More detail analysis on virtual strokes can help solving this problem. However, it will increase the overhead of italic detection that might need the performing of OCR. Hence, how to obtain more structural information from virtual stroke analysis with lower cost is the goal to be pursued in the future.

As we know, the purpose of italic detection is to distinguish the italic style character from its corresponding non-italic style character. Currently, the rules generalized from the features are only suitable to be employed for italic detection of English characters. The rules will be different for different languages. The generalization of italic detection rules that are suitable for different languages and the associating font types is also the goal to be pursued in the future.

KUO-CHIN FAN AND CHIEN-HSIANG HUANG

418

REFERENCES

1. S. Khoubyari and J. J. Hull, “Font and function word identification in document rec-ognition,” Computer Vision and Image Understanding, Vol. 63, 1996, pp. 66-74.

2. R. Cooperman, “Producing good font attributes determination using error-prone in-formation,” International Society for Optical English Journal, Vol. 3027, 1997, pp. 50-57.

3. Y. Zhu, T. N. Tan, and Y. H. Wang, “Font recognition based on global texture analy-sis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 23, 2001, pp. 1192-1200.

4. H. Shi and T. Pavlidis, “A system for text recognition based on graphic embedding matching,” in Proceedings of International Association for Pattern Recognition Workshop on Document Analysis System, 1996, pp. 413-427.

5. H. Shi and T. Pavlidis, “Font recognition and contextual processing for more accu-rate text recognition,” in Proceedings of the 14th International Conference on Docu-ment Analysis and Recognition, Vol. 1, 1997, pp. 39-44.

6. B. B. Chaudhuri and U. Garain, “Automatic detection of italic, bold and all-capital words in document images,” in Proceedings of the 14th Conference on Pattern Rec-ognition, Vol. 1, 1998, pp. 610-612.

7. B. B. Chaudhuri and U. Garain, “Extraction of type style based meta-information from imaged documents,” International Journal on Document Analysis and Recog-nition, Vol. 3, 2001, pp. 138-149.

8. C. Sun and D. Si, “Skew and slant correction for document image using gradient di-rection,” in Proceedings of the 14th International Conference on Document Analysis and Recognition, Vol. 1, 1997, pp. 142-146.

9. A. Zramdini and R. Ingold, “Optical font recognition using typographical features,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, 1998, pp. 877-882.

10. L. Zhang, Y. Lu, and C. L. Tan, “Italic font recognition using stroke pattern analysis on wavelet decomposed word images,” in Proceedings of the 17th International Conference on Pattern Recognition, Vol. 4, 2004. pp. 835-838.

11. Y. Li, S. Naoi, M. Cheriet, and C. Y. Suen “A segmentation method for touching italic characters,” in Proceedings of the 17th International Conference on Pattern Recognition, Vol. 2, 2004, pp. 594-597.

12. K. C. Fan, M. G. Wen, and D. F. Chen, “Skeletonization of binary images with non-uniform width using block decomposition and contour vector matching method,” Pattern Recognition, Vol. 31, 1998, pp. 823-838.

Kuo-Chin Fan (范國清) was born in Hsinchu, Taiwan, on 21 June 1959. He received his B.S. degree in Electrical Engi-neering from National Tsing Hua University, Taiwan, in 1981. In 1983, he worked for the Electronic Research and Service Or-ganization (ERSO), Taiwan, as a Computer Engineer. He started his graduate studies in Electrical Engineering at the University of Florida in 1984 and received the M.S. and Ph.D. degrees in 1985

ITALIC DETECTION AND RECTIFICATION

419

and 1989, respectively. From 1984 to 1989 he was a Research Assistant in the Center for Information Research at University of Florida. In 1989, he joined the Institute of Com-puter Science and Information Engineering at National Central University where he be-came professor in 1994. He was the chairman of the department from 1994 to 1997. Currently, he is the director of Communication Research Center at National Central University. His current research interests include image analysis, optical character recog-nition, and document analysis.

Chien-Hsiang Huang (黃健興) receive the B.S. degree in Computer Science and Information Engineering from Chun Yuan Christrain University, Chungli, Taiwan, in 1997. He is working toward his Ph.D. degree at National Central University, Chungli, Taiwan. His research interests include pattern recognition, image processing, cocument analysis, and document understanding.