[IEEE 2008 The Eighth IAPR International Workshop on Document Analysis Systems (DAS) - Nara, Japan...

8

Click here to load reader

Transcript of [IEEE 2008 The Eighth IAPR International Workshop on Document Analysis Systems (DAS) - Nara, Japan...

Page 1: [IEEE 2008 The Eighth IAPR International Workshop on Document Analysis Systems (DAS) - Nara, Japan (2008.09.16-2008.09.19)] 2008 The Eighth IAPR International Workshop on Document

A Robust System to Detect and Localize Texts in Natural Scene Images

Yi-Feng Pan, Xinwen Hou, Cheng-Lin LiuNational Laboratory of Pattern Recognition

Institute of Automation, Chinese Academy of Sciences95 Zhongguancun East Road, Beijing 100190, P. R. China

E-mails: {yfpan, xwhou, liucl}@nlpr.ia.ac.cn

Abstract

In this paper, we present a robust system to accuratelydetect and localize texts in natural scene images. For textdetection, a region-based method utilizing multiple featuresand cascade AdaBoost classifier is adopted. For text lo-calization, a window grouping method integrating text linecompetition analysis is used to generate text lines. Thenwithin each text line, local binarization is used to extractcandidate connected components (CCs) and non-text CCsare filtered out by Markov Random Fields (MRF) model,through which text line can be localized accurately. Exper-iments on the public benchmark ICDAR 2003 Robust Read-ing and Text Locating Dataset 1 show that our system iscomparable to the best existing methods both in accuracyand speed.

1. Introduction

With the widely use of digital image capture devices,such as digital cameras, mobile phones and PDAs, content-based image analysis techniques have been receiving moreand more attentions in the past years. Among all contentsin images, such as face, human, scene, etc., text-based in-formation has inspired great interests, since it provides lotsof useful information which can be easily understood bothby human and computer. In [7], Jung called integrated text-based image analysis system as Text Information Extrac-tion (TIE) system which is composed of three parts: 1) textdetection, 2) text localization, and 3) text character recog-nition, where the first two parts often intersect with eachother. In this paper, we design a robust system consistingof the first two parts to detect and localize texts in naturalscene images.

Text information in natural scene images is difficult toanalyze due to the variations of size, font, color and align-

1http://algoval.essex.ac.uk/icdar/Datasets.html

ment and it is often affected by complex background, lightshadow, image distortion and degrading. In the past years,many methods have been proposed to solve text detectionand localization problem [7, 13]. According to the ba-sic unit to be analyzed, these methods can be categorizedinto two classes: connected component (CC)-based meth-ods and region-based methods.

CC-based methods [6, 27, 15, 5] are based on the factthat texts in images can be seen as sets of seperating con-nected components, each of which has distinct intensityor color distributions and linked edge contours. Thesemethods normally include three steps: 1) CCs extractionto extract CCs from images, 2) CCs analysis to deter-mine whether CCs belong to text components and 3) post-processing to group text components into text regions suchas text lines. Although some CC-based methods give en-couraging results, there exist two difficulties degrading thesystem performance. First, CCs are hard to be accuratelyextracted due to image degrading and noises. Second, evenif CCs can be extracted accurately, designing fast and highlycredible CCs analysis algorithm is also difficult as there aretoo many text-like components in images.

In recent years, region-based methods [1, 8, 11, 23] havebeen receiving more and more attentions with the develop-ment of image analysis, pattern recognition and machinelearning techniques. These methods are based on the factthat text regions in images have distinct characteristics fromnon-text regions such as high density gradient distribution,distinctive texture and structure, which can be used to differ-entiate text regions from non-text regions effectively. Mostregion-based methods consist of two steps: text detectionand text localization. For text detection, features of sam-pled regional windows are extracted to determine whetherthey contain text information. Then window grouping orclustering methods are employed to generate candidate textlines, which can be seen as coarse text localization. In somecases, post-processing such as image segmentation or pro-file projection analysis is employed to localize texts further.

Comparing with CC-based methods, region-based meth-

The Eighth IAPR Workshop on Document Analysis Systems

978-0-7695-3337-7/08 $25.00 © 2008 IEEE

DOI 10.1109/DAS.2008.42

35

The Eighth IAPR Workshop on Document Analysis Systems

978-0-7695-3337-7/08 $25.00 © 2008 IEEE

DOI 10.1109/DAS.2008.42

35

The Eighth IAPR Workshop on Document Analysis Systems

978-0-7695-3337-7/08 $25.00 © 2008 IEEE

DOI 10.1109/DAS.2008.42

35

The Eighth IAPR Workshop on Document Analysis Systems

978-0-7695-3337-7/08 $25.00 © 2008 IEEE

DOI 10.1109/DAS.2008.42

35

The Eighth IAPR Workshop on Document Analysis Systems

978-0-7695-3337-7/08 $25.00 © 2008 IEEE

DOI 10.1109/DAS.2008.42

35

Page 2: [IEEE 2008 The Eighth IAPR International Workshop on Document Analysis Systems (DAS) - Nara, Japan (2008.09.16-2008.09.19)] 2008 The Eighth IAPR International Workshop on Document

Figure 1. The flow chat of the proposed system.

ods have similar localization accuracy but less sensitive toimage noise. However, the computational cost of featureextraction and window classification is much heavy sinceall sub-windows in the image need to be scanned and an-alyzed. To solve this problem, Chen et al. [2] proposed acascade AdaBoost classifier system for text detection andlocalization, which was inspired by Viola’s face detectionwork [21], and achieved a state-of-the-art performance [16].

In this paper, we design a robust system similar with[2], and proposing three modifications to improve the per-formance: 1) utilizing Histogram of Oriented Gradient(HOG) and multi-scale Local Binary Pattern (msLBP) fea-tures to build up candidate feature pool during text detec-tion process, 2) introducing a text line competition analy-sis technique based on Relaxation Labeling (RL) to filterout incorrect text lines around correct ones, and 3) adoptinga connected component analysis (CCA) approach to filterout non-text CCs, based on Markov Random Fields (MRF)model.

The rest of this paper is organized as follows. Section 2presents a framework of our system. The detailed descrip-tions of two important parts: region analysis and text local-ization are given in section 3 and 4, respectively. The ex-perimental results and performance explanations are givenin section 5, and section 6 provides conclusions and futureworks.

2. Framework

Similar to most region-based methods, our system iscomposed of two parts: text detection and text localization.

The first part text detection consists of two steps: pre-processing and region analysis. At pre-processing step, theimage is first transformed from RGB to gray-level spacesince the system can only deal with the gray-level image

currently. Then the gray-level image is re-scaled to forman image pyramid by nearest neighbor interpolation, sincethere may exist different sizes of texts to be detected in theimage. At region analysis step, feature integral map gen-eration, window sampling, feature extraction and windowclassification are adopted sequentially to detect candidatetext windows from the image pyramid.

The second part text localization also consists of twosteps: text line generation and text extraction. At text linegeneration step, a window grouping approach is used togroup the detected windows into candidate text lines, thenthe incorrect lines around the correct ones are filtered out bytext line competition analysis. At text extraction step, con-nected components are extracted from each text line regionby local binarization, then a connected component analy-sis approach based on MRF model is employed to filter outnon-text components and localize text lines accurately.

The flow chat of the proposed system is given in Fig.1 and some technical details of important parts will be ex-plained in the following sections.

3. Region Analysis

For region-based method, sub-windows scanned on dif-ferent positions and scales on the image pyramid need to beclassified quickly. We will present efficient feature extrac-tors and classifier structure in this section.

3.1 Feature Extraction

For text detection problem, several feature extractors,such as Gaussian filters [23], Discrete Cosine Transform(DCT) [25], Wavelets [4], etc., have been used. Recently, tosuitably arrange different feature computational complexi-ties for high detection speed, Chen [2] used four types of

3636363636

Page 3: [IEEE 2008 The Eighth IAPR International Workshop on Document Analysis Systems (DAS) - Nara, Japan (2008.09.16-2008.09.19)] 2008 The Eighth IAPR International Workshop on Document

features: gray-level means and variances, first order dif-ferential features, histogram of intensity and gradient, andedge linking features to build up feature pool and had effi-cient text detection performance. However, features such ashistogram of intensity and edge linking are under too strictassumptions, which are not always satisfied due to imagedegrading and noises. Whereas we adopt two different fea-tures: Histogram of Oriented Gradient and multi-scale Lo-cal Binary Pattern features to build up feature pool whichare proven also effective for text detection task in our ex-periments.

3.1.1 Histogram of Oriented Gradient (HOG)

Histogram of Oriented Gradient (HOG) and its variationshave been widely used in computer vision [3] and opticalcharacter recognition (OCR) [14] fields, due to their strongability to describe the strength and regularity of object con-tours. In the implementation of HOG, each pixel’s gradientvector, calculated by Sobel operator, can be decomposedinto 4-orientation as shown in Fig. 2c or 8-direction asshown in Fig. 2d based on the parallelogram law. Withina specific local region, the feature value of each HOG binis calculated by accumulating all corresponding bin valueson all pixels. To avoid sensitivity to illumination, all HOGfeature values are normalized by dividing intensity standarddeviation (STD) value of the sampled window.

(a) (b)

(c) (d)

Figure 2. Histogram of Oriented Gradient (HOG). (a) 3×3horizontal and vertical Sobel masks. (b) Gradient vector de-composition by the parallelogram law. (c) 4-orientation de-composition mask. (d) 8-direction gradient decompositionmask.

In object recognition and detection fields, it has beenproven that local features are more discriminative and ro-bust than global features, and local features extracted withindifferent local regions achieve different performance. Sooptimal local regions need to be defined for achieving bestperformance. Fig. 3b gives the average gradient responses

of 4-orientation decomposition of 2500 normalized textwindows (some examples are given in Fig. 3a), from whichwe find that specific orientation gradient response appearsregular distribution within the specific local region. Basedon these observations, within the normalized window wedesigned 14 local regions, including 10 single local regions(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) and 4 overlapped local regions(2-3, 3-4, 2-3-4, 6-7) as shown in Fig. 3c, from which moreinformative HOG features can be extracted.

(a)

(b)

(c)

Figure 3. Local regions design. (a) some normalized textwindows (b) average gradient responses with 4-orientationsdecomposition. (c) local regions defined to extract local fea-tures.

To speed up the HOG feature extraction, we adopt fea-ture integral map technique, which has been widely usedin face detection [21]and object tracking [19]. After gra-dient decomposition, 4-orientation or 8-direction gradientintegral maps can be generated, from which HOG valuefor each local region can be calculated by only a few arith-metic operations. Furthermore, mean and square gray valueof each window, which is used to compute intensity STDvalue, can also be calculated from gray-level integral maps.

3.1.2 Multi-scale Local Binary Pattern (msLBP)

Although HOG feature is useful to capture gradient re-sponse characteristic of text contour, there are also someobjects, such as vegetation, barrier and et al., often con-fused with text patterns in HOG feature space [2]. Zhanget al. [24] have shown that Local Binary Pattern (LBP) fea-ture can not only capture texture characteristic but also lo-cal structure characteristic, which is suitable for text detec-tion. In this paper, we use multi-scale Local Binary Pattern(msLBP) as the supplement of HOG.

The original LBP operator [18] compares the intensityvalue between the central pixel and its 3x3 neighbors. Theoutput of the comparison is encoded by convoluting with a

3737373737

Page 4: [IEEE 2008 The Eighth IAPR International Workshop on Document Analysis Systems (DAS) - Nara, Japan (2008.09.16-2008.09.19)] 2008 The Eighth IAPR International Workshop on Document

3x3 coding mask and transformed into a integral code from0 to 255. Then the feature value is calculated from the his-togram of all LBP codes within a local region. Fig. 4a givesan example of the original LBP encoding process. Basedon the observation that text character strokes always havefixed width and gradient image has obvious visual discrim-inability than gray-level image, we extend the original LBPoperator on two aspects: first, LBP codes are computed notonly on gray-level image but also on gradient-level image.For computation simplicity, we use absolute distance repre-senting the gradient magnitude of each pixel x, which is de-fined as: grad(x) = |devx(x)| + |devy(x)|, where devx()and devy() are horizontal and vertical Sobel operators , re-spectively; second, we calculate the large scale LBP valueby elongating the distance between the central pixel andits neighbors or averaging corresponding neighborhoods asshown in Fig. 4b.

(a)

(b)

Figure 4. Multi-scale Local Binary Pattern (msLBP). (a)the original LBP encoding process. (b) msLBP masks withthree different scales (original, elongating and averaging),where the central pixel is drawn on black and its comparedneighbors are drawn on gray.

Because msLBP code distributions of text patterns ap-pear distinctive while non-texts appear dispersive [24],we use the matching distance to calculate feature valuewithin a specific local region, which is defined asmatchDistance(region) =

∑Ni=1(|li − mi|), where N

is the code number of the code book, li is the number of i-thcode and mi is the average number of all training text sam-ples. For consistency, the local regions defined for msLBPare the same as HOG. Since all 256 codes for one msLBPfeature are too redundancy to build up code book due tothe limited pixel number within a local region, we employa simple code selection process to remove non-informativecodes. Specifically, we first calculate matching distancesof all training samples and calculate fisher ratio value foreach code. Then the code with the biggest fisher ratio valueis firstly selected to join the code book. After that, we it-

eratively select the code with the biggest score, which isdefined as the ratio between fisher ratio value and the maxi-mum correlation value of the new code and selected codes,to join the code book. This process repeats until the fixedcodes number is reached. In our experiment, we limited thenumber of selected informative codes N no more than 20.

The total number of features is based on the decompo-sition number of HOG and the scale number of msLBP.If we adopt 4-orientation HOG and three scale msLBP onboth gray-level and gradient-level images, we have 140 fea-tures integrating with 56 (4×14) HOGs and 84 (2× 3× 14)msLBPs. To improve feature discriminative ability, we usethe same strategy as [2] to build up feature pool with singlefeature and pair-wise feature by joining any two differentfeatures together. With these configurations, the final fea-ture pool includes 9870 (140 + 140 × 139 ÷ 2) features.

3.2 Classifier Design

After extracting features from sampled windows, a highperformance classifier is needed to decide whether thesewindows contain text information. Among all learning-based classifiers such as MLP, SVM [1] and etc., cascadeAdaBoost classifier has been successfully used in manyfields such as face detection [21] and shows its efficiencyand accuracy.

To build up cascade AdaBoost classifier, several Ad-aBoost strong classifiers are connected sequentially, eachof which is trained to accept almost all texts with a veryhigh detection rate di, and reject certain fraction of non-texts with a moderate false positive rate fi on the train-ing set. The total detection rate D of the cascade classi-fier is D =

∏Li=1 di, and the total false positive rate F is

F =∏L

i=1 fi, where L is the number of cascade layers.For the fixed di and fi, the whole training procedures stopswhen the desired total false positive rate is reached. Duringthe testing procedure, only windows classified as texts byall previous classifiers can be as input to the next layers.

In general, there are two factors influencing the perfor-mance of cascade AdaBoost classifier: 1) weak learner,each of which is corresponding to one feature in the featurepool, and 2) AdaBoost learning algorithm. In this paper, weadopt commonly used choices: 2-depth decision tree (DT)with ID3 algorithm [20] as the weak learner and DiscreteAdaBoost learning algorithm. In the experiment, we alsocompared Fisher Linear Discriminant (FLD) [9] with DT,since the former is also a widely used weak learner.

4. Text Localization

After region analysis, to benefit text character recogni-tion, text lines need to be localized accurately based on de-tected candidate text windows.

3838383838

Page 5: [IEEE 2008 The Eighth IAPR International Workshop on Document Analysis Systems (DAS) - Nara, Japan (2008.09.16-2008.09.19)] 2008 The Eighth IAPR International Workshop on Document

4.1 Text Line Generation

To generate text lines, a grouping approach is used,which repetitively groups two windows overlapping over80% of the minimum height in vertical and meanwhile justoverlapping in horizontal until no windows can be mergedfurther. The coordinates of each text line position is calcu-lated by averaging grouped windows.

Since we don’t add the ’pseudo’ texts, which are smalleror larger than the standard text line, into non-text sam-ples during the training procedures. Some incorrect textlines around correct ones are generated by grouping these’pseudo’ windows. To avoid such degrading, we introducea competition analysis mechanism based on Relaxation La-beling (RL) to filter out these incorrect text lines, based onthe fact that the detected windows grouped into correct textlines usually have higher classifier outputs than the ’pseudo’windows.

In detail, the confidence value for each text line is firstlyinitialized with the average classifier outputs of the groupedwindows. After building connection between overlappingtext lines, RL is used to repeatedly updating compatibil-ity and the confidence value for each text line until conver-gence. Finally, text lines with the confidence values higherthan the fixed threshold (0.5) are considered as correct linesand vice versa. The details of RL algorithm can be foundin [12] and the compatibility formula here is defined as:ri,j(ci, cj) = 1/(1 + exp(−δ × (overi,j − 0.8))), wherei and j represents two text lines with the class labels ci andcj , and overi,j is the overlapping ratio in vertical betweeni and j. If ci �= cj , δ is fixed to a positive value and viceversa.

Fig. 5 gives an example of text line generation process,which includes the original image, and results of detectedwindows, grouped text lines, text line competition analysisand coarse text line localization. It is noted that incorrecttext lines are all filtered out to be drawn on black rectan-gles, and correct ones are all preserved to be drawn on whiterectangles.

Figure 5. An example of text line generation process (It isnoted that the brighter of the rectangle the higher confidencevalue for the detected window and text line).

4.2 Text Extraction

For a full TIE system, text extraction is an importantstage before character recognition. Several approaches havebeen used to extract texts form text line image [1, 22].Among these approaches, local binarization is appealingas its simple implementation and satisfied performance. Inthis paper, we employ a variation of Niblack’s binarizationmethod [17] to extract connect components from text lineimage and the result can also be used to localize text lineaccurately. The formula to binarize each pixel is defined asfollows:

b(x) =

⎧⎨⎩

0, if gray(x) < μ(x) − k · σ(x);255, if gray(x) > μ(x) + k · σ(x);100, other,

(1)where μ(x) and σ(x) are the intensity mean and STD withina fixed size window centered on the pixel x, and k is thesmoothing coefficient, which is set to 0.5. For a binarizedimage, only connected components with 0 or 255 value areextracted as the candidate text components.

Most of the previous works filter out non-text compo-nents by heuristic rules [2] or making decisions from OCRresults [1]. However, these methods are hard to give robustresults as there are some free parameters or impractical as-sumptions. In this paper, we propose an adaptive connectedcomponent analysis (CCA) process based on Markov Ran-dom Fields (MRF) model under the fact that text CCs havedistinct characteristics from non-text CCs and similar char-acteristics among text neighbors.

In detail, CCA can be formulated as a labeling problemwith maximize a posterior (MAP) framework. Given theconnect components set C = {c1, c2, ..., cN}, the corre-sponding observation set O = {o1, o2, ..., oN} and label setF = {f1, f2, ..., fN}, the MAP solution with MRF modelassigns a label to each CC by maximizing the posteriorprobability as:

F ∗ = arg maxF

P (F |O) = arg maxF

N∏i=1

p(Oi|Fi)P (Fi),

(2)where p(Oi|Fi) is the likelihood function of Oi given Fi,and P (Fi) is the prior probability of Fi, all of which canbe calculated with single site and neighboring sites poten-tials. The MRF model inspired from [26] is used to es-timate likelihood potentials from a supervised learning al-gorithm and prior potentials from training set directly. Wedesign six single-component features (width, height, aspectratio, occupy ratio, contour gradient and offset) and sixpair-component features (horizontal/vertical distance, shapedifference, horizontal/vertical overlap and intensity differ-ence). Then multilayer perceptron (MLP) is employed as

3939393939

Page 6: [IEEE 2008 The Eighth IAPR International Workshop on Document Analysis Systems (DAS) - Nara, Japan (2008.09.16-2008.09.19)] 2008 The Eighth IAPR International Workshop on Document

the likelihood potential (p(Oi|Fi)) generator since the out-put of MLP can be viewed as approximate conditional prob-ability for each class directly. To construct component adja-cent graph, component centroid distance-based three near-est neighborhoods rule is employed.

An example of text extraction process is given in Fig.6, where from top left to bottom right are the original im-age, the binarized image, candidate CCs (embodied with redrectangles) extraction and graph construction (adjacent CCslinked with the blue lines), the left CCs not filtered out byMRF model and the final text extraction result, respectively.

Figure 6. An example of text extraction process with localbinarization and MRF-based connected component analy-sis.

5. Experiments

To evaluate the performance of the proposed system, wedid experiments on different settings and compared it withthe best existing methods on a public benchmark dataset.

5.1. Dataset and Settings

The dataset we used is the ICDAR 2003 Robust Readingand Text Locating Dataset, which contains TrialTrain andTrialTest sets. The text samples to train cascade AdaBoostclassifier were collected from 258 TrialTrain set images andgenerated by random disturbing the manually labeled posi-tions. The non-text samples (with the same number of textsamples) to train each AdaBoost classifier were generatedon regions not overlapping with text regions from the Tri-alTrain set by the bootstrap method. All training sampleswere normalized to 16×32, which is consistent with the de-tected window. The scale step of image pyramid was fixedto 1.2 and the layer of image pyramid was set according tothe original image size. For training MRF model, we la-beled 4395 text CCs and 2251 non-text CCs within text lineregions on the TrialTrain set images. All 249 TrialTest setimages containing 655 text lines were used to evaluate theproposed system.

To design the cascade AdaBoost classifier, we set the de-tection rate di to 99.9%, the false negative rate fi to 45%for each cascade layer and the total false negative rate F to3 × 10−6. To speed up the training and testing procedures,we used HOG through the whole training procedures and

msLBP after the total false negative rate is below 5× 10−4,since HOG has much smaller computational complexity.

Our system was coded with C++ language and all exper-iments were evaluated on a PIV 3.4GHz desktop computerwith Window XP OS.

5.2. Results

In all experiments, we adopted the performance evalu-ation criterion of ICDAR 2005 Text Locating Competition[16] which defines precision rate and recall rate based onarea matching ratio, but text line was adopted as basic eval-uation unit instead of text word.

To estimate to what extent the sample size influences thesystem performance, we evaluated our system with threetraining sets with different sizes (5000, 10000, 20000, bothtexts and non-texts) randomly generated. The result in Fig.7a shows that using larger size training set improves the sys-tem performance slightly, which agrees with the result of[10]. So the remaining experiments were all performed on5000 training samples.

In the second experiment, we compared different featurepool configurations. Besides the original feature pool, wegave another three configurations: 1) 4-orientation HOGonly, 2) 8-direction HOG only, and 3) 8-direction HOGand msLBP together. The result in Fig. 7b shows that: 1)4-orientation HOG is enough to capture text texture char-acteristic, however, 8-direction HOG is so complex that itdegrades performance as ’over-fitting’, and 2) msLBP isa meaningful supplement to HOG to improve the perfor-mance.

The next experiment is to evaluate which type of weaklearners are more credible to AdaBoost learning algorithm.As stated in section 2, we compared the 2-depth DT andFLD by choosing them as candidate weak learner to trainAdaBoost classifier, respectively. The result in Fig. 7cshows that for text detection, DT appears similar perfor-mance with FLD, but with fewer number (121 DTs vs. 204FLDs), which might be due to the DT’s nonlinearity.

To illustrate the effectiveness of CCs extraction andanalysis to localize text line accurately. We compare theperformance between it and only after coarse text line lo-calization with window grouping and text line competitionanalysis. Fig. 7d shows that accurate text line localization

Table 1. Evaluation results comparing the proposed methodwith other two methods

Recall Precision Averagerate(100%) rate(100%) speed time(s)

1st ICDAR05(CC-based) 0.62 0.67 14.4

2nd ICDAR05(region-based) 0.60 0.60 0.35The proposed

method 0.68 0.67 1.5

4040404040

Page 7: [IEEE 2008 The Eighth IAPR International Workshop on Document Analysis Systems (DAS) - Nara, Japan (2008.09.16-2008.09.19)] 2008 The Eighth IAPR International Workshop on Document

(a) (b) (c) (d)

Figure 7. Performance evaluations based on ROC curves (a) with different training set sizes. (b) with different feature pools.(c) with different weak learners. (d) with different localization steps.

can improve system performance significantly.Table 1 gives the comparing experimental result among

the proposed method and the top two methods evaluated onthe ICDAR 2005 Text Locating Competition. It should benoted that results of other two methods are evaluated ondifferent test set and the evaluation units are also differ-ent, because we can not get the exact testing set and post-processing package. However, it can be approximately tosay that the proposed method is comparative to these meth-ods both in accuracy and speed in some senses.

Fig. 8 shows some text localization examples evaluatedwith the proposed method and Chen’s method (on the publicInternet evaluation platform 2).

6. Conclusions and Future Works

In this paper, we propose a robust system to detect andlocalize texts in natural scene images. The experimental re-sults show that: 1) the feature pool, which is composed ofHistogram of Oriented Gradient (HOG) and multi-scale Lo-cal Binary Pattern (msLBP) feature, is suitable to text detec-tion problem since they can capture the texture and structurecharacteristics of text regions and are less sensitive to imagenoises, 2) Relaxation Labeling (RL) algorithm is effectiveto filter out incorrect text lines grouped from ’pseudo’ textwindows, and 3) CCs analysis based on MRF model is a ro-bust approach to filter out non-text components and localizetext line accurately.

In the future, the system will be improved on two as-pects: 1) adding more features to improve the discriminativeability of the feature pool, 2) modifying the cascade classi-fier learning procedure to make the system more accurateand fast.

7. Acknowledgments

This work is supported by the Hundred Talents Programof the Chinese Academy of Sciences.

2Trial web form for evaluating text locaters(http://algoval.essex.ac.uk:8080/textloc/upload.html)

References

[1] D. T. Chen, J. M. Odobez, and H. Bourlard. Text detec-tion and recognition in images and videos frames. PatternRecognition, 37(3):595–608, 2004.

[2] X. Chen and A. L. Yuille. Detecting and reading text innatural scenes. In Proc. IEEE Conf. Computer Vision andPattern Recognition, volume 2, pages 366–373, 2004.

[3] N. Dalal and B. Triggs. Histograms of oriented gradients forhuman detection. In Proc. IEEE Conf. Computer Vision andPattern Recognition, volume 1, pages 886–893, 2005.

[4] J. Gllavata, R. Ewerth, and B. Freisleben. Text detec-tion in images based on unsupervised classification of high-frequency wavelet coefficients. In Proc. 17th Int’l Conf. Pat-tern Recognition, volume 1, pages 425–428, 2004.

[5] T. Hiroki. Region graph based text extraction from outdoorimages. In Proc. 3rd Int’l Conf. Information Technology andApplications, volume 1, pages 680–685, 2005.

[6] A. K. Jain and B. Yu. Automatic text location in images andvideo frames. In Proc. 14th Int’l Conf. Pattern Recognition,volume 2, pages 1497–1499, 1998.

[7] K. Jung, K. I. Kim, and A. K. Jain. Text information extrac-tion in images and videos: A survey. Pattern Recognition,37(5):977–997, 2004.

[8] K. I. Kim, K. Jung, and J. H. Kim. Texture-based approachfor text detection in images using support vector machinesand continously adaptive mean shift algorithm. IEEE Trans.Pattern Analysis and Machine Intelligence, 25(12):1631–1639, 2003.

[9] I. Laptev. Improvements of object detection using boostedhistograms. In Proc. 17th British Machine Vision Conf., vol-ume 3, pages 949–958, 2006.

[10] K. Levi and Y. Weiss. Learning object detection from a smallnumber of examples: the importance of good features. InProc. IEEE Conf. Computer Vision and Pattern Recognition,volume 2, pages 53–60, 2004.

[11] H. P. Li and D. Doermann. A video text detection systembased on automated training. In Proc. 15th Int’l Conf. Pat-tern Recognition, volume 2, pages 2223–2226, 2000.

[12] S. Z. Li. Markov Random Field Modeling in Image Analysis.Springer, 2001.

[13] J. Liang, D. Doermann, and H. P. Li. Camera-based analysisof text and documents: A survey. Int’l J. Document Analysisand Recognition, 7(2-3):84–104, 2005.

4141414141

Page 8: [IEEE 2008 The Eighth IAPR International Workshop on Document Analysis Systems (DAS) - Nara, Japan (2008.09.16-2008.09.19)] 2008 The Eighth IAPR International Workshop on Document

Figure 8. Examples of text localization (from left column to right column: coarse text line localization, text extraction,accurate text line localization with our method and Chen’s localization method).

[14] C.-L. Liu, K. Nakashima, H. Sako, and H. Fujisawa. Hand-written digital recognition: Benchmarking of state-of-the-arttechniques. Pattern Recognition, 36(10):2271–2285, 2003.

[15] Y. X. Liu, S. Goto, and T. Ikenaga. A contour-based robustalgorithm for text detection in color images. IEICE Trans.Information and Systems, 89(3):1221–1230, 2006.

[16] S. M. Lucas. ICDAR 2005 text locating competition results.In Proc. 8th Int’l Conf. Document Analysis and Recognition,volume 1, pages 80–84, 2005.

[17] W. Niblack. An Introduction to Digital Image Processing.Prentice Hall, 1986.

[18] T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolutiongray-scale and rotation invariant texture classification withlocal binary patterns. IEEE Trans. Pattern Analysis and Ma-chine Intelligence, 24(7):971–987, 2002.

[19] F. Porikli. Integral histogram: a fast way to extract his-tograms in cartesian spaces. In Proc. IEEE Conf. ComputerVision and Pattern Recognition, volume 1, pages 829–836,2005.

[20] J. Quinlan. Introduction of decision tree. Machine Learning,1(1):81–106, 1986.

[21] P. Viola and M. Jones. Fast and robust classification usingasymmetric adaboost and a detector cascade. In Advances inNeural Information Processing Systems, volume 14, 2002.

[22] C. Wolf and J. M. Jolin. Extraction and recognition ofartificial text in multimedia documents. Technical Re-port RVF-RR-2002.01, http://rtf.insalyon.fr/wolf/papers/tr-rfv-2002-01.pdf, February 2002.

[23] V. Wu, R. Manmatha, and E. M. Riseman. Finding textin images. In Proc. 2nd ACM Int’l Conf. Digital libraries,pages 3–12, 1997.

[24] H. M. Zhang, W. Gao, X. L. Chen, and D. B. Zhao. Objectdetection using spatial histogram features. Image and VisionComputing, (24)4:327–341, 2006.

[25] Y. Zhong, H. Zhong, and A. K. Jain. Automatic caption lo-calization in compressed video. IEEE Trans. Pattern Analy-sis and Machine Intelligence, 22(4):385–392, 2000.

[26] X. D. Zhou and C.-L. Liu. Text/non-text ink stroke clas-sification in Japaneses handwriting based on markov ran-dom fields. In Proc. 9th Int’l Conf. Document Analysis andRecognition, volume 1, pages 377–381, 2007.

[27] K. H. Zhu, F. H. Qi, R. J. Jiang, L. Xu, M. Kimachi, Y. Wu,and T. Aizawa. Using AdaBoost to detect and segment char-acters from natural scenes. In Proc. 1st Int’l Workshop onCamera Based Document Analysis and Recognition, pages52–59, 2005.

4242424242