Using JEI format - Semantic Scholarclassificationhierarchyandreportanaccuracyof90.5%for...

10
Classifying images on the web automatically Rainer Lienhart Alexander Hartmann* Intel Labs Intel Corporation 2200 Mission College Boulevard Santa Clara California 95052-8119 E-mail: [email protected] Abstract. Numerous research works about the extraction of low- level features from images and videos have been published. How- ever, only recently the focus has shifted to exploiting low-level fea- tures to classify images and videos automatically into semantically broad and meaningful categories. In this paper, novel classification algorithms are presented for three broad and general-purpose cat- egories. In detail, we present algorithms for distinguishing photo-like images from graphical images, actual photos from only photo-like, but artificial images and presentation slides/scientific posters from comics. On a large image database, our classification algorithm achieved an accuracy of 97.69% in separating photo-like images from graphical images. In the subset of photo-like images, true pho- tos could be separated from ray-traced/rendered image with an ac- curacy of 97.3%, while with an accuracy of 99.5% the subset of graphical images was successfully partitioned into presentation slides/scientific posters and comics. © 2002 SPIE and IS&T. [DOI: 10.1117/1.1502259] 1 Introduction Today’s web search engines allow searching for text con- tained in web pages. However, more and more people are also interested in finding images and videos on the World Wide Web. Some search engines have already started to offer the possibility to search for images and videos such as AltaVista™ and Google™, however, they often only enable search based on textual hints, which are taken from the image’s filename, ALT-tag and/or associated web page. AltaVista™ also offers the possibility to search for im- ages similar to one already found using textual hints. How- ever, the similarity search is only possible for some images, maybe because either not all images are analyzed yet or there are certain criteria an image must meet before it can be used for a similarity search. Those criteria, however, are not explained. The next generation of search engines will also be media portals, which allow searching for all kinds of media ele- ments. For instance, 1 indexes web images based on the vi- sual appearance of text, faces, and registered trademark logos. There is a high demand for search engines, which can index beyond textual descriptions. Media portals of to- morrow need to classify their media content automatically. Image libraries of tens of millions of images cannot be classified manually. In this paper we present novel classification algorithms for three broad categories. In detail, we present algorithms for distinguishing 1. photos/photo-like images from graphical images, 2. actual photos from artificial photo-like images such as raytracing images or screen shots from photo- realistic computer games, and 3. presentation slides/scientific posters from comics/ cartoons. With the exception of the classification into photos/ photo-like images and graphical images we are not aware of any directly related work. Our choice for these four classes is the result of a thor- ough analysis of the image classes we could find most often in our database of web images. Over a period of four months about 300 000 web images, which did not represent buttons or navigational elements, were crawled and down- loaded from the web. A large percentage of these images fell into the earlier four categories. The four categories were arranged into a simple classification hierarchy ~see Fig. 1!. 2 Related Work Only recently automatic semantic classification of images into broad general-purpose classes has been the topic of some research. General-purpose classes are meaningful to normal people and can be performed by them without being an expert in a specific field. Examples of general-purpose classes are outdoor versus indoor and city versus landscape scenes. In Refs. 2 and 3 Vailaya et al. describe a method to classify vacation images into classes like indoor/outdoor, city/landscape, and sunset/mountain/forest scenes. They use a Bayesian framework for separating the images in a *Present address: IT-SAS, Gottlieb-Daimler-Str. 12, 68165 Mannheim, Germany. Paper II-04 received Feb. 18, 2002; revised manuscript received May 31, 2002; accepted for publication June 12, 2002. 1017-9909/2002/$15.00 © 2002 SPIE and IS&T. Journal of Electronic Imaging 11(4), 10 (October 2002). Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 1

Transcript of Using JEI format - Semantic Scholarclassificationhierarchyandreportanaccuracyof90.5%for...

Page 1: Using JEI format - Semantic Scholarclassificationhierarchyandreportanaccuracyof90.5%for indoorversusoutdoorclassification,95.3%forcityversus landscapeclassification,and96.6%forsunsetversusforest

Classifying images on the web automaticallyRainer Lienhart

Alexander Hartmann*Intel Labs

Intel Corporation2200 Mission College Boulevard

Santa ClaraCalifornia95052-8119

E-mail: [email protected]

Abstract. Numerous research works about the extraction of low-level features from images and videos have been published. How-ever, only recently the focus has shifted to exploiting low-level fea-tures to classify images and videos automatically into semanticallybroad and meaningful categories. In this paper, novel classificationalgorithms are presented for three broad and general-purpose cat-egories. In detail, we present algorithms for distinguishing photo-likeimages from graphical images, actual photos from only photo-like,but artificial images and presentation slides/scientific posters fromcomics. On a large image database, our classification algorithmachieved an accuracy of 97.69% in separating photo-like imagesfrom graphical images. In the subset of photo-like images, true pho-tos could be separated from ray-traced/rendered image with an ac-curacy of 97.3%, while with an accuracy of 99.5% the subset ofgraphical images was successfully partitioned into presentationslides/scientific posters and comics. © 2002 SPIE and IS&T.[DOI: 10.1117/1.1502259]

1 Introduction

Today’s web search engines allow searching for text con-tained in web pages. However, more and more people arealso interested in finding images and videos on the WorldWide Web. Some search engines have already started tooffer the possibility to search for images and videos such asAltaVista™ and Google™, however, they often only enablesearch based on textual hints, which are taken from theimage’s filename, ALT-tag and/or associated web page.

AltaVista™ also offers the possibility to search for im-ages similar to one already found using textual hints. How-ever, the similarity search is only possible for some images,maybe because either not all images are analyzed yet orthere are certain criteria an image must meet before it canbe used for a similarity search. Those criteria, however, arenot explained.

The next generation of search engines wil l also be mediaportals, which allow searching for all kinds of media ele-

ments. For instance,1 indexes web images based on the vi-sual appearance of text, faces, and registered trademarklogos. There is a high demand for search engines, whichcan index beyond textual descriptions. Media portals of to-morrow need to classify their media content automatically.Image libraries of tens of millions of images cannot beclassified manually.

In this paper we present novel classification algorithmsfor three broad categories. In detail, we present algorithmsfor distinguishing

1. photos/photo-like images from graphical images,

2. actual photos from artificial photo-like images suchas raytracing images or screen shots from photo-realistic computer games, and

3. presentation slides/scientific posters from comics/cartoons.

With the exception of the classification into photos/photo-like images and graphical images we are not awareof any directly related work.

Our choice for these four classes is the result of a thor-ough analysis of the image classes we could find most oftenin our database of web images. Over a period of fourmonths about 300000 web images, which did not representbuttons or navigational elements, were crawled and down-loaded from the web. A large percentage of these imagesfell into the earlier four categories. The four categorieswere arranged into a simple classification hierarchy ~seeFig. 1!.

2 Related Work

Only recently automatic semantic classification of imagesinto broad general-purpose classes has been the topic ofsome research. General-purpose classes are meaningful tonormal people and can be performed by them without beingan expert in a specific field. Examples of general-purposeclasses are outdoor versus indoor and city versus landscapescenes.

In Refs. 2 and 3 Vailaya et al. describe a method toclassify vacation images into classes like indoor/outdoor,city/landscape, and sunset/mountain/forest scenes. Theyuse a Bayesian framework for separating the images in a

*Present address: IT-SAS, Gottlieb-Daimler-Str. 12, 68165 Mannheim, Germany.

Paper II-04 received Feb. 18, 2002; revised manuscript received May 31, 2002;accepted for publication June 12, 2002.1017-9909/2002/$15.00 © 2002 SPIE and IS&T.

Journal of Electronic Imaging 11(4), 1–0 (October 2002).

Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 1

Page 2: Using JEI format - Semantic Scholarclassificationhierarchyandreportanaccuracyof90.5%for indoorversusoutdoorclassification,95.3%forcityversus landscapeclassification,and96.6%forsunsetversusforest

classification hierarchy and report an accuracy of 90.5% forindoor versus outdoor classification, 95.3% for city versuslandscape classification, and 96.6% for sunset versus forest/mountain classification.

Gorkani et al. propose amethod for distinguishing city/suburb from country/landscape scenes using the mostdominant orientation in the image texture.4 The dominantorientation differs between city and landscape images. Theauthors state that it takes humans almost no time or ‘‘brainpower’’ to distinguish between those image classes, sothere should exist an easy and fast to calculate feature. Theauthors report a classification accuracy of 92.8% on 98 testimages.

Yiu et al. classify pictures into indoor/outdoor scenesusing color histograms and texture orientation.5 For the ori-entation they use the algorithm by Gorkani and Picard.4

The vertical orientation serves as the discriminant feature,because indoor images tend to have more artifacts, and ar-tifacts tend to have strong vertical lines.

Bradshaw proposes amethod for labeling image regionsas natural or man-made. For instance, buildings are man-made, while mountains in the background are natural. Forhomogeneous images, i.e., images depicting either onlyman-made or only natural objects, an error rate of about10% is reported. Bradshaw also proposes how this featurecan be used for indoor versus outdoor classification.6 Hereports aclassification accuracy of 86.3%.

Swain et al. describe how to separate photographs andgraphics on web pages.7,8 They only search for ‘‘simple’’graphics such as navigation buttons or drawings, while ourwork deals with artificial but realistic-looking images,which would be classified as being natural by their algo-rithm. The features that Swain et al. used are: number ofcolors, most frequent color, farthest neighbor metric, satu-ration metric, color histogram metric, and afew more.7,8Anerror rate of about 9% is reported for distinguishing photosfrom graphics encoded as JPEG images.

Schettini et al. recently addressed the problem of sepa-rating photographs, graphics, text, and compound docu-ments using color distribution, color statistics, edge distri-bution, wavelet coefficients, texture features, andpercentage of skin color pixels as features. Compounddocuments here are images consisting of more than one ofthe categories photographs, graphics, and text. Decisiontrees trained by the CART algorithm are used as the baseclassifier. Multiple decision trees are trained and combinedin a majority vote. For photos versus text versus graphicsprecision values between 0.88 and 0.95 are reported.9 Theauthors also applied the same approach to the problem ofdistinguishing indoor, outdoor and close-up images. Preci-sion values between 0.87 and 0.91 are reported.10

3 Graphica l Versu s Photo-Lik e Images

One of the first decisions auser has to make when search-ing for a particular image is whether the image should begraphical or photo-like. Examples of graphical images arebuttons and navigation elements, scientific presentations,slides, and comics; examples of realistic-looking, photo-like images are photos, raytracing images, and photo-realistic images of modern computer games.

3.1 Features

Features, which could be distinctive for this separation andof which some have been proposed by Swain et al. in Refs.7 and 8 are:

• the total number of different colors. Graphics tend tohave less colors;

• the relative size of the largest region and/or the num-ber of regions with a relative size bigger than a cer-tain threshold. Graphics tend to have larger uniformlycolored regions;

• the sharpness of the edges. Edges in graphics are usu-ally sharper than edges in photos;

• the fraction of pixels with a saturation greater than acertain threshold. Colors in graphics are usually moresaturated than those in realistic-looking images;

• the fraction of pixels having the prevalent color.Graphics tend to have less colors than photos, and thusthe fraction of pixels of the prevalent color is higher;

• the farthest neighbor metric, which measures the colordistance between two neighbor pixels. The distance isdefined as d5ur12r2u1ug12g2u1ub12b2u, theabsolute difference of both pixels’ RGB values. Threesubfeatures can be derived:

• the fraction f 1 of pixels with a distance greater thanzero. Graphics usually have larger single-colored re-gions. So this metric should be lower for graphics;

• the fraction f 2 of pixels with adistance greater than ahigh threshold. This value should be high for graphics;and

• the ratio between f 2 and f 1 . As f 1 tends to be largerfor photographs, a low value of f 2 / f 1 indicates aphoto-like image.

3.2 Training

Obviously, most of these features are not statistically inde-pendent, but rather highly correlated. Therefore, we de-cided to implement all features and then to select the mostrelevant ones by means of feature selection. The discrete

Fig. 1 Classification hierarchy.

Lienhart and Hartmann

2 / Journal of Electronic Imaging / October 2002 / Vol. 11(4)

Page 3: Using JEI format - Semantic Scholarclassificationhierarchyandreportanaccuracyof90.5%for indoorversusoutdoorclassification,95.3%forcityversus landscapeclassification,and96.6%forsunsetversusforest

AdaBoost machine learning algorithm with stumps servedas our feature selector.11 AdaBoost is aboosting algorithmthat combines many ‘‘weak’’ classifiers into a ‘‘strong’’powerful committee-based classifier ~see Fig. 2!. Weakclassifiers can be very simple and are only required to bebetter than chance. Common weak classifiers are stumps—single-split trees with only two terminal nodes. In each loopthe stump with the lowest training error is selected in step~3a! of Fig. 2. In other words, in ~3a! k simple thresholdclassifiers ~‘‘stumps’’ ! are trained for all k dimensions ofthe input samples. The classifier with the lowest weightederror errm is selected as f m(x).

After training with about 7516 images only four featuresproved to be useful:

• the total number of colors cn after truncating eachcolor channel to only its five most significant bits ~32332332532768 colors!;

• the prevalent color cp ;

• the fraction f 1 of pixels with a distance greater thanzero; and

• the ratio between f 2 and f 1 .

Al l other features were not selected by the AdaBoostalgorithm. Most likely they were not distinctive enoughpartly due to the fact that all our images were JPEG com-pressed. Some of the characteristic features of graphics aredestroyed by JPEG’s lossy compression. Note that in Refs. 7and 8 most graphical images were GIF compressed, whichsimplifies the task.

The overall classifier

F~x!5sign[ (m51

M

cmf (x)]m

with

f m~x!5H valleftm x,threshold

valrightm else

used seven stumps (M57) with the parameters in Table 1.

3.3 Experimental Results

On a test set of 947 graphical images ~comics, scientificposters, and presentation slides, the same images as in Sec.5! and 2272 photographic images ~raytracing images andphotographs, the same images as in Sec. 4! 91.92% of thegraphical images and 98.97% of the photo-like imageswere classified correctly, resulting in an overall accuracy of97.69%. The misclassified photo-like images were mostlyphotos made up of only a few colors ~such as the rightimage in Fig. 3!, or raytracing images, which did not lookrealistic at all, but were put into this class because theywere part of an image archive of raytracing images ~see thetwo left most images in Fig. 3!.

Misclassified graphical images were either slides con-taining large photographs ~Fig. 4! or very colorful comics~not shown for copyright reasons!.

Overall, most errors in this class were caused by thelarge visual diversity of the slides/presentation class, whichnot only consists of PowerPoint presentation images, but

Fig. 2 Discrete AdaBoost training algorithm.

Fig. 3 Examples of realistic looking images, which were misclassified as being graphics. The misclas-sified images were either photos made up of only a few colors (right image) or raytracing images,which did not look realistic at all, but were put into this class because they were part of an imagearchive of raytracing images.

Classifying images on the web automatically

Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 3

Page 4: Using JEI format - Semantic Scholarclassificationhierarchyandreportanaccuracyof90.5%for indoorversusoutdoorclassification,95.3%forcityversus landscapeclassification,and96.6%forsunsetversusforest

also many scientific posters related to space/astronomy,fluid/wind motion or physics in general. Only 80.5% ofthem were classified as graphical.

4 Compute r Generated , Realisti c LookingImages Versu s Real Photos

The algorithm proposed in this section for distinguishingbetween real photos and computer-generated, but realistic-looking images can be applied to the set of images, whichhave been classified as being photo-like by the algorithmdescribed in Sec. 3.

Theclass of real photos encompasses all kinds of imagestaken from nature. Typical examples are digital photos andvideo frames. In contrast, the class of computer-generatedimages encompasses raytracing images as well as imagesfrom graphic tools such asAdobe Photoshop and computergames. Figure 5 shows three examples for each class.

4.1 Features

Every real photo contains noise due to the process of con-verting an analog image into digital form. For computer-generated, realistic-looking images this conversion/scanning process, however, is not needed. Thus, it can be

expected that computer-generated images are far less noisythan digitized images. By designing afeature that measuresnoise it should be possible to distinguish between scannedimages and images, which were digital right from the be-ginning.

A second suitable feature that can be used is the sharp-ness of the edges. Computer-generated images are sup-posed to display sharper edges than photographs. However,due to lossy JPEG compression this feature gets less reli-able. Sharp edges may be blurred, and blockiness may beadded, i.e., sharp edges might be added, which were notthere before.

In practice, wemeasure theamount of noiseby means ofthe histogram of the absolute difference image between theoriginal and its denoised version. The difference values canvary between 0 and 255. Two simple and fast filters fordenoising are the median and the Gaussian filter. The coredifference between both filters are that they assume differ-ent noise sources. The median filter is more suitable forindividual pixel outliers, while the Gaussian filter is betterfor additive noise.12

Both denoising filters were applied with aradius of 1, 2,3, and 4. Thus, the resulting feature vector consisted of

Fig. 4 Examples of graphical images misclassified as realistic.

Fig. 5 Examples of raytracing images and photographs.

Lienhart and Hartmann

4 / Journal of Electronic Imaging / October 2002 / Vol. 11(4)

Page 5: Using JEI format - Semantic Scholarclassificationhierarchyandreportanaccuracyof90.5%for indoorversusoutdoorclassification,95.3%forcityversus landscapeclassification,and96.6%forsunsetversusforest

2048 values: 43256 from the median filter and 43256from the Gaussian filter.

4.2 Training

As in Sec 3.2 we can expec that our features are highly cor-related. In addition, many of them may only encode noisewith respect to the classification task at hand. Therefore, weuse boosting again for training and feature selection. Thistime, however, due to the large number of features, wecompare the performance of four boosting variants: Dis-crete AdaBoost, Gentle AdaBoost, Real AdaBoost, andLogitBoost.13 The later three usually compare favorably toDiscrete AdaBoost with respect to the number of weakclassifiers needed to achieve acertain classification perfor-mance. The algorithm for Gentle AdaBoost is depicted inFig. 6. In our experiments it usually produced the classifierwith the best performance/computational complexity trade-off.

4.3 Experimental Results

The overall image set consisted of 3225 scenic photographsfrom the ‘‘Master Clips 500.000’’-collection and 4352 ray-tracing images from http://www.irtc.com, the Internet RayTracing Competition. The overall image set was randomlypartitioned into 5305 ~70%! training images and 2272~30%! test images.

Training was performed with

a. all 2048 feature values,

b. only the 1024 median feature values, and

c. only the 1024 Gaussian feature values

in order to analyze the suitability of the median and Gauss-ian features for the classification task as well as the perfor-mance gain by using both feature sets jointly.

The results are shown in Table 2. The number M ofweak classifiers was determined by setting the target hitrate as the termination criterion for the iterative loop of theboosting algorithms. The following observations can bedrawn from the results shown in Table 2.

a. The test accuracy increases consistently with thetraining accuracy demonstrating one of the mostimpressive features of boosting algorithms: theirtendency not to overfit the training data in prac-tice.

b. The median feature values perform better than theGaussian features. For instance, with DiscreteAdaBoost, the test error rate for median values is

3.4% compared to 5.9% for Gaussian valueswhile using only half the number of weak clas-sifiers and thus roughly half the number of fea-tures.

c. Using both features sets reduces the test error rateto 2.7%. At the same time the number of weakclassifier is reduced by another 30%. Thus by us-ing alarger features pool from which the boostingalgorithm can pick, less features are needed for abetter classifier.

d. The best classification results were achieved us-ing Gentle AdaBoost with all 2048 values in thefeature pool. Classification accuracy for the testset was 97.3%—98.2% for raytracing images and96.0% for photos. In our previous work, we usedthe linear vector quantization package from theHelsinki University of Technology14 to train ourclassifier. However, the classification accuracy forthe same test set was only 87.33%.15

A closer inspection of the misclassified raytracing im-ages revealed two main source for misclassification. Theseimages either used noisy, real-world textures or were verysmall in dimension ~e.g., only 100375 pixels!. Some ‘‘pho-tos’’ were misclassified since they were not real photos ~seeFig. 7, left image!. Figure 7shows afew examples of mis-classified images.

5 Presentatio n Slides ÕScientifi c Poster s VersusComics ÕCartoons

The algorithm proposed in this section for distinguishingbetween presentation slides/scientific posters and comics/cartoons can be applied to the set of images, which havebeen classified as being graphical by the algorithm in Sec. 3.

Fig. 6 Gentle AdaBoost training algorithm (see Ref. 12).

Table 1 Discrete AdaBoost parameters for graphics vs photo/photo-like images.

m Feature cm Thresholdm Valleft Valright

1 f2 /f1 3.430 24 0.214 25 0.970 419 0.040 232 9

2 f2 /f1 1.4672 0.063 857 8 0.931 319 0.263 485

3 cp 1.045 37 0.1872 0.821 982 0.387 983

4 cn 0.775 026 626 0.330 293 0.687 471

5 f2 /f1 0.766 541 0.213 204 0.370 636 0.798 874

6 f2 /f1 0.509 741 0.106 966 0.729 928 0.463 033

7 f1 0.538 214 0.987 547 0.636 153 0.397 175

Classifying images on the web automatically

Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 5

Page 6: Using JEI format - Semantic Scholarclassificationhierarchyandreportanaccuracyof90.5%for indoorversusoutdoorclassification,95.3%forcityversus landscapeclassification,and96.6%forsunsetversusforest

Fig. 7 Examples of misclassified images as (a) natural images and (b) photorealistic, but artificialimages.

Table 2 Classification performance of computer generated, realistic looking images vs real photos.The results are shown for four common boosting algorithms. Training/test accuracy was determined ona training/test set of 5305/2272 images. Training accuracy was used as a termination criterion for theboosting training. Note that discrete AdaBoost consistently needs more features to achieve the sametraining and test accuracy as the other boosting algorithms.

Input features

Median values Median1Gaussian values Gaussian values

No. featuresTrainingaccuracy

Testaccuracy No. features

Trainingaccuracy

Testaccuracy No. features

Trainingaccuracy

Testaccuracy

Gentle Adaboost 61 0.950 0.917 45 0.951 0.935 155 0.950 0.905

75 0.961 0.932 55 0.963 0.938 185 0.960 0.913

96 0.971 0.948 70 0.972 0.951 236 0.970 0.917

127 0.980 0.951 91 0.980 0.963 272 0.980 0.924

185 0.990 0.960 130 0.991 0.973 356 0.990 0.935

Discrete Adaboost 79 0.951 0.922 47 0.951 0.929 210 0.950 0.908

101 0.960 0.929 57 0.961 0.935 256 0.960 0.917

139 0.970 0.944 86 0.971 0.947 324 0.970 0.928

183 0.980 0.953 100 0.980 0.959 403 0.980 0.928

295 0.990 0.966 164 0.990 0.967 590 0.990 0.941

Real Adaboost 58 0.951 0.923 46 0.952 0.926 158 0.951 0.905

70 0.961 0.930 56 0.961 0.941 183 0.960 0.905

90 0.971 0.945 65 0.971 0.947 223 0.971 0.915

122 0.981 0.951 92 0.981 0.957 273 0.980 0.928

163 0.991 0.960 120 0.990 0.963 349 0.990 0.937

Logit Boost 52 0.951 0.926 39 0.951 0.923 144 0.951 0.892

63 0.961 0.929 50 0.961 0.932 179 0.960 0.897

80 0.970 0.942 65 0.970 0.941 205 0.970 0.908

104 0.980 0.955 88 0.980 0.957 261 0.980 0.913

151 0.990 0.958 121 0.990 0.964 321 0.990 0.921

Max 0.966 0.973 0.941

Lienhart and Hartmann

6 / Journal of Electronic Imaging / October 2002 / Vol. 11(4)

Page 7: Using JEI format - Semantic Scholarclassificationhierarchyandreportanaccuracyof90.5%for indoorversusoutdoorclassification,95.3%forcityversus landscapeclassification,and96.6%forsunsetversusforest

The class of presentation slides includes all imagesshowing slides independently of whether they were createddigitally by presentation programs such as MS PowerPointor by hand. Many scientific posters are designed like aslide, and, therefore, fall into this class, too. However, sci-entific posters may also differ significantly from the generallayout of slides. Both image classes, presentation slides andscientific posters, comprise the class of presentation slides/scientific posters.

The class of comics includes cartoons from newspapers,most of which are available on the web, and books as wellas other kinds of comics.

Images of both classes can be colored or black andwhite. Three examples for slides and three for scientificposters are shown in Fig. 8, while examples of comics can-not be shown for copyright reasons.

5.1 Features

We observed the following three main differences betweenpresentation slides/scientific posters and comics/cartoons.

1. In general, the relative size and/or alignment of textline occurrences differ for comics and slides/posters. Thus,images of both classes can be distinguished by means of

• the relative width of the topmost text line, i.e., theratio between the width of the topmost text line andthe width of the entire image,

• the average relative width and height of all text linesand their respective standard deviations, and

• the average relative position and standard deviation ofthe center of mass over all text lines.

These features are motivated by the following observa-tions: Slides usually have aheading, which almost fill s theentire width of the image. The subsequent text lines arewider than they are in comics. Moreover, the text lines inslides either have only one center in about the middle of theimage leading to a small standard deviation over the loca-tions of their centers of mass, or they all start in the samecolumn and therefore having different centers of mass, butall those centers of mass are still near each other, and resultin a small standard deviation over the average center loca-tion, too.

The relative width of the topmost text line in comics isusually smaller than in slides, as are all other text lines.Slides in general use larger fonts than comics do. There-fore, the larger the average relative height of the text lines,the more probable it is that the image represents aslide.

Further, text in two or more columns is uncommon forslides. Comics on the other hand usually consist of morethan one image resulting in more than just one visual centerof text blocks. Thus, the standard deviation over the textline center locations wil l be large.

2. Images containing multiple smaller images aligned on

Fig. 8 Examples for (a) slides and (b) scientific posters.

Classifying images on the web automatically

Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 7

Page 8: Using JEI format - Semantic Scholarclassificationhierarchyandreportanaccuracyof90.5%for indoorversusoutdoorclassification,95.3%forcityversus landscapeclassification,and96.6%forsunsetversusforest

a virtual grid and framed by rectangles are very likely to becomics. These borders can easily be detected by edge de-tection algorithms.

In comics, the length of those border lines is usually anintegral fraction of the image’s width or height. For in-stance, they might be about a third of the image’s width.The more lines of such length can be found, the higher isthe probability for the image to be a comic instead of apresentation slide.

This criterion can be made more precise by checking forthe presence of the other n21 lines in the same row/column if a line with a length of one n-th of the image’swidth/height was found. By means of this procedure linesare eliminated, which are just by chance of the correctlength but have nothing to do with the typical borders incomics.

2. Slides very often have a width-to-height ratio of 4:3~landscape orientation!. If the aspect ratio differs from thisratio, it is very unlikely that the image is aslide.

5.2 Feature Calculation

We used the algorithm and system developed by Lienhartet al. to find all text lines and text columns in the imageunder analysis.16 The text detection system was retrainedwith text samples from slides and comics in order to im-prove text line detection performance. Based on the de-tected bounding boxes the following five features were cal-culated:

• the relative width of the top most text line with respect to the image’s width,

• the average text line width and its standard deviation over all detected text lines, and

• the average horizontal center position and its standard deviation over all detected text lines.

Edges were extracted by means of the Canny edges de-tection algorithm and then vectorized.17 Al l nonhorizontalor nonvertical edges were discarded. Two vertical or hori-zontal lines were merged if and only if they had the sameorientation and the end point of one line was near the startpoint of the other. This procedure helped to overcome ac-cidental breakups in the borderlines as well as mergednearby lines from multiple ‘‘picture boxes.’’ Next thelengths of all remaining edges were determined andchecked whether they wereabout one, onehalf, one third or

one fourth of the width or height of the image. If not, therespective edge was discarded. Finally the relative fre-quency of edges with roughly the n-th fraction of the im-age’s width or height (nP$1,2,3,4%) were counted andtaken as another four features.

The feature set was completed by• the absolute number of vertical and the absolute num-

ber of horizontal edges as well as• the aspect ratio of the image dimension.In total 12 features were used.

Fig. 9 The only two misclassified presentation slides/scientific posters.

Table 3 Classification performance for comics/cartoons vs slides/posters.

No. featuresTrainingaccuracy

Testaccuracy

Gentle Adaboost 2 0.980 0.976

2 0.980 0.976

2 0.980 0.976

5 0.983 0.983

36 1.000 0.995

Discrete Adaboost 2 0.980 0.976

2 0.980 0.976

2 0.980 0.976

5 0.987 0.979

42 1.000 0.995

Real Adaboost 2 0.980 0.976

2 0.980 0.976

2 0.980 0.976

4 0.983 0.983

21 1.000 0.993

LogitBoost 2 0.980 0.976

2 0.980 0.976

2 0.980 0.976

5 0.994 0.990

1000 0.999 0.992

Max 0.995

Lienhart and Hartmann

8 / Journal of Electronic Imaging / October 2002 / Vol. 11(4)

Page 9: Using JEI format - Semantic Scholarclassificationhierarchyandreportanaccuracyof90.5%for indoorversusoutdoorclassification,95.3%forcityversus landscapeclassification,and96.6%forsunsetversusforest

5.3 Experimental Results

During our experiments we observed that in comics theneural network-based text localizer detected a significantnumber of false text blocks of small width, but large height.In contrast, the text blocks in slides were recognized verywell. This stark contrast in the false alarm rate of our textlocalizer between comics and slides can partly be explainedby the fact that the usage of large fonts are prevalent inslides, but not in comics and that our detector worked verywell on large text lines. In addition, the kinds of strokesused in comics to draw people and objects sometimes havesimilar properties as the strokes used for text, and thusresult in false alarms. Despite these imperfections of ourtext detector,16all our features except the average height ofthe text lines could be used.

Again the boosting learning algorithms were used fortraining. The training set consisted of 2211 images ~70% ofthe overall image set!—818 slides/posters and 1393 com-ics. Our novel classification algorithm was tested on a testset of 947 images ~30% of the overall image set!—361slides/posters and 586 comics. As shown in Table 3 thereare not many differences between the different boostingalgorithms. For Gentle and Discrete AdaBoost a test accu-racy of 99.5% was achieved. This translates to only fivemisclassified images. The image’s aspect ratio and thenumber of vertical edges were always the first two featureschosen by the boosting algorithms. In the GentleAdaBoostcase, even at the test accuracy of 99.5% the following threefeatures were not selected:

• the relative width of the top most text line with respectto the image width,

• the standard deviation of text line widths, and• the relative number of edges with alength of about one

third of the image width or height.As mentioned before, only five images were misclassi-

fied, of which two were slides/posters ~see Fig. 9!. Thethree misclassified cartoons cannot be shown for copyrightreason, however, their schematic layout is shown in Fig. 10.One of them showed off displaced bounding boxes ~Fig.10, right image!, while the other violates the assumptionthat framing lines must be a n-th fraction of the imagewidth or height ~Fig. 10, left image!. For the third misclas-sified comic the reason for misclassification was bad textdetection.

6 Conclusion

Automatic semantic classification of images is avery inter-esting research field. In this paper, we presented novel andeffective algorithms for two classification problems, whichhave not been addressed before: comics/cartoons versusslides/posters and real photos versus realistic-looking butcomputer generated images. On a large image database,true photos could be separated from ray-traced/renderedimage with an accuracy of 97.3%, while with an accuracyof 99.5% presentation slides were successfully distin-guished from comics. We also enhanced and adjusted thealgorithms proposed in Refs. 7 and 8 for the separation ofgraphical images from photo-like images. On alarge imagedatabase, our classification algorithm achieved an accuracyof 97.69%.

Acknowledgments

The authors would like to thank Alexander Kuranov andVadim Pisarevsky for the work they put in designing andimplementing the four boosting algorithms.

References1. www.visoo.com2. A. Vailaya, ‘‘Semantic classification in image databases.’’ PhD thesis,

Department of Computer Science, Michigan State University, 2000.http://www.cse.msu.edu/;vailayaa/publications.html.

3. A. Vailaya, M. Figueiredo, A. Jain, and H. J. Zhang, ‘‘Bayesian frame-work for hierarchical semantic classification of vacation images,’’Proceedings of the IEEE International Conference on MultimediaComputing and Systems (ICMSC), pp. 518–523, Florence, Italy~1999!.

4. M. M. Gorkani and R. W. Picard, ‘‘Texture orientation for sortingphotos ‘at a Glance’,’’ Proc. ICPR, pp. 459–464 ~Oct. 1994!.

5. E. Yiu, ‘‘Image classification using color cues and texture orienta-tion,’’ Department of Electrical Engineering and Computer Science,MIT, Master thesis ~1996!, http://www.ai.mit.edu/projects/cbcl/res-area/current-html/ecyiu/project.html.

6. B. Bradshaw, ‘‘Semantic based image retrieval: A probabilistic ap-proach,’’ ACM Multimedia 2000, pp. 167–176 ~Oct. 2000!.

7. V. Athitsos, M. J. Swain, and C. Frankel, ‘‘Distinguishing photographsand graphics on the world wide web,’’ IEEE Workshop on Content-Based Access of Image and Video Libraries, pp. 10–17 ~June 1997!.

8. C. Frankel, M. J. Swain, and V. Athistos, ‘‘WebSeer: An image searchengine for the world wide web,’’ University of Chicago Department ofComputer Science Technical Report TR-96-14 ~August 1996!, http://www.infolab.nwu.edu/webseer/.

9. R. Schettini, G. Ciocca, A. Valsasna, C. Brambilla, and M. De Ponti,‘‘A hierarchical classification strategy for digital documents,’’ PatternRecogn. 35~8!, 1759–1769 ~2002!.

10. R. Schettini, C. Brambilla, A. Valsasna, and M. De Ponti, ‘‘Contentbased classification of digital documents,’’ IAPR Workshop on Pattern

Fig. 10 Box layout of two of the three misclassified cartoon images.

Classifying images on the web automatically

Journal of Electronic Imaging / October 2002 / Vol. 11(4) / 9

Page 10: Using JEI format - Semantic Scholarclassificationhierarchyandreportanaccuracyof90.5%for indoorversusoutdoorclassification,95.3%forcityversus landscapeclassification,and96.6%forsunsetversusforest

Recognition in Information Systems, Setubal, Portugal ~6–7 July2001!.

11. Y. Freund and R. E. Schapire, ‘‘Experiments with a new boostingalgorithm,’’ in Machine Learning: Proceedings of the Thirteenth In-ternational Conference, pp. 148–156, Morgan Kaufman, San Fran-cisco ~1996!.

12. B. Jaehne, Digital Image Processing, Springer, Berlin ~1997!.13. J. Friedman, T. Hastie, and R. Tibshirani, ‘‘Additiv e logistic regres-

sion: A statistical view of boosting,’’ Dept. of Statistics, Stanford Uni-versity, Technical Report ~1998!.

14. The Learning Vector Quantization Program Package, ftp://cochlea.hut.fi.

15. A. Hartmann and R. Lienhart, ‘‘Automatic classification of images onthe web,’’ in Storage and Retrieval for Media Databases 2002, Proc.SPIE 4676, 31–40 ~2002!.

16. R. Lienhart and A. Wernicke, ‘‘Localizing and segmenting text inimages, videos and web pages,’’ IEEE Trans. Circuits Syst. VideoTechnol. 12~4!, 256–268 ~2002!.

17. J. Canny, ‘‘A computational approach to edge detection,’’ IEEE Trans-actions on Pattern Analysis and Machine Intelligence 8~6!, 34–43~1986!.

Rainer Lienhar t received his Master’s de-gree in computer science and applied eco-nomics and his PhD in computer sciencefrom the University of Mannheim, Germanyon ‘‘methods for content analysis, indexing,and comparison of digital video se-quences.’’ He was a core member of theMovie Content Analysis Project (MoCA).Since 1998 he is a Staff Researcher at In-tel Labs in Santa Clara. His research inter-ests includes image/video/audio content

analysis, machine learning, scalable signal processing, scalable

learning, ubiquitous and distributed media computing in heteroge-neous networks, media streaming, and peer-to-peer networking andmass media sharing. He is a member of the IEEE and the IEEEComputer Society.

Alexande r Hartman n received his Mas-ter’s degree in computer science and ap-plied economics from the University ofMannheim, Germany, on ‘‘new algorithmsfor automatic classification of images.’’During the summer of 2000 he was a Sum-mer Intern at Intel Labs in Santa Clara.Currently he is working as software engi-neer at ITSAS, an IBM Global ServicesCompany, in Germany. His interests in-clude Linux and cryptography.

Lienhart and Hartmann

10 / Journal of Electronic Imaging / October 2002 / Vol. 11(4)