Embedded-Text Detection and Its Application to Anti-Spam...

UNIVERSITY OF CALIFORNIASanta Barbara

Embedded-Text Detection and Its Application to

Anti-Spam Filtering

A Thesis submitted in partial satisfaction

of the requirements for the degree of

Master of Science

in

Computer Science

by

Ching-Tung Wu

Committee in Charge:

Professor Kwang-Ting Cheng, Chair

Professor Yuan-Fang Wang

Associate Professor Matthew Turk

April 2005

The Thesis ofChing-Tung Wu is approved:

Professor Yuan-Fang Wang

Associate Professor Matthew Turk

Professor Kwang-Ting Cheng, Committee Chairperson

April 2005

Embedded-Text Detection and Its Application to Anti-Spam Filtering

Copyright c© 2005

by

Ching-Tung Wu

iii

Abstract

Embedded-Text Detection and Its Application to

Anti-Spam Filtering

Ching-Tung Wu

Embedded-text in images usually carry important messages about the content.

In the past, several algorithms have been proposed to detect text boxes in video

frames. Previous work often followed a multi-step framework using a combina-

tion of image-analysis and machine-learning techniques. In this work, we propose

a unified embedded-text detection framework to efficiently and accurately locate

text boxes particularly in web and email images. We approach the embedded-

text problem from the angle of object detection. We define position-independent

features to capture the essence of characters and a smart-scan algorithm to trace

text lines using their spatial and geometrical properties. We also propose a novel

anti-spam system which utilizes visual clues, including the embedded-text infor-

mation. The experimental results demonstrate the effectiveness of the proposed

embedded-text detection framework and the anti-spam filtering system.

Professor Kwang-Ting Cheng

Thesis Committee Chair

iv

Contents

List of Figures vii

List of Tables viii

1 Introduction 1

1.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Embedded-Text Detection 12

2.1 Cascade Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.1 Position-Independent Feature - Motivation . . . . . . . . . 142.1.2 Position-Independent Feature - Definition . . . . . . . . . 172.1.3 Cascade Detector Training . . . . . . . . . . . . . . . . . . 212.1.4 Experiments and Discussions . . . . . . . . . . . . . . . . 24

2.2 Detection Framework . . . . . . . . . . . . . . . . . . . . . . . . . 272.2.1 One-Step Framework . . . . . . . . . . . . . . . . . . . . . 282.2.2 Two-Step Framework . . . . . . . . . . . . . . . . . . . . . 37

2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.3.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . 412.3.3 Computation Cost . . . . . . . . . . . . . . . . . . . . . . 442.3.4 Memory Consumption . . . . . . . . . . . . . . . . . . . . 45

3 Visual-Based Anti-Spam Filtering 48

3.1 The Spam Datasets and Analysis . . . . . . . . . . . . . . . . . . 513.2 The Anti-Spam Filter . . . . . . . . . . . . . . . . . . . . . . . . . 54

v

3.2.1 Feature Description . . . . . . . . . . . . . . . . . . . . . . 543.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 563.2.3 The Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . 62

4 Conclusions 65

Bibliography 67

vi

List of Figures

1.1 Embedded-Text in Images Example . . . . . . . . . . . . . . . . . 11

2.1 Top two features selected by AdaBoosting in Viola’s work . . . . . 152.2 Canny edge detection result of a sample image . . . . . . . . . . . 162.3 Edge coding, 3x3 sub-window, horizontal, and vertical pattern . . 172.4 Edge map of text and confusing background objects . . . . . . . . 242.5 Top three features selected by BrownBoosting . . . . . . . . . . . 262.6 Tough case of negative training samples . . . . . . . . . . . . . . . 272.7 One-Step Framework Flow . . . . . . . . . . . . . . . . . . . . . . 292.8 Example of Locality Shifts . . . . . . . . . . . . . . . . . . . . . . 302.9 Locality-Filter Rectangle Features . . . . . . . . . . . . . . . . . . 322.10 Vertical Locality-Filter Example of Letter A . . . . . . . . . . . . 322.11 Horizontal Locality-Filter Example of Letter A . . . . . . . . . . . 332.12 Horizontal Locality-Filter Example of Letter I . . . . . . . . . . . 332.13 Horizontal Locality-Filter Example of Letter L . . . . . . . . . . . 342.14 Diagram of the search algorithm . . . . . . . . . . . . . . . . . . . 362.15 Two-Step Framework Flow . . . . . . . . . . . . . . . . . . . . . . 382.16 Embedded-Text Detection Raw Results . . . . . . . . . . . . . . . 462.17 Embedded-Text Detection Final Results . . . . . . . . . . . . . . 47

3.1 Spam email with embedded-text in image example . . . . . . . . . 533.2 Text embedded in image example . . . . . . . . . . . . . . . . . . 583.3 Detected embedded-text regions . . . . . . . . . . . . . . . . . . . 59

vii

List of Tables

2.1 18-stage cascade detector using BrownBoosting . . . . . . . . . . 252.2 Feature sets comparison . . . . . . . . . . . . . . . . . . . . . . . 262.3 Locality-Filter Verification . . . . . . . . . . . . . . . . . . . . . . 412.4 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . 432.5 Computation Cost Comparison . . . . . . . . . . . . . . . . . . . 44

3.1 Statistics of emails containing accessible images . . . . . . . . . . 523.2 Classifier comparison . . . . . . . . . . . . . . . . . . . . . . . . . 633.3 Filter comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.4 Blocked/Missed email analysis . . . . . . . . . . . . . . . . . . . . 63

viii

Chapter 1

Introduction

As the Internet continues to grow, the type of information available to users

has shifted from text-only to multimedia-enriched. Embedded-text in multime-

dia content (e.g. images and videos) is becoming one of the prevalent means

for delivering messages to content viewers. Several embedded-text detection and

recognition algorithms have been proposed in the past. However, as the type of

multimedia content keeps getting more and more diverse and the amount grows

exponentially, developing an embedded-text detector that is accurate and efficient

has become an interesting topic and a challenging task.

Embedded-text often carries important messages regarding the where, what,

and who of the content. This information clearly helps with the recognition and

annotation of the content. Embedded-text within images and videos can be cat-

1

Chapter 1. Introduction

egorized as natural scene text and artificial text. Natural scene text refers to text

that appears as part of the scene in images or videos; for example, a news video

showing an event that happened in a recognizable location. Artificial text, on the

other hand, is manually added to provide content to viewers. One example of

artificial text is the closed caption of a video; another is the text banner box on

a web image.

The first step to extract embedded-text in an image or a video frame is to

pinpoint the location of the text boxes. In this thesis, we aim to develop a robust

and accurate embedded-text detector particularly for detecting artificial text in

web and images. The detection result can be fed into the existing Optical Charac-

ter Recognition (OCR) packages such as OmniPage or GNU OCR. In the second

part of this thesis, we also demonstrate an application to anti-spam filtering which

directly uses the text-detection result.

1.1 Previous Work

Artificial text boxes are designed to attract attention. These text boxes usually

present strong coherence in space, geometry, and color. For example, they usu-

ally align horizontally, contain homogeneous color regions, pose strong edge ener-

gies, and exhibit high intensity differences from neighboring objects or contrasting

2


background. Since visual, spatial, and geometric coherence provides strong clues,

previous work has primarily focused on these characteristics for the task of text

detection.

Based on the detection framework, previous work can be roughly classified

into two categories; (1) Image-analysis-based (2) Image-analysis plus machine-

learning-based. In the first category, embedded-text detection depends heavily

on image analysis techniques, such as edge detection, connected component, and

texture analysis. In the second category, a machine learning method is applied

in addition to the image analysis techniques. The machine learning step is used

mainly for verification purposes because it can further reduce the number of false

positives.

In the image-analysis category, embedded-text detection depends heavily on

image analysis techniques. These can be further sub-categorized into edge-based

and texture-based approaches. The common flow works as follows: First, edge

detection or texture analysis is performed on the image under consideration. The

image’s raw features are then grouped according to connectivity, spatial, and ge-

ometric correlations to form potential text regions. Second, potential text regions

are further examined using rule-based heuristics, such as the size, aspect ratio , and

orientation of the region. Fletcher et al. [6] used edge detection and connected-

component analysis to generate connected text blocks. Then, they applied a

3


Hough transformation to aggregate the text blocks into text strings. LeBourgeois

[12] proposed using a linear filter to transform the gradient map, such as the edge

energy map of an image, to a texture map by smearing and dilating the gradient

map in one direction. Since text pixels appear in a coherent way, text regions

usually stand out and can be easily identified after the transformation. Zhou et

al. [39] proposed a texture-based approach to extract text from web images using

color clustering and connected component analysis. Wu et al. [23] presented a

text segmentation based on distinctive textures of text. Text strokes are extracted

and refined in multiple phases to form the final bounding boxes.

In the second category, the machine-learning method is applied in conjunction

with image analysis techniques. The machine-learning step serves as the text

pixel or text region verifier. It is usually applied after the image analysis step to

further reduce the false positive rate. Chen et al. [2] used edge dilation to merge

horizontal and vertical edges into regions, then applied an SVM model to classify

these text regions. The assumption was that text lines contain short horizontal

and vertical edges within a confined space. The dilation and merging of these

edges make text appear as continuous regions. These regions are then classified

using the trained SVM model. Wolf et al. [25] extended LeBourgeois’ idea by

applying more geometrical and spatial heuristics on the text lines identification.

Then, an SVM model was trained for the final classification. Kim et al. [10] also

4


presented a texture-based approach by using SVM to report the probability of a

pixel being part of text texture, then, by applying CAMSHIFT algorithm on the

probability map to locate text regions.

The common part between image analysis based and image analysis plus ma-

chine learning based is that they often followed a two-step procedure. In the first

step, potential text regions are identified. Then, in the second step, a statistical-

driven and/or heuristics-based models were applied to refine the potential text

regions and to determine the final ones.

1.2 Motivation

The first motivation for our work is that the number of images with embedded-

text has increased rapidly, especially in web and email applications. Being able

to detect text embedded in these images will be a good starting point to further

analyze the received contents. One example is the Unsolicited Commercial Email

(UCE), also known as ”spam”, on the Internet. With the increasing importance of

email and the incursions of Internet marketers, spam has become a major problem.

In the past, researchers have addressed this problem as a text classification or

categorization problem. Recently, spammers have been using embedded-text in

5


images to avoid these text-based anti-spam filters. Without the ability to detect

embedded-text, it will be very difficult to alleviate this problem.

The second motivation for our work is to improve the text-detection accuracy

beyond the previous two-step framework. We hope to keep the advantages of the

existing two-step framework while reducing or eliminating its disadvantages.

In the previous embedded-text frameworks, the text detection problem is di-

vided into two sub-tasks. In the first step, potential text regions are identified

using primarily image-analysis techniques. Then, in the second step, potential

text regions are further refined using statistical-driven models or heuristic rules

based on spatial and geometrical constraints. One advantage of these frameworks

is that the embedded-text detection problem can be divided into two easier sub-

problems. Each step in the two-step framework deals with one of the sub-tasks,

making the process more efficient. The first step is to locate potential text regions.

The second step, which usually takes more time, is to refine and to determine the

final text regions. By separating the tasks, the slower step only has to deal with

part of the image, rather than the entire image. Another advantage is that the

false alarms can be further reduced as the text regions pass through each step.

A major disadvantage of the two-step framework is that the two steps depend

on each other, hence, the overall detection rate is constrained by the lowest one of

6


the two. For example, if the lowest detection rate of the two steps is 80%, then,

the overall detection rate cannot surpass 80%.

Viola and Jones [22] proposed a rapid object-detection framework using rect-

angle features and boosted cascade detector for the task of face detection. The

rectangle features are position-dependent and somewhat rigid. They are suitable

for capturing the differences in intensity among facial regions. The cascade de-

tector is a degenerated decision tree which acts as a rejection-based classifier. In

each stage of the cascade detector, the classifier is trained using AdaBoosting to

passes almost every positive sample while rejecting most of the negative ones.

The combination of stages yields a desirable ratio of a high detection rate to a low

false-positive rate. Both rectangle features and cascade detector can be evaluated

efficiently at every location and scale.

In our work, we focus on developing a robust embedded-text detection frame-

work to locate artificial text boxes embedded in images, especially web and email

images. Our goals are (1) efficiency, especially in detection speed; (2) favorable

ratio of high detection rate to low false-positive rate. We approach the embedded-

text detection problem from the angle of object detection. We define three sets of

Position-Independent Features (PIF) and propose using the boosted cascade de-

tector, similar to Viola and Jones’, as the text detection engine. The PIF features

can be evaluated efficiently. They are suitable for tackling the diverse and some-

7


what unpredictable shapes of text. Then, we propose a unified embedded-text

detection framework using the cascade detector and PIF features. The proposed

framework is capable of detecting horizontal text lines embedded within an image

at every location and scale.

1.3 Contributions

The main contributions of this thesis are summarized as follows:

• We define three sets of position-independent features (PIF) which can be

computed efficiently for the task of text detection. The three sets of PIF

features are local edge-pattern, local edge-density, and global edge-density.

These PIF features are able to capture diverse types and sizes of fonts as

well as to differentiate edges of characters from those of random objects.

• We propose using a cascade detector for the task of text detection. The

proposed cascade detector is trained using the position-independent features

and BrownBoosting mentioned above.

• We propose a novel unified-step embedded-text detection framework using

the above cascade detector. This embedded-text detector does not require

any pre-processing step and can do full image scans.

8


• We introduce the use of visual information for anti-spam filtering. By thor-

oughly analyzing a large collection of spam emails, we demonstrate that

useful features and parameters can be derived from images in spam emails

for the purpose of anti-spam filtering.

• Based on such visual features and parameters, we propose a novel visual-

based anti-spam filter. The proposed filter, used in conjunction with existing

text-based filters, can improve the filtering accuracy.

• We have successfully integrated the proposed anti-spam filter with Thun-

derbird and demonstrated very promising results.

1.4 Outline

The rest of this thesis is organized as follows: In chapter 2, we present the

details of the proposed embedded-text detector. In chapter 2.1, we introduce the

concept of the position-independent features as well as details of feature selection

and training. In chapter 2.2, we present the detection framework and explain how

to apply the cascade detector for the task of text detection. In chapter 2.3, we

show experimental results to evaluate the effectiveness of the position-independent

features and to demonstrate the robustness of the proposed embedded-text detec-

tor.

9


In chapter 3, we present the details of the visual-based anti-spam filter. In

chapter 3.1, we present the statistics of some visual parameters from a thorough

analysis of more than 120K spam emails downloaded from SpamArchive [30]. In

chapter 3.2, we present the proposed filtering system in detail. Then, in chap-

ter 3.3, we present experimental results to show the effectiveness of the proposed

anti-spam filter.

In chapter 4, we present the conclusions of this thesis.

10


Figure 1.1: Embedded-Text in Images Example

11

Chapter 2

Embedded-Text Detection

The heart of the proposed embedded-text detection system is the cascade

detector and Position-Independent Features (PIF). The cascade detector is de-

generated decision tree which consists of several stages organized in a sequential

manner. Each stage, in turn, is a boosted classifier that is trained to pass almost

every positive sample while rejecting as many negative ones as possible. The

PIF features are designed to differentiate the edges of text from those of random

objects. Both of the cascade detector and the PIF features can be evaluated ef-

ficiently. Therefore, they can be applied on part or even the entire image under

consideration.

In this chapter, we present the architecture of the proposed embedded-text de-

tection system. In chapter 2.1, we discuss the details of the PIF features, feature

12

Chapter 2. Embedded-Text Detection

selection, and cascade detector training. We elaborate the rationales behind the

design of the PIF features and the training strategy. In chapter 2.2, we demon-

strate how to applied the text detector engine (trained cascade detector and PIF

features) in the traditional two-step framework and the proposed unified one-step

framework.

2.1 Cascade Detector

Viola and Jones [22] proposed a cascade framework that is capable of detect-

ing face rapidly while achieving a high detection accuracy. There are three key

contributions in their approach. The first one is the introduction of a new image

structure called the Integral Image, which allows the features to be computed ef-

ficiently. The second one is the use of AdaBoosting learning algorithm to select

and combine a small number of critical features to become efficient and yet strong

classifiers. The third contribution is a smart framework for organizing classifiers

in a cascade manner, which allows background regions to be quickly discarded

while spending more computation on promising face-like regions. Our purpose is

to extend the similar idea to a new task, detecting embedded-text within images.

13


2.1.1 Position-Independent Feature - Motivation

Viola and Jones defined a set of rectangle features, which can be calculated

by computing the differences of raw pixel sums among regions within the rect-

angle under consideration. More than 40K rectangle features can be extracted

from a 24x24 training sample, and they can be calculated quickly in less than

100 machine instructions per feature value using a data structure called integral

image. Although rectangle features are proven to be effective and efficient for the

specific purpose of face detection, no work has been done so far for general object

detection using this type of feature. The main reason is that the rectangle fea-

tures carry position information (position-dependent), and thus the target object

has to strictly follow a well-defined rigid structure. For example, the first feature

selected by AdaBoosting algorithm, as shown in Figure 2.1, measures the differ-

ences in intensity between the eyes and the upper cheeks. The feature capitalizes

on the observation that the eye region is often darker than the cheeks. However,

in the task of text detection, due to the different fonts, sizes, and shapes of text,

position-dependent features cannot capture the unified information towards a set

of samples.

Edge information is a dominant feature used by many researchers for embedded-

text detection in images. Figure 2.2 shows the Canny edge detection result of a

14


Figure 2.1: Top two features selected by AdaBoosting in Viola’s work

sample image. By comparing the edges from the text and the background, we

conclude several key observations as follows:

• Text edges always follow a few fixed patterns, such as vertical and horizon-

tal lines, half-circles, and ’T-like’ conjunctions, while background edges are

totally unpredictable.

• Text edges should be smooth and continuous, and thus edge points of an

individual character should be connected to each other. Therefore, if we

define a small sub-window around a text edge point, the number of edge

points within the sub-window should fall within a certain range. Neverthe-

15


Figure 2.2: Canny edge detection result of a sample image

less, the number of edge points within a sub-window around a background

edge point is completely random.

• For a sub-window containing a full character, the edge points often equally

spread over the entire sub-window. However, for a sub-window containing

a background object, the edge points often cluster into small regions.

16


2.1.2 Position-Independent Feature - Definition

In this section, we introduce three different types of edge-based features corre-

sponding to the three key observations mentioned above. In addition, the proposed

features satisfy the following rules:

• Features must be able to differentiate edges of text from those of background

objects.

• Features do not include any position information, i.e. position-independent.

• Features can be computed efficiently.

Local Edge-Pattern

Figure 2.3: Edge coding, 3x3 sub-window, horizontal, and vertical pattern

17


To compute local-edge pattern, first, we apply the Canny edge detector on the

image under consideration, then we binarize the detection result by using ’0’ to

represent non-edge points and ’1’ to represent edge points. For each edge point,

if we define a 3x3 sub-window as shown in Figure 2.3a, we can use 8-bit binary

strings to represent different edge patterns. Figure 2.3b and Figure 2.3c show

the horizontal and vertical patterns. They represent a horizontal and a vertical

line respectively. Each binary number can be mapped to a decimal number. For

example, the binary and decimal numbers for Figure 2.3b and Figure 2.3c are

01000010 (66) and 00011000 (24) respectively. By using the binary coding scheme

and the 3x3 sub-window, we can extract and compute edge patterns efficiently.

The total number of edge patterns defined by a 3x3 sub-window is 256. These edge

patterns can be used to represent horizontal, vertical, diagonal, curvy, and even

conjunctional patterns. In algorithm 1, we show the local edge-pattern extraction

work flow.

Theoretically, we can compute 256 integral images, one for each edge pattern.

In real implementation, however, we do not need to compute all of them. Instead,

we only need to compute the features that are actually selected during the cascade

training. It should be noted that multiple integral images can be generated after

a single scan. The additional cost is mainly for memory, not for computation.

Moreover, we use a histogram manner to represent features of a rectangle to be

18


Algorithm 1 Local Edge-Pattern ExtractionINPUT: Image

OUTPUT: Local Edge-Pattern Features

step1: Produce an edge map by using Canny edge detector

step2: Obtain a pattern map by evaluating each edge point

step3: Compute integral images based on the pattern map

step4: Extract and Evaluate histogram features

evaluated. This design implicitly eliminates the position information. In other

words, we only consider the frequency of an interested pattern in a sub-window,

not the position of that pattern. Therefore, we refer to our feature as Position-

Independent Features (PIF) in contrast to Viola and Jones’ rectangle features.

Local Edge-Density

Based on the second observation in section 2.1.1, we concluded that the local

edge-density around each edge point will be useful for discriminating edges of text

from those of background objects. Therefore, we propose the local edge-density

feature set to capitalize on this information. We define a number of small sub-

windows around each edge point, and count the number of edge points within

the sub-windows. Empirically, we adopt different size sub-windows from 3x3 to

7x7. Using the similar algorithm mentioned above, we can calculate the integral

19


images based on local edge-density maps with respect to each pre-defined window.

In order to eliminate position information, we adopt a normalized histogram rep-

resentation as well.

Global Edge-Density

Besides local edge-density, we have also observed that the edges of text in

the rectangle under consideration often form a well-proportioned distribution.

Therefore, we propose the global edge-density feature set to capitalize on this

finding. For a NxN rectangle, we use a N2xN

2sub-window to do a sliding scan.

Each sub-window steps N4

pixels in x and/or y directions from the previous one.

As a result, totally nine sub-windows can be scanned. Based on the integral image

calculated from the edge map, we can extract nine numbers from the nine sub-

windows. These nine numbers exhibit an intuitive cue for the global distribution

of the edge points. However, a potential problem is that they are not position-

independent since each sub-window contains position information with respect to

the rectangle. Therefore, in order to remove the position information, we sort these

nine numbers to make them position-independent. Moreover, we can produce

more features based on the nine basic numbers, such as addition or subtraction

between any pair of them. Because this feature set requires additional cost of

20


sorting, the computation is significantly increased. In chapter 2.1.3, we introduce

a hierarchical training algorithm to minimize the computation cost.

2.1.3 Cascade Detector Training

The cascade detector is a degenerated decision tree in which several simple

classifiers (stages) are organized in a sequential manner. During the testing phase,

the stages are applied subsequently to a region of interest until at some stage the

candidate is rejected or all the stages are passed. In general, such a hierarchical

structure, i.e. cascade, can be learned by using Boosting algorithm which is a

method of combining several weak learners into a strong learner.

In Viola and Jones’ framework, a variant of AdaBoosting, discrete AdaBoost-

ing, was adopted to train the cascade detector for the task of face detection. How-

ever, for the specific purpose of embedded-text detection, we not only adopted a

different boosting algorithm, BrownBoosting[8], but also a different training strat-

egy, hierarchical training. The reasons are explained in detail as follows:

• BrownBoosting: Different boosting algorithms use different strategies to

re-weight the training samples. In discrete AdaBoosting, the weight of a

sample is adjusted based on its classification result. If a sample is correctly

classified, then the weight of this sample will be reduced for the next round

training. On the other hand, if a sample is misclassified, the weight of the

21


sample will remain the same. While the success of AdaBoosting is indis-

putable, however, there is a rising awareness that the algorithm is quite

susceptible to noise. Dietterich has given a detailed explanation of this be-

havior in [9]. He shows that, in general, boosting tends to assign much

higher weights to noisy samples than those to legitimate ones. As a re-

sult, the hypotheses generated in the later iterations often over-fit the noisy

samples. In our embedded-text detection, this weakness will significantly

degrade the performance of the resulting classifier. In Figure 2.4, we show

the edge maps of two letters, T and H, as well as three examples from noisy

objects. We observed an interesting phenomenon. If we only consider the

features proposed in section 2.1.2, i.e. local edge-pattern, local edge-density,

and global edge-density, it is very difficult to differentiate the edges of text

from those of noisy background objects. Due to (1) some negative samples

are very similar to positive ones (as shown in Figure 2.4); (2) Millions of

negative samples are required during the cascade training, a careful and

thorough check for every sample is quite impossible, if we follow the original

AdaBoosting algorithm, the final classifier will be inevitably less desirable.

Therefore, we borrow a key idea from BrownBoosting[8], another variant of

AdaBoosting. We use a non-monotone function to re-weight the misclassi-

fied samples: for small values of margin, the weight increases in a way very

22


similar to AdaBoosting; however, for some point onwards, the weight de-

creases to restrain the impact caused by the noisy examples (e.g. confusing

background objects).

• Hierarchical Training: We noticed that the computation cost for the

proposed three sets of PIF features differs from each other. The cost of the

local edge-pattern feature set is as cheap as the rectangle features proposed

in Viola and Jones’ work, while the local and global edge-density feature

sets require more time to compute. During the testing phase, the number

of testing samples reduces as they pass through the cascade. Therefore, it

is important to keep the early stages as fast as possible. To avoid using

the expensive sets of features at early stages of the cascade, we adopted a

hierarchical training strategy. The training procedure is as follows: (1) for

the first eleven stages, we only used the local edge-pattern feature set; (2)

from the 12th to 14th stages, we added the second feature set, which is the

local edge-density set; (3) from the 14th stages on, we applied all three sets

of features until the training process converged and the final stop criteria

was reached. Our experimental results showed that the hierarchical train-

ing scheme significantly boosted the detection performance without much

compromise on the total computation cost. In other words, the hierarchical

training scheme helps the boosting algorithm to find the best combination

23


of the PIF feature sets by balancing the detection performance and compu-

tation cost.

Figure 2.4: Edge map of text and confusing background objects

2.1.4 Experiments and Discussions

For the cascade detector training, we used 5,000 positive samples and 5,000

negative samples in each stage. The positive samples were manually labeled and

selected from a pool of web images. Each positive sample contained a single

English letter. The negative samples were dynamically generated from a selected

set of Corel images [4]. Table 2.1 shows the details of the 18-stage cascade detector

and its stage classifiers. The target criteria for each stage was 0.97 detection rate

and 0.60 false-positive rate. In other words, each stage classifier must pass at least

97% of the positive samples while rejecting more than 40% of the negative ones.

To save computation cost, the local edge-density and global edge-density fea-

ture sets were only used in later stages. As shown in Table 2.2, once the new

feature sets were added, the number of features selected in that stage were greatly

24


Table 2.1: 18-stage cascade detector using BrownBoosting

Stage Features Detection Rate False Positive

Features: Local Edge-Pattern1 4 0.991163 0.4788112 5 0.978423 0.2741703 9 0.965846 0.1515944 11 0.953431 0.0865915 12 0.941941 0.0536696 14 0.929834 0.0316477 24 0.917882 0.0186118 29 0.906083 0.0110729 25 0.895164 0.00682710 35 0.883658 0.00422211 98 0.872299 0.002596

Features: Local Edge-Pattern, Local Edge-Density12 29 0.861087 0.00160913 98 0.850019 0.000988

Features: Local Edge-Pattern, Local Edge-Density, Global Edge-Density14 10 0.839263 0.00058415 30 0.828475 0.00036116 52 0.817826 0.00022317 103 0.807314 0.00013718 152 0.796936 0.000084

reduced. This behavior indicated that the classification accuracy and efficiency

can be further improved by combining the three PIF feature sets.

In Figure 2.5, we show the top three features selected in the first round of

the boosting. The depicted features are considered as the most discriminative

ones. An intuitive explanation for these three patterns are as follows: Figure 2.5a

represents a horizontal line, Figure 2.5b represents a vertical line, and Figure 2.5c

25


Table 2.2: Feature sets comparison

Stage Features Detection Rate False Positive

w/ Local Edge-Density 12 29 0.861087 0.001609w/o Local Edge-Density 12 216 0.861087 0.001605w/ Global Edge-Density 14 10 0.839263 0.000584

w/o Global Edge-Density 14 105 0.839092 0.000601

corresponds to a half circle. These patterns appear frequently as parts of the text

edges.

Figure 2.5: Top three features selected by BrownBoosting

Figure 2.6 shows several negative samples which were misclassified when the

local edge-pattern feature set was used alone. These samples were correctly clas-

sified after adding the local and global edge-density feature sets. This example

26


Figure 2.6: Tough case of negative training samples

clearly demonstrates the effectiveness of the edge-density feature sets and proves

the rationales behind the design.

As shown in Table 2.1, the final detection rate is around 80%. It is relatively

low compared to that of the face-detection applications. (the common detection

rate for the face-detection is around 90%). However, faces appear in images as

independent objects while text objects appear as regions. This implies that if there

is a text object detected at some location, then, there should be text objects of

similar size at nearby positions. The additional spatial and geometric regularities

can be used not only to enhance the overall detection accuracy but also to reduce

the overall false positive rate. In the following sub-section, we present the details

of our embedded-text framework.

2.2 Detection Framework

Here we describe two detection frameworks, the proposed unified one-step

framework, and traditional two-step framework. In the one-step framework, the

27


cascade detector acts as a full image scanner while in the two-step framework, the

cascade detector acts as the text region classifier. In the following subsections, we

present both frameworks in detail.

2.2.1 One-Step Framework

In the one-step framework, the cascade detector is applied on the entire image

at every location and at various scales. For the task of text detection, we use a

sliding window to scan through the entire image. The scanning starts with an

initial window size. After each full image scan, the sliding window is enlarged by

10% until it is larger than the stopping window size or the image size. In our

experiments, we found that initial window size of 12x10 and the stopping window

size of 72x60 are sufficient to detect embedded-text in web images. Once a sub-

window is detected as text, its neighboring windows are examined and aggregated

to form the text region.

In addition to the cascade detector, there are three major components in the

proposed one-step framework. These components are locality filter, search algo-

rithm, and post processing. The locality filter centers the detection window on

the text object. The search algorithm helps cascade detector find the text line.

Then, the post processing cleans up the detection results and merge them into

regions. These components guides the cascade detector to right starting points

28

Administrator

locality �lter,

Administrator

search algorithm,

Administrator

post processing.


and optimizes the detection performance during the full image scan. They are

discussed in detail in the following sub-sections.

Figure 2.7: One-Step Framework Flow

During the scanning, the corresponding PIF features of the sub-window are

calculated using integral images and their values are sent to the cascade detector

to determine whether the sub-window contains text.

A. Locality Filter

Since the PIF features used in the cascade detector are completely position

independent, they are insensitive to the locality change of the text in a window.

An example is depicted in Figure 2.8. The cascade detector and PIF features

would not be able to tell the differences between the left and right samples with-

out additional helps. To avoid this phenomenon, we define three locality filters

29


to tackle the horizontal and vertical shifts of text objects in the detection sub-

windows. The three locality filters are central, vertical, and horizontal locality

filter. The locality filters serve for the following purposes: (1) To center the de-

tection sub-window on the edge-rich region by eliminating sub-windows which does

not contain enough edge points in the center region; (2) To reduce the number

of detected sub-windows around the same text object; (3) To reject sub-windows

which do not lie entirely on the text line by eliminating sub-windows with vertical

shifts; (4) To pinpoint the begin and start position of a text line by eliminating

sub-windows with left and right horizontal shifts.

Figure 2.8: Example of Locality Shifts

The locality filters act as the gate keeper and can be considered as the first

stage of the cascade detector. Although the locality filters are not trained by using

boosting algorithm, they are empirically verified to meet the same detection rate

criteria, i.e. to maintain a near 100% detection rate.

30


The evaluation of these filters are based on rectangle features which can be

calculated from the integral image of edge points in the original image. Figure 2.9

shows the rectangle features used in the three locality filters. These rectangle

features are similar to some of the extended features in [14].

The central locality-filter examines the number of edge points in the center

region of a sub-window. The area of the center region is set to 81% of the overall

area of the sub-window (9/10 in width and height). The sub-window is rejected

if the number of edge points is lower than the predefined threshold. In our imple-

mentation, we set the threshold to 60% of the total number of edge points in the

sub-window.

The vertical locality-filter divides the sub-window into top, middle, and bottom

regions, then examines the number of edge points in these regions. The height of

each region is set to 10% of the sub-window height. If the number of edge points

in any of the regions is zero, the, the sub-window is rejected. Figure 2.10 shows

the example of vertical locality filters on letter A.

The horizontal locality-filter divides the sub-window into left, and right re-

gions, then examines the number of edge points in them. The width of each

region is set to 10% of the sub-window width. If the number of edge points in one

of the region is zero while the other is not zero, then, the sub-window is rejected.

The reason is that some letters are very narrow in width, such as letter I, while

31


others are nearly square, such as letter A and D. If the detection sub-window

aligns with the letter, then, the letter should overlap with both regions (square

letters), or should not overlap with any of the regions (narrow letters). Figure 2.11,

figure 2.12, and figure 2.13 show the examples of horizontal locality-filter on letter

A, I, and L respectively.

Figure 2.9: Locality-Filter Rectangle Features

Figure 2.10: Vertical Locality-Filter Example of Letter A

32


Figure 2.11: Horizontal Locality-Filter Example of Letter A

Figure 2.12: Horizontal Locality-Filter Example of Letter I

B. Search Algorithm

In face detection, since faces have no correlation between each other, a random

and exhaustive scan of the entire image is necessary in order to detect all the

independent faces in the image. Text objects, on the other hand, appear in an

image as horizontal regions and have strong spatial and geometrical correlation

among each other along the same line. If a text object is detected at a particular

location, then, its neighboring sub-windows are likely to be text objects of the

same height as well. This spatial and geometrical coherence can be used to (1)

33


Figure 2.13: Horizontal Locality-Filter Example of Letter L

verify whether a detected sub-window is indeed text, hence reduce false positive;

(2) effectively detect text objects along the same line.

In this paper, we propose an efficient search algorithm which captures the

spatial and geometrical coherence among text objects on the same text line. The

search algorithm works as follows:

• For each sub-window size, WxH, we divide the image into 20 bands horizon-

tally. The width of each band is equal to the width of the sub-window W.

Depends on the sub-window width, W, each band may or may not overlap

with its neighbors.

• For each band, we do a vertical sliding scan. Each vertical scan steps 2

pixels from the previous one all the way toward the bottom of the image

until it reaches the end.

34


• During the vertical sliding scan, if a sub-window is detected as text at band

I, its neighboring ten sub-windows to the left and to the right from the

detected location are scanned. Each sub-window steps by half of the sub-

window width to the left or to the right. If there are less than half of

the neighboring sub-windows contain text, then the original sub-window is

regarded as false positive and discarded. This step is considered as the text

verification step.

• If the sub-window is verified as text sub-window, we do a horizontal sliding

scan starting from band I to band I-1 and from band I to band I+1. Each

sub-window steps by half of the sub-window width to the left or to the

right. The sub-windows detected consecutively are merged into a single text

rectangle. This step is considered as the horizontal sliding step.

• The vertical sliding scan and horizontal sliding scan are repeatedly applied

on the image until the entire image has been visited.

Figure 2.14 shows the diagram of the search algorithm. The search algorithm

makes use of the spatial and geometrical coherence between text objects along the

same line.

Figure 2.16 shows the raw detection results of images in figure 1.1 after the

vertical and horizontal sliding scan.

35


Figure 2.14: Diagram of the search algorithm

C. Post-Processing

Due to the exhaustive scan over the entire image, there are usually multiple

text rectangles along the same text line. These rectangles could be neighboring,

overlapping, or duplicate. Two rectangles are overlapping if their bounding boxes

overlap. Two rectangles are neighboring if they do not overlap but the distance

between each other is within a threshold. (In our application, we set the threshold

to 10% of the width of the shorter rectangle). A rectangle is duplicate if the major

part of it is covered by another one. (In our application, we set the threshold to

80%).

In the post-processing step, actions are carried out as follows: (1) Overlapping

text rectangles along the same line are merged into one single rectangle; (2) Du-

36


plicate rectangles are removed; (3) Single rectangles that are narrow or short are

removed.

As in [22], detected rectangles are partitioned into subsets based on their

height and centroid location on the Y-axis. For two rectangles in the same subset,

if their bounding boxes overlap, they are merged by enlarging one of the rectangle

to include the other. This process is repeated until there are no more overlapping

text rectangles. After the merging process, each partition usually yields a single

text line, sometimes a few.

Figure 2.16 shows the examples of raw detection results. Figure 2.17 shows

the final detection results after post-processing.

2.2.2 Two-Step Framework

The cascade detector can also be applied in the two-step framework. In this

case, the cascade detector works as a text-region classifier. There are two ad-

vantages of using cascade detector for classification. First, cascade detector is

relatively fast. Second, the cascade detector can achieve very detection rate to

false positive rate ratio.

To classify a text region, we use a sliding window to scan the entire region.

The height of the sub-window is set to the height of the text region and the width

of the sub-window is set to 1.2 times of the height. Each subsequent sub-window

37


steps 2 pixel from the previous one. The final decision for the text region is

determined by the majority vote of each individual sub-window.

Figure 2.15: Two-Step Framework Flow

2.3 Experiments

We have designed and conducted experiments to answer the following ques-

tions:

• The effectiveness of the locality-filters.

• The comparison between the cascade detector and other learning-based clas-

sifiers, particularly Support Vector Machines (SVM).

• The performance comparison between the proposed top-down framework

and traditional bottom-up approaches.

• The computation cost comparison between the proposed top-down frame-

work and traditional bottom-up approaches.

38


2.3.1 Metrics

To evaluate the performance of the proposed embedded-text detector, we de-

fined two types of metrics, pixel-based and region-based.

Before we dive in the definiton of metrics, we define text regions as follows for

better explanation:

• detected region: a text region reported by the text detector.

• matched region: a detected text region which covers more than 90% and

less than 110% of the ground-truth region. In other words, a detection text

region matches a ground-truth region if they are relatively similar in position

and size.

• refined region: a matched region whose top and height are set to that

of the matching ground-truth region. The reason for doing this is because

sometimes there’s a small shift between the detected region and ground-

truth. The refined region is to overcome this phenomenon and to better

reflect the detection results.

The pixel-based metric is measured as the ratio of number of correctly detected

pixels to the total number of ground-truth pixels. The DRpixel (Detection Rate)

and FPpixel (False Positive) are defined as follows:

39


DRpixel =area of refined regions

area of ground − truth regions(2.1)

For the pixel-based metric, the false positive comes from the following two

parts: (1) If a detected region is larger than the ground-truth region it matches,

then region is considered matched, however, the exceeding part is counted as false

positive. (2) If a detected text region does not match any of the ground-truth

regions, the entire region is counted as false positive.

FPpixel =area of mismatched regions

area of the image(2.2)

The region-based metric is measured as the ratio of number of correctly de-

tected regions to the total number of ground-truth regions. The false positive

is defined as the ratio of the number of mismatcched regions to the number of

detected regions. The DRregion (Detection Rate) and FPregion (False Positive) are

defined as follows:

DRregion =# of matched regions

# of ground − truth regions(2.3)

FPregion =# of mismatched regions

# of detected regions(2.4)

40


2.3.2 Experimental Results

Locality-Filter Verification

In this experiment, we verify the performance of locality-filters. To conduct

this experiment, we prepared three sets of data, the positive set, vertical shift set,

and horizontal shift set. They are summarized as follows:

• Positive set: 1,000 positive samples randomly selected from the cascade

detector training samples.

• Vertical shift set: 5,000 samples manually generated using positive set. Each

sample in the positive set was used to generate five samples by randomly

shifting from zero to half of the height of the sample vertically.

• Horizontal shift set: 5,000 samples manually generated using positive set.

Each sample in the positive set was used to generate five samples by ran-

domly shifting from zero to half of the width of the sample horizontally.

Table 2.3: Locality-Filter Verification

Filter Positive Set Vertical Shift Set Horizontal Shift Set

Central Filter 100.00 91.88 95.96Vertical Filter 99.30 16.08 99.30

Horizontal Filter 99.60 99.90 56.02Overall 98.90 7.00 48.70

41


Table 2.3 shows the experimental results. As we can see, the proposed locality-

filters can maintain a near 99% detection rate while reducing over 50% of the

sub-windows with vertical or horizontal shift. Also, the vertical locality-filter is

very powerful in rejecting sub-windows with vertical shift. This filter can help the

cascade detector find the text line more precisely.

Framework Experiments

In this experiment, we compared the performance among different frameworks,

Image Analysis based [6, 12, 13, 15, 23], image analysis + machine learning based

[2, 25, 38], image analysis + cascade detector, and the proposed framework.

To conduct this experiment, we extracted 2,000 images from a pool of SPAM

emails for training and testing. We manually labeled each image and identified

4,461 text regions. We reserved 3,000 text regions (1,435 images) for training

Support Vector Machines (SVM) classifiers. The rest of the images were used for

testing. The datasets were summarized as follows:

• Testing dataset: 565 email images. Totally 1,461 regions were labeled.

• SVM Training positive dataset: 1,435 email images. Totally 3,000 regions

were labeled.

• SVM Training negative dataset: 3,000 regions were randomly cropped from

a pool of Corel images.

42


We implemented two commonly used features to train the SVM classifiers.

The first one was the distance map [2] and the second one was star-like pixel

pattern [38]. The distance map is generated by transforming every pixel from its

raw value to its distance to the closest edge point. The star-like pixel pattern is

extracted directly from the gray level using a star pattern. The SVM program we

used for this experiment was SVMTorch[3]. The two implementation were referred

as SVM1 and SVM2 respectively in Table 2.4 and Table 2.5.

Table 2.4: Performance Comparison

Framework Pixel-based Region-based

Proposed 93.60/1.13 86.43/3.70Image Analysis 90.68/3.54 86.31/14.50

Image Analysis + SVM1 87.11/3.40 82.71/15.13Image Analysis + SVM2 86.82/3.39 82.91/15.89

Image Analysis + Cascade 87.76/2.05 83.62/9.89

Table 2.4 shows the comparison results. There are three observations. (1)

Clearly, the proposed embedded-text detection framework outperforms the rest.

(2) For other frameworks that bases on image analysis step, the detection rate is

bounded by the image analysis step. (3) Cascade detector based framework yields

relatively lower false positive than others.

43


2.3.3 Computation Cost

In this experiment, we compared the computation cost among different frame-

works. The experiment was carried out on Pentium-4, 2.8GHz Linux machine

with 1GB memory. Table 2.5 shows the experimental results. As indicated in the

table, the proposed embedded-text detection system only took 0.32 seconds per

image, while SVM based systems required 0.54 and 1.98 seconds respectively.

Table 2.5: Computation Cost Comparison

Framework Step1 Step2 Overall

Proposed na na 0.32sImage Analysis 0.11s na 0.11s

Image Analysis + SVM1 0.11s 1.87s 1.98sImage Analysis + SVM2 0.11s 0.43s 0.54s

Image Analysis + Cascade 0.11s 0.26s 0.37s

If comparing SVMs and Cascade alone, it is obvious that the cascade approach

is much more efficient than SVM-based ones. The star-like pixel pattern in SVM2

is based on raw pixels, therefore, it is close to the performance of cascade, however,

cascade still outperforms because cascade generally requires less computation than

SVM.

44


2.3.4 Memory Consumption

The memory consumption of the embedded-text detector is higher than that

of the face detector. In the face detection problem, rectangle features can be

calculated from a single integral image. In the proposed text detector, multiple

integral images have to be built.

To reduce the memory consumption, the image can be divided into sub-regions

horizontally or vertically. For example, if the image is divided into four regions,

the memory consumption can be reduced to 1/4 of the original without much

computation overhead.

45


Figure 2.16: Embedded-Text Detection Raw Results

46


Figure 2.17: Embedded-Text Detection Final Results

47

Chapter 3

Visual-Based Anti-Spam Filtering

With the increasing importance of email and the incursions of Internet mar-

keters, unsolicited commercial email (also known as spam) has become a major

problem on the Internet.

Junk email has been recognized as a problem since 1975 [27]. It was not a se-

rious concern until marketers began to flood the system, overtaxing the resources

of Internet Service Providers (ISPs). Since the late ’90s, several anti-spam filter-

ing solutions have been proposed. In general, these approaches treat the email

spam filtering problem as a text classification or categorization problem, employ-

ing various machine learning techniques to solve the problem. In [31], the authors

proposed using a Naive Bayesian classifier to filter junk emails. Some researchers

have also suggested using Support Vector Machines (SVM) [33] and decision trees

48

Chapter 3. Visual-Based Anti-Spam Filtering

[34] for this task. These text-based approaches have achieved a remarkable accu-

racy in filtering spam e-mails.

However, there are two major limitations to these text-based approaches.

First, spammers often use various tricks to confuse text-based anti-spam filters

[28]. Examples of these tricks are text obfuscation, random space or word in-

sertion, HTML layout, and text embedded in images. Second, as the scale and

capacity of the Internet continues to grow, the type of information in emails has

become more diverse. The genre of email content has moved from text-only to

multimedia-enriched. These limitations greatly reduce the effectiveness of existing

text-based anti-spam filters.

The key issue behind these challenges is that the type of content in emails

has switched from text-based to visual-based. On one hand, legitimate message

senders start to add more multimedia content, particularly images, to text-only

emails to enrich the message. On the other hand, spammers mask spam emails

with unreadable content, such as images, and employ HTML-based tricks to con-

found text-based anti-spam filters. The raw content of these spam emails may

not make any sense, but the key messages can be rendered visible when recipients

open them.

Since visual information is becoming more prevalent in emails, it becomes

increasingly necessary to use such information to achieve high accuracy for anti-

49


spam filtering. Our research team has investigated ways of using visual infor-

mation, particularly images, in anti-spam filtering. We studied the spam emails

to analyze the characteristics of the visual information in spam. One noticeable

characteristic is the types of images used in spam emails. These images are usu-

ally artificially generated and contain embedded text (i.e. text boxes embedded

into image files). In this work, we propose a novel anti-spam filter, which clas-

sifies multimedia-enriched emails based on their visual information. Specifically,

the proposed anti-spam filter extracts image features and uses an one-class SVM

classifier to decide whether an unseen email is in the spam category. Experimental

results show that, for emails containing image data, the proposed anti-spam filter

can achieve a detection rate of 80% or more with less than 1% false positives. In

addition, the proposed anti-spam filter can work with existing text-based filters.

In comparison with the Bayesian text-based filter used in Thunderbird (a mail

client of Mozilla [29]), our proposed filter can improve the spam detection rate

from 47.7% to 83.9% for the validation set derived from the SpamArchive dataset

[30].

The main contributions of this paper are summarized as follows:

• We introduce the use of visual information for anti-spam filtering. By thor-

oughly analyzing a large collection of spam emails, we demonstrate that

50


useful features and parameters can be derived from images in spam emails

for the purpose of anti-spam filtering.

• Based on such visual features and parameters, we propose a novel visual-

based anti-spam filter. The proposed filter, used in conjunction with existing

text-based filters, can improve the filtering accuracy.

• We have successfully integrated the proposed anti-spam filter with Thun-

derbird and demonstrated very promising results.

The rest of this chapter is organized as follows: In Section 3.1, we present the

statistics of some visual parameters from a thorough analysis of more than 120K

spam emails randomly downloaded from SpamArchive [30]. In Section 3.2, we

present the proposed filtering system in detail. Then, in Section 3.3, we present

experimental results to show the effectiveness of the proposed anti-spam filter.

3.1 The Spam Datasets and Analysis

We prepared the following datasets for analysis and experiments:

• SpamArchive dataset: 122,877 spam emails.

• Ling-Spam dataset: 2,412 legitimate emails from Linguist mailing list and

481 spam emails [32].

51


The SpamArchive dataset was randomly downloaded from the SpamArchive

website. The total number of downloaded and processed spam messages is 122,877.

The Ling-Spam dataset is a public anti-spam filtering corpus [32]. For this work,

we used the Ling-Spam dataset to train the Bayesian anti-spam filter in Thun-

derbird. Using these two datasets and the trained Bayesian anti-spam filter, we

intended to show experimentally the limitations of text-based anti-spam filters,

and also to demonstrate the additional power they can gain with our visual-based

anti-spam filter.

In the SpamArchive dataset, 37.76% (46,395/122,877), of the emails contain

images. Among those emails containing images, only 43.72% (20,283/46,395)

contain accessible images. Many emails did not have images explicitly attached

but they provided links to the images. A fraction of these links were no longer

accessible at the time we processed them. We further analyzed these 20,283 spam

emails. The statistics are shown in Table 3.1.

Table 3.1: Statistics of emails containing accessible images

Type no. of Emails Percentage

w/ image with embedded-text 16,849 83.07%w/ banner or graphics 19,868 97.95%

w/ external image 19,813 97.68%blocked by Bayesian filter 9,420 46.44%

52


Figure 3.1: Spam email with embedded-text in image example

The results shown in Table 3.1 clearly indicate that if a spam email contains

images, the images are likely to be artificially generated and external (i.e. not

explicitly attached to the email). These images may be banners/graphics or text

boxes embedded within the images. Figure 3.1 shows an example of a spam

email which only contains embedded-text message and no traditional text. The

message embedded in the spam email is clearly visible when being rendered by

email clients even though it contains no text at all. Figure 3.2 shows an example of

a computer generated image with embedded-text regions. In the following section,

we introduce the ideas behind our visual-based anti-spam filter.

53


3.2 The Anti-Spam Filter

For each email, we extract a set of features from the images contained in

the email. The set of features is then used for classification, where the one-class

Support Vector Machines (SVM) is used as the base classifier. In the following

subsections, we discuss the two major components, the features and the classifier,

respectively.

3.2.1 Feature Description

Based on the observations summarized in Section 3.1, we define three sets of

features as follows:

• Embedded-text features

• Banner and graphic features

• Image location features

More and more frequently, spam emails are embedding text messages in images

to get around text-based anti-spam filters. To detect such devious techniques, it

would be helpful to know (1) whether there is embedded text in the images, (2) if

so, the area of text regions vs. the total image area. To derive such information,

we have developed a text-in-image detector which is capable of detecting the text

region(s) in an image. The details of the detector will be described later. We use

54


this text-in-image detector to scan through each image in the email and derive the

following embedded-text features: (1) the total number of text regions detected in

all images in the email, (2) the percentage of images with detected embedded-text

regions, and (3) the pixel count ratio of the detected text regions to that of the

overall image area. Figure 3.3 shows an example of an image with identified text

regions.

Many of the images in spam emails are banners and computer-generated graph-

ics which are part of advertisements. We have developed a banner detector and

a graphics detector. Banner images are usually very narrow in width or height.

Also, banner images usually have a large aspect ratio vertically or horizontally.

Graphic images, on the other hand, usually contain homogeneous background and

very little texture.

Using these detectors, we can extract the following banner and graphic fea-

tures: (1) the ratio of the number of banner images to the total number of images,

and (2) the ratio of the number of graphic images to the total.

Spammers usually put the images behind web servers and create references in

the emails to save server and network resources. This is in contrast to personal

emails, where images are usually attached with the emails. We define the image

location feature to be the ratio of the number of external images to the total

number of images in the email.

55


3.2.2 Feature Extraction

Banner and Graphic Feature Extraction

Since banner images carry certain geometric characteristics, they can be de-

tected by using a very simple rule-based detector. In extracting banner features,

we first use the rule-based detector to check the size and aspect ratio of images;

then, we calculate the corresponding features according to the detected number

of banner images.

Because computer-generated graphics usually contain homogeneous color pat-

terns, they contain almost no texture in fine resolution. To extract graphics

features, we first apply wavelet transformation on the input images. Then, we ex-

tract texture features in three orientations (vertical, horizontal, and diagonal) at

fine resolution. If any of these extracted texture features falls below a predefined

threshold, the image is likely to be a computer-generated graphic. We calculate

the graphic features based on the detected number of graphic images.

Embedded-Text Feature Extraction

Several text-detection and text-recognition methods have been proposed in

the past. Previous text-detection methods often followed a multi-step framework

using a combination of image analysis and machine learning techniques. While

a multi-step framework greatly simplifies the problem by dividing the problem

56


into several sub-tasks, the overall detection accuracy is the product of every step

and is bounded by the lowest one. Since the web and email images are diverse,

previous text-detection methods could not work well if any of the steps yields low

detection rate.

We have developed an unified embedded-text detector particularly for web

and email images. Based on Viola and Jones’ object detection framework [22],

we define position-independent features to capture the essence of characters and

a smart scan algorithm to trace text lines using their spatial and geometrical

regularities. The embedded-text detector can be applied at every location and at

every size of the image. The heart of the embedded-text detector is the boosted

cascade detector. The cascade detector is a degenerated decision tree which acts

as a rejection-based classifier. In each stage of the cascade detector, a classifier is

training using BrownBoosting[8], which is a variant of AdaBoosting[7], to reject as

many negative samples as possible while keeping almost all the positive samples.

As a result, the cascade detector can yield a very high detection rate to false

positive rate ratio. Figure 3.2 shows an example of an email image with embedded-

text. Figure 3.3 shows the detection result by our embedded-text detector. We

calculate the embedded-text feature based on the detected text regions, such as

the number and area of regions.

57


Figure 3.2: Text embedded in image example

3.2.3 The Classifier

In previous approaches, the anti-spam filtering problem has typically been

treated as a two-class or multiple-class classification problem. In the two-class

case, researchers were trying to determine whether an unseen email was spam,

whereas in the multiple-class case, the unseen emails were divided into several

categories (such as commercial, financial, objectionable, health, spiritual, etc.).

One difficulty with the two-class and multiple-class classification is the need for

multiple sets of training samples. For example, in the Naive Bayesian approach,

58


Figure 3.3: Detected embedded-text regions

one set of spam emails and one set of legitimate emails are required to train

the classifier. While spam datasets are easily accessible, a representative set of

legitimate emails is difficult to collect. In our anti-spam filter, we define the anti-

spam filtering problem as a task of finding whether an unseen email is the spam

class. We propose using the one-class SVM [37] as the base classifier.

One-Class SVM

The basic model of SVM[36] is a maximal margin classifier. Given a positive

and a negative dataset, the SVM classifier maps the data from the input space to

59


a higher dimensional space, called the feature space, and constructs a hyperplane

in the feature space which separates the data with a maximal margin. In [37],

the authors extend the SVM to support one-class classification. The basic idea

is to construct a probability-dense region which encompasses the training data in

the input space. In this case, the support vectors define a region, rather than a

hyperplane. For training a one-class SVM classifier, we need only positive training

examples (i.e. spam emails in this target application). There is no need to obtain

a representative set of legitimate emails for training. In training a one-class SVM

classifier, the user can adjust a parameter to allow a certain percentage of training

samples to be outside the defined region (called the outlier percentage). The

larger the outlier percentage, the smaller the defined region, and thus the lower

the detection rate as well as the false-positive rate in the testing phase.

To evaluate the effectiveness of the one-class classifier, we have also imple-

mented a classifier using the standard two-class SVM. Our study of the trade-offs

between using one-class and two-class classifiers will be presented in Section 3.3.

3.3 Experiments

We have designed and conducted experiments to answer the following ques-

tions:

60


• The trade-offs between using the one-class and two-class classifiers for the

filtering task.

• The added value of the proposed visual-based anti-spam filter beyond exist-

ing text-based filter.

3.3.1 Datasets

In the experiments, the statistics of the datasets used for training and valida-

tion were as follows:

• 8,500 spam emails as the positive training set.

• 1,500 spam emails as the positive validation set.

• 428 legitimate emails as the negative validation set.

• 10,000 artificial emails as the negative training set.

The positive training and validation set were created from the SpamArchive

dataset introduced in section 3.1. From the spam emails which contain accessible

images, we randomly selected 10,000 of them as the positive set. We reserve 15%

of the positive set for validation and the rest for training.

We collected 428 legitimate emails which contained images as the negative

validation set. These emails were collected from several people. We also generated

10,000 artificial emails using a large collection of photo images from Corel CDs.

61


Each email contained a random number (from 1 to 10) of images. Every image

was assumed to be local (i.e. explicitly attached in the email).

3.3.2 Experimental Results

In this experiment, we trained several one-class SVM classifiers using a dif-

ferent percentage of outliers for each classifier. We also trained a two-class SVM

classifier using 8,500 spam emails as our positive training set and 10,000 artificial

emails as our negative training set. The SVM package we used in this experiment

is LIBSVM [35]. Table 3.2 compares the performance of these classifiers. In the

case of one-class classification, it is clear that the tighter the positive region (i.e.

the larger the outlier percentage), the lower the false positive rate will be. In two-

class classification, even though the detection rate is very high, the false positive

rate will also be high. One reason for the high false positive rate could be that the

negative sets (for both training and validation) are not representative or diverse

enough. In anti-spam filtering, due to privacy issues, a representative negative

set is difficult to collect. Moreover, low false positive rates are more important

than high detection rates. Therefore, we choose the one-class SVM classifier as

the base classifier with 20% outlier in our anti-spam filter.

Table 3.3 shows the comparative performances of the anti-spam filters. The

Bayesian filter, trained using Ling-Spam dataset, can only detect 47.73% of the

62


Table 3.2: Classifier comparison

Classifier Detection Rate False Positive

1-class SVM, 5% outlier 94.87% 6.54%1-class SVM, 10% outlier 90.73% 2.80%1-class SVM, 20% outlier 80.07% 0.93%

2-class SVM 99.93% 14.01%

spam emails. The proposed anti-spam filter can detect 80.05% (using 1-class SVM

with 20% outlier) of the spam emails and can bring an additional 36.2% detection

rate to the Bayesian filter.

Table 3.3: Filter comparison

Filter Detection Rate

Text-based Bayesian 47.73%Proposed 80.05%

Text-based Bayesian + Proposed 83.93%

Table 3.4: Blocked/Missed email analysis

Filter Blocked Missed

Text-based Bayesian 265 tokens 96 tokensProposed 69,224 pixels 48,975 pixels

Table 3.4 shows the statistics of emails blocked and those missed by the text-

based Bayesian and the proposed anti-spam filters. Most of the spam emails that

were missed by the Bayesian filter contained very few text tokens. Some of them

contained no text at all. The average number of text tokens in these emails was

63


96, while the average number of text tokens in emails that were blocked by the

Bayesian filter was 265. Moreover, the average number of image pixels per email

in those spam emails blocked by the proposed anti-spam filter was 69,224 and the

average number for the missed spam emails was 48,975.

The computation time for our proposed anti-spam filter depends on the number

of images in the email and the number of text regions in each image. For a 384x256

photo image, it takes around 100-200 ms to extract features. For a text-rich image,

it may take around 500-700 ms. To classify an email with four images, it takes

roughly 2-3 seconds. Since the proposed anti-spam filter is a client-side filter, and

email clients do not need real-time performance, the additional overhead caused

by the proposed anti-spam filter is not a concern.

64

Chapter 4

Conclusions

Previous work on embedded-text detection often followed a two-step frame-

work. First, image analysis techniques were applied to locate potential text re-

gions. Then, heuristic-based or machine learning-based models were applied to

refine the text regions. In other words, the two-step frameworks divided the

embedded-text task into two easier sub-tasks. There are two advantages for sep-

arating the task; the first is computation cost saving; the second is reduction in

false-positive rate. However, the sequential dependency between the two steps

can cause a major bottleneck in the two-step framework. The overall detection

accuracy will be the product of the results of the two steps, therefore the overall

result is constrained by the lower accuracy rate.

65

Chapter 4. Conclusions

In this thesis, we have proposed a unified embedded-text detection framework

to push the accuracy and efficiency beyond existing two-step frameworks. We ap-

proached the embedded-text detection problem from the angle of object detection.

Based on Viola and Jones’ rapid object-detection framework, we define three sets

of position-independent features to capture the essence of the characters; locality-

filters to locate text lines boundaries, and a smart-scan algorithm to trace text

lines using their spatial and geometrical properties.

Previous work in content-based anti-spam filtering relies primarily on the text

content of the emails. As the spammers’ techniques become more sophisticated,

and the genre of email content continues to evolve, these text-based approaches

alone are no longer sufficient for solving the spam problem. In this thesis, we have

analyzed the spam emails containing images and identified a number of useful

visual features that can be efficiently extracted from the emails for anti-spam

filtering. We have proposed a novel anti-spam filter based on these features using

a one-class SVM classifier.

We have integrated the proposed anti-spam filter with Thunderbird, an email

client from Mozilla. The experimental results show that the proposed anti-spam

filter is both effective and efficient. They clearly indicate that the proposed anti-

spam filter can bring extra filtering power to the existing text-based anti-spam

filters.

66

Bibliography

[1] D. Chen, J. Luettin and K. Shearer. A Survey of Text Detection and Recognition

in Images and Videos. IDIAP-RR 00-38, IDIAP, 2000

[2] D. Chen, H. Bourlard, and J-Ph. Thiran. Text identification in complex background

using SVM. In Proc. IEEE CVPR, p.621-626, Dec. 2001

[3] R. Collobert and S. Bengio. SVMTorch: Support Vector Machines for Large-Scale

Regression Problems. Journal of Machine Learning Research, vol 1, pages 143-160,

2001.

[4] Corel Images. http://www.corel.com

[5] Gary R Bradski. Open Source Computer Vision Library. Intel Corporation, 2001.

[6] L.A. Fletcher and R. Kasturi. A Robust Algorithm for Text String Separation

from Mixed Text/Graphics Images. IEEE Trans. on Pattern Analysis and Machine

Intelligence, vol. 10, no. 6, pp. 910-918, Nov. 1988

67

Bibliography

[7] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learn-

ing and an application to boosting. Journal of Computer and System Sciences,

55(1):119–139, Aug. 1997.

[8] Y. Freund. An adaptive version of the boost by majority algorithm. Machine

Learning, 43(3):293–318, Jun. 2001.

[9] T. G. Dietterich. An experimental comparison of three methods for constructing en-

sembles of decision trees: Bagging, boosting, and randomization. Machine Learning,

40(2):139–158, 2000

[10] K. I. Kim, K. Jung, and J. H. Kim. Texture-based approach for text detection in

images using support vector machines and continuously adaptive mean shift algo-

rithm. PAMI, 25(12):16311639, Dec. 2003.

[11] W. Liu and D. Dori. A Proposed Scheme for Performance Evaluation of Graph-

ics/Text Separation Algorithms. GREC 1997: 359-371

[12] F. LeBourgeois. Robust multifont OCR system from gray level images. In Proc. of

the 4th ICDAR, p.1-5, 1997.

[13] R. Lienhart. Automatic text recognition for video indexing Proc. of 4th ACM MM,

p.11-20, Nov. 1996, Boston,

68

Bibliography

[14] R. Lienhart, A. Kuranov, and V. Pisarevsky. Empirical analysis of detection cas-

cades of boosted classifiers for rapid object detection MRL Technical Report, Intel,

May, 2002

[15] R. Lienhart and A. Wernicked. Localizing and segmenting text in images and

videos. IEEE Trans. Circuits Syst. Video Technol., vol. 12, pp. 236-268, Apr. 2002.

[16] R. Lienhart. Video OCR: A Survey and Practitioner’s Guide. VideoMining03,

Chapter 6, 2003

[17] J. Ohya, A. Shio and S. Akamatsu. Recognizing Characters in Scene Images. IEEE

Trans. PAMI, Vol 16, No. 2, pp.214-224, Feb. 1994

[18] C. L. Tan and P. O. Ng Text extraction using pyramid. Pattern Recognition, 31

(1): 63-72, 1998.

[19] Q. Yuan and C. L. Tan Text extraction from gray scale document images using

edge informaiton In the 6th ICDAR, Sep. 2001, Seattle

[20] R. Cao and C. L. Tan Separation of overlapping text from graphics In the 6th

ICDAR, Sep. 2001, Seattle

[21] Z.W. Tu, X.R. Chen, A.L. Yuille and S.-C. Zhu. Image parsing: unifying segmen-

tation, detection, and recognition. In ICCV, 2003

[22] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple

features. In CVPR, 2001.

69

Bibliography

[23] V. Wu, R. Manmatha, and E. M. Riseman. Finding text in images. In Proc. ACM

Int. Conf. Digital Libraries, 1997.

[24] V. Wu, R. Manmatha, and E. M. Riseman. TextFinder: An automatic system to

detect and recognize text in images. IEEE Trans. on PAMI, 21(11):1224-1229, 1999

[25] C. Wolf and J.-M. Jolion Model based text detection in images and videos: a

learning approach Tech. Report RR-2004-13, LIRIS, 2004

[26] Y. Zhong, K. Karu, and A. K. Jain. Locating text in complex color images. In

Proc. of the 3rd ICDAR, p.146, Washington DC, 1995.

[27] J. Postel. On The Junk Mail Problem. RFC706. November 1975.

[28] J. Graham-Cumming. The Spammer’s Compendium. http://www.jgc.org/tsc/

[29] The Mozilla Project. http://www.mozilla.org

[30] SpamArchive. http://www.spamarchive.org

[31] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to

filtering junk email. AAAI Workshop on Learning for Text Categorization, Jul.

1998, Madison, Wisconsin. AAAI Technical Report WS-98-05

[32] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spy-

ropoulos. An Evaluation of Naive Bayesian Anti-Spam Filtering. In Proc. of the

workshop on Machine Learning in the New Information Age, 2000.

70

Bibliography

[33] A. Kolcz and J. Alspector. SVM-based filtering of e-mail spam with content-specific

misclassification costs. ICDM Workshop on Text Mining (TextDM 2001), Nov. 2001.

[34] X. Carreras and L. Mrquez. Boosting trees for anti-spam email filtering. In Pro-

ceedings of RANLP-01, 4th International Conference on Recent Advances in Natural

Language Processing, Tzigov Chark, BG, 2001

[35] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001.

Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm

[36] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York,

1995.

[37] B. Scolkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Es-

timating the support of a high-dimensional distribution. Technical Report 99-87,

Microsoft Research, 1999.

[38] K.I. Kim, K.C. Jung, S.H. Park, and H.J. Kim. Support vector machine-based text

detection in digital video. Pattern Recognition 34(2): 527-529 (2001)

[39] J. Zhou and D. Lopresti Extracting text from WWW images In Proc. of the 4th

ICDAR, p.242-252, 1997.

71

Embedded-Text Detection and Its Application to Anti-Spam...

Documents

Transcript of Embedded-Text Detection and Its Application to Anti-Spam...