Improving Image Spam Filtering Using Image Text Features

16
CEAS 2008 Battista Biggio, Ignazio Pillai, Giorgio Fumera, Fabio Roli Pattern Recognition and Applications Group University of Cagliari, Italy Department of Electrical and Electronic Engineering R A P G 5th Conference on Email and Anti-Spam (CEAS) 2008, Mountain View, California, USA, August 21st - 22nd Improving Image Spam Filtering Using Image Text Features

description

 

Transcript of Improving Image Spam Filtering Using Image Text Features

Page 1: Improving Image Spam Filtering Using Image Text Features

CEAS 2008

Battista Biggio, Ignazio Pillai, Giorgio Fumera, Fabio Roli

Pattern Recognition and Applications GroupUniversity of Cagliari, ItalyDepartment of Electrical and Electronic Engineering

R AP G

5th Conference on Email and Anti-Spam (CEAS) 2008,Mountain View, California, USA, August 21st - 22nd

Improving Image Spam FilteringUsing Image Text Features

Page 2: Improving Image Spam Filtering Using Image Text Features

21-08-2008 Image Spam Filtering 2CEAS 2008

About me

• Pattern Recognition and Applications Grouphttp://prag.diee.unica.it– DIEE, University of Cagliari, Italy.

• Contact– Battista Biggio, Ph.D. student

[email protected]

Page 3: Improving Image Spam Filtering Using Image Text Features

21-08-2008 Image Spam Filtering 3CEAS 2008

Pattern Recognition andApplications Group

• Research interests– Methodological issues

• Multiple classifier systems• Adversarial learning• Classification reliability

– Main applications• Intrusion detection in

computer networks• Multimedia document

categorization, Spam filtering• Biometric authentication

(fingerprint, face)• Content-based image

retrieval

R AP G

• Faculty members– F. Roli (group head)– G. Giacinto– G. Fumera– L. Didaci– G.L. Marcialis

– 7 PhD students– 3 post docs– 2 consultants

Page 4: Improving Image Spam Filtering Using Image Text Features

21-08-2008 Image Spam Filtering 4CEAS 2008

Outline

• Introduction– What is image spam?

• Image spam filtering– Image spam SoA– Our work

• Experiments

• A plug-in for SpamAssassin: Image Cerberus

Page 5: Improving Image Spam Filtering Using Image Text Features

21-08-2008 Image Spam Filtering 5CEAS 2008

Image spam

• Since about 2005: image spam– Embedding spam messages into images to evade

modules based on machine learning approaches(e.g. bayesian filters)

– Adding adversarial noise to prevent OCR fromreading embedded text (obfuscated spam images)

Page 6: Improving Image Spam Filtering Using Image Text Features

21-08-2008 Image Spam Filtering 6CEAS 2008

Image spam SoA

• Commercial / open source anti-spam filters:– OCR + keyword search– Image low-level feature analysis

• Research:– OCR + TC

• Fumera et al., JMLR 2006– BayesOCR plug-in for SpamAssassin

– Image classifiers (ham/spam) based on low-levelimage features (text areas, color distribution, etc.)• Wu et al., ICIP 2005• Aradhye et al., ICDAR 2005• Dredze et al., CEAS 2007

Page 7: Improving Image Spam Filtering Using Image Text Features

21-08-2008 Image Spam Filtering 7CEAS 2008

Our past work• OCR is not effective against obfuscated images

– Spammers learned from CAPTCHAs / HIPs!• Our idea: the presence of adversarial obfuscated text

can be a spamminess hint (Biggio et al., CEAS 2007)– How did we detect the presence of adv. obfuscated text?

• Four features based on:– Text localisation– Perimetric complexity– Edge detection

• However, these features did not work as we thought fordetecting only adversarial obfuscated text…

Page 8: Improving Image Spam Filtering Using Image Text Features

21-08-2008 Image Spam Filtering 8CEAS 2008

This work

• Our image text defect measures seemed to beable to provide some discriminant informationabout low level text characteristics betweenham and spam images

• We exploit the proposed image text defectmeasures as additional features in approachesbased on image classification techniques, toimprove their discriminant capability

Page 9: Improving Image Spam Filtering Using Image Text Features

21-08-2008 Image Spam Filtering 9CEAS 2008

Experiments

• Data sets (1)– A: 2006 ham images, 3297 spam images– B: 2006 ham images, 8549 spam images

• Image feature sets– Aradhye et al., ICDAR 2005

• Color heterogeneity, color saturation, text area

– Dredze et al., CEAS 2007• Image meta-data, visual features

– Four other visual features, for comparison (generic)• Number of colors (log), number of pixels (log), relative

area occupied by the most common color, text area

– Features used in this work (text)

(1) Data sets are publicly available at: http://prag.diee.unica.it/n3ws1t0/eng/spamRepository

Page 10: Improving Image Spam Filtering Using Image Text Features

21-08-2008 Image Spam Filtering 10CEAS 2008

Experiments (cont’d)

• We evaluated performances of imageham/spam classifiers based on individualfeature sets (aradhye, dredze, generic) andtheir fusion (either at feature or score level) withour features (text).

C(x1∪x2) C(x2)

C(s)

C(x1)

Feature level fusion Score level fusion

Page 11: Improving Image Spam Filtering Using Image Text Features

21-08-2008 Image Spam Filtering 11CEAS 2008

Results

Page 12: Improving Image Spam Filtering Using Image Text Features

21-08-2008 Image Spam Filtering 12CEAS 2008

Image CerberusImage Cerberus

A plug-in for SpamAssassin:Image Cerberus

• We implemented a SpamAssassin plug-in based on ourapproach– generic + text fused at feature level

• Publicly available– http://prag.diee.unica.it/n3ws1t0/imageCerberus

• We will release source code (C++) soon

We need your feedback!

R AP G

Page 13: Improving Image Spam Filtering Using Image Text Features

21-08-2008 Image Spam Filtering 13CEAS 2008

Some examples

score = 1.06 score = 0.98 score = 0.28

Page 14: Improving Image Spam Filtering Using Image Text Features

21-08-2008 Image Spam Filtering 14CEAS 2008

Some examples (cont’d)

score = 0.82score = 1.00score = 0.63

Page 15: Improving Image Spam Filtering Using Image Text Features

21-08-2008 Image Spam Filtering 15CEAS 2008

• Ham images from the TREC 2007 spam corpus!

Spam or ham?

score = 0.20

score = - 1.4

score = 0.27

Page 16: Improving Image Spam Filtering Using Image Text Features

21-08-2008 Image Spam Filtering 16CEAS 2008

Thank you!

• See you at the poster session!

• Contacts– [email protected]

[email protected]

[email protected]

[email protected]

• Web– http://prag.diee.unica.it

R AP G