Improving Image Spam Filtering Using Image Text Features
-
Upload
pra-group-university-of-cagliari -
Category
Technology
-
view
941 -
download
5
description
Transcript of Improving Image Spam Filtering Using Image Text Features
CEAS 2008
Battista Biggio, Ignazio Pillai, Giorgio Fumera, Fabio Roli
Pattern Recognition and Applications GroupUniversity of Cagliari, ItalyDepartment of Electrical and Electronic Engineering
R AP G
5th Conference on Email and Anti-Spam (CEAS) 2008,Mountain View, California, USA, August 21st - 22nd
Improving Image Spam FilteringUsing Image Text Features
21-08-2008 Image Spam Filtering 2CEAS 2008
About me
• Pattern Recognition and Applications Grouphttp://prag.diee.unica.it– DIEE, University of Cagliari, Italy.
• Contact– Battista Biggio, Ph.D. student
21-08-2008 Image Spam Filtering 3CEAS 2008
Pattern Recognition andApplications Group
• Research interests– Methodological issues
• Multiple classifier systems• Adversarial learning• Classification reliability
– Main applications• Intrusion detection in
computer networks• Multimedia document
categorization, Spam filtering• Biometric authentication
(fingerprint, face)• Content-based image
retrieval
R AP G
• Faculty members– F. Roli (group head)– G. Giacinto– G. Fumera– L. Didaci– G.L. Marcialis
– 7 PhD students– 3 post docs– 2 consultants
21-08-2008 Image Spam Filtering 4CEAS 2008
Outline
• Introduction– What is image spam?
• Image spam filtering– Image spam SoA– Our work
• Experiments
• A plug-in for SpamAssassin: Image Cerberus
21-08-2008 Image Spam Filtering 5CEAS 2008
Image spam
• Since about 2005: image spam– Embedding spam messages into images to evade
modules based on machine learning approaches(e.g. bayesian filters)
– Adding adversarial noise to prevent OCR fromreading embedded text (obfuscated spam images)
21-08-2008 Image Spam Filtering 6CEAS 2008
Image spam SoA
• Commercial / open source anti-spam filters:– OCR + keyword search– Image low-level feature analysis
• Research:– OCR + TC
• Fumera et al., JMLR 2006– BayesOCR plug-in for SpamAssassin
– Image classifiers (ham/spam) based on low-levelimage features (text areas, color distribution, etc.)• Wu et al., ICIP 2005• Aradhye et al., ICDAR 2005• Dredze et al., CEAS 2007
21-08-2008 Image Spam Filtering 7CEAS 2008
Our past work• OCR is not effective against obfuscated images
– Spammers learned from CAPTCHAs / HIPs!• Our idea: the presence of adversarial obfuscated text
can be a spamminess hint (Biggio et al., CEAS 2007)– How did we detect the presence of adv. obfuscated text?
• Four features based on:– Text localisation– Perimetric complexity– Edge detection
• However, these features did not work as we thought fordetecting only adversarial obfuscated text…
21-08-2008 Image Spam Filtering 8CEAS 2008
This work
• Our image text defect measures seemed to beable to provide some discriminant informationabout low level text characteristics betweenham and spam images
• We exploit the proposed image text defectmeasures as additional features in approachesbased on image classification techniques, toimprove their discriminant capability
21-08-2008 Image Spam Filtering 9CEAS 2008
Experiments
• Data sets (1)– A: 2006 ham images, 3297 spam images– B: 2006 ham images, 8549 spam images
• Image feature sets– Aradhye et al., ICDAR 2005
• Color heterogeneity, color saturation, text area
– Dredze et al., CEAS 2007• Image meta-data, visual features
– Four other visual features, for comparison (generic)• Number of colors (log), number of pixels (log), relative
area occupied by the most common color, text area
– Features used in this work (text)
(1) Data sets are publicly available at: http://prag.diee.unica.it/n3ws1t0/eng/spamRepository
21-08-2008 Image Spam Filtering 10CEAS 2008
Experiments (cont’d)
• We evaluated performances of imageham/spam classifiers based on individualfeature sets (aradhye, dredze, generic) andtheir fusion (either at feature or score level) withour features (text).
C(x1∪x2) C(x2)
C(s)
C(x1)
Feature level fusion Score level fusion
21-08-2008 Image Spam Filtering 11CEAS 2008
Results
21-08-2008 Image Spam Filtering 12CEAS 2008
Image CerberusImage Cerberus
A plug-in for SpamAssassin:Image Cerberus
• We implemented a SpamAssassin plug-in based on ourapproach– generic + text fused at feature level
• Publicly available– http://prag.diee.unica.it/n3ws1t0/imageCerberus
• We will release source code (C++) soon
We need your feedback!
R AP G
21-08-2008 Image Spam Filtering 13CEAS 2008
Some examples
score = 1.06 score = 0.98 score = 0.28
21-08-2008 Image Spam Filtering 14CEAS 2008
Some examples (cont’d)
score = 0.82score = 1.00score = 0.63
21-08-2008 Image Spam Filtering 15CEAS 2008
• Ham images from the TREC 2007 spam corpus!
Spam or ham?
score = 0.20
score = - 1.4
score = 0.27
21-08-2008 Image Spam Filtering 16CEAS 2008
Thank you!
• See you at the poster session!
• Contacts– [email protected]
• Web– http://prag.diee.unica.it
R AP G