Countering Spam Using Classification Techniques

46
Countering Spam Using Classification Techniques Steve Webb [email protected] Data Mining Guest Lecture February 21, 2008

description

Countering Spam Using Classification Techniques. Steve Webb [email protected] Data Mining Guest Lecture February 21, 2008. Overview. Introduction Countering Email Spam Problem Description Classification History Ongoing Research Countering Web Spam Problem Description - PowerPoint PPT Presentation

Transcript of Countering Spam Using Classification Techniques

Page 1: Countering Spam Using Classification Techniques

Countering Spam Using Classification Techniques

Steve [email protected] Mining Guest LectureFebruary 21, 2008

Page 2: Countering Spam Using Classification Techniques

Overview

Introduction Countering Email Spam

Problem Description Classification History Ongoing Research

Countering Web Spam Problem Description Classification History Ongoing Research

Conclusions

Page 3: Countering Spam Using Classification Techniques

Introduction

The Internet has spawned numerous information-rich environments Email Systems World Wide Web Social Networking

Communities

Openness facilities information sharing, but it also makes them vulnerable…

Page 4: Countering Spam Using Classification Techniques

Denial of Information (DoI) Attacks

Deliberate insertion of low quality information (or noise) into information-rich environments

Information analog to Denial of Service (DoS) attacks

Two goals Promotion of ideals by means of deception Denial of access to high quality information

Spam is the currently the most prominent example of a DoI attack

Page 5: Countering Spam Using Classification Techniques

Overview

IntroductionIntroduction Countering Email Spam

Problem Description Classification History Ongoing Research

Countering Web SpamCountering Web Spam Problem DescriptionProblem Description Classification HistoryClassification History Ongoing ResearchOngoing Research

ConclusionsConclusions

Page 6: Countering Spam Using Classification Techniques

Countering Email Spam

Close to 200 billion (yes, billion) emails are sent each day

Spam accounts for around 90% of that email traffic

~2 million spam messages every second

Page 7: Countering Spam Using Classification Techniques

Old Email Spam Examples

Page 8: Countering Spam Using Classification Techniques

Problem Description

Email spam detection can be modeled as a binary text classification problem Two classes: spam and legitimate (non-spam)

Example of supervised learning Build a model (classifier) based on training data to approximate

the target function

Construct a function : M {spam, legitimate} such that it overlaps : M {spam, legitimate} as much as possible

Page 9: Countering Spam Using Classification Techniques

Problem Description (cont.)

How do we represent a message?

How do we generate features?

How do we process features?

How do we evaluate performance?

Page 10: Countering Spam Using Classification Techniques

How do we represent a message?

Classification algorithms require a consistent format

Salton’s vector space model (“bag of words”) is the most popular representation

Each message m is represented as a feature vector f of n features: <f1, f2, …, fn>

Page 11: Countering Spam Using Classification Techniques

How do we generate features?

Sources of information SMTP connections

Network properties

Email headers Social networks

Email body Textual parts URLs Attachments

Page 12: Countering Spam Using Classification Techniques

How do we process features?

Feature Tokenization Alphanumeric tokens N-grams Phrases

Feature Scrubbing Stemming Stop word removal

Feature Selection Simple feature removal Information-theoretic

algorithms

Page 13: Countering Spam Using Classification Techniques

dc

cFN

ba

bFP

dc

dR

db

dP

How do we evaluate performance?

Traditional IR metrics Precision vs. Recall

False positives vs. False negatives Imbalanced error costs

ROC curves

Page 14: Countering Spam Using Classification Techniques

Classification History

Sahami et al. (1998)Used a Naïve Bayes classifierWere the first to apply text classification

research to the spam problem

Pantel and Lin (1998)Also used a Naïve Bayes classifierFound that Naïve Bayes outperforms RIPPER

Page 15: Countering Spam Using Classification Techniques

Classification History (cont.)

Drucker et al. (1999) Evaluated Support Vector Machines as a solution to

spam Found that SVM is more effective than RIPPER and

Rocchio

Hidalgo and Lopez (2000) Found that decision trees (C4.5) outperform Naïve

Bayes and k-NN

Page 16: Countering Spam Using Classification Techniques

Classification History (cont.)

Up to this point, private corpora were used exclusively in email spam research

Androutsopoulos et al. (2000a) Created the first publicly available email spam corpus

(Ling-spam) Performed various feature set size, training set size,

stemming, and stop-list experiments with a Naïve Bayes classifier

Page 17: Countering Spam Using Classification Techniques

Classification History (cont.)

Androutsopoulos et al. (2000b) Created another publicly available email spam corpus

(PU1) Confirmed previous research than Naïve Bayes

outperforms a keyword-based filter

Carreras and Marquez (2001) Used PU1 to show that AdaBoost is more effective

than decision trees and Naïve Bayes

Page 18: Countering Spam Using Classification Techniques

Classification History (cont.)

Androutsopoulos et al. (2004) Created 3 more publicly available corpora (PU2, PU3, and PUA) Compared Naïve Bayes, Flexible Bayes, Support Vector

Machines, and LogitBoost: FB, SVM, and LB outperform NB

Zhang et al. (2004) Used Ling-spam, PU1, and the SpamAssassin corpora Compared Naïve Bayes, Support Vector Machines, and

AdaBoost: SVM and AB outperform NB

Page 19: Countering Spam Using Classification Techniques

Classification History (cont.)

CEAS (2004 – present) Focuses solely on email and anti-spam research Generates a significant amount of academic and industry anti-spam

research

Klimt and Yang (2004) Published the Enron Corpus – the first large-scale corpus of legitimate

email messages

TREC Spam Track (2005 – present) Produces new corpora every year Provides a standardized platform to evaluate classification algorithms

Page 20: Countering Spam Using Classification Techniques

Ongoing Research

Concept Drift

New Classification Approaches

Adversarial Classification

Image Spam

Page 21: Countering Spam Using Classification Techniques

Concept Drift

Spam content is extremely dynamic Topic drift (e.g., specific

scams) Technique drift (e.g.,

obfuscations)

How do we keep up with the Joneses?

Batch vs. Online Learning

Page 22: Countering Spam Using Classification Techniques

New Classification Approaches

Filter Fusion

Compression-based Filtering

Network behavioral clustering

Page 23: Countering Spam Using Classification Techniques

Adversarial Classification

Classifiers assume a clear distinction between spam and legitimate features

Camouflaged messages Mask spam content with

legitimate content Disrupt decision

boundaries for classifiers

Page 24: Countering Spam Using Classification Techniques

Camouflage Attacks

Baseline performance Accuracies consistently

higher than 98%

Classifiers under attack Accuracies degrade to

between 50% and 70%

Retrained classifiers Accuracies climb back to

between 91% and 99%

Page 25: Countering Spam Using Classification Techniques

Camouflage Attacks (cont.)

Retraining postpones the problem, but it doesn’t solve it

We can identify features that are less susceptible to attack, but that’s simply another stalling technique

Page 26: Countering Spam Using Classification Techniques

Image Spam

What happens when an email does not contain textual features?

OCR is easily defeated

Classification using image properties

Page 27: Countering Spam Using Classification Techniques

Overview

IntroductionIntroduction Countering Email SpamCountering Email Spam

Problem DescriptionProblem Description Classification HistoryClassification History Ongoing ResearchOngoing Research

Countering Web Spam Problem Description Classification History Ongoing Research

ConclusionsConclusions

Page 28: Countering Spam Using Classification Techniques

Countering Web Spam

What is web spam? Traditional definition Our definition

Between 13.8% and 22.1% of all web pages

Page 29: Countering Spam Using Classification Techniques

Ad Farms

Only contain advertising links (usually ad listings)

Elaborate entry pages used to deceive visitors

Page 30: Countering Spam Using Classification Techniques

Ad Farms (cont.)

Clicking on an entry page link leads to an ad listing

Ad syndicators provide the content

Web spammers create the HTML structures

Page 31: Countering Spam Using Classification Techniques

Parked Domains

Domain parking servicesProvide place holders for newly registered

domainsAllow ad listings to be used as place holders

to monetize a domain

Inevitably, web spammers abused these services

Page 32: Countering Spam Using Classification Techniques

Parked Domains (cont.)

Functionally equivalent to Ad Farms Both rely on ad syndicators for content Both provide little to no value to their visitors

Unique Characteristics Reliance on domain parking services (e.g.,

apps5.oingo.com, searchportal.information.com, etc.) Typically for sale by owner (“Offer To Buy This

Domain”)

Page 33: Countering Spam Using Classification Techniques

Parked Domains (cont.)

Page 34: Countering Spam Using Classification Techniques

Advertisements

Pages advertising specific products or services

Examples of the kinds of pages being advertised in Ad Farms and Parked Domains

Page 35: Countering Spam Using Classification Techniques

Problem Description

Web spam detection can also be modeled as a binary text classification problem

Salton’s vector space model is quite common

Feature processing and performance evaluation are also quite similar

But what about feature generation…

Page 36: Countering Spam Using Classification Techniques

How do we generate features?

Sources of information HTTP connections

Hosting IP addresses Session headers

HTML content Textual properties Structural properties

URL linkage structure PageRank scores Neighbor properties

Page 37: Countering Spam Using Classification Techniques

Classification History

Davison (2000) Was the first to investigate link-based web spam Built decision trees to successfully identify “nepotistic

links”

Becchetti et al. (2005) Revisited the use of decision trees to identify link-

based web spam Used link-based features such as PageRank and

TrustRank scores

Page 38: Countering Spam Using Classification Techniques

Classification History

Drost and Scheffer (2005) Used Support Vector Machines to classify web spam

pages Relied on content-based features as well as link-

based features

Ntoulas et al. (2006) Built decision trees to classify web spam Used content-based features (e.g., fraction of visible

content, compressibility, etc.)

Page 39: Countering Spam Using Classification Techniques

Classification History

Up to this point, previous web spam research was limited to small (on the order of a few thousand), private data sets

Webb et al. (2006) Presented the Webb Spam Corpus – a first-of-its-kind large-scale,

publicly available web spam corpus (almost 350K web spam pages) http://www.webbspamcorpus.org

Castillo et al. (2006) Presented the WEBSPAM-UK2006 corpus – a publicly available web

spam corpus (only contains 1,924 web spam pages)

Page 40: Countering Spam Using Classification Techniques

Classification History

Castillo et al. (2007) Created a cost-sensitive decision tree to identify web spam in

the WEBSPAM-UK2006 data set Used link-based features from [Becchetti et al. (2005)] and

content-based features from [Ntoulas et al. (2006)]

Webb et al. (2008) Compared various classifiers (e.g., SVM, decision trees, etc.)

using HTTP session information exclusively Used the Webb Spam Corpus, WebBase data, and the

WEBSPAM-UK2006 data set Found that these classifiers are comparable to (and in many

cases, better than) existing approaches

Page 41: Countering Spam Using Classification Techniques

Ongoing Research

Redirection

Phishing

Social Spam

Page 42: Countering Spam Using Classification Techniques

Redirection

144,801 unique redirect chains (1.54 average HTTP redirects)

43.9% of web spam pages use some form of HTML or JavaScript redirection

49%

14%

11%

8%

7%

5%

3%

2%

1%

302 HTTP redirect

frame redirect

301 HTTP redirect

iframe redirect

meta refresh andlocation.replace()

meta refresh

meta refresh and location

location*

Other

Page 43: Countering Spam Using Classification Techniques

Phishing

Interesting form of deception that affects email and web users

Another form of adversarial classification

Page 44: Countering Spam Using Classification Techniques

Social Spam

Comment spam

Bulletin spam

Message spam

Page 45: Countering Spam Using Classification Techniques

Conclusions

Email and web spam are currently two of the largest information security problems

Classification techniques offer an effective way to filter this low quality information

Spammers are extremely dynamic, generating various areas of important future research…

Page 46: Countering Spam Using Classification Techniques

Questions