Countering Spam Using Classification Techniques Steve Webb [email protected] Data Mining Guest...

Countering Spam Using Classification Techniques

Steve [email protected] Mining Guest LectureFebruary 21, 2008

Overview

Introduction Countering Email Spam

Problem Description Classification History Ongoing Research

Countering Web Spam Problem Description Classification History Ongoing Research

Conclusions

Introduction

The Internet has spawned numerous information-rich environments Email Systems World Wide Web Social Networking

Communities

Openness facilities information sharing, but it also makes them vulnerable…

Denial of Information (DoI) Attacks

Deliberate insertion of low quality information (or noise) into information-rich environments

Information analog to Denial of Service (DoS) attacks

Two goals Promotion of ideals by means of deception Denial of access to high quality information

Spam is the currently the most prominent example of a DoI attack

Overview

IntroductionIntroduction Countering Email Spam

Problem Description Classification History Ongoing Research

Countering Web SpamCountering Web Spam Problem DescriptionProblem Description Classification HistoryClassification History Ongoing ResearchOngoing Research

ConclusionsConclusions

Countering Email Spam

Close to 200 billion (yes, billion) emails are sent each day

Spam accounts for around 90% of that email traffic

~2 million spam messages every second

Old Email Spam Examples

Problem Description

Email spam detection can be modeled as a binary text classification problem Two classes: spam and legitimate (non-spam)

Example of supervised learning Build a model (classifier) based on training data to approximate

the target function

Construct a function : M {spam, legitimate} such that it overlaps : M {spam, legitimate} as much as possible

Problem Description (cont.)

How do we represent a message?

How do we generate features?

How do we process features?

How do we evaluate performance?

How do we represent a message?

Classification algorithms require a consistent format

Salton’s vector space model (“bag of words”) is the most popular representation

Each message m is represented as a feature vector f of n features: <f1, f2, …, fn>


Sources of information SMTP connections

Network properties

Email headers Social networks

Email body Textual parts URLs Attachments

How do we process features?

Feature Tokenization Alphanumeric tokens N-grams Phrases

Feature Scrubbing Stemming Stop word removal

Feature Selection Simple feature removal Information-theoretic

algorithms

dc

cFN

ba

bFP

dc

dR

db

dP

How do we evaluate performance?

Traditional IR metrics Precision vs. Recall

False positives vs. False negatives Imbalanced error costs

ROC curves

Classification History

Sahami et al. (1998)Used a Naïve Bayes classifierWere the first to apply text classification

research to the spam problem

Pantel and Lin (1998)Also used a Naïve Bayes classifierFound that Naïve Bayes outperforms RIPPER

Classification History (cont.)

Drucker et al. (1999) Evaluated Support Vector Machines as a solution to

spam Found that SVM is more effective than RIPPER and

Rocchio

Hidalgo and Lopez (2000) Found that decision trees (C4.5) outperform Naïve

Bayes and k-NN


Up to this point, private corpora were used exclusively in email spam research

Androutsopoulos et al. (2000a) Created the first publicly available email spam corpus

(Ling-spam) Performed various feature set size, training set size,

stemming, and stop-list experiments with a Naïve Bayes classifier


Androutsopoulos et al. (2000b) Created another publicly available email spam corpus

(PU1) Confirmed previous research than Naïve Bayes

outperforms a keyword-based filter

Carreras and Marquez (2001) Used PU1 to show that AdaBoost is more effective

than decision trees and Naïve Bayes


Androutsopoulos et al. (2004) Created 3 more publicly available corpora (PU2, PU3, and PUA) Compared Naïve Bayes, Flexible Bayes, Support Vector

Machines, and LogitBoost: FB, SVM, and LB outperform NB

Zhang et al. (2004) Used Ling-spam, PU1, and the SpamAssassin corpora Compared Naïve Bayes, Support Vector Machines, and

AdaBoost: SVM and AB outperform NB


CEAS (2004 – present) Focuses solely on email and anti-spam research Generates a significant amount of academic and industry anti-spam

research

Klimt and Yang (2004) Published the Enron Corpus – the first large-scale corpus of legitimate

email messages

TREC Spam Track (2005 – present) Produces new corpora every year Provides a standardized platform to evaluate classification algorithms

Ongoing Research

Concept Drift

New Classification Approaches

Adversarial Classification

Image Spam

Concept Drift

Spam content is extremely dynamic Topic drift (e.g., specific

scams) Technique drift (e.g.,

obfuscations)

How do we keep up with the Joneses?

Batch vs. Online Learning

New Classification Approaches

Filter Fusion

Compression-based Filtering

Network behavioral clustering

Adversarial Classification

Classifiers assume a clear distinction between spam and legitimate features

Camouflaged messages Mask spam content with

legitimate content Disrupt decision

boundaries for classifiers

Camouflage Attacks

Baseline performance Accuracies consistently

higher than 98%

Classifiers under attack Accuracies degrade to

between 50% and 70%

Retrained classifiers Accuracies climb back to

between 91% and 99%

Camouflage Attacks (cont.)

Retraining postpones the problem, but it doesn’t solve it

We can identify features that are less susceptible to attack, but that’s simply another stalling technique

Image Spam

What happens when an email does not contain textual features?

OCR is easily defeated

Classification using image properties

Overview

IntroductionIntroduction Countering Email SpamCountering Email Spam

Problem DescriptionProblem Description Classification HistoryClassification History Ongoing ResearchOngoing Research

Countering Web Spam Problem Description Classification History Ongoing Research

ConclusionsConclusions

Countering Web Spam

What is web spam? Traditional definition Our definition

Between 13.8% and 22.1% of all web pages

Ad Farms

Only contain advertising links (usually ad listings)

Elaborate entry pages used to deceive visitors

Ad Farms (cont.)

Clicking on an entry page link leads to an ad listing

Ad syndicators provide the content

Web spammers create the HTML structures

Parked Domains

Domain parking servicesProvide place holders for newly registered

domainsAllow ad listings to be used as place holders

to monetize a domain

Inevitably, web spammers abused these services

Parked Domains (cont.)

Functionally equivalent to Ad Farms Both rely on ad syndicators for content Both provide little to no value to their visitors

Unique Characteristics Reliance on domain parking services (e.g.,

apps5.oingo.com, searchportal.information.com, etc.) Typically for sale by owner (“Offer To Buy This

Domain”)

Parked Domains (cont.)

Advertisements

Pages advertising specific products or services

Examples of the kinds of pages being advertised in Ad Farms and Parked Domains

Problem Description

Web spam detection can also be modeled as a binary text classification problem

Salton’s vector space model is quite common

Feature processing and performance evaluation are also quite similar

But what about feature generation…


Sources of information HTTP connections

Hosting IP addresses Session headers

HTML content Textual properties Structural properties

URL linkage structure PageRank scores Neighbor properties


Davison (2000) Was the first to investigate link-based web spam Built decision trees to successfully identify “nepotistic

links”

Becchetti et al. (2005) Revisited the use of decision trees to identify link-

based web spam Used link-based features such as PageRank and

TrustRank scores


Drost and Scheffer (2005) Used Support Vector Machines to classify web spam

pages Relied on content-based features as well as link-

based features

Ntoulas et al. (2006) Built decision trees to classify web spam Used content-based features (e.g., fraction of visible

content, compressibility, etc.)


Up to this point, previous web spam research was limited to small (on the order of a few thousand), private data sets

Webb et al. (2006) Presented the Webb Spam Corpus – a first-of-its-kind large-scale,

publicly available web spam corpus (almost 350K web spam pages) http://www.webbspamcorpus.org

Castillo et al. (2006) Presented the WEBSPAM-UK2006 corpus – a publicly available web

spam corpus (only contains 1,924 web spam pages)


Castillo et al. (2007) Created a cost-sensitive decision tree to identify web spam in

the WEBSPAM-UK2006 data set Used link-based features from [Becchetti et al. (2005)] and

content-based features from [Ntoulas et al. (2006)]

Webb et al. (2008) Compared various classifiers (e.g., SVM, decision trees, etc.)

using HTTP session information exclusively Used the Webb Spam Corpus, WebBase data, and the

WEBSPAM-UK2006 data set Found that these classifiers are comparable to (and in many

cases, better than) existing approaches

Ongoing Research

Redirection

Phishing

Social Spam

Redirection

144,801 unique redirect chains (1.54 average HTTP redirects)

43.9% of web spam pages use some form of HTML or JavaScript redirection

49%

14%

11%

8%

7%

5%

3%

2%

1%

302 HTTP redirect

frame redirect

301 HTTP redirect

iframe redirect

meta refresh andlocation.replace()

meta refresh

meta refresh and location

location*

Other

Phishing

Interesting form of deception that affects email and web users

Another form of adversarial classification

Social Spam

Comment spam

Bulletin spam

Message spam

Conclusions

Email and web spam are currently two of the largest information security problems

Classification techniques offer an effective way to filter this low quality information

Spammers are extremely dynamic, generating various areas of important future research…

Questions

Countering Spam Using Classification Techniques Steve Webb [email protected] Data Mining Guest...

Documents

Transcript of Countering Spam Using Classification Techniques Steve Webb [email protected] Data Mining Guest...