Spam Detection Jingrui He 10/08/2007. Spam Types Email Spam Unsolicited commercial email Blog Spam...

Post on 22-Dec-2015

243 views 3 download

Tags:

Transcript of Spam Detection Jingrui He 10/08/2007. Spam Types Email Spam Unsolicited commercial email Blog Spam...

Spam DetectionJingrui He

10/08/2007

Spam Types Email Spam

Unsolicited commercial email Blog Spam

Unwanted comments in blogs Splogs

Fake blogs to boost PageRank

From Learning Point of View Spam Detection

Classification problem (ham vs. spam) Feature Extraction

A Learning Approach to Spam Detection based on Social Networks. H.Y. Lam and D.Y. Yeung

Fast Classifier Relaxed Online SVMs for Spam Filtering. D.

Sculley, G.M. Wachman

A Learning Approach to Spam Detection based on Social Networks

H.Y. Lam and D.Y. Yeung

CEAS 2007

Problem Statement n Email Accounts Sender Set: ; Receiver Set Labeled Sender Set: s.t.

Goal Assign the remaining account with in

System Flow Chart

Social Network from Logs Directed Graph Directed Edge

Email sent from to Edge Weight =

is the number of emails sent from to

System Flow Chart

Features from Email Social Networks In-count / Out-count

The sum of in-coming / out-going edge weights

In-degree / Out-degree The number of email accounts that a node

receives emails from / sends emails to

Features from Email Social Networks Communication Reciprocity (CR)

The percentage of interactive neighbors that a node has

The set of accounts that received emails from

The set of accounts that sent emails to

Communication Interaction Average (CIA) The level of interaction between a sender and

each of the corresponding recipients

Features from Email Social Networks

Clustering Coefficient (CC) Friends-of-friends relationship between email

accounts

Features from Email Social Networks

Number of neighbors of

Number of connections between neighbors of

System Flow Chart

Preprocessing Sender Feature Vector

Weighted Features

Problematic?

System Flow Chart

Assigning Spam Score Similarity Weighted k-NN method

Gaussian similarity

Similarity weighted mean k-NN scores

Score scaling

The set of knearest

neighbors

:x

:x

j

j

ij jji

ijj

w yy

w

Experiments Enron Dataset: 9150 Senders To Get

Legitimate Enron senders: email transactions within the Enron email domain

5000 generated spam accounts 120 senders from each class

Results Averaged over 100 Times

Number of Nearest Neighbors

Feature Weights (CC)

Feature Weights (CIA)

Feature Weights (CR)

Feature Weights In/Out-Count & In/Out-Degree

The smaller the better Final Weights

In/Out-count & In/Out-degree: 1 CR: 1 CIA: 10 CC: 15

Conclusion Legitimacy Score

No content needed Can Be Combined with Content-Based Filters More Sophisticated Classifiers

SVM, boosting, etc Classifiers Using Combined Feature

Relaxed Online SVMs for Spam Filtering

D. Sculley and G.M. Washman

SIGIR 2007

Anti-Spam Controversy Support Vector Machines (SVMs) Academic Researchers

Statistically robust State-of-the-art performance

Practitioners Quadratic in the number of training examples Impractical!

Solution: Relaxed Online SVMs

Background: SVMs Data Set = Class Label : 1 for spam; -1 for ham Classifier: To Find and

Minimize:

Constraints:

Slack variable

Maximizing the marginMinimizing the loss function

Tradeoff parameter

Online SVMs

Tuning the Tradeoff Parameter C Spamassassin data set: 6034 examples

Large C preferred

Email Spam and SVMs TREC05P-1: 92189 Messages TREC06P: 37822 messages

Blog Comment Spam and SVMs Leave One Out Cross Validation 50 Blog Posts; 1024 Comments

Splogs and SVMs Leave One Out Cross Validation 1380 Examples

Computational Cost Online SVMs: Quadratic Training Time

Relaxed Online SVMs (ROSVM) Objective Function of SVMs:

Large C Preferred Minimizing training error more important than

maximizing the margin ROSVM

Full margin maximization not necessary Relax this requirement

The last value found for when

Three Ways to Relax SVMs (1) Only Optimize Over the Recent p Examples

Dual form of SVMs

Constraints

Three Ways to Relax SVMs (2) Only Update on Actual Errors

Original online SVMs Update when

ROSVM Update when m=0: mistake driven online SVMs NO significant degrade in performance Significantly reduce cost

Three Ways to Relax SVMs (3) Reduce the Number of Iterations in Interative

SVMs SMO: repeated pass over the training set to

minimize the objective function Parameter T: the maximum number of iterations T=1: little impact on performance

Testing Reduced Size

Testing Reduced Iterations

Testing Reduced Updates

Online SVMs and ROSVM ROSVM:

Email Spam

Blog Comment Spam

Splog Data Set