Download - Improving Spam Detection Based on Structural Similarity

Improving Spam Detection Based on Structural Similarity

By Luiz H. Gomes, Fernando D. O. Castro, Rodrigo B. Almeida,Luis M. A. Bettencourt, Virgílio A. F. Almeida, Jussara M. Almeida

Presented at Steps to Reducing Unwanted Traffic on the Internet Workshop, 2005

Presented by Jared Bott

2

Outline

Overview

Concepts

Detecting Spam

Experimental Results

Analysis of Paper

3

Overview

New algorithm to detect spam messagesUses email information that is harder to

changeWorks in conjunction with another spam

classifier I.e. SpamAssassin

Less false positives than compared methods

4

Spam Detection Problem

Spam detection algorithms use some part of emails to determine if a message is spam Spammers change messages so that they do

not meet detection criteria for spam

Very easy to change spam messages, usernames, domains, subjects, etc.

5

Key Idea

The lists that spammers and legitimate users send messages to and from can be used as the identifiers of classes of email traffic. The lists of addresses spammers send to are

unlikely to be similar to those of legitimate users.

Lists don’t change that often

6

Using Lists

A user is not just an email address. It can be a domain, etc.

Represent email user as a vector in multi-dimensional conceptual space created with all possible contacts Each sender and each recipient has their own

vectorModel relationship between senders and

recipients

7

Constructing Vectors

If there is at least one email sent from sender si to recipient rn, then the value in si’s vector’s nth dimension is 1. Otherwise, that value is 0.

If there is at least one email received by recipient ri from sender sn, the value in ri’s vector’s nth dimension is 1. Otherwise it is 0.

8

Example Vectors

User 1

User 2

User 3

S[0,1,1]R[0,1,0]

S[1,0,1]R[1,0,0]

S[0,0,0]R[1,1,0]

9

Similarity Between Senders

Similarity between senders si and sk is the cosine of the angle between their vectors cos(si, sk) 0 means no shared contact 1 means identical contact lists

In legitimate email, a 1 means that the senders operate in the same social group.

In spammers, a 1 means that the senders use the same list or are the same person.

10

Grouping Users Into Clusters

Group users with similar vectors Users with similar vectors are likely to have

related roles, i.e. spammer or legitimate user

Each cluster is represented by a vector This vector is the sum of all its component

users’ vectors

11

Similarity Between a User and a Cluster

Similarity is derived from user to user similarity equation If sender si is a member of cluster sck, then the

similarity is cos(sck – si, si).

If sender si is not a member of cluster sck, then the similarity is cos(sck, si).

Similarity between a user and a cluster will change over time Remove the user’s vector from the cluster’s vector when

computing similarity and reclassifying a user

12

Detecting Spam

Two probabilities to compute Ps(m) – Probability of an email m being sent by

a spammer

Pr(m) – Probability of an email m being addressed to users that receive spam

13

Detecting Spam

When an email arrives, classify it using some other method

Find the cluster (sc) the email’s sender belongs in If many users in the cluster send messages that are

classified as spam by auxiliary method, the probability of all the users in that cluster sending spam is high

Update the sc’s spam probability Ps(m) ← sc’s spam probability

14

Detecting Spam

For all recipients of the email, find the cluster (rc) each one belongs to

Update the spam probability for each cluster

Pr(m) ← Pr(m) + spam probability of each rc

Pr(m) ← Pr(m)/number of recipients

15

Detecting Spam

Compute a spam rank for the email based upon Pr(m) and Ps(m)

If the spam rank is above some threshold (ω), label it as spam

If the spam rank is below 1- ω, label it is legitimate

Otherwise label the email as the auxiliary method’s classification

17

Experimental Results

Tested on a log of eight days of email from a large Brazilian university

Tested on a 2.8 GHz Pentium 4 with 512 MB RAM Able to classify 20 messages per second Faster than the average message arrival peak

rate

18

Results

Measure Non-Spam Spam Aggregate

# of emails 191,417 173,584 365,001

Size of emails 11.3 GB 1.2 GB 12.5 GB

# of distinct senders

12,338 19,567 27,734

# of distinct recipients

22,762 27,926 38,875

19

Results

Manually checked false positives to see if they were spam or not Auxiliary algorithm had more false positives

Algorithm % of Misclassifications

Original Classification 60.33%

Their approach 39.67%

20

Strengths

Less false positives than SpamAssassin

Low-cost

Works with message information that doesn’t change that much

21

Weaknesses

Needs an additional message classifier, i.e. SpamAssassin

Manual tuning of algorithm

22

Improvements

Time correlation of similar addresses

Collaborative filtering based upon user feedback