Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap...

121
Social Networks of Spammers Alfred O. Hero, III Department of Electrical Engineering and Computer Science University of Michigan September 22, 2008 A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 1 / 54

Transcript of Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap...

Page 1: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Social Networks of Spammers

Alfred O. Hero, III

Department of Electrical Engineering and Computer ScienceUniversity of Michigan

September 22, 2008

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 1 / 54

Page 2: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Outline

1 IntroductionObjectivesHarvesting and SpammingSocial Networks

2 MethodologyCommunity DetectionSimilarity Measures

3 Results

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 2 / 54

Page 3: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction

Outline

1 IntroductionObjectivesHarvesting and SpammingSocial Networks

2 MethodologyCommunity DetectionSimilarity Measures

3 Results

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 3 / 54

Page 4: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction

Acknowledgements

My co-authors on “Revealing social networks of spammersthrough spectral clustering,” ICC09 (submitted)

Kevin Xu, Yilun Chen, Peter Woolf - University of MichiganMark Kliger - Medasense Biometrics, Inc

Other collaboratorsJohn Bell, Nitin Nayar - University of MichiganMatthew Prince, Eric Langheinrich, Lee Holloway - UnspamTechnologiesMatt Roughan, Olaf Maennel - University of Adelaide

SponsorsNational Science Foundation CCR-0325571Office of Naval Research N00014-08-1065NSERC graduate fellowship program

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 4 / 54

Page 5: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction

Acknowledgements

My co-authors on “Revealing social networks of spammersthrough spectral clustering,” ICC09 (submitted)

Kevin Xu, Yilun Chen, Peter Woolf - University of MichiganMark Kliger - Medasense Biometrics, Inc

Other collaboratorsJohn Bell, Nitin Nayar - University of MichiganMatthew Prince, Eric Langheinrich, Lee Holloway - UnspamTechnologiesMatt Roughan, Olaf Maennel - University of Adelaide

SponsorsNational Science Foundation CCR-0325571Office of Naval Research N00014-08-1065NSERC graduate fellowship program

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 4 / 54

Page 6: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction

Acknowledgements

My co-authors on “Revealing social networks of spammersthrough spectral clustering,” ICC09 (submitted)

Kevin Xu, Yilun Chen, Peter Woolf - University of MichiganMark Kliger - Medasense Biometrics, Inc

Other collaboratorsJohn Bell, Nitin Nayar - University of MichiganMatthew Prince, Eric Langheinrich, Lee Holloway - UnspamTechnologiesMatt Roughan, Olaf Maennel - University of Adelaide

SponsorsNational Science Foundation CCR-0325571Office of Naval Research N00014-08-1065NSERC graduate fellowship program

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 4 / 54

Page 7: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Objectives

Objectives

Objectives of this studyTo reveal social networks of spammers

Identifying communities of spammersFinding characteristics or “signatures” of communities

To understand temporal dynamics of spammers’ behaviorDetecting changes in social structure

MotivationCurrent anti-spam methods

Content filteringIP address blacklisting

Allows us to fight spam from another perspective by usingspammers’ social structure

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 5 / 54

Page 8: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Objectives

Objectives

Objectives of this studyTo reveal social networks of spammers

Identifying communities of spammersFinding characteristics or “signatures” of communities

To understand temporal dynamics of spammers’ behaviorDetecting changes in social structure

MotivationCurrent anti-spam methods

Content filteringIP address blacklisting

Allows us to fight spam from another perspective by usingspammers’ social structure

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 5 / 54

Page 9: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Objectives

Background

Much of past research on spam has focussed on scalar analysis

Spam/phishing content structural analysis [Chandrasekaran,Narayanan, Uphadhyaya CSC06]Server lifetime and reachability analysis [Duan,Gopalan, Yuan,ICC07]Spam botnet behavior patterns [Ramachandran, Feamster,SIGCOMM06]Honeypot summary statistics [Prince, Holloway, Langheinrich,Dahl, Keller EAS05]

•We perform analysis of spammer interactions over entire spam cycle

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 6 / 54

Page 10: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Objectives

Background

Much of past research on spam has focussed on scalar analysis

Spam/phishing content structural analysis [Chandrasekaran,Narayanan, Uphadhyaya CSC06]Server lifetime and reachability analysis [Duan,Gopalan, Yuan,ICC07]Spam botnet behavior patterns [Ramachandran, Feamster,SIGCOMM06]Honeypot summary statistics [Prince, Holloway, Langheinrich,Dahl, Keller EAS05]

•We perform analysis of spammer interactions over entire spam cycle

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 6 / 54

Page 11: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

The Spam Cycle

Two phases of the spam cycleHarvesting: collecting email addresses from web sites using spambotsSpamming: sending large amounts of emails to collectedaddresses using spam servers

Spammers conceal their identity (IP address) in spamming phaseby using public SMTP servers, open proxies, botnets, etc.Key assumption: spammer IP address in harvesting phase isclosely related to actual location

Previous study found harvester IP address more closely related toactual spammer than spam server IP address (Prince et. al, 2005)We treat the harvester as the spam source

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 7 / 54

Page 12: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

The Spam Cycle

Two phases of the spam cycleHarvesting: collecting email addresses from web sites using spambotsSpamming: sending large amounts of emails to collectedaddresses using spam servers

Spammers conceal their identity (IP address) in spamming phaseby using public SMTP servers, open proxies, botnets, etc.Key assumption: spammer IP address in harvesting phase isclosely related to actual location

Previous study found harvester IP address more closely related toactual spammer than spam server IP address (Prince et. al, 2005)We treat the harvester as the spam source

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 7 / 54

Page 13: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

The Spam Cycle

Two phases of the spam cycleHarvesting: collecting email addresses from web sites using spambotsSpamming: sending large amounts of emails to collectedaddresses using spam servers

Spammers conceal their identity (IP address) in spamming phaseby using public SMTP servers, open proxies, botnets, etc.Key assumption: spammer IP address in harvesting phase isclosely related to actual location

Previous study found harvester IP address more closely related toactual spammer than spam server IP address (Prince et. al, 2005)We treat the harvester as the spam source

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 7 / 54

Page 14: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

Harvestor Email Address Collection

Spam bot

Email addresses:[email protected]@abc.com

Links:john.html

www.def.com

HTML Source of www.abc.com

[email protected]@abc.com

[email protected]

[email protected]

...

Spammer’s Database

Email addresses:[email protected]

Links:kevin.html

Email addresses:[email protected]

Links:www.ghi.com

HTML Source of john.html HTML Source of www.def.com

...

How harvesters acquire email addresses using spam bots

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 8 / 54

Page 15: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

The Path of Spam

The path of spam: from an email address on a web page to your inbox

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 9 / 54

Page 16: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

Project Honey Pot

Network of decoy web pages (“honey pots”) with trap emailaddressesAll email received is spamUnique email address generated at each visitVisitor (harvester) IP address is trackedWhen spam is received, we know the harvester IP address inaddition to the spam server IP address

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 10 / 54

Page 17: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

Project Honey Pot

Network of decoy web pages (“honey pots”) with trap emailaddressesAll email received is spamUnique email address generated at each visitVisitor (harvester) IP address is trackedWhen spam is received, we know the harvester IP address inaddition to the spam server IP address

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 10 / 54

Page 18: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

Project Honey Pot

Network of decoy web pages (“honey pots”) with trap emailaddressesAll email received is spamUnique email address generated at each visitVisitor (harvester) IP address is trackedWhen spam is received, we know the harvester IP address inaddition to the spam server IP address

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 10 / 54

Page 19: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

Project Honey Pot

Network of decoy web pages (“honey pots”) with trap emailaddressesAll email received is spamUnique email address generated at each visitVisitor (harvester) IP address is trackedWhen spam is received, we know the harvester IP address inaddition to the spam server IP address

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 10 / 54

Page 20: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

Project Honey Pot

Network of decoy web pages (“honey pots”) with trap emailaddressesAll email received is spamUnique email address generated at each visitVisitor (harvester) IP address is trackedWhen spam is received, we know the harvester IP address inaddition to the spam server IP address

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 10 / 54

Page 21: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

Project Honey Pot Statistics (as of Sept. 16, 2008)

Spam Trap Addresses Monitored: 29,765,172Spam Trap Monitoring Capability: 272,870,000,000Spam Servers Identified: 29,712,922Harvesters Identified: 52,069

www.projecthoneypot.org

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 11 / 54

Page 22: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

Total Emails By Month

0

0.5

1

1.5

2

2.5x 10

6

2005

−09

2006

−03

2006

−09

2007

−03

2007

−09

Month

# of

em

ails

Total emails received at Project Honey Pot trap addresses by month

Outbreak of spam observed in October 2006 consistent withmedia reports

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 12 / 54

Page 23: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

Total Active Harvesters By Month

0

1000

2000

3000

4000

5000

6000

7000

2005

−09

2006

−03

2006

−09

2007

−03

2007

−09

Month

# of

har

vest

ers

Total active harvesters tracked by Project Honey Pot by month

Increase in harvesters in October 2006 not as significant asincrease in number of spam emails

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 13 / 54

Page 24: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

Harvestor-to-server degree distribution: May 2006

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 14 / 54

Page 25: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

Harvestor-to-server degree distribution: Oct 2006

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 15 / 54

Page 26: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

Phishing

Phishing is an attempt to fraudulently acquire sensitive informationby appearing to represent a trustworthy entityProject Honey Pot is an excellent data source for studyingphishing emails

Trap email address cannot, for example, sign up for a PayPalaccountAll emails supposedly received from financial institutions can beclassified as phishing

We classify an email as a phishing email if its subject containscommon phishing words

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 16 / 54

Page 27: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

Phishing

Phishing is an attempt to fraudulently acquire sensitive informationby appearing to represent a trustworthy entityProject Honey Pot is an excellent data source for studyingphishing emails

Trap email address cannot, for example, sign up for a PayPalaccountAll emails supposedly received from financial institutions can beclassified as phishing

We classify an email as a phishing email if its subject containscommon phishing words

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 16 / 54

Page 28: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

Phishing

Phishing is an attempt to fraudulently acquire sensitive informationby appearing to represent a trustworthy entityProject Honey Pot is an excellent data source for studyingphishing emails

Trap email address cannot, for example, sign up for a PayPalaccountAll emails supposedly received from financial institutions can beclassified as phishing

We classify an email as a phishing email if its subject containscommon phishing words

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 16 / 54

Page 29: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Harvesting and Spamming

Phishing Statistics

Define a phishing level for each harvester as

Phishing level =# of phishing emails sent

total # of emails sentLabel harvesters with phishing level > 0.5 as phishersOctober 2006 statistics

4.5% of emails were phishing emails23% of harvesters were phishers

0 0.2 0.4 0.6 0.8 10

500

1000

1500

2000

Phishing level

# of

har

vest

ers

Histogram of harvesters’ phishing levels from October 2006

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 17 / 54

Page 30: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Introduction Social Networks

Social Networks

Social network: social structure consisting of actors and tiesActors represent individualsTies represent relationships between individuals

School friendships Scientific collaborations

Moody, 2001 Girvan and Newman, 2002

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 18 / 54

Page 31: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology

Outline

1 IntroductionObjectivesHarvesting and SpammingSocial Networks

2 MethodologyCommunity DetectionSimilarity Measures

3 Results

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 19 / 54

Page 32: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology

Graph of Harvester Interactions

Represent network of harvesters by undirected weighted graphG = (V ,E ,W )

V : set of vertices (harvesters)E : set of edges between harvestersW : matrix of edge weights (adjacency matrix of graph)

Edge weights represent strength of connection between twoharvestersTotal weights of edges between two sets of harvesters A,B ⊂ V isdefined by

links(A,B) =∑i∈A

∑j∈B

wij

Degree of a set A is defined by

deg(A) = links(A,V )

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 20 / 54

Page 33: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology

Graph of Harvester Interactions

Represent network of harvesters by undirected weighted graphG = (V ,E ,W )

V : set of vertices (harvesters)E : set of edges between harvestersW : matrix of edge weights (adjacency matrix of graph)

Edge weights represent strength of connection between twoharvestersTotal weights of edges between two sets of harvesters A,B ⊂ V isdefined by

links(A,B) =∑i∈A

∑j∈B

wij

Degree of a set A is defined by

deg(A) = links(A,V )

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 20 / 54

Page 34: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology

Graph of Harvester Interactions

Represent network of harvesters by undirected weighted graphG = (V ,E ,W )

V : set of vertices (harvesters)E : set of edges between harvestersW : matrix of edge weights (adjacency matrix of graph)

Edge weights represent strength of connection between twoharvestersTotal weights of edges between two sets of harvesters A,B ⊂ V isdefined by

links(A,B) =∑i∈A

∑j∈B

wij

Degree of a set A is defined by

deg(A) = links(A,V )

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 20 / 54

Page 35: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

Community Detection

Characteristics of a communityHigh similarity between actors within communityLow similarity between actors in different communities

Formulate community detection as a graph partitioning problemDivide the graph into clustersMaximize edge weights within clusters (association)Minimize edge weights between clusters (cut)

Using edge weights normalized by group sizes results in bettergroups

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 21 / 54

Page 36: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

Community Detection

Characteristics of a communityHigh similarity between actors within communityLow similarity between actors in different communities

Formulate community detection as a graph partitioning problemDivide the graph into clustersMaximize edge weights within clusters (association)Minimize edge weights between clusters (cut)

Using edge weights normalized by group sizes results in bettergroups

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 21 / 54

Page 37: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

Community Detection

Characteristics of a communityHigh similarity between actors within communityLow similarity between actors in different communities

Formulate community detection as a graph partitioning problemDivide the graph into clustersMaximize edge weights within clusters (association)Minimize edge weights between clusters (cut)

Using edge weights normalized by group sizes results in bettergroups

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 21 / 54

Page 38: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

Normalized Cut and Association

Normalized cut of a graph partition ΓKV is defined as

KNcut(ΓKV ) =

1K

K∑i=1

links(Vi ,V\Vi)

deg(Vi)

Normalized association of ΓKV is defined as

KNassoc(ΓKV ) =

1K

K∑i=1

links(Vi ,Vi)

deg(Vi)

KNcut(ΓKV ) + KNassoc(ΓK

V ) = 1 so minimizing normalized cutsimultaneously maximizes normalized associationWe try to maximize normalized association

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 22 / 54

Page 39: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

Normalized Cut and Association

Normalized cut of a graph partition ΓKV is defined as

KNcut(ΓKV ) =

1K

K∑i=1

links(Vi ,V\Vi)

deg(Vi)

Normalized association of ΓKV is defined as

KNassoc(ΓKV ) =

1K

K∑i=1

links(Vi ,Vi)

deg(Vi)

KNcut(ΓKV ) + KNassoc(ΓK

V ) = 1 so minimizing normalized cutsimultaneously maximizes normalized associationWe try to maximize normalized association

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 22 / 54

Page 40: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

Normalized Cut and Association

Normalized cut of a graph partition ΓKV is defined as

KNcut(ΓKV ) =

1K

K∑i=1

links(Vi ,V\Vi)

deg(Vi)

Normalized association of ΓKV is defined as

KNassoc(ΓKV ) =

1K

K∑i=1

links(Vi ,Vi)

deg(Vi)

KNcut(ΓKV ) + KNassoc(ΓK

V ) = 1 so minimizing normalized cutsimultaneously maximizes normalized associationWe try to maximize normalized association

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 22 / 54

Page 41: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

The Discrete Optimization Problem

Represent graph partition ΓKV by matrix X = [x1,x2, . . . ,xK]

xi: column indicator vector with ones in the rows corresponding toharvesters in cluster iDegree matrix D = diag(W1M)Rewrite links and deg as

links(Vi ,Vi) = xiT Wxi

deg(Vi) = xiT Dxi

KNassoc maximization problem becomes

maximize KNassoc(X ) =1K

K∑i=1

xiT Wxi

xiT Dxi

subject to X ∈ {0,1}M×K

X1K = 1M

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 23 / 54

Page 42: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

The Discrete Optimization Problem

Represent graph partition ΓKV by matrix X = [x1,x2, . . . ,xK]

xi: column indicator vector with ones in the rows corresponding toharvesters in cluster iDegree matrix D = diag(W1M)Rewrite links and deg as

links(Vi ,Vi) = xiT Wxi

deg(Vi) = xiT Dxi

KNassoc maximization problem becomes

maximize KNassoc(X ) =1K

K∑i=1

xiT Wxi

xiT Dxi

subject to X ∈ {0,1}M×K

X1K = 1M

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 23 / 54

Page 43: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

The Discrete Optimization Problem

Represent graph partition ΓKV by matrix X = [x1,x2, . . . ,xK]

xi: column indicator vector with ones in the rows corresponding toharvesters in cluster iDegree matrix D = diag(W1M)Rewrite links and deg as

links(Vi ,Vi) = xiT Wxi

deg(Vi) = xiT Dxi

KNassoc maximization problem becomes

maximize KNassoc(X ) =1K

K∑i=1

xiT Wxi

xiT Dxi

subject to X ∈ {0,1}M×K

X1K = 1M

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 23 / 54

Page 44: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

The Discrete Optimization Problem

Represent graph partition ΓKV by matrix X = [x1,x2, . . . ,xK]

xi: column indicator vector with ones in the rows corresponding toharvesters in cluster iDegree matrix D = diag(W1M)Rewrite links and deg as

links(Vi ,Vi) = xiT Wxi

deg(Vi) = xiT Dxi

KNassoc maximization problem becomes

maximize KNassoc(X ) =1K

K∑i=1

xiT Wxi

xiT Dxi

subject to X ∈ {0,1}M×K

X1K = 1M

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 23 / 54

Page 45: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

Spectral Clustering

KNassoc maximization problem has exponential complexity evenfor K = 2Define Z = X (X T DX )−1/2

Reformulate problem

maximize KNassoc(Z ) =1K

tr(Z T WZ )

subject to Z T DZ = IK

Relax Z into continuous domainSolve generalized eigenvalue problem

W zi = λD zi

Form optimal continuous partition matrix Z = [ z1, z2, . . . , zK ] anddiscretize to get near global-optimal solution (Yu and Shi, 2003)

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 24 / 54

Page 46: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

Spectral Clustering

KNassoc maximization problem has exponential complexity evenfor K = 2Define Z = X (X T DX )−1/2

Reformulate problem

maximize KNassoc(Z ) =1K

tr(Z T WZ )

subject to Z T DZ = IK

Relax Z into continuous domainSolve generalized eigenvalue problem

W zi = λD zi

Form optimal continuous partition matrix Z = [ z1, z2, . . . , zK ] anddiscretize to get near global-optimal solution (Yu and Shi, 2003)

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 24 / 54

Page 47: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

Spectral Clustering

KNassoc maximization problem has exponential complexity evenfor K = 2Define Z = X (X T DX )−1/2

Reformulate problem

maximize KNassoc(Z ) =1K

tr(Z T WZ )

subject to Z T DZ = IK

Relax Z into continuous domainSolve generalized eigenvalue problem

W zi = λD zi

Form optimal continuous partition matrix Z = [ z1, z2, . . . , zK ] anddiscretize to get near global-optimal solution (Yu and Shi, 2003)

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 24 / 54

Page 48: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

Spectral Clustering

KNassoc maximization problem has exponential complexity evenfor K = 2Define Z = X (X T DX )−1/2

Reformulate problem

maximize KNassoc(Z ) =1K

tr(Z T WZ )

subject to Z T DZ = IK

Relax Z into continuous domainSolve generalized eigenvalue problem

W zi = λD zi

Form optimal continuous partition matrix Z = [ z1, z2, . . . , zK ] anddiscretize to get near global-optimal solution (Yu and Shi, 2003)

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 24 / 54

Page 49: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

Spectral Clustering

KNassoc maximization problem has exponential complexity evenfor K = 2Define Z = X (X T DX )−1/2

Reformulate problem

maximize KNassoc(Z ) =1K

tr(Z T WZ )

subject to Z T DZ = IK

Relax Z into continuous domainSolve generalized eigenvalue problem

W zi = λD zi

Form optimal continuous partition matrix Z = [ z1, z2, . . . , zK ] anddiscretize to get near global-optimal solution (Yu and Shi, 2003)

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 24 / 54

Page 50: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

Spectral Clustering

KNassoc maximization problem has exponential complexity evenfor K = 2Define Z = X (X T DX )−1/2

Reformulate problem

maximize KNassoc(Z ) =1K

tr(Z T WZ )

subject to Z T DZ = IK

Relax Z into continuous domainSolve generalized eigenvalue problem

W zi = λD zi

Form optimal continuous partition matrix Z = [ z1, z2, . . . , zK ] anddiscretize to get near global-optimal solution (Yu and Shi, 2003)

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 24 / 54

Page 51: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Community Detection

Choosing the Number of Clusters

How do we choose the number of clusters?Heuristic for spectral clustering: look at the gap betweeneigenvalues of the Laplacian matrix of the graph (von Luxburg,2007)

0 2 4 6 8 100

0.02

0.04

0.06

0.08

Gap between 7thand 8th eigenvalues

Ten smallest eigenvalues of Laplacian matrix

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 25 / 54

Page 52: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Choosing Edge Weights

Edge weights wij represent strength of connection betweenharvesters i and jWe cannot observe direct relationships between harvestersUse indirect relationships to determine edge weights

Similarity in spam server usageSimilarity in temporal spammingSimilarity in temporal harvesting

Choice of similarity measure determines topology of the graphPoor choice could lead to detecting no community structureCreate coincidence matrix H as intermediate step to creatingadjacency matrix W

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 26 / 54

Page 53: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Choosing Edge Weights

Edge weights wij represent strength of connection betweenharvesters i and jWe cannot observe direct relationships between harvestersUse indirect relationships to determine edge weights

Similarity in spam server usageSimilarity in temporal spammingSimilarity in temporal harvesting

Choice of similarity measure determines topology of the graphPoor choice could lead to detecting no community structureCreate coincidence matrix H as intermediate step to creatingadjacency matrix W

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 26 / 54

Page 54: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Choosing Edge Weights

Edge weights wij represent strength of connection betweenharvesters i and jWe cannot observe direct relationships between harvestersUse indirect relationships to determine edge weights

Similarity in spam server usageSimilarity in temporal spammingSimilarity in temporal harvesting

Choice of similarity measure determines topology of the graphPoor choice could lead to detecting no community structureCreate coincidence matrix H as intermediate step to creatingadjacency matrix W

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 26 / 54

Page 55: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Choosing Edge Weights

Edge weights wij represent strength of connection betweenharvesters i and jWe cannot observe direct relationships between harvestersUse indirect relationships to determine edge weights

Similarity in spam server usageSimilarity in temporal spammingSimilarity in temporal harvesting

Choice of similarity measure determines topology of the graphPoor choice could lead to detecting no community structureCreate coincidence matrix H as intermediate step to creatingadjacency matrix W

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 26 / 54

Page 56: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Choosing Edge Weights

Edge weights wij represent strength of connection betweenharvesters i and jWe cannot observe direct relationships between harvestersUse indirect relationships to determine edge weights

Similarity in spam server usageSimilarity in temporal spammingSimilarity in temporal harvesting

Choice of similarity measure determines topology of the graphPoor choice could lead to detecting no community structureCreate coincidence matrix H as intermediate step to creatingadjacency matrix W

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 26 / 54

Page 57: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Choosing Edge Weights

Edge weights wij represent strength of connection betweenharvesters i and jWe cannot observe direct relationships between harvestersUse indirect relationships to determine edge weights

Similarity in spam server usageSimilarity in temporal spammingSimilarity in temporal harvesting

Choice of similarity measure determines topology of the graphPoor choice could lead to detecting no community structureCreate coincidence matrix H as intermediate step to creatingadjacency matrix W

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 26 / 54

Page 58: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Similarity in Spam Server Usage

Spammers need spam servers to send emailsCommon usage of spam servers between harvesters may indicatesocial connectionCreate bipartite graph of harvesters and spam serversChoose edge weights based on correlation in spam server usage

Harvester A Harvester B Harvester C

Spam Server 1

Spam Server 2

Spam Server 3

Spam Server 4

Spam Server 5

Spam Server 6

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 27 / 54

Page 59: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Similarity in Spam Server Usage

Spammers need spam servers to send emailsCommon usage of spam servers between harvesters may indicatesocial connectionCreate bipartite graph of harvesters and spam serversChoose edge weights based on correlation in spam server usage

Harvester A Harvester B Harvester C

Spam Server 1

Spam Server 2

Spam Server 3

Spam Server 4

Spam Server 5

Spam Server 6

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 27 / 54

Page 60: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Similarity in Spam Server Usage

Spammers need spam servers to send emailsCommon usage of spam servers between harvesters may indicatesocial connectionCreate bipartite graph of harvesters and spam serversChoose edge weights based on correlation in spam server usage

Harvester A Harvester B Harvester C

Spam Server 1

Spam Server 2

Spam Server 3

Spam Server 4

Spam Server 5

Spam Server 6

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 27 / 54

Page 61: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Similarity in Spam Server Usage

Spammers need spam servers to send emailsCommon usage of spam servers between harvesters may indicatesocial connectionCreate bipartite graph of harvesters and spam serversChoose edge weights based on correlation in spam server usage

Harvester A Harvester B Harvester C

Spam Server 1

Spam Server 2

Spam Server 3

Spam Server 4

Spam Server 5

Spam Server 6

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 27 / 54

Page 62: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Similarity in Spam Server Usage Coincidence Matrix

Create coincidence matrix H between harvesters and spamservers

H =

[pij

djei

]M,N

i,j=1

pij : the number of emails sent using spam server j to emailaddresses collected by harvester idj : the total number of emails sent by spam server jei : the total number of email addresses collected by harvester iEntries of incidence matrix represent harvester i ’s percentage ofusage of spam server j per address he has acquired

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 28 / 54

Page 63: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Temporal Similarity

Common temporal patterns of activity may also indicate socialconnectionLook at number of emails sent or email addresses collected asfunction of timeDiscretize time into 1-hour intervals

0 100 200 300 400 500 600 700 8000

50

100

150

0 100 200 300 400 500 600 700 8000

50

100

150

Time (in 1−hour bins)

# of

em

ails

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 29 / 54

Page 64: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Sample Temporal Histograms

0 400 8000

5

10

Time (in 1−hour bins)

# of

em

ails

0 400 8000

50

100

150

0 400 8000

10

20

30

0 400 8000

5

10

Sample temporal spamming histograms representing four types ofdistributions

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 30 / 54

Page 65: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Temporal Similarity Coincidence Matrices

Choose edge weights based on correlation in number of emailssent during each time intervalSimilarity in temporal spamming

Create coincidence matrix H between harvesters and discretizedtime intervals

H =

[sij

ei

]M,N

i,j=1

sij : number of emails sent by harvester i during j th time intervalei : the total number of email addresses collected by harvester i

Similarity in temporal harvestingCreate coincidence matrix H between harvesters and discretizedtime intervals

H = [aij ]M,Ni,j=1

aij : number of addresses collected by harvester i during j th timeinterval

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 31 / 54

Page 66: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Temporal Similarity Coincidence Matrices

Choose edge weights based on correlation in number of emailssent during each time intervalSimilarity in temporal spamming

Create coincidence matrix H between harvesters and discretizedtime intervals

H =

[sij

ei

]M,N

i,j=1

sij : number of emails sent by harvester i during j th time intervalei : the total number of email addresses collected by harvester i

Similarity in temporal harvestingCreate coincidence matrix H between harvesters and discretizedtime intervals

H = [aij ]M,Ni,j=1

aij : number of addresses collected by harvester i during j th timeinterval

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 31 / 54

Page 67: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Temporal Similarity Coincidence Matrices

Choose edge weights based on correlation in number of emailssent during each time intervalSimilarity in temporal spamming

Create coincidence matrix H between harvesters and discretizedtime intervals

H =

[sij

ei

]M,N

i,j=1

sij : number of emails sent by harvester i during j th time intervalei : the total number of email addresses collected by harvester i

Similarity in temporal harvestingCreate coincidence matrix H between harvesters and discretizedtime intervals

H = [aij ]M,Ni,j=1

aij : number of addresses collected by harvester i during j th timeinterval

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 31 / 54

Page 68: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Creating the Adjacency Matrix

From coincidence matrix H we can obtain a matrix ofunnormalized pairwise similarities S = HHT

Normalize S to obtain matrix of normalized pairwise similaritiesN = D−1/2SD−1/2

D = diag(S)Scales similarities so each harvester’s self-similarity is 1Ensures each harvester is equally important

Connect harvesters to their k nearest neighbors according tosimilarities in N to form adjacency matrix W

Results in sparser adjacency matrixHow to choose k?Heuristic: Choose k = log n to start and increase as necessary toavoid artificially disconnecting components (von Luxburg, 2007)

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 32 / 54

Page 69: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Creating the Adjacency Matrix

From coincidence matrix H we can obtain a matrix ofunnormalized pairwise similarities S = HHT

Normalize S to obtain matrix of normalized pairwise similaritiesN = D−1/2SD−1/2

D = diag(S)Scales similarities so each harvester’s self-similarity is 1Ensures each harvester is equally important

Connect harvesters to their k nearest neighbors according tosimilarities in N to form adjacency matrix W

Results in sparser adjacency matrixHow to choose k?Heuristic: Choose k = log n to start and increase as necessary toavoid artificially disconnecting components (von Luxburg, 2007)

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 32 / 54

Page 70: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Methodology Similarity Measures

Creating the Adjacency Matrix

From coincidence matrix H we can obtain a matrix ofunnormalized pairwise similarities S = HHT

Normalize S to obtain matrix of normalized pairwise similaritiesN = D−1/2SD−1/2

D = diag(S)Scales similarities so each harvester’s self-similarity is 1Ensures each harvester is equally important

Connect harvesters to their k nearest neighbors according tosimilarities in N to form adjacency matrix W

Results in sparser adjacency matrixHow to choose k?Heuristic: Choose k = log n to start and increase as necessary toavoid artificially disconnecting components (von Luxburg, 2007)

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 32 / 54

Page 71: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Outline

1 IntroductionObjectivesHarvesting and SpammingSocial Networks

2 MethodologyCommunity DetectionSimilarity Measures

3 Results

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 33 / 54

Page 72: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Similarity in Spam Server Usage

Results from October 2006 using similarity in spam server usage(visualization created using Cytoscape)

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 34 / 54

Page 73: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Alternate View Colored By Phishing Level

Results from October 2006 colored by phishing level

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 35 / 54

Page 74: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Distribution of Phishers in Clusters

Distribution of phishers in clusters from October 2006 results

Label 1 2 3 4 5 6 7 8

Cluster size 1040 188 77 68 68 35 29 26

# of phishers 17 0 10 0 65 28 24 0

% of phishers 1.63 0 13.0 0 95.6 80 82.8 0

Label 9 10 11 12 13 14 15 16

Cluster size 19 16 14 14 14 11 11 11

# of phishers 18 16 13 1 12 9 0 11

% of phishers 94.7 100 92.9 7.14 85.7 81.8 0 100

Very few phishers in large, loosely-connected clusterMany small, tightly-connected clusters have high concentration ofphishers

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 36 / 54

Page 75: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Cluster Validation Indices

Rand index: measure of agreement between clustering resultsand labels

Rand index =a + d

a + b + c + d

a: number of pairs of nodes with same label and in same clusterb: number of pairs with same label but in different clustersc: number of pairs with different labels but in the same clusterd : number of pairs with different labels and in different clusters

Adjusted Rand index: Rand index corrected for chance (Hubertand Arabie, 1985)Expected adjusted Rand index for random clustering result is 0Label clusters as phishing clusters if ratio of phishers toharvesters > 0.5Use phisher or non-phisher as label for each harvesterLook for agreement between harvester labels and cluster labels

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 37 / 54

Page 76: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Cluster Validation Indices

Rand index: measure of agreement between clustering resultsand labels

Rand index =a + d

a + b + c + d

a: number of pairs of nodes with same label and in same clusterb: number of pairs with same label but in different clustersc: number of pairs with different labels but in the same clusterd : number of pairs with different labels and in different clusters

Adjusted Rand index: Rand index corrected for chance (Hubertand Arabie, 1985)Expected adjusted Rand index for random clustering result is 0Label clusters as phishing clusters if ratio of phishers toharvesters > 0.5Use phisher or non-phisher as label for each harvesterLook for agreement between harvester labels and cluster labels

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 37 / 54

Page 77: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Cluster Validation Indices

Rand index: measure of agreement between clustering resultsand labels

Rand index =a + d

a + b + c + d

a: number of pairs of nodes with same label and in same clusterb: number of pairs with same label but in different clustersc: number of pairs with different labels but in the same clusterd : number of pairs with different labels and in different clusters

Adjusted Rand index: Rand index corrected for chance (Hubertand Arabie, 1985)Expected adjusted Rand index for random clustering result is 0Label clusters as phishing clusters if ratio of phishers toharvesters > 0.5Use phisher or non-phisher as label for each harvesterLook for agreement between harvester labels and cluster labels

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 37 / 54

Page 78: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Cluster Validation Indices

Rand index: measure of agreement between clustering resultsand labels

Rand index =a + d

a + b + c + d

a: number of pairs of nodes with same label and in same clusterb: number of pairs with same label but in different clustersc: number of pairs with different labels but in the same clusterd : number of pairs with different labels and in different clusters

Adjusted Rand index: Rand index corrected for chance (Hubertand Arabie, 1985)Expected adjusted Rand index for random clustering result is 0Label clusters as phishing clusters if ratio of phishers toharvesters > 0.5Use phisher or non-phisher as label for each harvesterLook for agreement between harvester labels and cluster labels

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 37 / 54

Page 79: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Cluster Validation Indices

Rand index: measure of agreement between clustering resultsand labels

Rand index =a + d

a + b + c + d

a: number of pairs of nodes with same label and in same clusterb: number of pairs with same label but in different clustersc: number of pairs with different labels but in the same clusterd : number of pairs with different labels and in different clusters

Adjusted Rand index: Rand index corrected for chance (Hubertand Arabie, 1985)Expected adjusted Rand index for random clustering result is 0Label clusters as phishing clusters if ratio of phishers toharvesters > 0.5Use phisher or non-phisher as label for each harvesterLook for agreement between harvester labels and cluster labels

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 37 / 54

Page 80: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Cluster Validation Indices

Rand index: measure of agreement between clustering resultsand labels

Rand index =a + d

a + b + c + d

a: number of pairs of nodes with same label and in same clusterb: number of pairs with same label but in different clustersc: number of pairs with different labels but in the same clusterd : number of pairs with different labels and in different clusters

Adjusted Rand index: Rand index corrected for chance (Hubertand Arabie, 1985)Expected adjusted Rand index for random clustering result is 0Label clusters as phishing clusters if ratio of phishers toharvesters > 0.5Use phisher or non-phisher as label for each harvesterLook for agreement between harvester labels and cluster labels

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 37 / 54

Page 81: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Validation Indices For Similarity in Spam Server Usage

Validation indices for similarity in spam server usage results

Year 2006 2007

Month July October January April July

Rand index 0.884 0.936 0.923 0.937 0.880

Adj. Rand index 0.759 0.847 0.803 0.802 0.618

Very high Rand and adjusted Rand indices indicates goodagreement between labels and clustering resultsResults highly unlikely to be caused by chance

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 38 / 54

Page 82: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Top Subject Lines in Phishing Clusters

Cluster 5

Subject line Hits

Password Change Required 126

Question from eBay Member 69

Credit Union Online R© $50 Reward Survey 47

PayPal Account 42

PayPal Account - Suspicious Activity 40

Cluster 9

Subject line Hits

Notification from Billing Department 49

IMPORTANT: Notification of limited accounts 25

PayPal Account Review Department 22

Notification of Limited Account Access 13

A secondary e-mail address has been added to your 11

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 39 / 54

Page 83: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Top Subject Lines in Non-Phishing Clusters

Cluster 2

Subject line Hits

tthemee 6893

St ock 6 6729

Notification 4516

Access granted to send emails to 4495

Thanks for joining 4405

Cluster 4

Subject line Hits

Make Money by Sharing Your Life with Friends and F 1027

Premiere Professional & Executive Registries Invit 750

Texas Land/Golf is the Buzz 459

Keys to Stock Market Success 408

An Entire Case of Fine Wine plus Exclusive Gift fo 367

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 40 / 54

Page 84: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Venn Diagram of Phishers’ Life Times

Venn diagram of phishers’ life times as percentage of total (1805 totalphishers)

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 41 / 54

Page 85: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Venn Diagram of Non-Phishers’ Life Times

Venn diagram of non-phishers’ life times as percentage of total (4801total non-phishers)

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 42 / 54

Page 86: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Findings from Similarity in Spam Server Usage

Clustering divides spammers into communities of mostly phishersand mostly non-phishersEmpirical evidence that phishers tend to form small groups andshare resourcesPhishers have shorter life times than non-phishersDiscovered community structure is highly unlikely by chance

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 43 / 54

Page 87: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Findings from Similarity in Spam Server Usage

Clustering divides spammers into communities of mostly phishersand mostly non-phishersEmpirical evidence that phishers tend to form small groups andshare resourcesPhishers have shorter life times than non-phishersDiscovered community structure is highly unlikely by chance

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 43 / 54

Page 88: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Findings from Similarity in Spam Server Usage

Clustering divides spammers into communities of mostly phishersand mostly non-phishersEmpirical evidence that phishers tend to form small groups andshare resourcesPhishers have shorter life times than non-phishersDiscovered community structure is highly unlikely by chance

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 43 / 54

Page 89: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Findings from Similarity in Spam Server Usage

Clustering divides spammers into communities of mostly phishersand mostly non-phishersEmpirical evidence that phishers tend to form small groups andshare resourcesPhishers have shorter life times than non-phishersDiscovered community structure is highly unlikely by chance

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 43 / 54

Page 90: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Similarity in Temporal Spamming

Results from October 2006 using similarity in temporal spamming

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 44 / 54

Page 91: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Temporal Spamming Histograms

0 200 400 600 8000

200

400

Time (in 1−hour bins)

# of

em

ails

0 200 400 600 8000

200

400

0 200 400 600 8000

200

400

0 200 400 600 8000

200

400

0 200 400 600 8000

200

400

0 200 400 600 8000

200

400

0 200 400 600 8000

200

400

0 200 400 600 8000

200

400

0 200 400 600 8000

200

400

0 200 400 600 8000

200

400

Temporal spamming histograms of ten harvesters from same cluster

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 45 / 54

Page 92: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Statistics from Similarity in Temporal Spamming

Average temporal spamming correlation coefficients between twoharvesters in the aforementioned group

Year 2006 2007

Month July October January April July

ρavg 0.979 0.988 0.933 0.950 0.937

IP addresses of all harvesters in this group have 208.66.195/24prefixThese harvesters are among the heaviest spammers in eachmonthWe discovered several other groups with coherent temporalbehavior and similar IP addresses

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 46 / 54

Page 93: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Statistics from Similarity in Temporal Spamming

Average temporal spamming correlation coefficients between twoharvesters in the aforementioned group

Year 2006 2007

Month July October January April July

ρavg 0.979 0.988 0.933 0.950 0.937

IP addresses of all harvesters in this group have 208.66.195/24prefixThese harvesters are among the heaviest spammers in eachmonthWe discovered several other groups with coherent temporalbehavior and similar IP addresses

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 46 / 54

Page 94: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Statistics from Similarity in Temporal Spamming

Average temporal spamming correlation coefficients between twoharvesters in the aforementioned group

Year 2006 2007

Month July October January April July

ρavg 0.979 0.988 0.933 0.950 0.937

IP addresses of all harvesters in this group have 208.66.195/24prefixThese harvesters are among the heaviest spammers in eachmonthWe discovered several other groups with coherent temporalbehavior and similar IP addresses

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 46 / 54

Page 95: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Statistics from Similarity in Temporal Spamming

Average temporal spamming correlation coefficients between twoharvesters in the aforementioned group

Year 2006 2007

Month July October January April July

ρavg 0.979 0.988 0.933 0.950 0.937

IP addresses of all harvesters in this group have 208.66.195/24prefixThese harvesters are among the heaviest spammers in eachmonthWe discovered several other groups with coherent temporalbehavior and similar IP addresses

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 46 / 54

Page 96: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Similarity in Temporal Harvesting

Results from July 2006 using similarity in temporal harvesting

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 47 / 54

Page 97: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Temporal Harvesting Histograms

0 200 400 600 80005

10

Time (in 1−hour bins)

# of

em

ail a

ddre

sses

col

lect

ed

0 200 400 600 80005

10

0 200 400 600 80005

10

0 200 400 600 80005

10

0 200 400 600 80005

10

0 200 400 600 80005

10

0 200 400 600 80005

10

0 200 400 600 80005

10

0 200 400 600 80005

10

0 200 400 600 80005

10

Temporal spamming histograms of 208.66.195/24 group of harvesters

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 48 / 54

Page 98: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Statistics from Similarity in Temporal Harvesting

Average temporal harvesting correlation coefficients between twoharvesters in the 208.66.195/24 group

Year 2006

Month May June July August September

ρavg 0.579 0.645 0.661 0.533 0.635

Correlation is not as high as with temporal spammingLower correlation is expected due to randomness of addressacquisition timesResults still indicate high behavioral correlationAll harvesting was done between May and September 2006

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 49 / 54

Page 99: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Statistics from Similarity in Temporal Harvesting

Average temporal harvesting correlation coefficients between twoharvesters in the 208.66.195/24 group

Year 2006

Month May June July August September

ρavg 0.579 0.645 0.661 0.533 0.635

Correlation is not as high as with temporal spammingLower correlation is expected due to randomness of addressacquisition timesResults still indicate high behavioral correlationAll harvesting was done between May and September 2006

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 49 / 54

Page 100: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Statistics from Similarity in Temporal Harvesting

Average temporal harvesting correlation coefficients between twoharvesters in the 208.66.195/24 group

Year 2006

Month May June July August September

ρavg 0.579 0.645 0.661 0.533 0.635

Correlation is not as high as with temporal spammingLower correlation is expected due to randomness of addressacquisition timesResults still indicate high behavioral correlationAll harvesting was done between May and September 2006

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 49 / 54

Page 101: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Statistics from Similarity in Temporal Harvesting

Average temporal harvesting correlation coefficients between twoharvesters in the 208.66.195/24 group

Year 2006

Month May June July August September

ρavg 0.579 0.645 0.661 0.533 0.635

Correlation is not as high as with temporal spammingLower correlation is expected due to randomness of addressacquisition timesResults still indicate high behavioral correlationAll harvesting was done between May and September 2006

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 49 / 54

Page 102: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Statistics from Similarity in Temporal Harvesting

Average temporal harvesting correlation coefficients between twoharvesters in the 208.66.195/24 group

Year 2006

Month May June July August September

ρavg 0.579 0.645 0.661 0.533 0.635

Correlation is not as high as with temporal spammingLower correlation is expected due to randomness of addressacquisition timesResults still indicate high behavioral correlationAll harvesting was done between May and September 2006

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 49 / 54

Page 103: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Findings from Temporal Similarity

We discover several groups with coherent temporal behavior andsimilar IP addressesIn particular, a group of ten heavy spammers with 208.66.195/24IP address prefix

Indicates that these computers are very close geographicallyEither the same spammer or a group of spammers in same physicallocation

Highly likely that these groups are coordinated

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 50 / 54

Page 104: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Findings from Temporal Similarity

We discover several groups with coherent temporal behavior andsimilar IP addressesIn particular, a group of ten heavy spammers with 208.66.195/24IP address prefix

Indicates that these computers are very close geographicallyEither the same spammer or a group of spammers in same physicallocation

Highly likely that these groups are coordinated

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 50 / 54

Page 105: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Findings from Temporal Similarity

We discover several groups with coherent temporal behavior andsimilar IP addressesIn particular, a group of ten heavy spammers with 208.66.195/24IP address prefix

Indicates that these computers are very close geographicallyEither the same spammer or a group of spammers in same physicallocation

Highly likely that these groups are coordinated

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 50 / 54

Page 106: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Findings from Temporal Similarity

We discover several groups with coherent temporal behavior andsimilar IP addressesIn particular, a group of ten heavy spammers with 208.66.195/24IP address prefix

Indicates that these computers are very close geographicallyEither the same spammer or a group of spammers in same physicallocation

Highly likely that these groups are coordinated

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 50 / 54

Page 107: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Findings from Temporal Similarity

We discover several groups with coherent temporal behavior andsimilar IP addressesIn particular, a group of ten heavy spammers with 208.66.195/24IP address prefix

Indicates that these computers are very close geographicallyEither the same spammer or a group of spammers in same physicallocation

Highly likely that these groups are coordinated

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 50 / 54

Page 108: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

Border Gateway Protocol

Border Gateway Protocol (BGP) is the core routing protocol at thehighest level in the InternetRouters on edge of autonomous systems (ASes) send updatesbetween themselves about connectivity within their AS

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 51 / 54

Page 109: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

BGP Life Span

BGP life span of a spam server is roughly the amount of time it isconnected to the rest of the InternetIt has been observed that some spam servers have short BGP lifespans, perhaps to remain untraceable (Ramachandran andFeamster, 2006)Are harvesters which use short-lived spam servers tightlyconnected?

Few spam servers have short BGP life spansNo significant correlation found between BGP life span andphishing level

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 52 / 54

Page 110: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

BGP Life Span

BGP life span of a spam server is roughly the amount of time it isconnected to the rest of the InternetIt has been observed that some spam servers have short BGP lifespans, perhaps to remain untraceable (Ramachandran andFeamster, 2006)Are harvesters which use short-lived spam servers tightlyconnected?

Few spam servers have short BGP life spansNo significant correlation found between BGP life span andphishing level

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 52 / 54

Page 111: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

BGP Life Span

BGP life span of a spam server is roughly the amount of time it isconnected to the rest of the InternetIt has been observed that some spam servers have short BGP lifespans, perhaps to remain untraceable (Ramachandran andFeamster, 2006)Are harvesters which use short-lived spam servers tightlyconnected?

Few spam servers have short BGP life spansNo significant correlation found between BGP life span andphishing level

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 52 / 54

Page 112: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

BGP Life Span

BGP life span of a spam server is roughly the amount of time it isconnected to the rest of the InternetIt has been observed that some spam servers have short BGP lifespans, perhaps to remain untraceable (Ramachandran andFeamster, 2006)Are harvesters which use short-lived spam servers tightlyconnected?

Few spam servers have short BGP life spansNo significant correlation found between BGP life span andphishing level

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 52 / 54

Page 113: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Results

BGP Life Span

BGP life span of a spam server is roughly the amount of time it isconnected to the rest of the InternetIt has been observed that some spam servers have short BGP lifespans, perhaps to remain untraceable (Ramachandran andFeamster, 2006)Are harvesters which use short-lived spam servers tightlyconnected?

Few spam servers have short BGP life spansNo significant correlation found between BGP life span andphishing level

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 52 / 54

Page 114: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Summary

Summary

Clustering using similarity in spam server usage revealscommunities of phishers and non-phishers

Phishing is a phenotype: most harvestors are either phishers ornon-phishersPhishers form gangs: they share resources in isolated closely knitcommunities.

Clustering using temporal similarity reveals coordinated groups ofharvestersSpammer social network patterns might be used for detection andinterdiction

Future workStatistical latent variable models for spammer community discoveryClustering based on combinations of similarity measuresEvolutionary models for community behavior

The dual problem: discovery of spam server communities.

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 53 / 54

Page 115: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Summary

Summary

Clustering using similarity in spam server usage revealscommunities of phishers and non-phishers

Phishing is a phenotype: most harvestors are either phishers ornon-phishersPhishers form gangs: they share resources in isolated closely knitcommunities.

Clustering using temporal similarity reveals coordinated groups ofharvestersSpammer social network patterns might be used for detection andinterdiction

Future workStatistical latent variable models for spammer community discoveryClustering based on combinations of similarity measuresEvolutionary models for community behavior

The dual problem: discovery of spam server communities.

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 53 / 54

Page 116: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Summary

Summary

Clustering using similarity in spam server usage revealscommunities of phishers and non-phishers

Phishing is a phenotype: most harvestors are either phishers ornon-phishersPhishers form gangs: they share resources in isolated closely knitcommunities.

Clustering using temporal similarity reveals coordinated groups ofharvestersSpammer social network patterns might be used for detection andinterdiction

Future workStatistical latent variable models for spammer community discoveryClustering based on combinations of similarity measuresEvolutionary models for community behavior

The dual problem: discovery of spam server communities.

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 53 / 54

Page 117: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Summary

Summary

Clustering using similarity in spam server usage revealscommunities of phishers and non-phishers

Phishing is a phenotype: most harvestors are either phishers ornon-phishersPhishers form gangs: they share resources in isolated closely knitcommunities.

Clustering using temporal similarity reveals coordinated groups ofharvestersSpammer social network patterns might be used for detection andinterdiction

Future workStatistical latent variable models for spammer community discoveryClustering based on combinations of similarity measuresEvolutionary models for community behavior

The dual problem: discovery of spam server communities.

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 53 / 54

Page 118: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Summary

Summary

Clustering using similarity in spam server usage revealscommunities of phishers and non-phishers

Phishing is a phenotype: most harvestors are either phishers ornon-phishersPhishers form gangs: they share resources in isolated closely knitcommunities.

Clustering using temporal similarity reveals coordinated groups ofharvestersSpammer social network patterns might be used for detection andinterdiction

Future workStatistical latent variable models for spammer community discoveryClustering based on combinations of similarity measuresEvolutionary models for community behavior

The dual problem: discovery of spam server communities.

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 53 / 54

Page 119: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Summary

Summary

Clustering using similarity in spam server usage revealscommunities of phishers and non-phishers

Phishing is a phenotype: most harvestors are either phishers ornon-phishersPhishers form gangs: they share resources in isolated closely knitcommunities.

Clustering using temporal similarity reveals coordinated groups ofharvestersSpammer social network patterns might be used for detection andinterdiction

Future workStatistical latent variable models for spammer community discoveryClustering based on combinations of similarity measuresEvolutionary models for community behavior

The dual problem: discovery of spam server communities.

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 53 / 54

Page 120: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

Summary

Summary

Clustering using similarity in spam server usage revealscommunities of phishers and non-phishers

Phishing is a phenotype: most harvestors are either phishers ornon-phishersPhishers form gangs: they share resources in isolated closely knitcommunities.

Clustering using temporal similarity reveals coordinated groups ofharvestersSpammer social network patterns might be used for detection andinterdiction

Future workStatistical latent variable models for spammer community discoveryClustering based on combinations of similarity measuresEvolutionary models for community behavior

The dual problem: discovery of spam server communities.

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 53 / 54

Page 121: Social Networks of Spammershero/Preprints/Social...Total emails received at Project Honey Pot trap addresses by month Outbreak of spam observed in October 2006 consistent with media

References

References

1 M. Girvan and M. E. J. Newman, “Community Structure in Socialand Biological Networks,” National Academy of Sciences (2002).

2 L. Hubert and P. Arabie, “Comparing Partitions,” Journal ofClassification (1985).

3 J. Moody, “Race, School Integration, and Friendship Segmentationin America,” American Journal of Sociology (2001).

4 M. Prince et al., “Understanding How Spammers Steal Your E-MailAddress: An Analysis of the First Six Months of Data from ProjectHoney Pot,” 2nd Conference on Email and Anti-Spam (2005).

5 U. von Luxburg, “A Tutorial on Spectral Clustering,” Statistics andComputing, (2007).

6 S. Yu and J. Shi, “Multiclass Spectral Clustering,” 9th IEEEInternational Conference on Computer Vision (2003).

A. Hero (University of Michigan) Social Networks of Spammers September 22, 2008 54 / 54