Modeling Arbitrary Layers of Continuous-Level Defenses in ...
Network-Level Spam Defenses
description
Transcript of Network-Level Spam Defenses
![Page 1: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/1.jpg)
Network-Level Spam Defenses
Nick FeamsterGeorgia Tech
with Anirudh Ramachandran, Shuang Hao, Alex Gray, Santosh Vempala
![Page 2: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/2.jpg)
2
Spam: More than Just a Nuisance
• 95% of all email traffic– Image and PDF Spam
(PDF spam ~12%)
• As of August 2007, one in every 87 emails constituted a phishing attack
• Targeted attacks on the rise– 20k-30k unique phishing attacks per month
Source: CNET (January 2008), APWG
![Page 3: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/3.jpg)
3
Approach: Filter
• Prevent unwanted traffic from reaching a user’s inbox by distinguishing spam from ham
• Question: What features best differentiate spam from legitimate mail?– Content-based filtering: What is in the mail?– IP address of sender: Who is the sender?– Behavioral features: How the mail is sent?
![Page 4: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/4.jpg)
Conventional: Content Filters• Trying to hit a moving target...
...and even mp3s!
PDFs Excel sheets Images
![Page 5: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/5.jpg)
5
Problems with Content Filtering• Customized emails are easy to generate: Content-
based filters need fuzzy hashes over content, etc.
• Low cost to evasion: Spammers can easily alter features of an email’s content can be easily adjusted and changed
• High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophisticated
![Page 6: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/6.jpg)
6
Another Approach: IP Addresses
• Problem: IP addresses are ephemeral
• Every day, 10% of senders are from previously unseen IP addresses
• Possible causes– Dynamic addressing– New infections
![Page 7: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/7.jpg)
7
Our Idea: Network-Based Filtering
• Filter email based on how it is sent, in addition to simply what is sent.
• Network-level properties are less malleable– Network/geographic location of sender and receiver– Set of target recipients– Hosting or upstream ISP (AS number)– Membership in a botnet (spammer, hosting
infrastructure)
![Page 8: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/8.jpg)
8
Why Network-Level Features?
• Lightweight: Don’t require inspecting details of packet streams– Can be done at high speeds– Can be done in the middle of the network
• Robust: Perhaps more difficult to change some network-level features than message contents
![Page 9: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/9.jpg)
9
Challenges (Talk Outline)• Understanding network-level behavior
– What network-level behaviors do spammers have?– How well do existing techniques work?
• Building classifiers using network-level features– Key challenge: Which features to use?– Two Algorithms: SNARE and SpamTracker
• Building the system – Dynamism: Behavior itself can change– Scale: Lots of email messages (and spam!) out there
![Page 10: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/10.jpg)
10
Some Questions of Study
• Where (in IP space, in geography) does spam originate from?
• What OSes are used to send spam?
• What techniques are used to send spam?
![Page 11: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/11.jpg)
11
Data: Spam and BGP• Spam Traps: Domains that receive only spam• BGP Monitors: Watch network-level reachability
Domain 1
Domain 2
17-Month Study: August 2004 to December 2005
![Page 12: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/12.jpg)
12
Data Collection: MailAvenger• Configurable SMTP server• Collects many useful statistics
![Page 13: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/13.jpg)
13
Finding: BGP “Spectrum Agility”• Hijack IP address space using BGP• Send spam• Withdraw IP address
A small club of persistent players appears to be using
this technique.
Common short-lived prefixes and ASes
61.0.0.0/8 4678 66.0.0.0/8 2156282.0.0.0/8 8717
~ 10 minutes
Somewhere between 1-10% of all spam (some clearly intentional,
others might be flapping)
![Page 14: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/14.jpg)
14
Spectrum Agility: Big Prefixes?
• Flexibility: Client IPs can be scattered throughout dark space within a large /8– Same sender usually returns with different IP
addresses
• Visibility: Route typically won’t be filtered (nice and short)
![Page 15: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/15.jpg)
15
Other Findings
• Top senders: Korea, China, Japan– Still about 40% of spam coming from U.S.
• More than half of sender IP addresses appear less than twice
• ~90% of spam sent to traps from Windows
![Page 16: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/16.jpg)
16
How Well do IP Blacklists Work?
• Completeness: The fraction of spamming IP addresses that are listed in the blacklist
• Responsiveness: The time for the blacklist to list the IP address after the first occurrence of spam
![Page 17: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/17.jpg)
17
Completeness and Responsiveness
• 10-35% of spam is unlisted at the time of receipt• 8.5-20% of these IP addresses remain unlisted
even after one month
Data: Trap data from March 2007, Spamhaus from March and April 2007
![Page 18: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/18.jpg)
18
What’s Wrong with IP Blacklists?• Based on ephemeral identifier (IP address)
– More than 10% of all spam comes from IP addresses not seen within the past two months
• Dynamic renumbering of IP addresses• Stealing of IP addresses and IP address space• Compromised machines
• IP addresses of senders have considerable churn
• Often require a human to notice/validate the behavior– Spamming is compartmentalized by domain and not analyzed
across domains
![Page 19: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/19.jpg)
19
Are There Other Approaches?
• Option 1: Stronger sender identity [AIP, Pedigree]
– Stronger sender identity/authentication may make reputation systems more effective
– May require changes to hosts, routers, etc.
• Option 2: Behavior-based filtering [SNARE, SpamTracker]
– Can be done on today’s network– Identifying features may be tricky, and some may
require network-wide monitoring capabilities
![Page 20: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/20.jpg)
20
Outline
• Understanding the network-level behavior– What behaviors do spammers have?– How well do existing techniques work?
• Classifiers using network-level features– Key challenge: Which features to use?– Two algorithms: SNARE and SpamTracker
• The System: SpamSpotter – Dynamism: Behavior itself can change– Scale: Lots of email messages (and spam!) out there
![Page 21: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/21.jpg)
21
Finding the Right Features
• Goal: Sender reputation from a single packet?– Low overhead– Fast classification– In-network– Perhaps more evasion resistant
• Key challenge– What features satisfy these properties and can
distinguish spammers from legitimate senders?
![Page 22: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/22.jpg)
22
Set of Network-Level Features• Single-Packet
– AS of sender’s IP– Distance to k nearest senders– Status of email service ports– Geodesic distance– Time of day
• Single-Message– Number of recipients– Length of message
• Aggregate (Multiple Message/Recipient)
![Page 23: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/23.jpg)
23
Sender-Receiver Geodesic Distance
90% of legitimate messages travel 2,200 miles or less
![Page 24: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/24.jpg)
24
Density of Senders in IP Space
For spammers, k nearest senders are much closer in IP space
![Page 25: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/25.jpg)
25
Local Time of Day at Sender
Spammers “peak” at different local times of day
![Page 26: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/26.jpg)
26
Combining Features: RuleFit• Put features into the RuleFit classifier• 10-fold cross validation on one day of query logs
from a large spam filtering appliance provider
• Comparable performance to SpamHaus– Incorporating into the system can further reduce FPs
• Using only network-level features• Completely automated
![Page 27: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/27.jpg)
27
SNARE: Putting it Together
• Email arrival• Whitelisting
– Top 10 ASes responsible for 43% of misclassified IP addresses• Greylisting• Retraining
![Page 28: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/28.jpg)
28
Benefits of Whitelisting
Whitelisting top 50 ASes:False positives reduced to 0.14%
![Page 29: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/29.jpg)
29
Outline• Understanding the network-level behavior
– What behaviors do spammers have?– How well do existing techniques work?
• Classifiers using network-level features– Key challenge: Which features to use?– Algorithms: SNARE and SpamTracker
• Building the system (SpamSpotter)– Dynamism: Behavior itself can change– Scale: Lots of email messages (and spam!) out there
![Page 30: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/30.jpg)
30
SpamTracker
• Idea: Blacklist sending behavior (“Behavioral Blacklisting”)– Identify sending patterns commonly used by
spammers
• Intuition: Much more difficult for a spammer to change the technique by which mail is sent than it is to change the content
![Page 31: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/31.jpg)
31
SpamTracker: Clustering Approach
• Construct a behavioral fingerprint for each sender
• Cluster senders with similar fingerprints• Filter new senders that map to existing
clusters
![Page 32: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/32.jpg)
32
SpamTracker: Identify Invariant
domain1.com domain2.com domain3.com
spam spam spam
IP Address: 76.17.114.xxxKnown Spammer
DHCPReassignment
Behavioral fingerprint
domain1.com domain2.com domain3.com
spam spam spam
IP Address: 24.99.146.xxxUnknown sender
Cluster on sending behavior
Similar fingerprint!
Cluster on sending behavior
Infection
![Page 33: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/33.jpg)
33
Building the Classifier: Clustering
• Feature: Distribution of email sending volumes across recipient domains
• Clustering Approach– Build initial seed list of bad IP addresses– For each IP address, compute feature vector:
volume per domain per time interval– Collapse into a single IP x domain matrix:– Compute clusters
![Page 34: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/34.jpg)
34
Clustering: Output and Fingerprint
• For each cluster, compute fingerprint vector:
• New IPs will be compared to this “fingerprint”
IP x IP Matrix: Intensity indicates pairwise similarity
![Page 35: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/35.jpg)
35
Evaluation
• Emulate the performance of a system that could observe sending patterns across many domains– Build clusters/train on given time interval
• Evaluate classification– Relative to labeled logs– Relative to IP addresses that were eventually listed
![Page 36: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/36.jpg)
36
Data• 30 days of Postfix logs from email hosting service
– Time, remote IP, receiving domain, accept/reject– Allows us to observe sending behavior over a large
number of domains– Problem: About 15% of accepted mail is also spam
• Creates problems with validating SpamTracker
• 30 days of SpamHaus database in the month following the Postfix logs– Allows us to determine whether SpamTracker detects
some sending IPs earlier than SpamHaus
![Page 37: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/37.jpg)
37
Clustering ResultsHam
Spam
SpamTracker Score
Separation may not be sufficient alone, but could be a useful feature
![Page 38: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/38.jpg)
38
Outline• Understanding the network-level behavior
– What behaviors do spammers have?– How well do existing techniques work?
• Building classifiers using network-level features– Key challenge: Which features to use?– Algorithms: SpamTracker and SNARE
• Building the system (SpamSpotter)– Dynamism: Behavior itself can change– Scale: Lots of email messages (and spam!) out there
![Page 39: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/39.jpg)
39
Deployment: Real-Time Blacklist
• As mail arrives, lookups received at BL
• Queries provide proxy for sending behavior
• Train based on received data
• Return score
Approach
![Page 40: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/40.jpg)
40
Challenges• Scalability: How to collect and aggregate data, and form
the signatures without imposing too much overhead?
• Dynamism: When to retrain the classifier, given that sender behavior changes?
• Reliability: How should the system be replicated to better defend against attack or failure?
• Evasion resistance: Can the system still detect spammers when they are actively trying to evade?
![Page 41: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/41.jpg)
41
Design Choice: Augment DNSBL• Expressive queries
– SpamHaus: $ dig 55.102.90.62.zen.spamhaus.org• Ans: 127.0.0.3 (=> listed in exploits block list)
– SpamSpotter: $ dig \ receiver_ip.receiver_domain.sender_ip.rbl.gtnoise.net
• e.g., dig 120.1.2.3.gmail.com.-.1.1.207.130.rbl.gtnoise.net
• Ans: 127.1.3.97 (SpamSpotter score = -3.97)
• Also a source of data– Unsupervised algorithms work with unlabeled
data
![Page 42: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/42.jpg)
42
Latency
Performance overhead is negligible.
![Page 43: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/43.jpg)
43
Design Choice: Sampling
Relatively small samples can achieve low false positive rates
![Page 44: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/44.jpg)
44
Possible Improvements• Accuracy
– Synthesizing multiple classifiers– Incorporating user feedback– Learning algorithms with bounded false positives
• Performance– Caching/Sharing– Streaming
• Security– Learning in adversarial environments
![Page 45: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/45.jpg)
45
Summary: Network-Based Behavioral Reputation
• Spam increasing, spammers becoming agile– Content filters are falling behind– IP-Based blacklists are evadable
• Up to 30% of spam not listed in common blacklists at receipt. ~20% remains unlisted after a month
• Complementary approach: behavioral blacklisting based on network-level features– Key idea: Blacklist based on how messages are sent– SNARE: Automated sender reputation
• ~90% accuracy of existing with lightweight features– SpamTracker: Spectral clustering
• catches significant amounts faster than existing blacklists– SpamSpotter: Putting it together in an RBL system
![Page 46: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/46.jpg)
46
Next Steps: Phishing and Scams
• Scammers host Web sites on dynamic scam hosting infrastructure– Use DNS to redirect users to different sites
when the location of the sites move• State of the art: Blacklist URL• Our approach: Blacklist based on
network-level fingerprints
Konte et al., “Dynamics of Online Scam Hosting Infrastructure”, PAM 2009
![Page 47: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/47.jpg)
47
References• Anirudh Ramachandran and Nick Feamster, “Understanding
the Network-Level Behavior of Spammers”, ACM SIGCOMM, 2006
• Anirudh Ramachandran, Nick Feamster, and Santosh Vempala, “Filtering Spam with Behavioral Blacklisting”, ACM CCS, 2007
• Nadeem Syed, Shuang Hao, Nick Feamster, Alex Gray and Sven Krasser, “SNARE: Spatio-temporal Network-level Automatic Reputation Engine”, GT-CSE-08-02
• Anirudh Ramachandran, Shuang Hao, Hitesh Khandelwal, Nick Feamster, Santosh Vempala, “A Dynamic Reputation Service for Spotting Spammers”, GT-CS-08-09 (In submission)
![Page 48: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/48.jpg)
48
Time Between Record ChangesFast-flux Domains tend to change much more frequently than legitimately hosted sites
![Page 49: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/49.jpg)
49
![Page 50: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/50.jpg)
50
Classifying IP Addresses
• Given “new” IP address, build a feature vector based on its sending pattern across domains
• Compute the similarity of this sending pattern to that of each known spam cluster– Normalized dot product of the two feature vectors– Spam score is maximum similarity to any cluster
![Page 51: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/51.jpg)
51
Sampling: Training Time
![Page 52: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/52.jpg)
52
Additional History: Message Size Variance
Senders of legitimate mail have a much higher variance in sizes of messages they send
Message Size Range
Certain Spam
Likely Spam
Likely Ham
Certain Ham
Surprising: Including this feature (and others with more history) can actually decrease the accuracy of the classifier
![Page 53: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/53.jpg)
53
Completeness of IP Blacklists
~80% listed on average
~95% of bots listed in one or more blacklists
Number of DNSBLs listing this spammer
Only about half of the IPs spamming from short-lived BGP are listed in any blacklistFr
actio
n of
all
spam
rece
ived
Spam from IP-agile senders tend to be listed in fewer blacklists
![Page 54: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/54.jpg)
54
Low Volume to Each Domain
Lifetime (seconds)
Am
ount
of S
pam Most spammers send very little spam, regardless
of how long they have been spamming.
![Page 55: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/55.jpg)
55
Some Patterns of Sending are Invariant
domain1.com domain2.com domain3.com
spam spam spam
IP Address: 76.17.114.xxx
DHCPReassignment
domain1.com domain2.com domain3.com
spam spam spam
IP Address: 24.99.146.xxx
• Spammer's sending pattern has not changed• IP Blacklists cannot make this connection
![Page 56: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/56.jpg)
56
Characteristics of Agile Senders
• IP addresses are widely distributed across the /8 space• IP addresses typically appear only once at our sinkhole• Depending on which /8, 60-80% of these IP addresses
were not reachable by traceroute when we spot-checked
• Some IP addresses were in allocated, albeit unannounced space
• Some AS paths associated with the routes contained reserved AS numbers
![Page 57: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/57.jpg)
57
Early Detection Results
• Compare SpamTracker scores on “accepted” mail to the SpamHaus database– About 15% of accepted mail was later determined to
be spam– Can SpamTracker catch this?
• Of 620 emails that were accepted, but sent from IPs that were blacklisted within one month– 65 emails had a score larger than 5 (85th percentile)
![Page 58: Network-Level Spam Defenses](https://reader033.fdocuments.net/reader033/viewer/2022051218/56815c2b550346895dca039d/html5/thumbnails/58.jpg)
58
Evasion
• Problem: Malicious senders could add noise– Solution: Use smaller number of trusted domains
• Problem: Malicious senders could change sending behavior to emulate “normal” senders– Need a more robust set of features…