04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 11
Email Spam FilteringEmail Spam FilteringComputer Security SeminarComputer Security Seminar
N.Muthiyalu Jothir – 271120N.Muthiyalu Jothir – 271120Media InformaticsMedia Informatics
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 22
AgendaAgenda
What is Spam ?What is Spam ? StatisticsStatistics Who Benefits from it?Who Benefits from it? Spam Filtering TechniquesSpam Filtering Techniques Combining FiltersCombining Filters ConclusionConclusion
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 33
What is Spam?What is Spam? Spam Spam Unsolicited email Unsolicited email
Emails that involves sending identical Emails that involves sending identical or nearly identical messages to or nearly identical messages to thousands (or millions) of recipients. thousands (or millions) of recipients.
Caution !Caution !““SPAM - Spiced Ham ” is a popular SPAM - Spiced Ham ” is a popular
American canned meat brand…American canned meat brand…
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 44
Problem Problem With a tiny investment, a spammer can send over With a tiny investment, a spammer can send over
100,000 bulk emails per hour100,000 bulk emails per hour..
Junk mails waste storage and transmission Junk mails waste storage and transmission bandwidth.bandwidth.
ISP’s investment ISP’s investment Cost we absorb as ISP’s Cost we absorb as ISP’s customercustomer
Spam is a problem because the Spam is a problem because the cost is forced onto cost is forced onto us, the recipientus, the recipient..
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 55
StatisticsStatisticsEmail considered Spam 40% of all
Daily Spam emails sentDaily Spam emails sent 12.4 billion12.4 billion
Daily Spam received perperson
6
Annual Spam received perAnnual Spam received perpersonperson
2,2002,200
Spam cost to all non-corp. Internet users $255 million
Spam cost to all U.S.Spam cost to all U.S.Corporations in 2002Corporations in 2002
$8.9 billion$8.9 billion
Estimated Spam increaseby 2007
63%
Users who reply to SpamUsers who reply to Spamemailemail
28%28%
Users who purchased from Spam email 8%
Wasted corporate time per Spam emailWasted corporate time per Spam email 4-5 seconds4-5 seconds
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 66
Who benefits from Spam?Who benefits from Spam?Financial Firms e.g. Mortgage
Lead Generators(Gain 2% of Loan value per customer data) Spammers
(Share the profit with Lead Generators)
Recipient
Information about interested customers
Recipient replies here
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 77
Spam Control TechniquesSpam Control Techniques
Fight Back techniques Filtering Techniques
• Reporting Spam to ISP
• Fight back filters
• Slow Senders
• Law ???
• etc.
• Challenge-Response Filtering
• Blacklists and White lists
• Content based filters Rule based Bayesian filters
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 88
Reporting Spam To ISPsReporting Spam To ISPs Original spam solution Legitimate ISPs respond to such
complaints Spammers kicked offDisadvantage Disguised Spammers. Naïve users cannot interpret the
email headers
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 99
Filters that Fight Back (FFB) Majority of spam contain links to web pages.
Spam filters could auto retrieve the URLs and crawl back to those pages, which would increase the load on the server.
If all the spam receivers do this at the same time, the server might be crashed and so the cost of spamming increases.
Caution !
FFB usually works with blacklists (of malicious servers) in order to avoid the attack on innocent servers.
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1010
Filtering TechniquesFiltering Techniques
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1111
Spam Vs HamSpam Vs Ham Care to be taken in any Spam filtering techniqueCare to be taken in any Spam filtering technique
““All the Spam could be allowed to pass thro; but, All the Spam could be allowed to pass thro; but, not even a single legitimate mail should be not even a single legitimate mail should be filtered.”filtered.”
False Positive – Legitimate mail classified as spam.False Positive – Legitimate mail classified as spam.
Least false positive rate desired…Least false positive rate desired…
Caution Caution : Check your junk folder before deleting: Check your junk folder before deleting
Don’tDon’t believebelieve your Spam filter your Spam filter
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1212
Challenge-Response Filtering Emails from unknown senders will receive an auto-reply
message asking them to verify themselves
Senders “Challenged" to type in a word that is hidden within a graphic or a sound file
Mail is forwarded to receiver’s inbox, only after successful “response”
This technique almost filters all spam . No spammer would be interested to take the extra effort to prove him / her self.
Commercial product “spamarrest”
Disadvantage This technique is rude
Sometimes senders don’t or forget to reply to the challenge
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1313
Blacklists and White lists Blacklists of misbehaving servers or known spammers that
are collected by several sites.
Sender id in the email is compared with the blacklist
White lists are complementary to black lists, and contain addresses of trusted contacts
Use blacklists and white lists for the first level filtering (before applying content checks) and not used as the only tool for making decision.
Disadvantage Prone to wrong configurations with legitimate servers unable to
exit from a list where they had been incorrectly inserted.
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1414
Content based filters
Not a good idea to filter mails just based Not a good idea to filter mails just based on blacklists on blacklists
Wiser decisionWiser decision Consider the actual Consider the actual content of the emailcontent of the email
Almost all the successful spam filters use Almost all the successful spam filters use this techniquethis technique
Major types : Rule-based and BayesianMajor types : Rule-based and Bayesian
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1515
Rule Based FiltersRule Based Filters Rule based filters work based on some
static rules to decide whether a mail is a spam or not.
Rules could be• words and phrases• lots of uppercase characters• exclamation points• special characters• Web links• HTML messages• background colors• crazy Subject lines etc.
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1616
Rule based filtersRule based filters Rules are given scores, based on importance
Incoming mails are parsed and checked for known malicious patterns
Total score calculated for the triggered rules
If Final Score > Threshold, classify as spam. Otherwise, classify as legitimate mail.
Threshold decided by the user.
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1717
Rule Based FiltersRule Based Filters “Spamassasin”, a popular spam filtering product
uses rule based filtering.
Perl Regex (Regular expressions) used for pattern checking
Example rules• header __LOCAL_FROM_NEWS From /news@example\.com/i
• body __LOCAL_SALES_FIGURES /\bMonthly Sales Figures\b/
• score LOCAL_NEWS_SALES_FIGURES 0.8
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1818
Rule Based FiltersRule Based Filters AdvantageAdvantage
Easy to implement Easy to implement No training requiredNo training required
DisadvantageDisadvantage Static rules too generalStatic rules too general Spammers find new ways to deceive the Spammers find new ways to deceive the
rulesrules
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 1919
Bayesian FiltersBayesian Filters Bayesian filters are the latest in spam
filtering technology and the most successful.
Bayes classifiers were used extensively in the field of pattern recognition.
Given an unlabeled example, the classifier will calculate the most likely classification with some degree of probability.
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2020
Bayesian FiltersBayesian Filters Steps in Bayes Filtering
Training Validation Implementation
Training starts with two collections of mails : one of spam and one of legitimate mail.
For every word in these emails, it calculates a spam probability based on the proportion of spam occurrences.
Bayesian filters are quite accurate, and adapt automatically as spam evolves.
False positives are minimized by Bayesian filtering because they consider evidence of innocence as well as evidence of spam.
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2121
Bayesian FilteringBayesian Filtering Bayes Probability,Bayes Probability,
Pr (spam | words) = Pr (spam) * Pr (spam | words) = Pr (spam) * Pr (words | Spam)
Pr (words)
Probability closer to 1 would be classified as spam and closer to 0 is classified as ham.
0.5 is set as the threshold.
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2222
Neural Network for TrainingNeural Network for Training Neural Network StructureNeural Network Structure
i
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2323
Neural Networks for TrainingNeural Networks for Training Neural networks are used to train the
spam filter (Rule-based or Bayesian) and itself is not a filter
Input words or rules etc.
Trained over multiple samples of the user’s mails (both spam and ham)
Weights of the links are altered till the desired output is obtained.
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2424
Supervised LearningSupervised Learning Supervised learning Training with a
“teacher” signal
Train the system till we get optimized unaltered weights for the edges.
Caution! Take care not to over train the network.
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2525
Combining Spam Filters
GoalGoal Combined filter aims to improve individual filters performance.
Combined Filter = Original Filter (OF) + Received Filter (RF)Combined Filter = Original Filter (OF) + Received Filter (RF)
Max gain Received filter contains some feature sets not found in the original filter.
E.g.Original Filter = {“Share Market”, “Higher Studies”}Received filter = {“Share Market”, “Job Alerts”}
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2626
ChallengesChallenges Decisions (Spam / Ham) made by both Decisions (Spam / Ham) made by both
filters individuallyfilters individually
Decisions agree Decisions agree No Problem No Problem
DisagreementDisagreement Due to difference of Due to difference of feature setsfeature sets
ChallengesChallenges• “How do we select the correct decision or filter?”• “Who selects it?”
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2727
Filter Selector (FS)Filter Selector (FS) Training Phase Training Phase FS predictsFS predicts the unique the unique
features (e.g. words) of RFfeatures (e.g. words) of RF
Parse the emails of training set and Parse the emails of training set and extract the featuresextract the features
‘‘BagBag’ of (predicted) features for RF ’ of (predicted) features for RF
Text similarity comparison between the current e-mail's features and the feature sets of the filters.
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2828
Algorithm FlowchartAlgorithm Flowchart
1.1. Training PhaseTraining Phase2.2. Final VerdictFinal Verdict
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 2929
TF – IDF Similarity Measure
Commonly used in Information Retrieval applications.
More frequent words would be key to accurate classification of emails
FS predicted feature set is unique
“Query – Document” retrieval procedure.• 2 documents – Feature sets• Query – Current email
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 3030
Experiments & ResultsExperiments & Results
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 3131
ConclusionConclusion We discussed the techniques to We discussed the techniques to “kill”“kill” spam spam
ComparisonComparison between various techniques between various techniques
So far, So far, BayesianBayesian seems to be seems to be reliablereliable
Discussed a new approach to combine filtersDiscussed a new approach to combine filters
FutureFuture workwork : : Learning techniques for Filter SelectorLearning techniques for Filter Selector Better Similarity measures Better Similarity measures
04/22/2304/22/23 Email Spam Filtering - Muthiyalu JothirEmail Spam Filtering - Muthiyalu Jothir 3232
Thank You Thank You
Top Related