Email Spam Filtering Computer Security Seminar
Embed Size (px)
description
Transcript of Email Spam Filtering Computer Security Seminar
-
Email Spam FilteringComputer Security SeminarN.Muthiyalu Jothir 271120Media Informatics
Email Spam Filtering - Muthiyalu Jothir
-
AgendaWhat is Spam ?StatisticsWho Benefits from it?Spam Filtering TechniquesCombining FiltersConclusion
Email Spam Filtering - Muthiyalu Jothir
-
What is Spam?Spam Unsolicited email Emails that involves sending identical or nearly identical messages to thousands (or millions) of recipients.
Caution !SPAM - Spiced Ham is a popular American canned meat brand
Email Spam Filtering - Muthiyalu Jothir
-
Problem With a tiny investment, a spammer can send over 100,000 bulk emails per hour.
Junk mails waste storage and transmission bandwidth.
ISPs investment Cost we absorb as ISPs customer
Spam is a problem because the cost is forced onto us, the recipient.
Email Spam Filtering - Muthiyalu Jothir
-
Statistics
Email Spam Filtering - Muthiyalu Jothir
-
Who benefits from Spam?
Financial Firms e.g. MortgageLead Generators(Gain 2% of Loan value per customer data)Spammers (Share the profit with Lead Generators)RecipientInformation about interested customersRecipient replies here
Email Spam Filtering - Muthiyalu Jothir
-
Spam Control TechniquesFight Back techniquesFiltering Techniques Reporting Spam to ISP
Fight back filters
Slow Senders
Law ???
etc.
Challenge-Response Filtering
Blacklists and White lists
Content based filters Rule based Bayesian filters
Email Spam Filtering - Muthiyalu Jothir
-
Reporting Spam To ISPsOriginal spam solutionLegitimate ISPs respond to such complaintsSpammers kicked offDisadvantageDisguised Spammers.Nave users cannot interpret the email headers
Email Spam Filtering - Muthiyalu Jothir
-
Filters that Fight Back (FFB)
Majority of spam contain links to web pages.
Spam filters could auto retrieve the URLs and crawl back to those pages, which would increase the load on the server.
If all the spam receivers do this at the same time, the server might be crashed and so the cost of spamming increases.
Caution !
FFB usually works with blacklists (of malicious servers) in order to avoid the attack on innocent servers.
Email Spam Filtering - Muthiyalu Jothir
-
Filtering Techniques
Email Spam Filtering - Muthiyalu Jothir
-
Spam Vs HamCare to be taken in any Spam filtering technique
All the Spam could be allowed to pass thro; but, not even a single legitimate mail should be filtered.
False Positive Legitimate mail classified as spam.
Least false positive rate desired
Caution : Check your junk folder before deleting
Dont believe your Spam filter
Email Spam Filtering - Muthiyalu Jothir
-
Challenge-Response Filtering
Emails from unknown senders will receive an auto-reply message asking them to verify themselves
Senders Challenged" to type in a word that is hidden within a graphic or a sound file
Mail is forwarded to receivers inbox, only after successful response
This technique almost filters all spam . No spammer would be interested to take the extra effort to prove him / her self.Commercial product spamarrest
DisadvantageThis technique is rude
Sometimes senders dont or forget to reply to the challenge
Email Spam Filtering - Muthiyalu Jothir
-
Blacklists and White lists
Blacklists of misbehaving servers or known spammers that are collected by several sites.
Sender id in the email is compared with the blacklist
White lists are complementary to black lists, and contain addresses of trusted contacts
Use blacklists and white lists for the first level filtering (before applying content checks) and not used as the only tool for making decision.
DisadvantageProne to wrong configurations with legitimate servers unable to exit from a list where they had been incorrectly inserted.
Email Spam Filtering - Muthiyalu Jothir
-
Content based filters
Not a good idea to filter mails just based on blacklists
Wiser decision Consider the actual content of the email
Almost all the successful spam filters use this technique
Major types : Rule-based and Bayesian
Email Spam Filtering - Muthiyalu Jothir
-
Rule Based FiltersRule based filters work based on some static rules to decide whether a mail is a spam or not.
Rules could bewords and phraseslots of uppercase charactersexclamation pointsspecial charactersWeb linksHTML messagesbackground colorscrazy Subject lines etc.
Email Spam Filtering - Muthiyalu Jothir
-
Rule based filtersRules are given scores, based on importance
Incoming mails are parsed and checked for known malicious patterns
Total score calculated for the triggered rules
If Final Score > Threshold, classify as spam. Otherwise, classify as legitimate mail.
Threshold decided by the user.
Email Spam Filtering - Muthiyalu Jothir
-
Rule Based FiltersSpamassasin, a popular spam filtering product uses rule based filtering.
Perl Regex (Regular expressions) used for pattern checking
Example rulesheader __LOCAL_FROM_NEWS From /[email protected]\.com/i
body __LOCAL_SALES_FIGURES /\bMonthly Sales Figures\b/
score LOCAL_NEWS_SALES_FIGURES 0.8
Email Spam Filtering - Muthiyalu Jothir
-
Rule Based FiltersAdvantageEasy to implement No training required
DisadvantageStatic rules too generalSpammers find new ways to deceive the rules
Email Spam Filtering - Muthiyalu Jothir
-
Bayesian FiltersBayesian filters are the latest in spam filtering technology and the most successful.
Bayes classifiers were used extensively in the field of pattern recognition.
Given an unlabeled example, the classifier will calculate the most likely classification with some degree of probability.
Email Spam Filtering - Muthiyalu Jothir
-
Bayesian FiltersSteps in Bayes FilteringTrainingValidationImplementation
Training starts with two collections of mails : one of spam and one of legitimate mail.
For every word in these emails, it calculates a spam probability based on the proportion of spam occurrences.
Bayesian filters are quite accurate, and adapt automatically as spam evolves.
False positives are minimized by Bayesian filtering because they consider evidence of innocence as well as evidence of spam.
Email Spam Filtering - Muthiyalu Jothir
-
Bayesian FilteringBayes Probability,
Pr (spam | words) = Pr (spam) * Pr (words | Spam)
Pr (words)
Probability closer to 1 would be classified as spam and closer to 0 is classified as ham.
0.5 is set as the threshold.
Email Spam Filtering - Muthiyalu Jothir
-
Neural Network for TrainingNeural Network Structurei
Email Spam Filtering - Muthiyalu Jothir
-
Neural Networks for TrainingNeural networks are used to train the spam filter (Rule-based or Bayesian) and itself is not a filter
Input words or rules etc.
Trained over multiple samples of the users mails (both spam and ham)
Weights of the links are altered till the desired output is obtained.
Email Spam Filtering - Muthiyalu Jothir
-
Supervised LearningSupervised learning Training with a teacher signal
Train the system till we get optimized unaltered weights for the edges.
Caution!Take care not to over train the network.
Email Spam Filtering - Muthiyalu Jothir
-
Combining Spam Filters
Goal Combined filter aims to improve individual filters performance.
Combined Filter = Original Filter (OF) + Received Filter (RF)
Max gain Received filter contains some feature sets not found in the original filter.
E.g.Original Filter = {Share Market, Higher Studies}Received filter = {Share Market, Job Alerts}
Email Spam Filtering - Muthiyalu Jothir
-
ChallengesDecisions (Spam / Ham) made by both filters individually
Decisions agree No Problem
Disagreement Due to difference of feature sets
ChallengesHow do we select the correct decision or filter?Who selects it?
Email Spam Filtering - Muthiyalu Jothir
-
Filter Selector (FS)Training Phase FS predicts the unique features (e.g. words) of RF
Parse the emails of training set and extract the features
Bag of (predicted) features for RF
Text similarity comparison between the current e-mail's features and the feature sets of the filters.
Email Spam Filtering - Muthiyalu Jothir
-
Algorithm FlowchartTraining PhaseFinal Verdict
Email Spam Filtering - Muthiyalu Jothir
-
TF IDF Similarity Measure
Commonly used in Information Retrieval applications.
More frequent words would be key to accurate classification of emails
FS predicted feature set is unique
Query Document retrieval procedure.2 documents Feature setsQuery Current email
Email Spam Filtering - Muthiyalu Jothir
-
Experiments & Results
Email Spam Filtering - Muthiyalu Jothir
-
ConclusionWe discussed the techniques to kill spam
Comparison between various techniques
So far, Bayesian seems to be reliable
Discussed a new approach to combine filters
Future work : Learning techniques for Filter SelectorBetter Similarity measures
Email Spam Filtering - Muthiyalu Jothir
-
Thank You
Email Spam Filtering - Muthiyalu Jothir