SPAM FILTERING

of 27 /27
By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Embed Size (px)

description

SPAM FILTERING. By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014. What is Spam Email?. “junk email” or “unsolicited commercial email”. Spam filtering - a special case of email classification. Only 2 classes – Spam and Non-spam. Various Approaches. Bayesian Learning - PowerPoint PPT Presentation

Transcript of SPAM FILTERING

  • SPAM FILTERINGByAnkur Khator 01005028Gaurav Sharma 01005029Arpit Mathur 01D05014

  • What is Spam Email?junk email or unsolicited commercial email.Spam filtering - a special case of email classification.Only 2 classes Spam and Non-spam.

  • Various ApproachesBayesian LearningProbabilistic model for Spam FilteringBag of Words RepresentationRipper algorithm Context Sensitive Learning.Boosting algorithmImproving Accuracy by combining weaker hypotheses.

  • Term Vectors

  • Naive Bayes for SpamSeeking model to find P(Y=1/X1=x1,X2=x2,..,Xd=xd)

    From Bayes theoremP(Y=1/X1=x1,..,Xd=xd) = P(Y=1) * P(X1=x1,..,Xd=xd/Y=1) P(X1=x1,..Xd=xd)

    P(Y=0/X1=x1,..,Xd=xd) = P(Y=0) * P(X1=x1,..,Xd=xd/Y=0) P(X1=x1,..Xd=xd)

  • Justification of using Bayes TheoremSparseness of dataP(B/A) can be easily and accurately determined as compared to P(A/B)

  • Naive Bayes for Spam(contd.)Assume P(X1=x1,..,Xd=xd/Y=k) = P(Xi=xi / Y=k) Also assume Xi = 1 if no of occurrence of word i >= 1 = 0 otherwise

  • referred to as weights of evidence Inconsistency when some probability is zero.Smooth the estimates by adding a smooth positive constant to both numerator and denominator of each probability estimateNaive Bayes for Spam(contd.)

  • ClassifyingAssume new mail with text The quick rabbit rests

    0.51 + 0.51 + 0 + 0.51 + 1.10 + 0 = 2.63Probability = 0.93

  • ThresholdLower threshold Higher false positive rate

    Higher thresholdHigher false negative ratePreferred

  • Non-Linear ClassificationLinear Classifier Ignores the effect of Context of word on its meaning.

    Unrealistic . Build a linear classifier that test for more complex Features like Simultaneous Occurrences. High Computation Cost !! Non-Linear Classification is the Solution

  • Ripper Disjunction of Different ContextsEach Contexts is conjunction of Simple terms Context of w1 is : if w2 belongs to data and w3 belongs to data. i.e. for context to be true w1 must occur with w2 and w3. Three Components of Ripper Algorithm:

  • Rule Learning :Spam spam SubjectSpam Free Subject ,Spam Subject.Spam Gift!! Subject, Click Subject. The rule would be disjunction of three statements stated above. There is an initial set of rules too

  • Constructing Rule SetInitial Rule Set is Constructed Using a greedy Strategy.Based on the IREP (Incremental Reduced Error Pruning)To Construct A new Rule partitioning Dataset into two parts training Set And Pruning Set is Done.Every Time a Single condition is Added to Rule.

  • Simplification And OptimizationAt every step the density of +ve examples covered is increased.Adding stops until clause cover no ve example or there is no positive gain. After this, pruning i.e. simplification is done. At every stage, again following greedy Strategy

  • Reaching Sufficient RulesThe clause is deleted which maximizes the Function

    where U+(i+1) and U-(i-1) are the positive and negative examples.Termination when information gain is non-zero i.e. every rule covers +ve examples. But If data is noisy then number of rules increase

  • MDL Several heuristics are applied to solve the problem. MDL(Minimum Description Length) is one of them. After addition of each rule , total length of current rule set and example is calculated. Addition of rule is stopped when this length is d bits larger than shortest length.

  • AdaBoostEasy to find rule of thumb which are often correctIf buy now occurs in message, then predict spamHard to find one rule which is very accurateAdaBoost helps heregeneral method of converting rough rules of thumb into highly accurate prediction ruleConcentrating on hard examples

  • Pictorially

  • Algorithm

    Input S = { (Xi , Yi) } mi=1Initialize D(i) = 1/m for all iFor i = 1 to TH(t) = WeakLearner(S,Dt)Choose t ln((1-)/) (proven to Minimize error for 2class) [2]

    Update Dt+1(i) = Dt(i) exp(-tYiht(xi)) and Normalize

    Final Hypotheses f(x) = t ht(x)

  • Example

  • Example

  • Accuracy

    Weighted accuracy measure (L- + S+) / (L + S) strictness measure L : # legitimate messages S : # spam L- : #legitimate messages classified as legitimateS+ : #spam classified as spam Improving accuracyIncrease Introduce threshold Example classified positive only if f(x) > Default is ZERORecall correctly predicted spam out of number of spam in corpusPrecision correctly classified spam out of number predicted as spam

  • Results on corpus PU1 . . . [1]

  • Pros and Cons

    Fast and SimpleNo parameters to tuneFlexibleCan combine with any learning Algorithm No knowledge needed of WeakLearnerError reduces exponentiallyRobust to overfittingData Driven requires lots of dataPerformance depends on WeakLearnerMay fail if WeakLearner is too weak

  • ConclusionRIPPER as text categorization algorithm works better than Nave Bayes (better for more classes).Comparable for spam filtering (2 classes)Boosting better than any weak learner it works on.

  • References[1] Boosting trees for Anti Spam Email Filtering by Xavier Carreras and Llius Marquez 2001.[2] The boosting approach to machine learning: An overview. by Robert E. Schapire in MSRI Workshop on Nonlinear Estimation and Classification, 2002.[3] Statistics and The War on Spam by David Madigan, David Madigan, 2004.[4] Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. An Evaluation of Naive Bayesian Anti-Spam Filtering. In Proc. of the workshop on Machine Learning in the New Information Age, 2000. http://citeseer.ist.psu.edu/androutsopoulos00evaluation.html [5] William W. Cohen, Yoram Singer: Context-sensitive Learning Methods for Text Categorization. SIGIR 1996: 307-315