SPAM FILTERING ALGORITHM.pptx

download SPAM FILTERING ALGORITHM.pptx

of 25

Embed Size (px)

Transcript of SPAM FILTERING ALGORITHM.pptx

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    1/25

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    2/25

    What is Spam?

    Unsolicited, unwanted email that wassent indiscriminately, directly orindirectly, by a sender having no currentrelationship with the recipient.

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    3/25

    Purpose of spam:

    Delivering information to the recipient thatcontains a payload such as:

    Advertising for a (likely worthless, illegal, ornon-existent) product, Bait for a fraud scheme,Promotion of a cause, orComputer malware designed to hijack therecipients computer .

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    4/25

    As it is so cheap to send information,

    only a very small fraction of targetedrecipients perhaps 1 in 10,000 or

    fewer need to receive and respondto the payload for spam to be

    protable to its sender

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    5/25

    Problems faced due to spam:

    Large amounts of spam traffic betweenservers cause delays in delivery of legitimatee-mailPeople with dial-up Internet access have tospend bandwidth downloading junk mailSorting out the unwanted messages takes

    time and introduces a risk of deleting normalmail by mistake.

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    6/25

    Methods for dealing with spam:

    SOCIA L METHODS :

    Legal measures:Ex. Anti-spam law introduced in US

    Plain personal involvement:Ex. Never respond to spam , never publish your email

    address on web pages , never forward chain letters

    TECHNOLOGICAL METHODS :

    Blocking spammers IP -address Email-filtering

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    7/25

    Email Filtering:

    Two general approaches to mail filtering are:

    Knowledge engineeringMachine Learning

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    8/25

    Knowledge engineering

    A set of rules is created according to whichmessages are categorized as spam orlegitimate-mail.

    Ex. A typical rule of this kind could look like ifthe subject of a message contains the textBUY NOW then the message is a spam.

    These rules are created either by user of thefilter or some other authority(ex. The softwarecompany that provides a particular rule-basedspam-filtering tool)

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    9/25

    Drawbacks of knowledge engineering:

    The set of rules must be constantly updated,and maintaining it is not convenient for mostusers.

    When the rules are publicly available, thespammer has the ability to adjust the text ofhis message so that it would pass through thefilter

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    10/25

    Machine Learning:

    A set of pre-classified documents(training samples) is needed.

    A specific algorithm is then used tolearn the classification rules from thisdata.

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    11/25

    Problem Statement:

    To obtain a spam filter, that is: a decisionfunction f, that would tell us whether a given

    e-mail message m is spam (S) orlegitimate mail (L).

    If we denote the set of all e-mail messagesby M, we search for a function

    f : M {S,L}.

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    12/25

    SPAM FILTERING ALGORITHMS

    Nave Bayesian Classifierk-Nearest Neighbours ClassifierArtificial neural networks The Perceptron Multilayer PerceptronSupport Vector Machine

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    13/25

    Nave Bayesian ClassifierWe have two categories of messages:Spam (S) and Legitimate mail(L), andx is a feature vector of a message , that is,vector of number of occurences of certainwords in a message.P(x | c) denotes the probability of obtaining amessage with feature vector x from category c. Aim of a spam filter is to determine, P(c | x) ,given a message x, what category cproduced it

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    14/25

    . Using Bayes rule we get:

    P(c | x) = P(x|c)P(c)P(x)

    = P(x|c)P(c)

    P(x|S)P(S)+P(x|L)P(L)

    whereP(x) denotes the a-priori probability of

    message x andP(c) denotes the a-priori probability of

    class c

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    15/25

    The final classification rule is in the form ofa likelihood ratio:

    c= P(x | S) P(S) > (k) P(x |L)P(L) ? S : L

    where k is the parameter that specifies howdangerous it is to misclassify legitimatemail as spam.

    The greater is k , the less false positives willthe classifier produce.

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    16/25

    Algorithm:Training:

    For the given training set (x , c) calculate(x) = P(x | S) and, (k) P(L)

    P(x |L) P(S)

    Class i f ica t ion:

    Given a message m determine x, retrieve the storedvalue for (x),and,Use the decision rule to determine the category ofmessage m.

    c= P(x | S) P(S) > (k) P(x |L)P(L) ? S : L

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    17/25

    Advantages:

    Conceptually very easy to understand

    Very effective(filters more than 99.691%)

    Everyones filter is essentially customizedmaking it difficult for the spammers to defeateveryones filter with a single message

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    18/25

    Disadvantages:We need to have a collection of spam andlegitimate mail to initialize the filter.

    Initialization is a bit time consuming.

    On each message a user-specific databaseof word probabilities has to be consulted.

    False positives do happen(though rarely).

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    19/25

    k Nearest Neighbours Classifier Train ing

    Store the training messages.

    Classi f icat ion

    Given a message x, determine its k nearest neighboursamong the messages in the training set. If l or moremessages among the k nearest neighbours of x arespam, classify x as spam , otherwise classify it aslegitimate mail.

    l is used for controlling the number of false positives.

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    20/25

    We use eucledian distances for determiningthe nearest neighbours .

    We have to calculate distances to all trainingmessages and find the k nearest neighbours.This may take about O(nm) time for atraining set of n messages containing feature

    vectors with m elements.

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    21/25

    Artificial Neural Networks

    Artificial neural networks are models inspiredby animal central nervous that are capableof machine learning and pattern recognition.

    They are usually presented as systems ofinterconnected " neurons " that can computevalues from inputs by feeding informationthrough the network . There are two kinds ofneural networks generally used:The perceptron ,andThe multilayer perceptron.

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    22/25

    The PerceptronThe idea of the perceptron isto find a linear function of thefeature vectorf(x) = w x + b such that f(x) > 0 for vectors of oneclass, and f(x) < 0 for vectors of otherclass. w = (w1,w2, . . . ,wm) is thevector of coefficients (weights)of the function ,andb is the so-called bias.

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    23/25

    Algorithm: Training

    Initialize w and b (to random values or to 0). Find a training example (x, c) for whichsign(w x + b) c. If there is no such example training iscompleted. Store the final w and b and stop. Otherwise

    go to next step.Update (w, b):w := w + cx,b := b + c.

    Go to previous step. Class i f ica t ionGiven a message x, determine its class as sign( wx +b).

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    24/25

    Multilayer PerceptronMultilayer perceptron is a function that may be visualized as a networkwith several layers of neurons, connected in a feed forward manner.

    The neurons in the first layer are called input neurons, and representinput variables.

    The neurons in the last layer are called output neurons and providefunction result value.

    The layers between the first and the last are called hidden layers.

    Each neuron in the network is similar to a perceptron : it takes inputvalues x 1 , x 2 , . . . x k, and calculates its output value o by the formula

    Output(o) = (wixi + b)

    where w i, b are the weights and the bias of the neuron and is a certain

    nonlinear function. Most often (x) is either 1/(1+e^ax) or tanh(x).

    ax1/(1+e )

  • 8/14/2019 SPAM FILTERING ALGORITHM.pptx

    25/25

    Training of the multilayer perceptron meanssearching for such weights and biases of all theneurons for which the network will have as smallerror on the training set as possible.

    Total training error :

    E(f) = |f(xi) ci|^2 ,

    where (xi, ci) are training samples.