Bayesian Learning Application to Text Classification Example: spam filtering

34
Bayesian Learning Application to Text Classification Example: spam filtering Kunstmatige Intelligentie / RuG Marius Bulacu & prof. dr. Lambert Schomaker KI2 - 3

description

KI2 - 3. Bayesian Learning Application to Text Classification Example: spam filtering. Marius Bulacu & prof. dr. Lambert Schomaker. Kunstmatige Intelligentie / RuG. Founders of Probability Theory. Pierre Fermat (1601-1665, France). Blaise Pascal (1623-1662, France). - PowerPoint PPT Presentation

Transcript of Bayesian Learning Application to Text Classification Example: spam filtering

Page 1: Bayesian Learning Application to Text Classification Example: spam filtering

Bayesian Learning

Application to Text Classification

Example: spam filtering

Kunstmatige Intelligentie / RuG

Marius Bulacu & prof. dr. Lambert Schomaker

KI2 - 3

Page 2: Bayesian Learning Application to Text Classification Example: spam filtering

2

Founders of Probability Theory

Blaise Pascal

(1623-1662, France)

Pierre Fermat

(1601-1665, France)

They laid the foundations of the probability theory in a correspondence on a dice game.

Page 3: Bayesian Learning Application to Text Classification Example: spam filtering

3

Prior, Joint and Conditional Probabilities

P(A, B) = joint probability of A and BP(A | B) = conditional (posterior) probability of

A given BP(B | A) = conditional (posterior) probability of

B given A

P(A) = prior probability of AP(B) = prior probability of B

Page 4: Bayesian Learning Application to Text Classification Example: spam filtering

4

Probability Rules

Product rule:P(A, B) = P(A | B) P(B)

or equivalently

P(A, B) = P(B | A) P(A)

Sum rule:

P(A) = ΣB P(A, B) = ΣB P(A | B) P(B)

if A is conditionalized on B, then the total probability of A is the sum of its joint probabilities with all B

Page 5: Bayesian Learning Application to Text Classification Example: spam filtering

5

Statistical Independence

Two random variables A and B are independent iff:

P(A, B) = P(A) P(B)

P(A | B) = P(A)

P(B | A) = P(B)

knowing the value of one variable does not yield any information about the value of the other

Page 6: Bayesian Learning Application to Text Classification Example: spam filtering

6

Statistical Dependence - Bayes

Thomas Bayes

(1702-1761, England)

“Essay towards solving a problem in the doctrine of chances” published in the Philosophical Transactions of the Royal Society of London in 1764.

Page 7: Bayesian Learning Application to Text Classification Example: spam filtering

7

Bayes Theorem

=> P(A B) = P(A|B) P(B) = P(B|A) P(A)

P(A|B) = P(A B) / P(B)P(B|A) = P(A B) / P(A)

=> P(A|B) = P(B|A) P(A)

P(B)

Page 8: Bayesian Learning Application to Text Classification Example: spam filtering

8

Bayes Theorem - Causality

Diagnostic:

P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect)

Pattern Recognition:

P(Class|Feature) = P(Feature|Class) P(Class) / P(Feature)

P(A|B) = P(B|A) P(A)

P(B)

Page 9: Bayesian Learning Application to Text Classification Example: spam filtering

9

Bayes Formula and Classification

)X(p

)C(p)C|X(p)X|C(p

Posteriorprobability of the classafter seeing the data

Priorprobability of the classbefore seeing anything

Unconditionalprobability of the data

Conditional Likelihoodof the data

given the class

Page 10: Bayesian Learning Application to Text Classification Example: spam filtering

10

Medical example

p(+disease) = 0.002

p(+test | +disease) = 0.97

p(+test | -disease) = 0.04

p(+disease | +test) = p(+test | +disease) * p(+disease) / p(+test)

= 0.97 * 0.002 / 0.04186 = 0.00194 / 0.04186 = 0.046

p(-disease | +test) = p(+test | -disease) * p(-disease) / p(+test)

= 0.04 * 0.998 / 0.04186 = 0.03992 / 0.04186 = 0.953

p(+test) = p(+test | +disease) * p(+disease) + p(+test | -disease) * p(-disease)

= 0.97 * 0.002 + 0.04 * 0.97 = 0.00194 + 0.03992 = 0.04186

Page 11: Bayesian Learning Application to Text Classification Example: spam filtering

11

MAP Classification

p(C1|x) = p(x|C1)p(C1)

Decisionthreshold

p(C2|x) = p(x|C2)p(C2)

To minimize probability of misclassification, assign new input x to the class with the Maximum A posteriori Probability, e.g. assign to x to class C1 if:

p(C1|x) > p(C2|x) <=> p(x|C1)p(C1) > p(x|C2)p(C2)

Therefore we must impose a decision boundary where the two posterior probability distributions cross each other.

x

Page 12: Bayesian Learning Application to Text Classification Example: spam filtering

12

Maximum Likelihood Classification

When the prior class distributions are not known or for equal (non-informative) priors:

p(x|C1)p(C1) > p(x|C2)p(C2)

becomes

p(x|C1) > p(x|C2)

Therefore assign the input x to the class with the Maximum Likelihood to have generated it.

Page 13: Bayesian Learning Application to Text Classification Example: spam filtering

13

Continuous Features

Two methods for dealing with continuous-valued features:

1) Binning: divide the range of continuous values into a discrete number of bins, then apply the discrete methodology.

2) Mixture of Gaussians: make an assumption regarding the functional form of the PDF (liniar combination of Gaussians) and derive the corresponding parameters (means and standard deviations).

Page 14: Bayesian Learning Application to Text Classification Example: spam filtering

14

Accumulation of Evidence

Bayesian inference allows for integrating prior knowledge about the world (beliefs being expressed in terms of probabilities) with new incoming data.

Different forms of data (possibly incommensurable) can be fused towards the final decision using the “common currency” of probability.

As the new data arrives, the latest posterior becomes the new prior for interpreting the new input.

p(C|X,Y) = p(X,Y,C) = p(C) p(X,Y|C)

= p(C) p(X|C) p(Y|C,X)

... = p(C) p(X|C) p(Y|C,X) p(Z|C,X,Y)

priornew prior

new prior

Page 15: Bayesian Learning Application to Text Classification Example: spam filtering

15

Example: temperature classification

Classes C:

Cold P(x|C)Normal P(x|N)Warm P(x|W)Hot P(x|H)

P(x)P(x)

P(x|C)P(x|C)P(x|N)P(x|N)

P(x|W)P(x|W)

P(x|H)P(x|H)

P(x) likelihoodP(x) likelihoodof x valuesof x values

Page 16: Bayesian Learning Application to Text Classification Example: spam filtering

16

Bayes: probability “blow up”

Classes C:

Cold P(x|C)Normal P(x|N)Warm P(x|W)Hot P(x|H)

P(C|x) P(C|x) P(N|x)P(N|x) P(W|x)P(W|x) P(H|x)P(H|x)

Page 17: Bayesian Learning Application to Text Classification Example: spam filtering

P(x|C) P(x|C)

P(C|x) P(C|x)

P(C|x) = P(x|C) P(C) / P(x)P(C|x) = P(x|C) P(C) / P(x)

Bayesian outputhas a nice plateau

even with an irregularPDF shape …

in

out

Page 18: Bayesian Learning Application to Text Classification Example: spam filtering

18

Puzzle

So if Bayes is optimal and can be used for continuous data too, why has it become popular so late, i.e., much later than neural networks?

Page 19: Bayesian Learning Application to Text Classification Example: spam filtering

19

Why Bayes has become popular so late…

P(x)

x

Note: the example was 1-dimensional

A PDF (histogram) with 100 bins for one dimension will cost 10000 bins for two dimensions etc.

Ncells = Nbinsndims

Page 20: Bayesian Learning Application to Text Classification Example: spam filtering

20

Why Bayes has become popular so late…

Ncells = Nbinsndims

Yes… but you could use n-dimensional theoretical distributions (Gauss, Weibull etc.) instead of empirically measured PDFs…

Page 21: Bayesian Learning Application to Text Classification Example: spam filtering

21

Why Bayes has become popular so late…

… use theoretical distributions instead of empirically measured PDFs …

still the dimensionality is a problem:– 20 samples needed to estimate 1-dim. Gaussian PDF

400 samples needed to estimate 2-dim. Gaussian!, etc.

massive amounts of labeled data are needed to estimate probabilities reliably!

Page 22: Bayesian Learning Application to Text Classification Example: spam filtering

22

Labeled (ground truthed) data

Example: client evaluation in insurances

0.1 0.54 0.53 0.874 8.455 0.001 –0.111 risk

0.2 0.59 0.01 0.974 8.40 0.002 –0.315 risk

0.11 0.4 0.3 0.432 7.455 0.013 –0.222 safe

0.2 0.64 0.13 0.774 8.123 0.001 –0.415 risk

0.1 0.17 0.59 0.813 9.451 0.021 –0.319 risk

0.8 0.43 0.55 0.874 8.852 0.011 –0.227 safe

0.1 0.78 0.63 0.870 8.115 0.002 –0.254 risk

. . . . . . . .

Page 23: Bayesian Learning Application to Text Classification Example: spam filtering

23

Success of speech recognition

massive amounts of data increased computing power cheap computer memory

allowed for the use of Bayes in

hidden Markov Models for speech recognition

similarly (but slower): application of Bayes

in script recognition

Page 24: Bayesian Learning Application to Text Classification Example: spam filtering

Global Structure: year title date date and number of entry (Rappt) redundant lines between paragraphs jargon-words:

NotificatieBesluit fiat

imprint with page number

XML model

Page 25: Bayesian Learning Application to Text Classification Example: spam filtering

Local probabilistic structure:

P(“Novb 16 is a date” | “sticks out to the left” & is left of “Rappt ”) ?

Page 26: Bayesian Learning Application to Text Classification Example: spam filtering

26

Naive BayesConditional Independence

Naive Bayes assumes the attributes (features) are independent:

p(X,Y|C) = p(X|C) P(Y|C)or

p(x1, ... xn|C) = i p(xi|C)

Often works surprisingly well in practice despite its manifest simplicity.

Page 27: Bayesian Learning Application to Text Classification Example: spam filtering

27

Accumulation of Evidence – Independence

)Y(p

)X|C(p)C|Y(p

)X|Y(p

)X|C(P)X,C|Y(p

)X(p)X|Y(p

)C(p)C|X(p)X,C|Y(p

)Y,X(p

)C(p)C|Y,X(p)Y,X|C(p

“naive” assumption that

X and Y are independent

Page 28: Bayesian Learning Application to Text Classification Example: spam filtering

28

The Naive Bayes Classifier Assume that each sample x to be classified is described by the attributes a1, a2 ... an.

The most probable (MAP) classification for x is:

Naive Bayes independence assumption:

Therefore:

)c(P)c|a...a,a(P

)a...a,a(P

)c(P)c|a...a,a(P

)a...a,a|c(P)x(class

iinc

n

iin

c

nic

maxarg

maxarg

maxarg

i

i

i

21

21

21

21

j

ijin )c|a(P)c|a...a,a(P 21

j

ijic

)c|a(P)c(P)x(class maxargi

Page 29: Bayesian Learning Application to Text Classification Example: spam filtering

29

Learning to Classify Text

Representation: each electronic document is represented by the set of words that it contains under the independence assumptions

- order of words does not matter

- co-occurrences of words do not matter

i.e. each document is represented as a “bag of words” Learning: estimate from the training dataset of documents

- the prior class probability P(ci)

- the conditional likelihood of a word wj given the document class ci P(wj|ci)

Classification: maximum a posteriori (MAP)

Page 30: Bayesian Learning Application to Text Classification Example: spam filtering

30

Learning to Classify e-mail

Is this e-mail a spam?: e-mail {spam, ham} Each word represents an attribute characterizing the

e-mail. Estimate the class priors p(spam) and p(ham) from the

training data as well as the class conditional likelihoods for all the encountered words.

For a new e-mail, assuming naive Bayes conditional independence, compute the MAP hypothesis.

Page 31: Bayesian Learning Application to Text Classification Example: spam filtering

31

Spam filtering

From [email protected] Mon Nov 10 19:23:44 2003Return-Path: <[email protected]>Received: from serlinux15.essex.ac.uk (serlinux15.essex.ac.uk [155.245.48.17]) by tcw2.ppsw.rug.nl (8.12.8/8.12.8) with ESMTP id hAAIecHC008727; Mon, 10 Nov 2003 19:40:38 +0100

Apologies for multiple postings.> 2nd C a l l f o r P a p e r s>> DAS 2004>> Sixth IAPR International Workshop on> Document Analysis Systems>> September 8-10, 2004>> Florence, Italy>> http://www.dsi.unifi.it/DAS04>> Note:> There are two main additions with respect to the previous CFP:> 1) DAS&DL data are now available on the workshop web site> 2) Proceedings will be published by Springer Verlag in LNCS series

Example of regular mail:

Page 32: Bayesian Learning Application to Text Classification Example: spam filtering

32

Spam filtering

Example of spam:

From : Easy Qualify" <[email protected]> To : [email protected] Subject : Claim your Unsecured Platinum Card - 75OO dollar limit Date : Tue, 28 Oct 2003 17:12:07 -0400

==================================================mbulacu - Tuesday, Oct 28, 2003==================================================

Congratulations, you have been selected for an Unsecured Platinum Credit Card / $7500 starting credit limit.

This offer is valid even if you've had past credit problems or evenno credit history. Now you can receive a $7,500 unsecured Platinum Credit Card that can help build your credit. And to help get your card to you sooner, we have been authorized to waive any employment or credit verification.

Page 33: Bayesian Learning Application to Text Classification Example: spam filtering

33

Conclusions

Effective: about 90% correct classification Could be applied to any text classification

problem Needs to be polished

Page 34: Bayesian Learning Application to Text Classification Example: spam filtering

34

Summary

Bayesian inference allows for integrating prior knowledge about the world (beliefs being expressed in terms of probabilities) with new incoming data. Inductive bias of Naive Bayes: attributes are independent. Although this assumption is often violated, it provides a very efficient tool often used (e.g. text classification – spam filtering). Applicable to discrete or continuous data.