Bayesian Learning Application to Text Classification Example: spam filtering

Bayesian Learning

Application to Text Classification

Example: spam filtering

Kunstmatige Intelligentie / RuG

Marius Bulacu & prof. dr. Lambert Schomaker

KI2 - 3

2

Founders of Probability Theory

Blaise Pascal

(1623-1662, France)

Pierre Fermat

(1601-1665, France)

They laid the foundations of the probability theory in a correspondence on a dice game.

3

Prior, Joint and Conditional Probabilities

P(A, B) = joint probability of A and BP(A | B) = conditional (posterior) probability of

A given BP(B | A) = conditional (posterior) probability of

B given A

P(A) = prior probability of AP(B) = prior probability of B

4

Probability Rules

Product rule:P(A, B) = P(A | B) P(B)

or equivalently

P(A, B) = P(B | A) P(A)

Sum rule:

P(A) = ΣB P(A, B) = ΣB P(A | B) P(B)

if A is conditionalized on B, then the total probability of A is the sum of its joint probabilities with all B

5

Statistical Independence

Two random variables A and B are independent iff:

P(A, B) = P(A) P(B)

P(A | B) = P(A)

P(B | A) = P(B)

knowing the value of one variable does not yield any information about the value of the other

6

Statistical Dependence - Bayes

Thomas Bayes

(1702-1761, England)

“Essay towards solving a problem in the doctrine of chances” published in the Philosophical Transactions of the Royal Society of London in 1764.

9

Bayes Formula and Classification

)X(p

)C(p)C|X(p)X|C(p

Posteriorprobability of the classafter seeing the data

Priorprobability of the classbefore seeing anything

Unconditionalprobability of the data

Conditional Likelihoodof the data

given the class

10

Medical example

p(+disease) = 0.002

p(+test | +disease) = 0.97

p(+test | -disease) = 0.04

p(+disease | +test) = p(+test | +disease) * p(+disease) / p(+test)

= 0.97 * 0.002 / 0.04186 = 0.00194 / 0.04186 = 0.046

p(-disease | +test) = p(+test | -disease) * p(-disease) / p(+test)

= 0.04 * 0.998 / 0.04186 = 0.03992 / 0.04186 = 0.953

p(+test) = p(+test | +disease) * p(+disease) + p(+test | -disease) * p(-disease)

= 0.97 * 0.002 + 0.04 * 0.97 = 0.00194 + 0.03992 = 0.04186

11

MAP Classification

p(C1|x) = p(x|C1)p(C1)

Decisionthreshold

p(C2|x) = p(x|C2)p(C2)

To minimize probability of misclassification, assign new input x to the class with the Maximum A posteriori Probability, e.g. assign to x to class C1 if:

p(C1|x) > p(C2|x) <=> p(x|C1)p(C1) > p(x|C2)p(C2)

Therefore we must impose a decision boundary where the two posterior probability distributions cross each other.

x

12

Maximum Likelihood Classification

When the prior class distributions are not known or for equal (non-informative) priors:

p(x|C1)p(C1) > p(x|C2)p(C2)

becomes

p(x|C1) > p(x|C2)

Therefore assign the input x to the class with the Maximum Likelihood to have generated it.

13

Continuous Features

Two methods for dealing with continuous-valued features:

1) Binning: divide the range of continuous values into a discrete number of bins, then apply the discrete methodology.

2) Mixture of Gaussians: make an assumption regarding the functional form of the PDF (liniar combination of Gaussians) and derive the corresponding parameters (means and standard deviations).

14

Accumulation of Evidence

Bayesian inference allows for integrating prior knowledge about the world (beliefs being expressed in terms of probabilities) with new incoming data.

Different forms of data (possibly incommensurable) can be fused towards the final decision using the “common currency” of probability.

As the new data arrives, the latest posterior becomes the new prior for interpreting the new input.

p(C|X,Y) = p(X,Y,C) = p(C) p(X,Y|C)

= p(C) p(X|C) p(Y|C,X)

... = p(C) p(X|C) p(Y|C,X) p(Z|C,X,Y)

priornew prior

new prior

15

Example: temperature classification

Classes C:

Cold P(x|C)Normal P(x|N)Warm P(x|W)Hot P(x|H)

P(x)P(x)

P(x|C)P(x|C)P(x|N)P(x|N)

P(x|W)P(x|W)

P(x|H)P(x|H)

P(x) likelihoodP(x) likelihoodof x valuesof x values

16

Bayes: probability “blow up”

Classes C:

Cold P(x|C)Normal P(x|N)Warm P(x|W)Hot P(x|H)

P(C|x) P(C|x) P(N|x)P(N|x) P(W|x)P(W|x) P(H|x)P(H|x)

P(x|C) P(x|C)

P(C|x) P(C|x)

P(C|x) = P(x|C) P(C) / P(x)P(C|x) = P(x|C) P(C) / P(x)

Bayesian outputhas a nice plateau

even with an irregularPDF shape …

in

out

18

Puzzle

So if Bayes is optimal and can be used for continuous data too, why has it become popular so late, i.e., much later than neural networks?

19

Why Bayes has become popular so late…

P(x)

x

Note: the example was 1-dimensional

A PDF (histogram) with 100 bins for one dimension will cost 10000 bins for two dimensions etc.

Ncells = Nbinsndims

20


Ncells = Nbinsndims

Yes… but you could use n-dimensional theoretical distributions (Gauss, Weibull etc.) instead of empirically measured PDFs…

21


… use theoretical distributions instead of empirically measured PDFs …

still the dimensionality is a problem:– 20 samples needed to estimate 1-dim. Gaussian PDF

400 samples needed to estimate 2-dim. Gaussian!, etc.

massive amounts of labeled data are needed to estimate probabilities reliably!

22

Labeled (ground truthed) data

Example: client evaluation in insurances

0.1 0.54 0.53 0.874 8.455 0.001 –0.111 risk

0.2 0.59 0.01 0.974 8.40 0.002 –0.315 risk

0.11 0.4 0.3 0.432 7.455 0.013 –0.222 safe

0.2 0.64 0.13 0.774 8.123 0.001 –0.415 risk

0.1 0.17 0.59 0.813 9.451 0.021 –0.319 risk

0.8 0.43 0.55 0.874 8.852 0.011 –0.227 safe

0.1 0.78 0.63 0.870 8.115 0.002 –0.254 risk

. . . . . . . .

23

Success of speech recognition

massive amounts of data increased computing power cheap computer memory

allowed for the use of Bayes in

hidden Markov Models for speech recognition

similarly (but slower): application of Bayes

in script recognition

Global Structure: year title date date and number of entry (Rappt) redundant lines between paragraphs jargon-words:

NotificatieBesluit fiat

imprint with page number

XML model

Local probabilistic structure:

P(“Novb 16 is a date” | “sticks out to the left” & is left of “Rappt ”) ?

28

The Naive Bayes Classifier Assume that each sample x to be classified is described by the attributes a1, a2 ... an.

The most probable (MAP) classification for x is:

Naive Bayes independence assumption:

Therefore:

)c(P)c|a...a,a(P

)a...a,a(P

)c(P)c|a...a,a(P

)a...a,a|c(P)x(class

iinc

n

iin

c

nic

maxarg

maxarg

maxarg

i

i

i

21

21

21

21

j

ijin )c|a(P)c|a...a,a(P 21

j

ijic

)c|a(P)c(P)x(class maxargi

29

Learning to Classify Text

Representation: each electronic document is represented by the set of words that it contains under the independence assumptions

- order of words does not matter

- co-occurrences of words do not matter

i.e. each document is represented as a “bag of words” Learning: estimate from the training dataset of documents

- the prior class probability P(ci)

- the conditional likelihood of a word wj given the document class ci P(wj|ci)

Classification: maximum a posteriori (MAP)

30

Learning to Classify e-mail

Is this e-mail a spam?: e-mail {spam, ham} Each word represents an attribute characterizing the

e-mail. Estimate the class priors p(spam) and p(ham) from the

training data as well as the class conditional likelihoods for all the encountered words.

For a new e-mail, assuming naive Bayes conditional independence, compute the MAP hypothesis.

31

Spam filtering

From [email protected] Mon Nov 10 19:23:44 2003Return-Path: <[email protected]>Received: from serlinux15.essex.ac.uk (serlinux15.essex.ac.uk [155.245.48.17]) by tcw2.ppsw.rug.nl (8.12.8/8.12.8) with ESMTP id hAAIecHC008727; Mon, 10 Nov 2003 19:40:38 +0100

Apologies for multiple postings.> 2nd C a l l f o r P a p e r s>> DAS 2004>> Sixth IAPR International Workshop on> Document Analysis Systems>> September 8-10, 2004>> Florence, Italy>> http://www.dsi.unifi.it/DAS04>> Note:> There are two main additions with respect to the previous CFP:> 1) DAS&DL data are now available on the workshop web site> 2) Proceedings will be published by Springer Verlag in LNCS series

Example of regular mail:

32

Spam filtering

Example of spam:

From : Easy Qualify" <[email protected]> To : [email protected] Subject : Claim your Unsecured Platinum Card - 75OO dollar limit Date : Tue, 28 Oct 2003 17:12:07 -0400

==================================================mbulacu - Tuesday, Oct 28, 2003==================================================

Congratulations, you have been selected for an Unsecured Platinum Credit Card / $7500 starting credit limit.

This offer is valid even if you've had past credit problems or evenno credit history. Now you can receive a $7,500 unsecured Platinum Credit Card that can help build your credit. And to help get your card to you sooner, we have been authorized to waive any employment or credit verification.

33

Conclusions

Effective: about 90% correct classification Could be applied to any text classification

problem Needs to be polished

34

Summary

Bayesian inference allows for integrating prior knowledge about the world (beliefs being expressed in terms of probabilities) with new incoming data. Inductive bias of Naive Bayes: attributes are independent. Although this assumption is often violated, it provides a very efficient tool often used (e.g. text classification – spam filtering). Applicable to discrete or continuous data.

Bayesian Learning Application to Text Classification Example: spam filtering

Documents

Transcript of Bayesian Learning Application to Text Classification Example: spam filtering