Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

52
Information Retrieval Search Engine Technology (5&6) http://tangra.si.umich.edu/clair/ir09 Prof. Dragomir R. Radev [email protected]

description

Information Retrieval Search Engine Technology (5&6) http://tangra.si.umich.edu/clair/ir09. Prof. Dragomir R. Radev [email protected]. Final projects. Two formats: - PowerPoint PPT Presentation

Transcript of Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Page 1: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Information RetrievalSearch Engine Technology

(5&6)http://tangra.si.umich.edu/clair/ir09

Prof. Dragomir R. Radev

[email protected]

Page 2: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Final projects

• Two formats:– A software system that performs a specific search-engine

related task. We will create a web page with all such code and make it available to the IR community.

– A research experiment documented in the form of a paper. Look at the proceedings of the SIGIR, WWW, or ACL conferences for a sample format. I will encourage the authors of the most successful papers to consider submitting them to one of the IR-related conferences.

• Deliverables:– System (code + documentation + examples) or Paper (+ code,

data)– Poster (to be presented in class)– Web page that describes the project.

Page 3: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

SET/IR – W/S 2009

…9. Text classification

Naïve Bayesian classifiers Decision trees…

Page 4: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Introduction

• Text classification: assigning documents to predefined categories: topics, languages, users

• A given set of classes C• Given x, determine its class in C• Hierarchical vs. flat• Overlapping (soft) vs non-overlapping (hard)

Page 5: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Introduction

• Ideas: manual classification using rules (e.g., Columbia AND University EducationColumbia AND “South Carolina” Geography

• Popular techniques: generative (knn, Naïve Bayes) vs. discriminative (SVM, regression)

• Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x)

• Discriminative: model p(y|x) directly.

Page 6: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Bayes formula

)(

)|()()|(

Ap

BApBpABp

Full probability

Page 7: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Example (performance enhancing drug)

• Drug(D) with values y/n• Test(T) with values +/-• P(D=y) = 0.001• P(T=+|D=y)=0.8• P(T=+|D=n)=0.01• Given: athlete tests positive• P(D=y|T=+)=

P(T=+|D=y)P(D=y) / (P(T=+|D=y)P(D=y)+P(T=+|D=n)P(D=n)=(0.8x0.001)/(0.8x0.001+0.01x0.999)=0.074

Page 8: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Naïve Bayesian classifiers

• Naïve Bayesian classifier

• Assuming statistical independence

• Features = words (or phrases) typically

),(

)()|,...,(),...,|(

,...21

2121

k

kk FFFP

CdPCdFFFPFFFCdP

k

j j

k

j j

kFP

CdPCdFPFFFCdP

1

121

)(

)()|(),...,|(

Page 9: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Example

• p(well)=0.9, p(cold)=0.05, p(allergy)=0.05– p(sneeze|well)=0.1– p(sneeze|cold)=0.9– p(sneeze|allergy)=0.9– p(cough|well)=0.1– p(cough|cold)=0.8– p(cough|allergy)=0.7– p(fever|well)=0.01– p(fever|cold)=0.7– p(fever|allergy)=0.4

Example from Ray Mooney

Page 10: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Example (cont’d)

• Features: sneeze, cough, no fever• P(well|e)=(.9) * (.1)(.1)(.99) / p(e)=0.0089/p(e)• P(cold|e)=(.05) * (.9)(.8)(.3) / p(e)=0.01/p(e)• P(allergy|e)=(.05) * (.9)(.7)(.6) / p(e)=0.019/p(e)• P(e) = 0.0089+0.01+0.019=0.379• P(well|e)=.23• P(cold|e)=.26• P(allergy|e)=.50

Example from Ray Mooney

Page 11: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Issues with NB

• Where do we get the values – use maximum likelihood estimation (Ni/N)

• Same for the conditionals – these are based on a multinomial generator and the MLE estimator

is (Tji/Tji)• Smoothing is needed – why?

• Laplace smoothing ((Tji+1)/Tji+1))• Implementation: how to avoid floating point

underflow

)( CdP

Page 12: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Spam recognitionReturn-Path: <[email protected]>X-Sieve: CMU Sieve 2.2From: "Ibrahim Galadima" <[email protected]>Reply-To: [email protected]: [email protected]: Tue, 14 Jan 2003 21:06:26 -0800Subject: Gooday

DEAR SIR

FUNDS FOR INVESTMENTS

THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HADNO PREVIOUS CORRESPONDENCE WITH YOU

I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENTNATIONAL ELECTORAL COMMISSION INEC I GOT YOURCONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLEPERSON WITH WHOM TO HANDLE A VERY CONFIDENTIALTRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED ATTWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATESDOLLARS US$20M TO A SAFE FOREIGN ACCOUNT

THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITHARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OFOVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A

Page 13: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

SpamAssassin

• http://spamassassin.apache.org/

• http://spamassassin.apache.org/tests_3_1_x.html

Page 14: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Feature selection: The 2 test

• For a term t:

• C=class, it = feature• Testing for independence:

P(C=0,It=0) should be equal to P(C=0) P(It=0)– P(C=0) = (k00+k01)/n– P(C=1) = 1-P(C=0) = (k10+k11)/n– P(It=0) = (k00+K10)/n– P(It=1) = 1-P(It=0) = (k01+k11)/n

It

0 1

C 0 k00 k01

1 k10 k11

Page 15: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Feature selection: The 2 test

• High values of 2 indicate lower belief in independence.

• In practice, compute 2 for all words and pick the top k among them.

))()()((

)(

0010011100011011

2011000112

kkkkkkkk

kkkknΧ

Page 16: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Feature selection: mutual information

• No document length scaling is needed

• Documents are assumed to be generated according to the multinomial model

• Measures amount of information: if the distribution is the same as the background distribution, then MI=0

• X = word; Y = class

x y yPxP

yxPyxPYXMI

)()(

),(log),(),(

Page 17: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Well-known datasets

• 20 newsgroups– http://people.csail.mit.edu/u/j/jrennie/public_html/

20Newsgroups/

• Reuters-21578– http://www.daviddlewis.com/resources/testcollections/

reuters21578/ – Cats: grain, acquisitions, corn, crude, wheat, trade…

• WebKB– http://www-2.cs.cmu.edu/~webkb/ – course, student, faculty, staff, project, dept, other– NB performance (2000)– P=26,43,18,6,13,2,94– R=83,75,77,9,73,100,35

Page 18: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Evaluation of text classification

• Microaveraging – average over classes

• Macroaveraging – uses pooled table

Page 19: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Vector space classification

x1

x2

topic2

topic1

Page 20: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Decision surfaces

x1

x2

topic2

topic1

Page 21: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Decision trees

x1

x2

topic2

topic1

Page 22: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Classification usingdecision trees

• Expected information need

• I (s1, s2, …, sm) = - pi log (pi)

• s = data samples

• m = number of classes

Page 23: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

RID Age Income student credit buys?

1 <= 30 High No Fair No

2 <= 30 High No Excellent No

3 31 .. 40 High No Fair Yes

4 > 40 Medium No Fair Yes

5 > 40 Low Yes Fair Yes

6 > 40 Low Yes Excellent No

7 31 .. 40 Low Yes Excellent Yes

8 <= 30 Medium No Fair No

9 <= 30 Low Yes Fair Yes

10 > 40 Medium Yes Fair Yes

11 <= 30 Medium Yes Excellent Yes

12 31 .. 40 Medium No Excellent Yes

13 31 .. 40 High Yes Fair Yes

14 > 40 Medium no excellent no

Page 24: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Decision tree induction

• I(s1,s2)= I(9,5) = = - 9/14 log 9/14 – 5/14 log 5/14 == 0.940

Page 25: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Entropy and information gain

•E(A) = I (s1j,…,smj) S1j + … + smj

s

Entropy = expected information based on the partitioning intosubsets by A

Gain (A) = I (s1,s2,…,sm) – E(A)

Page 26: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Entropy

• Age <= 30s11 = 2, s21 = 3, I(s11, s21) = 0.971

• Age in 31 .. 40s12 = 4, s22 = 0, I (s12,s22) = 0

• Age > 40s13 = 3, s23 = 2, I (s13,s23) = 0.971

Page 27: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Entropy (cont’d)

• E (age) =5/14 I (s11,s21) + 4/14 I (s12,s22) + 5/14 I (S13,s23) = 0.694

• Gain (age) = I (s1,s2) – E(age) = 0.246

• Gain (income) = 0.029, Gain (student) = 0.151, Gain (credit) = 0.048

Page 28: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Final decision tree

excellent

age

student credit

no yes no yes

yes

no

31 .. 40

> 40

yes fair

Page 29: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Other techniques

• Bayesian classifiers

• X: age <=30, income = medium, student = yes, credit = fair

• P(yes) = 9/14 = 0.643

• P(no) = 5/14 = 0.357

Page 30: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Example

• P (age < 30 | yes) = 2/9 = 0.222P (age < 30 | no) = 3/5 = 0.600P (income = medium | yes) = 4/9 = 0.444P (income = medium | no) = 2/5 = 0.400P (student = yes | yes) = 6/9 = 0.667P (student = yes | no) = 1/5 = 0.200P (credit = fair | yes) = 6/9 = 0.667P (credit = fair | no) = 2/5 = 0.400

Page 31: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Example (cont’d)

• P (X | yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044• P (X | no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019• P (X | yes) P (yes) = 0.044 x 0.643 = 0.028• P (X | no) P (no) = 0.019 x 0.357 = 0.007

• Answer: yes/no?

Page 32: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

SET/IR – W/S 2009

…10. Linear classifiers Kernel methods Support vector machines…

Page 33: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Linear boundary

x1

x2

topic2

topic1

Page 34: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Vector space classifiers

• Using centroids

• Boundary = line that is equidistant from two centroids

Page 35: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Generative models: knn

• Assign each element to the closest cluster• K-nearest neighbors

• Very easy to program• Tessellation; nonlinearity• Issues: choosing k, b?• Demo:

– http://www-2.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html

)(

),(),(qdkNNd

qcq ddsbdcscore

Page 36: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Linear separators

• Two-dimensional line:w1x1+w2x2=b is the linear separator

w1x1+w2x2>b for the positive class

bxwT

• In n-dimensional spaces:

Page 37: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Example 1

x1

x2

topic2

topic1

w

Page 38: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Example 2

• Classifier for “interest” in Reuters-21578

• b=0• If the document is

“rate discount dlr world”, its score will be0.67*1+0.46*1+(-0.71)*1+(-0.35)*1= 0.05>0

Example from MSR

wi xi wi xi

0.70 prime -0.71 dlrs

0.67 rate -0.35 world

0.63 interest -0.33 sees

0.60 rates -0.25 year

0.46 discount -0.24 group

0.43 bundesbank -0.24 dlr

Page 39: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Example: perceptron algorithm

Input:

Algorithm:

Output:

}1,1{,)),,(),...,,(( 111 i

Nnn yxyxyxS

END

END

1

take //mis0)( IF

TO 1 FOR

0,0

1

0

kk

xyww

xwy

ni

kw

iikk

iki

kw

Page 40: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

[Slide from Chris Bishop]

Page 41: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Linear classifiers

• What is the major shortcoming of a perceptron?

• How to determine the dimensionality of the separator?– Bias-variance tradeoff (example)

• How to deal with multiple classes?– Any-of: build multiple classifiers for each class– One-of: harder (as J hyperplanes do not

divide RM into J regions), instead: use class complements and scoring

Page 42: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Support vector machines

• Introduced by Vapnik in the early 90s.

Page 43: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Issues with SVM

• Soft margins (inseparability)

• Kernels – non-linearity

Page 44: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

The kernel idea

before after

Page 45: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Example32:

),2,(),,(),( 2221

2132121 xxxxzzzxx

(mapping to a higher-dimensional space)

Page 46: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

The kernel trick

)',(',)',''2,')(,2,()'(),( 22221

21

2221

21 xxkxxxxxxxxxxxx T

dcxxxxk ))',()',(

)',tanh()',( xxkxxk

))2/('exp()',( 22 xxxxk

Polynomial kernel:

Sigmoid kernel:

RBF kernel:

Many other kernels are useful for IR:e.g., string kernels, subsequence kernels, tree kernels, etc.

Page 47: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

SVM (Cont’d)

• Evaluation:– SVM > knn > decision tree > NB

• Implementation– Quadratic optimization– Use toolkit (e.g., Thorsten Joachims’s

svmlight)

Page 48: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Semi-supervised learning

• EM

• Co-training

• Graph-based

Page 49: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Exploiting Hyperlinks – Co-training

• Each document instance has two sets of alternate view (Blum and Mitchell 1998)– terms in the document, x1– terms in the hyperlinks that point to the document, x2

• Each view is sufficient to determine the class of the instance– Labeling function that classifies examples is the

same applied to x1 or x2– x1 and x2 are conditionally independent, given the

class

[Slide from Pierre Baldi]

Page 50: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Co-training Algorithm

• Labeled data are used to infer two Naïve Bayes classifiers, one for each view

• Each classifier will– examine unlabeled data – pick the most confidently predicted positive and

negative examples– add these to the labeled examples

• Classifiers are now retrained on the augmented set of labeled examples

[Slide from Pierre Baldi]

Page 51: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Conclusion

• SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters.

• NB also good in many circumstances

Page 52: Information Retrieval Search Engine Technology (5&6) tangra.si.umich/clair/ir09

Readings

• MRS18

• MRS17, MRS19