CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’()...

22
CLASSIFICATION AND CATEGORIZATION INTRODUCTION TO DATA SCIENCE ELI UPFAL

Transcript of CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’()...

Page 1: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

CLASSIFICATION AND CATEGORIZATION

INTRODUCTION TO DATA SCIENCE

ELI UPFAL

Page 2: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

MACHINE LEARNING PROBLEMS

2

classification or categorization clustering

regression dimensionality reduction

Supervised Learning Unsupervised Learning

Dis

cret

eC

ontin

uous

Page 3: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

EXAMPLE: TITANIC DATASET

3

Label Features

Can we predict survival from these features?

Page 4: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

THE MACHINE LEARNING FRAMEWORK

y = f(x)

Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the prediction function f by minimizing the prediction error on the training set

Testing: apply f to a never before seen test example x and output the predicted value y = f(x)

output prediction function

features

Slide credit: L. Lazebnik

Page 5: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

ML PIPELINE (SUPERVISED)

5

Page 6: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

EVALUATION – CROSS-VALIDATION

• Error type:• Training error: fraction of errors on training set • Generalization error: expected fraction of error

on new items• Estimating generalization error:

• Hold-out training set - test on fresh items• Cross validation, k-fold, leave-one-out,…

Page 7: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

CONFUSION TABLE

Page 8: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

TEXT FEATURES

Hair

Spam

Not Spam

Bag of Words𝑉𝑖𝑎𝑔𝑟𝑎𝑆𝑜𝑓𝑡𝐻𝑒𝑟𝑏𝑒𝑙𝑃𝑖𝑙𝑙𝑠𝐴𝑟𝑒…

ℎ𝑒𝑟𝑏𝑒𝑙𝑝𝑖𝑙𝑙𝑠𝑝𝑖𝑙𝑙𝑠𝑓𝑜𝑟𝑓𝑜𝑟𝐻𝑎𝑖𝑟

𝐻𝑎𝑖𝑟𝑒𝑛𝑙𝑎𝑟𝑔𝑒𝑚𝑒𝑛𝑡𝑒𝑛𝑙𝑎𝑟𝑔𝑒𝑚𝑒𝑛𝑡𝑇𝑒𝑐ℎ𝑛𝑖𝑞𝑢𝑒𝑠

N-Grams

Page 9: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

TOKENIZATION AND STEMMING

WORKING WITH TEXT

Page 10: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

TOKENIZATION

Input: “Friends, Romans and Countrymen”

Output: Tokens

• Friends• Romans• and• Countrymen

A token is an instance of a sequence of characters

Sec. 2.2.1

Page 11: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

COMMON STEPS

• Remove Stop Words (a, an, the, to, be, …)• Normalization to terms

• deleting periods: U.S.A. à USA• deleting hyphens: anti-discriminatory à antidiscriminatory• Abbreviations: Massachusetts Institute of Technology à MIT• Case-folding: Meal à meal, Brown à brown• Language-issues: Tuebingen, Tübingen à Tubingen • asymmetric expansion: windows à window• ...• What examples above are problematic?

• Thesauri and soundex• car = automobile color = colour

• Stemming

Page 12: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

STEMMING

Reduce terms to their “roots” before indexing

“Stemming” suggest crude affix chopping

• language dependent• e.g., automate(s), automatic, automation all

reduced to automat.

for example compressed and compression are both accepted as equivalent to compress.

Sec. 2.2.4

for exampl compress andcompress ar both acceptas equival to compress

Page 13: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

Commonest algorithm for stemming English

• Results suggest it’s at least as good as other stemming options

Conventions + 5 phases of reductions

• phases applied sequentially• each phase consists of a set of commands• sample convention: Of the rules in a compound

command, select the one that applies to the longest suffix.

Sec. 2.2.4

PORTER’S ALGORITHM

Page 14: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

sses → ss

ies → i

ational → ate

tional → tion

Weight of word sensitive rules

(m>1) EMENT →• replacement → replac• cement → cement

Sec. 2.2.4

TYPICAL RULES IN PORTER

Page 15: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

OTHER STEMMERS

Other stemmers exist, e.g., Lovins stemmer • http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm

• Single-pass, longest suffix removal (about 250 rules)

Full morphological analysis – at most modest benefits for retrieval

Do stemming and other normalizations help?

• English: very mixed results. Helps recall for some queries but harms precision on others

• E.g., operative (dentistry) ⇒ oper

• Definitely useful for Spanish, German, Finnish, …• 30% performance gains for Finnish!

Sec. 2.2.4

Page 16: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

MANY CLASSIFIERS TO CHOOSE FROM

Slide credit: D. Hoiem

Decision TreesK-nearest neighborSupport Vector MachinesLogistic RegressionNaïve BayesRandom ForrestBayesian networkRandomized ForestsBoosted Decision TreesRBMs….

Page 17: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

CLASSIFIERS: NEAREST NEIGHBOR

Slide credit: L. Lazebnik

f(x) = label of the training example nearest to x

• All we need is a distance function for our inputs• No training required!

Test example

Training examples

from class 1

Training examples

from class 2

Page 18: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

K-NEAREST NEIGHBOR

x x

x x

x

xx

xo

oo

o

o

oo

x2

x1

+

+

Page 19: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

1-NEAREST NEIGHBOR

x x

x x

x

xx

xo

oo

o

o

oo

x2

x1

+

+

Page 20: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

3-NEAREST NEIGHBOR

x x

x x

x

xx

xo

oo

o

o

oo

x2

x1

+

+

Page 21: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

5-NEAREST NEIGHBOR

x x

x x

x

xx

xo

oo

o

o

oo

x2

x1

+

+

Page 22: CLASSIFICATION AND CATEGORIZATION · TEXT FEATURES Hair Spam Not Spam Bag of Words!"#$%# &’() *+%,+-."--/ 0%+ … ℎ+%,+- 4"--/ 4"--/ (’% (’% *#"% *#"% +5-#%$+6+5) +5-#%$+6+5)

DECISION BOUNDARIES KNN

Assign label of nearest training data point to each test data point

Voronoi partitioning of feature space for two-category 2D and 3D data

from Duda et al.

Source: D. Lowe