1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by...

65
1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by...

Page 1: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

1

I256: Applied Natural Language Processing

Marti HearstNov 1, 2006

(Most slides originally by Barbara Rosario, modified here)

 

 

Page 2: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

2

Today

Algorithms for Classification Binary classification

PerceptronWinnowSupport Vector Machines (SVM)Kernel Methods

Multi-Class classificationDecision TreesNaïve BayesK nearest neighbor

Page 3: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

3

Binary Classification: examples

Spam filtering (spam, not spam)Customer service message classification (urgent vs. not urgent)Sentiment classification (positive, negative)Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes

Page 4: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

4

Binary Classification

Given: some data items that belong to a positive (+1 ) or a negative (-1 ) classTask: Train the classifier and predict the class for a new data itemGeometrically: find a separator

Page 5: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

5

Linear versus Non Linear algorithms

Linearly separable data: if all the data points can be correctly classified by a linear (hyperplanar) decision boundary

Page 6: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

6

Linearly separable data

Class1Class2Linear Decision boundary

Page 7: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

7

Non linearly separable data

Class1Class2

Page 8: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

8

Non linearly separable data

Non Linear Classifier Class1Class2

Page 9: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

11

Simple linear algorithms

Perceptron and Winnow algorithmBinary classificationOnline (process data sequentially, one data point at the time)Mistake-driven

Page 10: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

12From Gert Lanckriet, Statistical Learning Theory Tutorial

Linear binary classification

Data: {(xi,yi)}i=1...n

x in Rd (x is a vector in d-dimensional space) feature vector

y in {-1,+1} label (class, category)

Question: Design a linear decision boundary: wx + b (equation of hyperplane) such that the classification rule associated with it has minimal probability of error classification rule:

– y = sign(w x + b) which means:– if wx + b > 0 then y = +1 (positive example)– if wx + b < 0 then y = -1 (negative example)

Page 11: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

13From Gert Lanckriet, Statistical Learning Theory Tutorial

Linear binary classification

Find a good hyperplane (w,b) in Rd+1

that correctly classifies data points as much as possible

In online fashion: try one data point at the time, update weights as necessary

wx + b = 0

Classification Rule: y = sign(wx + b)

Page 12: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

14From Gert Lanckriet, Statistical Learning Theory Tutorial

Perceptron algorithmInitialize: w1 = 0

Updating rule For each data point x

If class(x) != decision(x,w)then

wk+1 wk + yixi

k k + 1 else

wk+1 wk

Function decision(x, w)If wx + b > 0 return +1

Else return -1

wk

0

+1

-1wk x + b = 0

wk+1

Wk+1 x + b = 0

Page 13: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

15From Gert Lanckriet, Statistical Learning Theory Tutorial

Perceptron algorithm

Online: can adjust to changing target, over timeAdvantages

Simple and computationally efficientGuaranteed to learn a linearly separable problem (convergence, global optimum)

LimitationsOnly linear separationsOnly converges for linearly separable dataNot really “efficient with many features”

Page 14: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

16From Gert Lanckriet, Statistical Learning Theory Tutorial

Winnow algorithm

Another online algorithm for learning perceptron weights:

f(x) = sign(wx + b)Linear, binary classification Update-rule: again error-driven, but multiplicative (instead of additive)

Page 15: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

17From Gert Lanckriet, Statistical Learning Theory Tutorial

Winnow algorithm

wk

0

+1

-1wk x + b= 0

wk+1

Wk+1 x + b = 0

Initialize: w1 = 0Updating rule For each data point x

If class(x) != decision(x,w)then

wk+1 wk + yixi Perceptron

wk+1 wk *exp(yixi) Winnow

k k + 1 else

wk+1 wk

Function decision(x, w)If wx + b > 0 return +1Else return -1

Page 16: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

18From Gert Lanckriet, Statistical Learning Theory Tutorial

Perceptron vs. Winnow

AssumeN available featuresonly K relevant items, with K<<N

Perceptron: number of mistakes: O( K N)Winnow: number of mistakes: O(K log N)

Winnow is more robust to high-dimensional feature

spaces

Page 17: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

19From Gert Lanckriet, Statistical Learning Theory Tutorial

Perceptron vs. WinnowPerceptronOnline: can adjust to changing target, over timeAdvantages

Simple and computationally efficientGuaranteed to learn a linearly separable problem

Limitationsonly linear separationsonly converges for linearly separable datanot really “efficient with many features”

WinnowOnline: can adjust to changing target, over timeAdvantages

Simple and computationally efficientGuaranteed to learn a linearly separable problem Suitable for problems with many irrelevant attributes

Limitationsonly linear separationsonly converges for linearly separable datanot really “efficient with many features”

Used in NLP

Page 18: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

20From Gert Lanckriet, Statistical Learning Theory Tutorial

Another family of linear algorithmsIntuition (Vapnik, 1965) If the classes are linearly separable:

Separate the dataPlace hyper-plane “far” from the data: large marginStatistical results guarantee good generalization

Large margin classifier

BAD

Page 19: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

21From Gert Lanckriet, Statistical Learning Theory Tutorial

GOOD

Maximal Margin Classifier

Intuition (Vapnik, 1965) if linearly separable:

Separate the dataPlace hyperplane “far” from the data: large marginStatistical results guarantee good generalization

Large margin classifier

Page 20: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

22From Gert Lanckriet, Statistical Learning Theory Tutorial

If not linearly separableAllow some errorsStill, try to place hyperplane “far” from each class

Large margin classifier

Page 21: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

23

Large Margin Classifiers

AdvantagesTheoretically better (better error bounds)

LimitationsComputationally more expensive, large quadratic programming

Page 22: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

24From Gert Lanckriet, Statistical Learning Theory Tutorial

Support Vector Machine (SVM)

Large Margin Classifier

Linearly separable case

Goal: find the hyperplane that maximizes the margin

wT x + b = 0

M wTxa + b = 1

wTxb + b = -1

Support vectors

Page 23: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

25From Gert Lanckriet, Statistical Learning Theory Tutorial

Support Vector Machine (SVM) Applications

Text classificationHand-writing recognitionComputational biology (e.g., micro-array data)Face detection Face expression recognition Time series prediction

Page 24: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

26

Non Linear problem

Page 25: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

27

Non Linear problem

Page 26: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

28From Gert Lanckriet, Statistical Learning Theory Tutorial

Non Linear problem

Kernel methodsA family of non-linear algorithmsTransform the non linear problem in a linear one (in a different feature space)Use linear algorithms to solve the linear problem in the new space

Page 27: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

29From Gert Lanckriet, Statistical Learning Theory Tutorial

X=[x z]

Basic principle kernel methods : Rd RD (D >> d)

(X)=[x2 z2 xz]

f(x) = sign(w1x2+w2z2+w3xz +b)

wT(x)+b=0

Page 28: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

30From Gert Lanckriet, Statistical Learning Theory Tutorial

Basic principle kernel methods

Linear separability: more likely in high dimensionsMapping: maps input into high-dimensional feature spaceClassifier: construct linear classifier in high-dimensional feature spaceMotivation: appropriate choice of leads to linear separabilityWe can do this efficiently!

Page 29: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

32

MultiLayer Neural Networks

Also known as a multi-layer perceptronAlso known as artificial neural networks, to distinguish from the biological onesMany learning algorithms, but most popular is backpropagation

The output values are compared with the correct answer to compute the value of some predefined error-function. Propagate the errors back through the networkAdjust the weights to reduce the errorsContinue iterating some number of times.

Can be linear or nonlinearTends to work very well, but

Is very slow to runIsn’t great with huge feature sets (slow and memory-intensive)

Page 30: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

33From Palmer & Hearst '97

Multilayer Neural Network Applied to

Sentence Boundary DetectionFeatures in Descriptor Array

Page 31: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

34From Wikipedia article on backpropagation

Multilayer Neural Networks

Backpropagation algorithm:Present a training sample to the neural network. Compare the network's output to the desired output from that sample. Calculate the error in each output neuron. For each neuron, calculate what the output should have been, and a scaling factor, how much lower or higher the output must be adjusted to match the desired output. This is the local error. Adjust the weights of each neuron to lower the local error. Assign "blame" for the local error to neurons at the previous level, giving greater responsibility to neurons connected by stronger weights. Repeat the steps above on the neurons at the previous level, using each one's "blame" as its error.

For a detailed example, see:http://galaxy.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html

Page 32: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

35

Multi-class classification

Page 33: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

36

Multi-class classification

Given: some data items that belong to one of M possible classes Task: Train the classifier and predict the class for a new data itemGeometrically: harder problem, no more simple geometry

Page 34: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

37

Multi-class classification: Examples

Author identificationLanguage identificationText categorization (topics)

Page 35: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

38

(Some) Algorithms for Multi-class classification

LinearDecision trees, Naïve Bayes

Non LinearK-nearest neighborsNeural Networks

Page 36: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

39

Linear class separators (ex: Naïve Bayes)

Page 37: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

40

Non Linear (ex: k Nearest Neighbor)

Page 38: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

41http://dms.irb.hr/tutorial/tut_dtrees.php

Decision Trees

Decision tree is a classifier in the form of a tree structure, where each node is either:

Leaf node - indicates the value of the target attribute (class) of examples, orDecision node - specifies some test to be carried out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test.

A decision tree can be used to classify an example by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance.

Page 39: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

42

Decision Tree Example

NoStrongHighMildRainD14

YesWeakNormalHotOvercastD13

YesStrongHighMildOvercastD12

YesStrongNormalMildSunnyD11

YesStrongNormalMildRainD10

YesWeakNormalColdSunnyD9

NoWeakHighMildSunnyD8

YesWeakNormalCoolOvercastD7

NoStrongNormalCoolRainD6

YesWeakNormalCoolRainD5

YesWeakHighMildRain D4

YesWeakHighHotOvercastD3

NoStrongHighHotSunnyD2

NoWeakHighHotSunnyD1

Play TennisWindHumidityTemp.OutlookDay

Goal: learn when we can play Tennis and when we cannot

Page 40: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

43www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp

Decision Tree for PlayTennis

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

Page 41: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

44www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp

Decision Tree for PlayTennis

Outlook

Sunny Overcast Rain

Humidity

High Normal

No Yes

Each internal node tests an attribute

Each branch corresponds to anattribute value node

Each leaf node assigns a classification

Page 42: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

45www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp

No

Decision Tree for PlayTennis

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ?

Page 43: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

46Foundations of Statistical Natural Language Processing, Manning and Schuetze

Decision Tree for Reuter classification

Page 44: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

47Foundations of Statistical Natural Language Processing, Manning and Schuetze

Decision Tree for Reuter classification

Page 45: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

48

Building Decision Trees

Given training data, how do we construct them?The central focus of the decision tree growing algorithm is selecting which attribute to test at each node in the tree. The goal is to select the attribute that is most useful for classifying examples. Top-down, greedy search through the space of possible decision trees.

That is, it picks the best attribute and never looks back to reconsider earlier choices.

Page 46: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

49

Building Decision Trees

Splitting criterionFinding the features and the values to split on

– for example, why test first “cts” and not “vs”? – Why test on “cts < 2” and not “cts < 5” ?

Split that gives us the maximum information gain (or the maximum reduction of uncertainty)

Stopping criterionWhen all the elements at one node have the same class, no need to split further

In practice, one first builds a large tree and then one prunes it back (to avoid overfitting)

See Foundations of Statistical Natural Language Processing, Manning and Schuetze for a good introduction

Page 47: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

50http://dms.irb.hr/tutorial/tut_dtrees.php

Decision Trees: Strengths

Decision trees are able to generate understandable rules. Decision trees perform classification without requiring much computation. Decision trees are able to handle both continuous and categorical variables. Decision trees provide a clear indication of which features are most important for prediction or classification.

Page 48: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

51http://dms.irb.hr/tutorial/tut_dtrees.php

Decision Trees: Weaknesses

Decision trees are prone to errors in classification problems with many classes and relatively small number of training examples. Decision tree can be computationally expensive to train.

Need to compare all possible splitsPruning is also expensive

Most decision-tree algorithms only examine a single field at a time. This leads to rectangular classification boxes that may not correspond well with the actual distribution of records in the decision space.

Page 49: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

52

Naïve Bayes Models

Graphical Models: graph theory plus probability theoryNodes are variablesEdges are conditional probabilities

A

B C

P(A) P(B|A)P(C|A)

Page 50: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

53

Naïve Bayes Models

Graphical Models: graph theory plus probability theoryNodes are variablesEdges are conditional probabilitiesAbsence of an edge between nodes implies independence between the variables of the nodes

A

B C

P(A) P(B|A)P(C|A) P(C|A,B)

Page 51: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

54Foundations of Statistical Natural Language Processing, Manning and Schuetze

Naïve Bayes for text classification

Page 52: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

55

Naïve Bayes for text classification

earn

Shr 34 cts vs shrper

Page 53: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

56

Naïve Bayes for text classification

The words depend on the topic: P(wi| Topic)

P(cts|earn) > P(tennis| earn)

Naïve Bayes assumption: all words are independent given the topicFrom training set we learn the probabilities P(wi| Topic) for each word and for each topic in the training set

Topic

w1 w2 w3 w4 wnwn-1

Page 54: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

57

Naïve Bayes for text classification

To: Classify new exampleCalculate P(Topic | w1, w2, … wn) for each topic

Bayes decision rule:Choose the topic T’ for which P(T’ | w1, w2, … wn) > P(T | w1, w2, … wn) for each T T’

Topic

w1 w2 w3 w4 wnwn-1

Page 55: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

59

Naïve Bayes: Strengths

Very simple modelEasy to understandVery easy to implement

Very efficient, fast training and classificationModest space storageWidely used because it works somewhat well for text categorizationLinear, but non parallel, decision boundaries

Page 56: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

60

Naïve Bayes: Weaknesses

Naïve Bayes independence assumption:Ignores the sequential ordering of words (uses bag of words model)

Naïve Bayes assumption is inappropriate if there are strong conditional dependencies between the variablesBut even if the model is not “right”, Naïve Bayes models do well in a surprisingly large number of cases because often we are interested in classification accuracy and not in accurate probability estimations

Page 57: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

61

Multinomial Naïve Bayes

(Based on a paper by McCallum & Nigram ’98)Features include the number of times words occur in the document, not binary (present/absent) indicatorsUses a statistical formula known as the multinomial distribution.Authors compared, on several text classification tasks:

Multinomial naïve bayesBinary-featured multi-variate Bernoulli-distributed

Results:Multinomial much better when using large vocabularies.However, they note that Bernoulli can handle other features (e.g., from-title) as numbers, whereas this will confuse the multinomial version.

Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification In AAAI/ICML-98 Workshop on Learning for Text Categorization.

Page 58: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

62

k Nearest Neighbor Classification

Nearest Neighbor classification rule: to classify a new object, find the object in the training set that is most similar. Then assign the category of this nearest neighborK Nearest Neighbor (KNN): consult k nearest neighbors. Decision based on the majority category of these neighbors. More robust than k = 1

Example of similarity measure often used in NLP is cosine similarity

Page 59: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

63

1-Nearest Neighbor

Page 60: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

64

1-Nearest Neighbor

Page 61: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

65

3-Nearest Neighbor

Page 62: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

66

3-Nearest Neighbor

Assign the category of the majority of the neighbors

But this is closer..We can weight neighbors according to their similarity

Page 63: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

67

k Nearest Neighbor Classification

StrengthsRobustConceptually simpleOften works wellPowerful (arbitrary decision boundaries)

WeaknessesPerformance is very dependent on the similarity measure used (and to a lesser extent on the number of neighbors k used)Finding a good similarity measure can be difficultComputationally expensive

Page 64: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

68

Summary

Algorithms for Classification Linear versus non linear classificationBinary classification

Perceptron WinnowSupport Vector Machines (SVM)Kernel MethodsMultilayer Neural Networks

Multi-Class classificationDecision TreesNaïve BayesK nearest neighbor

Page 65: 1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)

69

Next Time

More learning algorithmsClustering