Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at...
-
Upload
melissa-white -
Category
Documents
-
view
231 -
download
1
Transcript of Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at...
![Page 1: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/1.jpg)
Text Classification
Slides by
Tom Mitchell (NB),
William Cohen (kNN),
Ray Mooney and others at UT-Austin,
me
CIS 8590 NLP1
![Page 2: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/2.jpg)
Outline
• Problem definition and applications• Representation
– Vector Space Model (and variations)– Feature Selection
• Classification Techniques– Naïve Bayes– k-Nearest Neighbor
• Issues and Discussion– Representations and independence assumptions – Sparsity and smoothing– Bias-variance tradeoff
CIS 8590 – Fall 2008 NLP2
![Page 3: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/3.jpg)
Spam or not Spam?
• Most people who’ve ever used email have developed a hatred of spam
• In the days before Gmail (and still sometimes today), you could get hundreds of spam messages per day.
• “Spam Filters” were developed to automatically classify, with high (but not perfect) accuracy, which messages are spam and which aren’t.
CIS 8590 – Fall 2008 NLP3
![Page 4: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/4.jpg)
Text Classification Problem
Let D be the space of all possible documents
Let C be the space of possible classes
Let H be the space of all possible hypotheses (or classifiers)
Input: a labeled sample X = {<d,c> | d in D and c in C}
Output: a hypothesis h in H: D C for predicting, with high accuracy, the class of previously unseen documents
CIS 8590 – Fall 2008 NLP4
![Page 5: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/5.jpg)
Example Applications
• News topic classification (e.g., Google News)
C={politics,sports,business,health,tech,…}
• “SafeSearch” filtering
C={porn, not porn}
• Language classification
C={English,Spanish,Chinese,…}
• Sentiment classification
C={positive review,negative review}
• Email sorting
C={spam,meeting reminders,invitations, …} – user-defined!CIS 8590 – Fall 2008 NLP
5
![Page 6: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/6.jpg)
Outline
• Problem definition and applications• Representation
– Vector Space Model (and variations)– Feature Selection
• Classification Techniques– Naïve Bayes– k-Nearest Neighbor
• Issues and Discussion– Representations and independence assumptions – Sparsity and smoothing– Bias-variance tradeoff
CIS 8590 – Fall 2008 NLP6
![Page 7: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/7.jpg)
Vector Space Model
Idea: represent each document as a vector.
• Why bother?
• How can we make a document into a vector?
CIS 8590 – Fall 2008 NLP7
![Page 8: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/8.jpg)
Documents as Vectors
CIS 8590 – Fall 2008 NLP8
Example:Document D1: “yes we got no bananas”
Document D2: “what you got”
Document D3: “yes I like what you got”
Vector V1:
Vector V2:
Vector V3:
yes we got no bananas what you I like
1 1 1 1 1 0 0 0 0
0 0 1 0 0 1 1 0 0
1 0 1 0 0 1 1 1 1
![Page 9: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/9.jpg)
Documents as Vectors
Generically, we convert a document into a vector by:
1. Determine the vocabulary V, or set of all terms in the collection of documents
2. For each document d, compute a score sv(d) for every term v in V.– For instance, sv(d) could be the number of times v appears in
d.
CIS 8590 – Fall 2008 NLP9
![Page 10: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/10.jpg)
Why Bother?
The vector space model has a number of limitations (discussed later).
But two major benefits:
1.Convenience (notational & mathematical)
2.It’s well-understood– That is, there are a lot of side benefits, like
similarity and distance metrics, that you get for free.
CIS 8590 – Fall 2008 NLP10
![Page 11: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/11.jpg)
Handy Tools
• Euclidean distance and norm
• Cosine similarity
• Dot product
CIS 8590 – Fall 2008 NLP11
![Page 12: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/12.jpg)
Measuring Similarity
Similarity metric:
the size of the angle between document vectors.
“Cosine Similarity”:
CIS 8590 – Fall 2008 NLP12
21
21,21 21
cos),(vv
vvvvCS vv
![Page 13: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/13.jpg)
Variations of the VSM
• What should we include in V?– Stoplists– Phrases and ngrams– Feature selection
• How should we compute sv(d)?– Binary (Bernoulli)– Term frequency (TF) (multinomial)– Inverse Document Frequency (IDF)– TF-IDF– Length normalization– Other …
CIS 8590 – Fall 2008 NLP13
![Page 14: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/14.jpg)
What should we include in the Vocabulary?
CIS 8590 – Fall 2008 NLP14
Example:Document D1: “yes we got no bananas”
Document D2: “what you got”
Document D3: “yes I like what you got”
• All three documents include “got” it’s not very informative for discriminating between the documents.
• In general, we’d like to include all and only informative features
![Page 15: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/15.jpg)
Zipf Distribution of Language
Languages contain a few high-frequency words,
a large number of medium frequency words,
and a ton of low-frequency words.
CIS 8590 – Fall 2008 NLP15
![Page 16: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/16.jpg)
Stop words and stop lists
A simple way to get rid of uninteresting features is to eliminate the high-frequency ones
These are often called “stop words”
- e.g., “the”, “of”, “you”, “got”, “was”, etc.
Systems often contain a list (“stop list”) of ~100 stop words, which are pruned from the vocabulary
CIS 8590 – Fall 2008 NLP16
![Page 17: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/17.jpg)
Beyond terms
It would be great to include multi-word features like “New York”, rather than just “New” and “York”
But: including all pairs of words, or all consecutive pairs of words, as features creates WAY too many to deal with.
In order to include such features, we need to know more about feature selection (upcoming)
CIS 8590 – Fall 2008 NLP17
![Page 18: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/18.jpg)
Variations of the VSM
• What should we include in V?– Stoplists– Phrases and ngrams– Feature selection
• How should we compute sv(d)?– Binary (Bernoulli)– Term frequency (TF) (multinomial)– Inverse Document Frequency (IDF)– TF-IDF– Length normalization– Other …
CIS 8590 – Fall 2008 NLP18
![Page 19: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/19.jpg)
Score for a feature in a document
CIS 8590 – Fall 2008 NLP19
Example:Document: “yes we got no bananas no bananas we got we got”
Binary:
Term Frequency:
yes we got no bananas what you I like
1 1 1 1 1 0 0 0 0
1 3 3 2 2 0 0 0 0
![Page 20: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/20.jpg)
Inverse Document Frequency
An alternative method of scoring a feature
Intuition: words that are common to many documents are less informative, so give them less weight.
IDF(v) =
log (#Documents / #Documents containing v)
CIS 8590 – Fall 2008 NLP20
![Page 21: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/21.jpg)
TF-IDF
Term Frequency Inverse Document Frequency
TF-IDFv(d) = TFv(d) * IDF(v)
TF-IDFv(d) = (#v occurs in D) *
log (#Documents / #Documents containing v)
CIS 8590 – Fall 2008 NLP21
![Page 22: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/22.jpg)
TF-IDF weighted vectors
CIS 8590 – Fall 2008 NLP22
Example:Document D1: “yes we got no bananas”
Document D2: “what you got”
Document D3: “yes I like what you got”
Vector V1:
Vector V2:
Vector V3:
yes we got no bananas what you I like
.18 1 1 1 1 0 0 0 0
0 0 1 0 0 1 1 0 0
1 0 1 0 0 1 1 1 .48
![Page 23: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/23.jpg)
Limitations
The vector space model has the following limitations:• Long documents are poorly represented because they
have poor similarity values.• Search keywords must precisely match document terms;
word substrings might result in a false positive match.• Semantic sensitivity: documents with similar context but
different term vocabulary won't be associated, resulting in a false negative match.
• The order in which the terms appear in the document is lost in the vector space representation.
CIS 8590 – Fall 2008 NLP23
![Page 24: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/24.jpg)
Outline
• Problem definition and applications• Representation
– Vector Space Model (and variations)– Feature Selection
• Classification Techniques– Naïve Bayes– k-Nearest Neighbor
• Issues and Discussion– Representations and independence assumptions – Sparsity and smoothing– Bias-variance tradeoff
CIS 8590 – Fall 2008 NLP24
![Page 25: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/25.jpg)
Curse of Dimensionality.
If the data x lies in high dimensional space, then an enormous amount of data is required to learn distributions, decision rules, or clusters.
Example:
50 dimensions.
Each dimension has 2 possible values.
This gives a total of 250 = ~1015 cells.
But the no. of data samples will be far less. There will not be enough data samples to learn.
![Page 26: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/26.jpg)
Dimensionality Reduction
• Goal: Reduce the dimensionality of the space, while preserving distances
• Basic idea: find the dimensions that have the most variation in the data, and eliminate the others.
• Many techniques (SVD, MDS)• May or may not help
![Page 27: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/27.jpg)
Feature Selection and Dimensionality Reduction in NLP
• TF-IDF (reduces the weight of some features, increases weight of others)
• Latent Semantic Analysis (LSA)
• Mutual Information (MI)
• Pointwise Mutual Information (PMI)
• Information Gain (IG)
• Chi-square or other independence tests
• Pure frequencyCIS 8590 – Fall 2008 NLP
27
![Page 28: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/28.jpg)
Mutual Information
What makes “Shanghai” a good feature for classifying a document as being about “China”?
Intuition: four cases
CIS 8590 – Fall 2008 NLP28
+China -China
+ Shanghai How common? How common?
-Shanghai How common? How common?
![Page 29: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/29.jpg)
Mutual Information
What makes “Shanghai” a good feature for classifying a document as being about “China”?
Intuition: four cases
If all four cases are equally common, MI = 0.
CIS 8590 – Fall 2008 NLP29
+China -China
+ Shanghai X X
-Shanghai X X
![Page 30: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/30.jpg)
Mutual Information
What makes “Shanghai” a good feature for classifying a document as being about “China”?
Intuition: four cases
MI grows when one (or two) case(s) becomes much more common than the others.
CIS 8590 – Fall 2008 NLP30
+China -China
+ Shanghai 10X X
-Shanghai X 0
![Page 31: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/31.jpg)
Mutual Information
What makes “Shanghai” a good feature for classifying a document as being about “China”?
Intuition: four cases
That’s also the case where the feature is useful!
CIS 8590 – Fall 2008 NLP31
+China -China
+ Shanghai 10X X
-Shanghai X 0
![Page 32: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/32.jpg)
Mutual Information
MI(term t, class c) =
P(+t, +c) log P(+t,+c)/P(+t)/P(+c) +
P(-t, +c) log P(-t,+c)/P(-t)/P(+c) +
P(+t, -c) log P(+t,-c)/P(+t)/P(-c) +
P(-t, -c) log P(-t,-c)/P(-t)/P(-c)
CIS 8590 – Fall 2008 NLP32
![Page 33: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/33.jpg)
Pointwise Mutual Information
What makes “Shanghai” a good feature for classifying a document as being about “China”?
PMI focuses on just the (+, +) case:
How much more likely than chance is it for Shanghai to appear in a document about China?
CIS 8590 – Fall 2008 NLP33
+China -China
+ Shanghai 10X X
-Shanghai X 0
![Page 34: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/34.jpg)
Pointwise Mutual Information
PMI(term t, class c) =
P(+t, +c) log P(+t,+c)/P(+t)/P(+c)
CIS 8590 – Fall 2008 NLP34
![Page 35: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/35.jpg)
Outline
• Problem definition and applications• Representation
– Vector Space Model (and variations)– Feature Selection
• Classification Techniques– Naïve Bayes– k-Nearest Neighbor
• Issues and Discussion– Representations and independence assumptions – Sparsity and smoothing– Bias-variance tradeoff
CIS 8590 – Fall 2008 NLP35
![Page 36: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/36.jpg)
Classification Techniques
• We’ll discuss three:– Naïve Bayes– k-Nearest Neighbor– Support Vector Machines
CIS 8590 – Fall 2008 NLP36
![Page 37: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/37.jpg)
Bayes Rule
Which is shorthand for:
![Page 38: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/38.jpg)
![Page 39: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/39.jpg)
![Page 40: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/40.jpg)
![Page 41: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/41.jpg)
![Page 42: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/42.jpg)
![Page 43: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/43.jpg)
![Page 44: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/44.jpg)
For code, see www.cs.cmu.edu/~tom/mlbook.html click on “Software and Data”
![Page 45: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/45.jpg)
![Page 46: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/46.jpg)
![Page 47: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/47.jpg)
![Page 48: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/48.jpg)
![Page 49: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/49.jpg)
How can we implement this if the ai are continuous-valued attributes?
![Page 50: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/50.jpg)
Also called “Gaussian distribution”
![Page 51: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/51.jpg)
Gaussian
Assume P(ai|vj) follows Gaussian distribution, use training data to estimate its mean and variance
![Page 52: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/52.jpg)
52
K-nearest neighbor methods
William Cohen
10-601 April 2008
![Page 53: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/53.jpg)
53
BellCore’s MovieRecommender
• Participants sent email to [email protected]• System replied with a list of 500 movies to rate on a 1-10
scale (250 random, 250 popular)– Only subset need to be rated
• New participant P sends in rated movies via email• System compares ratings for P to ratings of (a random
sample of) previous users• Most similar users are used to predict scores for unrated
movies (more later)• System returns recommendations in an email message.
![Page 54: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/54.jpg)
54
Suggested Videos for: John A. Jamus. Your must-see list with predicted ratings: •7.0 "Alien (1979)" •6.5 "Blade Runner" •6.2 "Close Encounters Of The Third Kind (1977)" Your video categories with average ratings: •6.7 "Action/Adventure" •6.5 "Science Fiction/Fantasy" •6.3 "Children/Family" •6.0 "Mystery/Suspense" •5.9 "Comedy" •5.8 "Drama"
![Page 55: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/55.jpg)
55
The viewing patterns of 243 viewers were consulted. Patterns of 7 viewers were found to be most similar. Correlation with target viewer: •0.59 viewer-130 ([email protected]) •0.55 bullert,jane r ([email protected]) •0.51 jan_arst ([email protected]) •0.46 Ken Cross ([email protected]) •0.42 rskt ([email protected]) •0.41 kkgg ([email protected]) •0.41 bnn ([email protected]) By category, their joint ratings recommend: •Action/Adventure:
•"Excalibur" 8.0, 4 viewers •"Apocalypse Now" 7.2, 4 viewers •"Platoon" 8.3, 3 viewers
•Science Fiction/Fantasy: •"Total Recall" 7.2, 5 viewers
•Children/Family: •"Wizard Of Oz, The" 8.5, 4 viewers •"Mary Poppins" 7.7, 3 viewers
Mystery/Suspense: •"Silence Of The Lambs, The" 9.3, 3 viewers
Comedy: •"National Lampoon's Animal House" 7.5, 4 viewers •"Driving Miss Daisy" 7.5, 4 viewers •"Hannah and Her Sisters" 8.0, 3 viewers
Drama: •"It's A Wonderful Life" 8.0, 5 viewers •"Dead Poets Society" 7.0, 5 viewers •"Rain Man" 7.5, 4 viewers
Correlation of predicted ratings with your actual ratings is: 0.64 This number measures ability to evaluate movies accurately for you. 0.15 means low ability. 0.85 means very good ability. 0.50 means fair ability.
![Page 56: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/56.jpg)
56
Algorithms for Collaborative Filtering 1: Memory-Based Algorithms (Breese et al, UAI98)
• vi,j= vote of user i on item j
• Ii = items for which user i has voted
• Mean vote for i is
• Predicted vote for “active user” a is weighted sum
weights of n similar usersnormalizer
![Page 57: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/57.jpg)
57
Basic k-nearest neighbor classification
• Training method:– Save the training examples
• At prediction time:– Find the k training examples (x1,y1),…(xk,yk) that
are closest to the test example x– Predict the most frequent class among those
yi’s.
• Example: http://cgm.cs.mcgill.ca/~soss/cs644/projects/simard/
![Page 58: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/58.jpg)
58
What is the decision boundary?Voronoi diagram
![Page 59: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/59.jpg)
59
Convergence of 1-NN
x
y
x1
y1
x2
y2
neighbor
P(Y|x1)
P(Y|x’’)
P(Y|x)
*'
22
'
2
1
)|'Pr()|*Pr(1
)|'Pr(1
)Pr(1
knnError)(
yy
y
xyYxy
xyY
yy
P
assume equal
let y*=argmax Pr(y|x)
rate)error optimal Bayes(2
))|*Pr(1(2
...
xy
![Page 60: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/60.jpg)
60
Basic k-nearest neighbor classification
• Training method:– Save the training examples
• At prediction time:– Find the k training examples (x1,y1),…(xk,yk) that are closest to
the test example x– Predict the most frequent class among those yi’s.
• Improvements:– Weighting examples from the neighborhood– Measuring “closeness”– Finding “close” examples in a large training set quickly
![Page 61: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/61.jpg)
61
K-NN and irrelevant features
+ + + ++ + + +oo o o oo ooooo oo o oo oo?
![Page 62: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/62.jpg)
62
K-NN and irrelevant features
+
+
+
+
+
++ +
o
o
o o
o
o
oo
o
o
o
oo
o
o
o
o
o?
![Page 63: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/63.jpg)
63
K-NN and irrelevant features
+
+
+
+
+
+ + +o
oo o
o
o
oo
oo
ooo
oo
o
o
o?
![Page 64: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/64.jpg)
64
Ways of rescaling for KNNNormalized L1 distance:
Scale by IG:
Modified value distance metric:
![Page 65: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/65.jpg)
65
Ways of rescaling for KNNDot product:
Cosine distance:
TFIDF weights for text: for doc j, feature i: xi=tfi,j * idfi :
#occur. of term i in
doc j
#docs in corpus
#docs in corpus that contain term i
![Page 66: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/66.jpg)
66
Combining distances to neighborsStandard KNN:
Distance-weighted KNN:
|}':')','{(|)',(
))(,(maxargˆ
yyDyxDyC
xNeighborsyCy y
)',(1)',(
))',(()',(
))(,(maxargˆ
}':')','{(
xxxxSIM
xxSIMDyC
xNeighborsyCy
yyDyx
y
}':')','{(
))',(1( 1 )',(yyDyx
xxSIMDyC
![Page 67: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/67.jpg)
67
![Page 68: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/68.jpg)
68
![Page 69: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/69.jpg)
69
![Page 70: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/70.jpg)
70
Computing KNN: pros and cons• Storage: all training examples are saved in memory
– A decision tree or linear classifier is much smaller
• Time: to classify x, you need to loop over all training examples (x’,y’) to compute distance between x and x’.– However, you get predictions for every class y
• KNN is nice when there are many many classes
– Actually, there are some tricks to speed this up…especially when data is sparse (e.g., text)
![Page 71: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/71.jpg)
71
Efficiently implementing KNN (for text)
IDF is nice computationally
![Page 72: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/72.jpg)
72
Tricks with fast KNN
K-means using r-NN1. Pick k points c1=x1,….,ck=xk as centers
2. For each xi, find Di=Neighborhood(xi)
3. For each xi, let ci=mean(Di)
4. Go to step 2….
![Page 73: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/73.jpg)
73
Efficiently implementing KNN
dj2
dj3
dj4
Selective classification: given a training set and test set, find the N test cases that you can most confidently classify
![Page 74: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/74.jpg)
Support Vector Machines
Slides by Ray Mooney et al.
U. Texas at Austin machine learning group
![Page 75: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/75.jpg)
Perceptron Revisited: Linear Separators
• Binary classification can be viewed as the task of separating classes in feature space:
wTx + b = 0
wTx + b < 0wTx + b > 0
f(x) = sign(wTx + b)
![Page 76: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/76.jpg)
Linear Separators
• Which of the linear separators is optimal?
![Page 77: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/77.jpg)
Classification Margin
• Distance from example xi to the separator is
• Examples closest to the hyperplane are support vectors. • Margin ρ of the separator is the distance between support
vectors.
w
xw br i
T
r
ρ
![Page 78: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/78.jpg)
Maximum Margin Classification• Maximizing the margin is good according to intuition and
PAC theory.• Implies that only support vectors matter; other training
examples are ignorable.
![Page 79: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/79.jpg)
Linear SVM Mathematically
• Let training set {(xi, yi)}i=1..n, xiRd, yi {-1, 1} be separated by a
hyperplane with margin ρ. Then for each training example (xi, yi):
• For every support vector xs the above inequality is an equality. After rescaling w and b by ρ/2 in the equality, we obtain that distance between each xs and the hyperplane is
• Then the margin can be expressed through (rescaled) w and b as:
wTxi + b ≤ - ρ/2 if yi = -1wTxi + b ≥ ρ/2 if yi = 1
w
22 r
ww
xw 1)(y
br s
Ts
yi(wTxi + b) ≥ ρ/2
![Page 80: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/80.jpg)
Linear SVMs Mathematically (cont.)
• Then we can formulate the quadratic optimization problem:
Which can be reformulated as:
Find w and b such that
is maximized
and for all (xi, yi), i=1..n : yi(wTxi + b) ≥ 1w
2
Find w and b such that
Φ(w) = ||w||2=wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1
![Page 81: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/81.jpg)
Solving the Optimization Problem
• Need to optimize a quadratic function subject to linear constraints.
• Quadratic optimization problems are a well-known class of mathematical programming problems for which several (non-trivial) algorithms exist.
• The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with every inequality constraint in the primal (original) problem:
Find w and b such thatΦ(w) =wTw is minimized and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1
Find α1…αn such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi
Txj is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi
![Page 82: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/82.jpg)
The Optimization Problem Solution
• Given a solution α1…αn to the dual problem, solution to the primal is:
• Each non-zero αi indicates that corresponding xi is a support vector.
• Then the classifying function is (note that we don’t need w explicitly):
• Notice that it relies on an inner product between the test point x and the support vectors xi – we will return to this later.
• Also keep in mind that solving the optimization problem involved computing the inner products xi
Txj between all training points.
w =Σαiyixi b = yk - Σαiyixi Txk for any αk > 0
f(x) = ΣαiyixiTx + b
![Page 83: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/83.jpg)
Soft Margin Classification
• What if the training set is not linearly separable?
• Slack variables ξi can be added to allow misclassification of difficult or noisy examples, resulting margin called soft.
ξi
ξi
![Page 84: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/84.jpg)
Soft Margin Classification Mathematically
• The old formulation:
• Modified formulation incorporates slack variables:
• Parameter C can be viewed as a way to control overfitting: it “trades off” the relative importance of maximizing the margin and fitting the training data.
Find w and b such thatΦ(w) =wTw is minimized and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1
Find w and b such thatΦ(w) =wTw + CΣξi is minimized and for all (xi ,yi), i=1..n : yi (wTxi + b) ≥ 1 – ξi, , ξi ≥ 0
![Page 85: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/85.jpg)
Soft Margin Classification – Solution
• Dual problem is identical to separable case (would not be identical if the 2-norm penalty for slack variables CΣξi
2 was used in primal objective, we would need additional Lagrange multipliers for slack variables):
• Again, xi with non-zero αi will be support vectors.
• Solution to the dual problem is:
Find α1…αN such thatQ(α) =Σαi - ½ΣΣαiαjyiyjxi
Txj is maximized and (1) Σαiyi = 0(2) 0 ≤ αi ≤ C for all αi
w =Σαiyixi
b= yk(1- ξk) - ΣαiyixiTxk for any k s.t.
αk>0 f(x) = ΣαiyixiTx + b
Again, we don’t need to compute w explicitly for classification:
![Page 86: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/86.jpg)
Theoretical Justification for Maximum Margins
• Vapnik has proved the following:
The class of optimal linear separators has VC dimension h bounded from above as
where ρ is the margin, D is the diameter of the smallest sphere that can enclose all of the training examples, and m0 is the dimensionality.
• Intuitively, this implies that regardless of dimensionality m0 we can minimize the VC dimension by maximizing the margin ρ.
• Thus, complexity of the classifier is kept small regardless of dimensionality.
1,min 02
2
m
Dh
![Page 87: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/87.jpg)
Linear SVMs: Overview
• The classifier is a separating hyperplane.
• Most “important” training points are support vectors; they define the hyperplane.
• Quadratic optimization algorithms can identify which training points xi are support vectors with non-zero Lagrangian multipliers αi.
• Both in the dual formulation of the problem and in the solution training points appear only inside inner products:
Find α1…αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0(2) 0 ≤ αi ≤ C for all αi
f(x) = ΣαiyixiTx + b
![Page 88: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/88.jpg)
Non-linear SVMs
• Datasets that are linearly separable with some noise work out great:
• But what are we going to do if the dataset is just too hard?
• How about… mapping data to a higher-dimensional space:
0
0
0
x2
x
x
x
![Page 89: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/89.jpg)
Non-linear SVMs: Feature spaces
• General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:
Φ: x → φ(x)
![Page 90: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/90.jpg)
The “Kernel Trick”
• The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj
• If every datapoint is mapped into high-dimensional space via some transformation Φ: x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
• A kernel function is a function that is eqiuvalent to an inner product in some feature space.
• Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2
,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2
,= 1+ xi12xj1
2 + 2 xi1xj1 xi2xj2+ xi2
2xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi2
2 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj2
2 √2xj1 √2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x1
2 √2 x1x2 x22 √2x1 √2x2]
• Thus, a kernel function implicitly maps data to a high-dimensional space (without the need to compute each φ(x) explicitly).
![Page 91: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/91.jpg)
What Functions are Kernels?
• For some functions K(xi,xj) checking that K(xi,xj)= φ(xi) Tφ(xj) can be
cumbersome. • Mercer’s theorem:
Every semi-positive definite symmetric function is a kernel• Semi-positive definite symmetric functions correspond to a semi-
positive definite symmetric Gram matrix:
K(x1,x1) K(x1,x2) K(x1,x3) … K(x1,xn)
K(x2,x1) K(x2,x2) K(x2,x3) K(x2,xn)
… … … … …
K(xn,x1) K(xn,x2) K(xn,x3) … K(xn,xn)
K=
![Page 92: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/92.jpg)
Examples of Kernel Functions
• Linear: K(xi,xj)= xiTxj
– Mapping Φ: x → φ(x), where φ(x) is x itself
• Polynomial of power p: K(xi,xj)= (1+ xiTxj)p
– Mapping Φ: x → φ(x), where φ(x) has dimensions
• Gaussian (radial-basis function): K(xi,xj) =– Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional: every point is mapped
to a function (a Gaussian); combination of functions for support vectors is the separator.
• Higher-dimensional space still has intrinsic dimensionality d (the mapping is not onto), but linear separators in it correspond to non-linear separators in original space.
2
2
2ji
exx
p
pd
![Page 93: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/93.jpg)
Non-linear SVMs Mathematically
• Dual problem formulation:
• The solution is:
• Optimization techniques for finding αi’s remain the same!
Find α1…αn such thatQ(α) =Σαi - ½ΣΣαiαjyiyjK(xi, xj) is maximized and (1) Σαiyi = 0(2) αi ≥ 0 for all αi
f(x) = ΣαiyiK(xi, xj)+ b
![Page 94: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/94.jpg)
SVM applications
• SVMs were originally proposed by Boser, Guyon and Vapnik in 1992 and gained increasing popularity in late 1990s.
• SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data.
• SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data.
• SVM techniques have been extended to a number of tasks such as regression [Vapnik et al. ’97], principal component analysis [Schölkopf et al. ’99], etc.
• Most popular optimization algorithms for SVMs use decomposition to hill-climb over a subset of αi’s at a time, e.g. SMO [Platt ’99] and
[Joachims ’99]• Tuning SVMs remains a black art: selecting a specific kernel and
parameters is usually done in a try-and-see manner.
![Page 95: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/95.jpg)
Outline
• Problem definition and applications• Representation
– Vector Space Model (and variations)– Feature Selection
• Classification Techniques– Naïve Bayes– k-Nearest Neighbor
• Issues and Discussion– Representations and independence assumptions – Sparsity and smoothing– Bias-variance tradeoff
CIS 8590 – Spring 2010 NLP95
![Page 96: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/96.jpg)
Other (text) classifiers to be aware of
• Rule-based systems– Decision lists (e.g., Ripper)– Decision trees (e.g. C4.5)
• Perceptron and Neural Networks
• Log-linear (maximum-entropy) models
CIS 8590 NLP96
![Page 97: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/97.jpg)
Performance Comparison (?) linear SVM rbf-SVM
NB Rocchio Dec. Trees kNN C=0.5 C=1
earn 96.0 96.1 96.1 97.8 98.0 98.2 98.1
acq 90.7 92.1 85.3 91.8 95.5 95.6 94.7
money-fx 59.6 67.6 69.4 75.4 78.8 78.5 74.3
grain 69.8 79.5 89.1 82.6 91.9 93.1 93.4
crude 81.2 81.5 75.5 85.8 89.4 89.4 88.7
trade 52.2 77.4 59.2 77.9 79.2 79.2 76.6
interest 57.6 72.5 49.1 76.7 75.6 74.8 69.1
ship 80.9 83.1 80.9 79.8 87.4 86.5 85.8
wheat 63.4 79.4 85.5 72.9 86.6 86.8 82.4
corn 45.2 62.2 87.7 71.4 87.5 87.8 84.6
microavg. 72.3 79.9 79.4 82.6 86.7 87.5 86.4
SVM classifier break-even F from (Joachims, 2002a, p. 114). Results are shown for the 10 largest categories and for microaveraged performance over all 90 categories on the Reuters-21578 data set.
![Page 98: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/98.jpg)
Choosing a classifierTechnique Train
timeTesttime
“Accuracy” Interpre-tability
Bias-Variance
DataComplexity
Naïve Bayes
|W| + |C| |V|
|C| * |Vd|
Medium-low Medium High-bias Low
k-NN |W| |V| * |Vd|
Medium Low ? High
SVM |C||D|3 |V|ave
|C|* |Vd|
High Low Mixed Medium-low
Neural Nets ? |C|* |Vd|
High Low High-variance
High
Log-linear ? |C|* |Vd|
High Medium High-variance/
mixed
Medium
Ripper ? ? Medium High High-bias ?“Accuracy” – reputation for accuracy in experimental settings. Note that it is impossible to say beforehand which classifier will be most accurate on any given problem.C = set of classes. W = bag of training tokens. V = set of training types. D = set of train docs. V d = types in test document d. Vave = average number of types per doc in training.
![Page 99: Text Classification Slides by Tom Mitchell (NB), William Cohen (kNN), Ray Mooney and others at UT-Austin, me CIS 8590 NLP 1.](https://reader031.fdocuments.net/reader031/viewer/2022020800/56649ea25503460f94ba639e/html5/thumbnails/99.jpg)
Feature Engineering
• This is where domain experts and human judgement come into play.
• Not much to say …. except that it matters a lot, often more than choosing a classifier
CIS 8590 NLP99