Text classification
-
Upload
david-hoen -
Category
Technology
-
view
22 -
download
0
Transcript of Text classification
![Page 1: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/1.jpg)
Text Classification and Naïve Bayes
An example of text classificationDefinition of a machine learning problemA refresher on probabilityThe Naive Bayes classifier
1
![Page 2: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/2.jpg)
Google News
2
![Page 3: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/3.jpg)
Different ways for classification
Human labor (people assign categories to every incoming article)
Hand-crafted rules for automatic classification If article contains: stock, Dow, share, Nasdaq, etc. Business If article contains: set, breakpoint, player, Federer, etc. Tennis
Machine learning algorithms
3
![Page 4: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/4.jpg)
What is Machine Learning?
4
Definition: A computer program is said to learn from experience E when its performance P at a task T improves with experience E.
Tom Mitchell, Machine Learning, 1997
Examples:- Learning to recognize spoken words- Learning to drive a vehicle- Learning to play backgammon
![Page 5: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/5.jpg)
Components of a ML System (1)
Experience (a set of examples that combines together input and output for a task)
Text categorization: document + category Speech recognition: spoken text + written text
Experience is referred to as Training Data. When training data is available, we talk of Supervised Learning.Performance metrics
Error or accuracy in the Test Data Test Data are not present in the Training Data When there are few training data, methods like ‘leave-one-out’ or
‘ten-fold cross validation’ are used to measure error.
5
![Page 6: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/6.jpg)
Components of a ML System (2)
Type of knowledge to be learned (known as the target function, that will map between input and output)Representation of the target function
Decision trees Neural networks Linear functions
The learning algorithm C4.5 (learns decision trees) Gradient descent (learns a neural network) Linear programming (learns linear functions)
6
Task
![Page 7: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/7.jpg)
Defining Text Classification
7
XdXd},,,{ 21 Jccc C
D cd ,
CXcd ,CX:
D)(
the document in the multi-dimensional space
a set of classes (categories, or labels)
the training set of labeled documents
Target function:
Learning algorithm:
cd , “Beijing joins the World Trade Organization”, China
cd )( )(d China
![Page 8: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/8.jpg)
Naïve Bayes Learning
8
dnk
kCcCc
MAP ctPcPdcPc1
)|(ˆ)(ˆmaxarg)|(ˆmaxarg
cd )(
Learning Algorithm: Naïve Bayes
Target Function:
)|()(maxarg)|(maxarg cdPcPdcPcCcCc
MAP
)(cP
)|( cdP
The generative process:
)|( dcP
a priori probability, of choosing a category
the cond. prob. of generating d, given the fixed c
a posteriori probability that c generated d
![Page 9: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/9.jpg)
A Refresher on Probability
9
![Page 10: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/10.jpg)
Visualizing probability
A is a random variable that denotes an uncertain event Example: A = “I’ll get an A+ in the final exam”
P(A) is “the fraction of possible worlds where A is true”
10
Worlds in which A is true
Slide: Andrew W. Moore
Worlds in which A is false
Event space of all possible worlds. Its area is 1.
P(A) = Area of the blue circle.
![Page 11: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/11.jpg)
Axioms and Theorems of Probability
Axioms: 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) – P(A and B)
Theorems: P(not A) = P(~A) = 1 – P(A) P(A) = P(A ^ B) + P(A ^ ~B)
11
![Page 12: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/12.jpg)
Conditional Probability
P(A|B) = the probability of A being true, given that we know that B is true
12
F
H
H = “I have a headache”F = “Coming down with flu”
P(H) = 1/10P(F) = 1/40P(H/F) = 1/2
Slide: Andrew W. Moore
Headaches are rare and flu even rarer, but if you got that flu, there is a 50-50 chance you’ll have a headache.
![Page 13: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/13.jpg)
Deriving the Bayes Rule
13
)()()|(
BPBAPBAP
Conditional Probability:
)()|()( BPBAPBAP Chain rule:
)()|()()( APABPABPBAP
Bayes Rule: )()()|()|(
APBPBAPABP
![Page 14: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/14.jpg)
Back to the Naïve Bayes Classifier
14
![Page 15: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/15.jpg)
Deriving the Naïve Bayes
15
)()()|()|(
APBPBAPABP (Bayes Rule)
21,cc 'dGiven two classes and the document
)'()|'()()'|( 11
1 dPcdPcPdcP )'(
)|'()()'|( 222 dP
cdPcPdcP
We are looking for a that maximizes the a-posteriori ic )'|( dcP i
)'(dP (the denominator) is the same in both cases
)|()(maxarg cdPcPcCc
MAP
Thus:
![Page 16: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/16.jpg)
Estimating parameters for the target function
We are looking for the estimates and
16
)(ˆ cP )|(ˆ cdP
P(c) is the fraction of possible worlds where c is true.
NNcP c)(ˆ N – number of all documents
Nc – number of documents in class c
d is a vector in the space X)|,,,()|( 2 ctttPcdP
dni
where each dimension is a term:
)()|()( BPBAPBAP By using the chain rule: we have:
(P
),,...,(),,...,|()|,,,( 2212 cttPctttPctttPddd nnni
...
![Page 17: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/17.jpg)
Naïve assumptions of independence
1. All attribute values are independent of each other given the class. (conditional independence assumption)
2. The conditional probabilities for a term are the same independent of position in the document.
We assume the document is a “bag-of-words”.
17
d
dnk
kni ctPctttPcdP1
2 )|()|,,,()|(
dnk
kCcCc
MAP ctPcPdcPc1
)|(ˆ)(ˆmaxarg)|(ˆmaxarg
Finally, we get the target function of Slide 8:
![Page 18: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/18.jpg)
Again about estimation
18
For each term, t, we need to estimate P(t|c)
Vt ct
ct
TTctP' '
)|(ˆ
Because an estimate will be 0 if a term does not appear with a class in the training data, we need smoothing:
||)(1
)1(1)|(ˆ
' '' ' VTT
TTctP
Vt ct
ct
Vt ct
ct
Laplace Smoothing
|V| is the number of terms in the vocabulary
Tct is the count of term t in all documents of class c
![Page 19: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/19.jpg)
An Example of classification with Naïve Bayes
19
![Page 20: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/20.jpg)
Example 13.1 (Part 1)
20
Trainingset
docID c = China?1 Chinese Beijing Chinese Yes
2 Chinese Chinese Shangai Yes
3 Chinese Macao Yes
4 Tokyo Japan Chinese No
Test set 5 Chinese Chinese Chinese Tokyo Japan ?Two classes: “China”, “not China”
N = 4 4/3)(ˆ cP 4/1)(ˆ cP
V = {Beijing, Chinese, Japan, Macao, Tokyo}
![Page 21: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/21.jpg)
Example 13.1 (Part 1)
21
Trainingset
docID c = China?1 Chinese Beijing Chinese Yes
2 Chinese Chinese Shangai Yes
3 Chinese Macao Yes
4 Tokyo Japan Chinese No
Test set 5 Chinese Chinese Chinese Tokyo Japan ?7/3)68/()15()|Chinese(ˆ cP
14/1)68/()10()|Japan(ˆ)|Tokyo(ˆ cPcP
9/2)63/()11()|Chinese(ˆ cP
9/2)63/()11()|Japan(ˆ)|Tokyo(ˆ cPcP
Estimation Classification
dnk
k ctPcPdcP1
)|()()|(
0001.09/29/2)9/2(4/1)|(
0003.014/114/1)7/3(4/3)|(3
5
35
dcP
dcP
![Page 22: Text classification](https://reader035.fdocuments.net/reader035/viewer/2022070514/588250691a28ab37158b69b5/html5/thumbnails/22.jpg)
Summary: Miscellanious
Naïve Bayes is linear in the time is takes to scan the data
When we have many terms, the product of probabilities with cause a floating point underflow, therefore:
For a large training set, the vocabulary is large. It is better to select only a subset of terms. For that is used “feature selection” (Section 13.5).
22
dnk
kCc
MAP ctPcPc1
)|(log)(ˆ[logmaxarg