AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo...
-
Upload
lesley-hensley -
Category
Documents
-
view
218 -
download
1
Transcript of AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo...
![Page 1: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/1.jpg)
AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-
TREES
Imad Rahal and William PerrizoComputer Science DepartmentNorth Dakota State University
Fargo, [email protected]
![Page 2: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/2.jpg)
2
Outline The Text Categorization problem The P-tree technology Vector Space Model Proposed Solution
Intervalization (discretization) P-tree representation Similarity measures Categorization algorithm Performance analysis study
![Page 3: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/3.jpg)
3
Text categorization problem Text Categorization (topic spotting or text
classification) is the process of assigning categories or labels to documents based entirely on their contents
Problems text has no explicit structured unlike other data
(e.g. relational data) information is described freely in the documents (After introducing structure) huge number of
features
![Page 4: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/4.jpg)
4
Motivation Increase in the number of text documents (on
the Internet!) Medical articles Research Publications E-mails News reports (e.g. Reuters) others
Most algorithms fail to scale up because of the curse of dimensionality
Most algorithms suffer from relatively low accuracy
![Page 5: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/5.jpg)
5
The P-tree technology
Tree-like data structure that store numeric (and categorical) relational data in bit-compressed format by splitting each attribute into bits representing each bit position by a P-
tree
![Page 6: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/6.jpg)
6
Transformation to binary
![Page 7: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/7.jpg)
7
Each binary column will form a P-tree
![Page 8: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/8.jpg)
8
AND and OR operations
![Page 9: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/9.jpg)
9
Complement operation
![Page 10: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/10.jpg)
10
P-trees are characterized by 1-time creation cost Compression High speed processing (ANDing, no DB scans)
The latest bench mark on P-tree ANDing has shown a speed of 6 ms for two 1320x1320 images (i.e. two bit sequences each containing 1.6 million bits represented using P-trees)
![Page 11: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/11.jpg)
11
We have 8 P-trees in total for each attribute shown in the previous example: PA,7 PA,6 PA,5 PA,4 PA,3 PA,2 PA,1 and PA,0
To query for a certain attribute value, say Attribute A = 1110 0001, we do the following: PA,1110 0001 = PA,7 & PA,6 & PA,5 & P’A,4 & P’A,3 & P’A,2 & P’A,1 & PA,0
We can have varying bit precision. We query for A = 001, we do the following: PA,001 = P’A,3 & P’A,2 & PA,0
![Page 12: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/12.jpg)
12
Vector Space Model Each document is represented as a
vector whose dimensions are the terms in the initial document collection
Each vector coordinate is a term and has a numeric value which represents its relevance to the document. Usually higher values imply higher relevance
![Page 13: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/13.jpg)
13
Three popular weighting schemes are: Binary, TF, and TF*IDF.
The binary scheme uses the values 1 and 0 to reveal whether a term exists in the document or not
The term frequency (TF) scheme counts the occurrences of a term in a document. Usually measures are normalized to help overcome the problems associated with document length
![Page 14: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/14.jpg)
14
The TF*IDF scheme multiplies the coordinate measure derived by the TF scheme by a global weight called the IDF. The IDF measure for term t is defined as log(N/Nt) where N is the total number of documents and Nt is the total number of documents containing t. The cosine normalization is usually used
![Page 15: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/15.jpg)
15
Proposed solution Model 1: Classification over binary
representation is not accurate but fast
Model 2: Classification using exact counts (tf, idf, normalized tf…) more accurate but slower (very high dimensional space)
This can be viewed as a concept hierarchy
![Page 16: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/16.jpg)
16
Work along this hierarchy by using intervals Better speed than model 2
(approaching to Model1) Better accuracy than model 1
(approaching Model2)
![Page 17: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/17.jpg)
17
An example say we’re using TF (values normalized in the
range of [0,1]) divide range into 4 intervals: None, Low,
Medium, High Each interval will be represented by a string
of bits (we have four intervals so we need 2 bits)
None = “00”, Low =“01” , Medium = “10” and High=“11” (note the order among them)
Each bit position will be represented by a P-tree; so we have 2 P-trees for every dimension
![Page 18: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/18.jpg)
18
Term 1 Term 2 Term 3 Term 4 Term 5 Term 6 Term 7 C 1 C 2Doc 1 "00" "00" "00" "01" "01" "10" "11" 0 1
Doc 2 "00" "00" "00" "10" "01" "10" "11" 0 1
Doc 3 "00" "11" "00" "01" "00" "00" "11" 0 1
Doc 4 "00" "00" "00" "01" "01" "01" "00" 1 0
Doc 5 "11" "00" "00" "10" "01" "00" "11" 1 0
Doc 6 "11" "00" "11" "00" "01" "01" "00" 1 0
Doc 7 "01" "00" "11" "10" "00" "00" "11" 0 1
Doc 8 "10" "11" "01" "01" "01" "00" "00" 1 0
![Page 19: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/19.jpg)
19
kNN Algorithm Used to find the k most similar points
(referred to as k neighbours) to some given point P in some space and then assigning a proper class to P using the class labels of the k neighbours
Usually proceeds by the selecting the neighbours first (selection phase) and then assigning the class label (voting phase)
![Page 20: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/20.jpg)
20
Categorization Algorithm: Selection Phase Initialize a P-tree, Pnn, to contain only pure-1 quadrants
(i.e. all entries in it are 1’s) – identity P-tree Order the set of all term P-trees S in descending order from
term P-trees representing higher to lower interval values in dnew
For every term P-tree, Pt, in S do the following AND Pnn with Pt If root count of Pnn is less than k, expand Pt by removing the
rightmost bit from the interval value (i.e. interval 01 and 00 become 0 and intervals 10 and 11 become 1). This could be done by recalculating the Pt while disregarding the rightmost bit P-tree. Repeat this step until the root count of Pnn AND Pt is greater than k – this is guaranteed to happen at least when all the bits are disregarded.
Else, put the result in Pnn Loop
End of selection phase
![Page 21: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/21.jpg)
21
Term 1 Term 2 Term 3 Term 4 Term 5 Term 6 Term 7 C 1 C 2Doc 1 "00" "00" "00" "01" "01" "10" "11" 0 1
Doc 2 "00" "00" "00" "10" "01" "10" "11" 0 1
Doc 3 "00" "11" "00" "01" "00" "00" "11" 0 1
Doc 4 "00" "00" "00" "01" "01" "01" "00" 1 0
Doc 5 "11" "00" "00" "10" "01" "00" "11" 1 0
Doc 6 "11" "00" "11" "00" "01" "01" "00" 1 0
Doc 7 "01" "00" "11" "10" "00" "00" "11" 0 1
Doc 9 "10" "11" "01" "01" "01" "00" "00" 1 0
dnew "00" "00" "11" "01" "01" "10" "11"
Pnn P3 P7 P6 P4 P5 P1 P2
![Page 22: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/22.jpg)
22
Categorization Algorithm: Voting Phase For every class ci, loop through dnew vector and
do the following for every term tj in dnew vector:
Get the P-tree representing the neighboring documents (Pnn from the selection phase) having the same value for t (Pt) and class ci (Pi). This could be done by calculating Presult = Ptj AND Pnn AND Pi
If the term under consideration has a value Ij then multiply the root count of Presult by (Ij+1)
//if we want to neglect Ij=“00” then don’t add 1 Add the result to the counter of ci, w(ci). Loop
Select the class ck having the largest counter w(ck) as the respective class of dnew
End of voting phase
![Page 23: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/23.jpg)
23
Performance analysis study
Compared accuracy and speed to cosine-similarity KNN and accuracy to string kernels approach by Lodhi et al. (Journal of Machine Learning Feb. 2002)
Speed Used synthetic document x term
matrices with different sizes
![Page 24: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/24.jpg)
24
![Page 25: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/25.jpg)
25
Accuracy: Followed the sampling approach depicted in the
string kernels approach Tested over a subset of the Reuters-21578 collection
(analysis over the whole dataset if still underway) Experimented on four classes namely:
acquisition, earn, corn, and crude. We used k=3 and a 4-interval value set, I0=[0,0], I1=(0,0.25], I2=(0.25,0.75] and I3=(0.75,1].
Averaged precision (not shown), recall (not shown) and F1-measures (2pr/(p+r)) for our approach and cosine KNN and compared with string kernels
![Page 26: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/26.jpg)
26
F-1 Measure values Class/Method P-tree based KNN String kernels
Earn 95.4 85.5 94.4
Acq 90.2 80.8 87.2
Crude 87.3 73.8 94.9
Corn 88.7 66.0 83.1
![Page 27: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/27.jpg)
27
Compared to the KNN approach, we show much better results in terms of speed and accuracy The reason for the improvement in speed
is mainly related to the complexity of the selection phases: O(n)
VS O(mn) where m is the size of the dataset – number of rows – and n is the number of dimensions.
and P-tree ANDing speed.
![Page 28: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/28.jpg)
28
As for accuracy, the KNN approach uses the angle
between the vectors and considers all terms
Our approach uses ANDing to compare the closeness of the value of each term and to ignore unneeded terms (those whose ANDing renders a less than k neighbors)
![Page 29: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/29.jpg)
29
As for the kernels approach, it would not be appropriate to compare speeds here because the two approaches are fundamentally different. Example-based VS Eager Context sensitive VS Context insensitive
In general, results were very comparable results
![Page 30: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/30.jpg)
30
The range for the precision, recall and F1 measurements in the other two approaches spreads over a wider range than they do in ours which indicates that our P-tree based approach’s accuracy is less variable across categories or classes thus leading to more stable results in general
![Page 31: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/31.jpg)
31
Drawsbacks
Needs tuning We need to decide upon the
number of intervals and their ranges ahead of time (analysis for varying those is still underway)
Since this is a KNN algorithm, K must also be known ahead of time
![Page 32: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/32.jpg)
32
Conclusion
We have shown Higher accuracy
the use of sequential ANDing in selection
Very fair voting Use of closed neighbourhood (in case
root count is greater than K) – refer to Maleq Khan’s thesis (Dec. 2001) for previous work
![Page 33: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/33.jpg)
33
Better space utilization reduced compressed space
Reduced space due to intervalization (from 8 bits to 2 bits reduction by a factor of 4)
Compression due to the use of P-trees Higher speed
Due to P-trees No DB scans Based on the AND operation which is among the
fastest computer instructions
![Page 34: AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.](https://reader036.fdocuments.net/reader036/viewer/2022062516/56649e355503460f94b247bb/html5/thumbnails/34.jpg)
34
Future direction Solve the problem of random
ANDing for term P-trees having the same values? Information gain?
Test the effects of varying the number of intervals and their values over different datasets
Analyze speed and accuracy results over large datasets (all Reuters collection)