HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.
-
date post
21-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.
![Page 1: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/1.jpg)
Hypertext Categorization
Rayid Ghani
IR Seminar - 10/3/00
![Page 2: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/2.jpg)
![Page 3: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/3.jpg)
“Standard” Approach
Apply traditional text learning algorithms In many cases, goal is not to classify
hypertext but to test the algorithms Is it actually the right approach?
![Page 4: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/4.jpg)
Results?
Mixed results Positive results in most cases BUT the goal was
to test the algorithms Negative in few e.g. Chakrabarti BUT the goal
was to motivate their own algorithm
![Page 5: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/5.jpg)
How is hypertext different?
Link Information Diverse Authorship Short text - topic not obvious from the text Structure / position within the web graph Author-supplied features(meta-tags) Bold , italics, heading etc.
![Page 6: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/6.jpg)
How to use those extra features?
![Page 7: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/7.jpg)
Specific approaches to classify hypertext
Chakrabarti et al SIGMOD 98 Oh et al. SIGIR 00 Slattery & Mitchell ICML 00 Goal is not classification but retrieval
Bharat & Henzinger SIGIR 98 Croft & Turtle 93
![Page 8: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/8.jpg)
Chakrabarti et al. SIGMOD 98
Use the page and linkage information Add words from the “neighbors” and treat
them as belonging to the page itself Decrease in performance (not surprising) Link information is very noisy
Use topic information from neighbors instead
![Page 9: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/9.jpg)
Data Sets
IBM Patent Database 12 classes (630 train, 300 test for each class)
Yahoo 13 classes , 20000 docs (for expts involving
hypertext, only 900 documents were used) (?)
![Page 10: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/10.jpg)
Experiments
Using text from neighbors Local+Neighbor_Text: Local+Neighbor_Text_Tagged:
Assume Neighbors are Pre-classified Text – 36% Link – 34% Prefix – 22.1% (words in class heirarchy used) Text+Prefix – 21%
![Page 11: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/11.jpg)
Oh et al. SIGIR 2000
Relationship b/w class and neighbors of a web page in the training set is not consistent/useful (?)
Instead, Use the class and neighbor info of the page being classified (use regularities in the test set)
![Page 12: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/12.jpg)
Classify test instance d by:
Classification
)]())|()([(maxarg
)]|()|([maxarg
)],|([maxarg
)|(||
1
cNeighborctPcP
GCPTCP
TGCP
ddtN
T
ii
c
c
c
i
Ld
dd w
L
clcNeighbor
)()(
![Page 13: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/13.jpg)
Algorithm
For each test document d, generate a set A of “trustable” neighbors
For all terms ti in d, adjust the term weight using the term weights from A
For each doc a in A, assign a max confidence value if its class is known otherwise assign a class probabilistically and give it partial confidence weight
Classify d using the equation given earlier.
![Page 14: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/14.jpg)
Experiments
Reuters used to assess the algorithm on datasets without hyperlinks – only varying the size of the training set & # of features (?) Results not directly comparable but numbers
similar to reported results
Articles from an encyclopedia – 76 classes, 20836 documents
![Page 15: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/15.jpg)
Results
Terms+Classes > Only Classes > Only Terms > No use of inlinks
![Page 16: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/16.jpg)
Other issues
Link discrimination Knowledge of neighbor classes Use of links in training set Inclusion of new terms from neighbors
![Page 17: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/17.jpg)
ComparisonChakrabarti Oh et al. Improvement
Links in training set
Y N 5%
Link discrimination
N Y 6.7%
Knowledge of neighbor class
Y Y 6.6%
1.9%
Iteration Y N 1.5%
Using new terms from neighbors
Y N 31.4%
![Page 18: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/18.jpg)
Slattery & Mitchell ICML 00
Given a problem setting in which the test set contains structural regularities, How can we find and use them?
![Page 19: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/19.jpg)
Hubs and AuthoritiesKleinberg (1998)
“.. a good hub is a page that points to many good authorities;
a good authority is a page pointed to by many good hubs.”
Hubs Authorities
![Page 20: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/20.jpg)
Hubs and AuthoritiesKleinberg (1998)
“Hubs and authorities exhibit what could be called a mutually reinforcing relationship”
Iterative relaxation:
pqq
qpq
qp
qp
:
:
)(Hub)(Authority
)(Authority)(Hub
Hubs Authorities
![Page 21: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/21.jpg)
The Plan
Take an existing learning algorithm Extend it to exploit structural regularities in
the test set Using Hubs and Authorities as inspiration
![Page 22: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/22.jpg)
FOILQuinlan & Cameron-Jones (1993)
Learns relational rules like:target_page(A) :- has_research(A), link(A,B),
has_publications(B).
For each test example Pick matching rule with best training set
performance p. Predict positive with confidence p
![Page 23: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/23.jpg)
FOIL-Hubs Representation
Add two rules to a learned rule set target_page(A):-link(B,A),target_hub(B). target_hub(A):-link(A,B),target_page(B).
Talk about confidence rather than truth target_page(page15) = 0.75
Evaluate by summing instantiations
page15) link(B, : B
(B)target_hub e(page15)target_pag
![Page 24: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/24.jpg)
FOIL-Hubs Algorithm
1. Apply learned FOIL rules: learned(A)
2. Iterate1. Evaluate target_hub(A)
2. Evaluate target_page(A)
3. Set target_page(A) =
3. Report target_page(A)
learned(A)e(A)target_pag s
![Page 25: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/25.jpg)
FOIL-Hubs AlgorithmLearned FOIL rules
foil(A) target_hub(A)target_page(A)
1. Apply learned FOIL rules to test set
2. Initialise target_page(A) confidence from foil(A)
3. Evaluate target_hub(A)
4. Evaluate target_page(A)
5. target_page(A)=target_page(A)s+foil(A)
![Page 26: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/26.jpg)
Data Set
4127 pages from Computer Science departments of four universities:Cornell University University of Texas at Austin
University of Washington University of Wisconsin
• Hand labeled into:Student 558 Web pages
Course 243 Web pages
Faculty 153 Web pages
![Page 27: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/27.jpg)
Experiment
Three binary classification tasks
1. Student Home Page
2. Course Home Page
3. Faculty Home Page
Leave two university out cross-validation
![Page 28: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/28.jpg)
Student Home Page
0
20
40
60
80
100
0 20 40 60 80 100
Recall
Pre
cisi
on
FOIL-Hubs
FOIL
![Page 29: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/29.jpg)
Course Home Page
0
20
40
60
80
100
0 20 40 60 80 100
Recall
Pre
cisi
on
FOIL-Hubs
FOIL
![Page 30: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/30.jpg)
More Detailed Results
Partition the test data into Examples covered by some learned FOIL
rule Examples covered by no learned FOIL rule
![Page 31: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/31.jpg)
Student – FOIL covered
0
20
40
60
80
100
0 20 40 60 80 100
Recall
Pre
cisi
on
FOIL-Hubs
FOIL
![Page 32: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/32.jpg)
Student – FOIL uncovered
0
20
40
60
80
100
0 20 40 60 80 100
Recall
Pre
cisi
on
FOIL-Hubs
FOIL
![Page 33: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/33.jpg)
Course – FOIL covered
0
20
40
60
80
100
0 20 40 60 80 100
Recall
Pre
cisi
on
FOIL-Hubs
FOIL
![Page 34: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/34.jpg)
Course – FOIL uncovered
0
20
40
60
80
100
0 20 40 60 80 100
Recall
Pre
cisi
on
FOIL-Hubs
FOIL
![Page 35: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/35.jpg)
Recap
We’ve searched for regularities of the form
student_page(A):-
link(Web->KB members page,A)
in the test set. We consider this an instance of a regularity schema
student_page(A):-
link(<page constant>,A)
![Page 36: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/36.jpg)
Conclusions
Test set regularities can be used to improve classification performance
FOIL-Hubs used such regularities to outperform FOIL on three Web page classification problems
We can potentially search for other regularity schemas using FOIL
![Page 37: HypertextHypertext Categorization Rayid Ghani IR Seminar - 10/3/00.](https://reader038.fdocuments.net/reader038/viewer/2022103123/56649d615503460f94a430c9/html5/thumbnails/37.jpg)
Other work
Using the structure of HTML to improve retrieval. Michal Cutler, Yungming Shih, Weiyi Meng. USENIX 1997 Use tfidf - different different weights to text in
different html tags