Recent Results in Automatic Web Resource Discovery
-
Upload
patience-kelly -
Category
Documents
-
view
20 -
download
0
description
Transcript of Recent Results in Automatic Web Resource Discovery
04/19/23 1
Recent Results in Automatic Web Resource Discovery
Soumen ChakrabartivPresentation by Cui Tao
04/19/23 2
Introduction Classical IR:
Indexing a collection of documents Answering queries by returning a ranked list of
relevant document Problems for retrieve online document
Ambiguity Context sensitivity Synonymy Polysemy Large amount of relevant Web pages
04/19/23 3
IntroductionDirectory-based topic browsing: tree-like structure
Most Maintained by human expert
Advantages: exemplary, influential
Disadvantages: slow, subjective and noisy
04/19/23 4
Introduction Standard crawler and search engine
1997: cover 35-40% out of 340 million Web pages
1999: cover 18% out of 800 million Web pages
Cannot be used for maintaining generic portals and automatic resource discovery
04/19/23 5
Introduction Focused crawler:
Can selectively seek out pages that are relevant to pre-defined set of topics
Experts and researchers preferred Two modules:
Classifier: analyzes the text in and links around a given web page and automatically assigns it to suitable directories in a web catalog
Distiller: identifies the centrality of crawled pages to determine visit priorities
04/19/23 6
Distillation techniques Google:
Simulate a random wander on the Web Ranked by pre-computed popularity and
visitation rate fast
04/19/23 7
Distillation techniques HITS (Hyperlink Induced Topic Search):
Depends on a search engine Combine two scores:
Authorities: identify pages with useful information about a topic
Hubs: identify pages that contain many links to pages with useful information on the topic
Query dependent and slow May lead topic contamination or drift
04/19/23 8
Distillation techniques ARC and CLEVER:
ARC (Automatic Resource Complier): part of CLEVER
Root set was expanded by 2 links instead of 1link
( Including all pages which are link-distance two or less from at least one page in the root set )
Assign weights to the hyperlinks: base on the match between the query and the text surrounding the hyperlink in the source document
04/19/23 9
Distillation techniques Outlier filtering:
Computes relevance weights for pages using Vector Space Model
All pages whose weights are below a threshold are pruned
Effectively prune away outlier nodes in the neighborhood, thus avoid contamination
04/19/23 10
Topic distillation vs. Resource discovery Topic distillation:
Depend on large, comprehensive Web crawls and indices (Post processing)
Can be used to generate a Web taxonomy? Set a keyword query for each node in the
taxonomy Run a distillation program Simple but have some problems
04/19/23 11
Query: +"power suppl*" ßwitch* mode" smps -multiprocessor* üninterrupt* power suppl*" ups -parcel
The Yahoo! node /Business&Economy /Companies /Electronics /PowerSupplies
To match the directory based browsing quality of :
Yahoo!: 7.03 terms and 4.34 operators
Alta Vista: 2.35 terms and 0.41 operators
Problems: Construction the query: involves trial, error and
complicated thought Query: “North American telecommunication companies”
Topic distillation vs. Resource discovery
04/19/23 12
Topic distillation vs. Resource discovery Problems:
Contamination stop-sites: not automatic terming weighting edge weighing: no precise algorithm to set the
weight
Topic distillation by itself is not enough for resource discovery
04/19/23 13
Hypertext classification: learning from example Adding example pages and their distance-1
neighbors into the graph to be distilled will improve the result
The contents of the given example and its neighbors provide a way to compute the decision boundary of classification
NN, Bayesian and support vector classifiers
04/19/23 14
Hypertext classification Link-based features: important
Circular topic influence Topic of one page influences its text and its
neighbor page’s topic Knowledge of the linked vicinity’s topic provides
clues for the test document’s topic Bibliometric, more general than the simple linear
endorsement model used in topic distillation
04/19/23 16
Conclusion Emphasized the importance of scalable
automatic resource discovery Argued that common search engines are
not adequate to achieve the resource discovery
Introduced the recently invented focused crawling system