Recent Results in Automatic Web Resource Discovery

04/19/23 1

Recent Results in Automatic Web Resource Discovery

Soumen ChakrabartivPresentation by Cui Tao

04/19/23 2

Introduction Classical IR:

Indexing a collection of documents Answering queries by returning a ranked list of

relevant document Problems for retrieve online document

Ambiguity Context sensitivity Synonymy Polysemy Large amount of relevant Web pages

04/19/23 3

IntroductionDirectory-based topic browsing: tree-like structure

Most Maintained by human expert

Advantages: exemplary, influential

Disadvantages: slow, subjective and noisy

04/19/23 4

Introduction Standard crawler and search engine

1997: cover 35-40% out of 340 million Web pages

1999: cover 18% out of 800 million Web pages

Cannot be used for maintaining generic portals and automatic resource discovery

04/19/23 5

Introduction Focused crawler:

Can selectively seek out pages that are relevant to pre-defined set of topics

Experts and researchers preferred Two modules:

Classifier: analyzes the text in and links around a given web page and automatically assigns it to suitable directories in a web catalog

Distiller: identifies the centrality of crawled pages to determine visit priorities

04/19/23 6

Distillation techniques Google:

Simulate a random wander on the Web Ranked by pre-computed popularity and

visitation rate fast

04/19/23 7

Distillation techniques HITS (Hyperlink Induced Topic Search):

Depends on a search engine Combine two scores:

Authorities: identify pages with useful information about a topic

Hubs: identify pages that contain many links to pages with useful information on the topic

Query dependent and slow May lead topic contamination or drift

04/19/23 8

Distillation techniques ARC and CLEVER:

ARC (Automatic Resource Complier): part of CLEVER

Root set was expanded by 2 links instead of 1link

( Including all pages which are link-distance two or less from at least one page in the root set )

Assign weights to the hyperlinks: base on the match between the query and the text surrounding the hyperlink in the source document

04/19/23 9

Distillation techniques Outlier filtering:

Computes relevance weights for pages using Vector Space Model

All pages whose weights are below a threshold are pruned

Effectively prune away outlier nodes in the neighborhood, thus avoid contamination

04/19/23 10

Topic distillation vs. Resource discovery Topic distillation:

Depend on large, comprehensive Web crawls and indices (Post processing)

Can be used to generate a Web taxonomy? Set a keyword query for each node in the

taxonomy Run a distillation program Simple but have some problems

04/19/23 11

Query: +"power suppl*" ßwitch* mode" smps -multiprocessor* üninterrupt* power suppl*" ups -parcel

The Yahoo! node /Business&Economy /Companies /Electronics /PowerSupplies

To match the directory based browsing quality of :

Yahoo!: 7.03 terms and 4.34 operators

Alta Vista: 2.35 terms and 0.41 operators

Problems: Construction the query: involves trial, error and

complicated thought Query: “North American telecommunication companies”

Topic distillation vs. Resource discovery

04/19/23 12

Topic distillation vs. Resource discovery Problems:

Contamination stop-sites: not automatic terming weighting edge weighing: no precise algorithm to set the

weight

Topic distillation by itself is not enough for resource discovery

04/19/23 13

Hypertext classification: learning from example Adding example pages and their distance-1

neighbors into the graph to be distilled will improve the result

The contents of the given example and its neighbors provide a way to compute the decision boundary of classification

NN, Bayesian and support vector classifiers

04/19/23 14

Hypertext classification Link-based features: important

Circular topic influence Topic of one page influences its text and its

neighbor page’s topic Knowledge of the linked vicinity’s topic provides

clues for the test document’s topic Bibliometric, more general than the simple linear

endorsement model used in topic distillation

04/19/23 15

Putting it together for resource discovery

04/19/23 16

Conclusion Emphasized the importance of scalable

automatic resource discovery Argued that common search engines are

not adequate to achieve the resource discovery

Introduced the recently invented focused crawling system

04/19/23 17

Future Works How to derive the training examples

automatically? How to personalize the outcome of focused

crawler for users?

Recent Results in Automatic Web Resource Discovery

Documents

Transcript of Recent Results in Automatic Web Resource Discovery