BINGO!: Bookmark-Induced Gathering of Information

31
BINGO!: BINGO!: Bookmark- Bookmark- Induced Gathering of Induced Gathering of Information Information Sergej Sizov Sergej Sizov , Martin , Martin Theobald, Theobald, Stefan Siersdorfer, Gerhard Stefan Siersdorfer, Gerhard Weikum Weikum University of the Saarland University of the Saarland Germany Germany

description

BINGO!: Bookmark-Induced Gathering of Information. Sergej Sizov , Martin Theobald, Stefan Siersdorfer, Gerhard Weikum University of the Saarland Germany. Part I. System Overview. Motivation. Web search engines The vector space model Link analysis & authority ranking Information demands - PowerPoint PPT Presentation

Transcript of BINGO!: Bookmark-Induced Gathering of Information

Page 1: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!:BINGO!: Bookmark-Induced Bookmark-Induced Gathering of InformationGathering of Information

Sergej SizovSergej Sizov, Martin Theobald,, Martin Theobald,

Stefan Siersdorfer, Gerhard WeikumStefan Siersdorfer, Gerhard Weikum

University of the SaarlandUniversity of the Saarland

GermanyGermany

Page 2: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Part IPart I

System OverviewSystem Overview

Page 3: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

MotivationMotivation

Web search engines

The vector space modelLink analysis & authority ranking

Information demands

Mass queries(“madonna tour”)

Needle-in-a-haystack queries(“solidarity eisler”)

?

Page 4: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Overview (II)Overview (II)WWW

ROOT

SemistructuredData

DB CoreTechnology

NetworkingWorkflow and

E-Services

WebRetrieval

DataMining

XML

SemistructuredData

DB CoreTechnology

NetworkingWorkflow and

E-Services

WebRetrieval

DataMining

XML

Page 5: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Focused CrawlingFocused Crawling

Crawler Queue

Results

Classifier

Page 6: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Focused Crawling (2)Focused Crawling (2)

Key aspects:

the mathematical model and algorithm that are used for the classifier(e.g., Naive Bayes vs. SVM)

the feature set upon which the classifier makes its decision(e.g., all terms vs. a careful selection of the "most

discriminative" terms)

the quality of the training data

Page 7: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Focused Crawling (3)Focused Crawling (3)

Crawler

Re-Training

Queue

SVM Classifier H I T S

SVM Archetypes

HubsAuthorities

Page 8: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

System OverviewSystem Overview

Crawler

DocumentAnalyzer Feature

Selection

ClassifierAdaptive

Re-Training

LinkAnalyzer

URLQueue

DocsFeatureVectors

OntologyIndex

TrainingDocs

Book-marks

Hubs &Authorities

W W W

......

.....

......

.....

Page 9: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Part IIPart II

System ComponentsSystem Components

Page 10: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Focus ManagerFocus Manager

Focusing strategies

Depth-first (df):

Breadth-first (bf):

Strong focus (learning phase)

Soft focus (harvesting phase)

Tunneling

depth(j)+pos(j) /links(j)P (j)=bf 2(confidence(j)+1)

pos(j) 2P (j)=- depth(j)+ ×(confidence(j)+1)df links(j)

Page 11: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Focus Manager (2)Focus Manager (2)

Sample URL Prioritization

confidence = 0.3topic=A

1

2 3

5 6

4

7 8 9 10

confidence = 0.4topic=A

confidence = 0.85topic=A

confidence = 0.6topic=B

DF strong order: 1–2–5–3–6–4–9–10 ..BF strong order: 1–2–5–3–4–6–9–10 ..DF soft order: 1–2–5–6–3–7–8–4–9–10 ..BF soft order: 1–2–5–3–6–4–7–8–9–10 ..

Page 12: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Feature SelectionFeature Selection

Mutual Information (MI) criterion:P[ X V ]i jMI( X ,V ) P[ X V ] log

i j i j P[ X ] P[V ]i j

A A NMI( X ,V ) log

i j N A B ( A C )

A is the number of documents in Vj containing Xi,B is the number of documents with Xi in "competitive" topics C is the number of documents in Vj without Xi N is the overall number of documents in Vj and its competitive topics

Time complexity: O(n)+O(mk) for n documents, m terms and k competitive topic.

Page 13: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Feature Selection (2)Feature Selection (2)Top features for the topic “DB Core Technology" with regard to tf*idf (left) and MI (right)

tf*idf score MI weight

below 1.4927 storag 0.1428 et 1.2778 modifi 0.1258 graph 1.2446 sql 0.1209 involv 1.0406 disk 0.1179 accomplish 0.9491 pointer 0.1150 backup 0.8613 deadlock 0.1001 command 0.8567 redo 0.1001 exactli 0.8112 implement 0.0963 feder 0.7764 correctli 0.0911 histor 0.6822 size 0.0911

Page 14: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

ClassifierClassifier

w x b 0

δ

¬ VV

?

δ

x1

x2

Training: Compute w x b 0 ����������������������������

Classification: Check w y b 0 ����������������������������

Input:

n training vectors with

components (x1, ..., xm, C)

and C = +1 or C = -1σ

Page 15: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Hierarchical ClassificationHierarchical Classification

Recursive classification by the taxonomy tree. Decisions based on topic-specific feature spaces

SemistructuredData

DB CoreTechnology

ROOT

NetworkingWorkflow and

E-Services

WebRetrieval

DataMining

XML

0.80.1

-0.50.2

-0.70.2

SemistructuredData

0.4

DataMining

Page 16: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Link AnalysisLink Analysis

The HITS Algorithm

q p( p ,q ) E

q p( p ,q ) E

Authority Score : x y

Hub Score : y x

Iterative approximation of the dominant Eigenvectors of ATA and AAT:

xAA:yA:x TT

yAA:xA:y T yAx T

xAy

Web graphG = (S, E)

?

Page 17: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Retraining based on ArchetypesRetraining based on Archetypes

Two sources of potential archetypes:

Link analysis → Nauth good authorities

SVM classifier → Nconf best-rated docs

To avoid the "topic drift" phenomenon: the classification confidence of an archeteype must be higher than the mean confidence of the previous iteration's training documents.

Page 18: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Retraining (2)Retraining (2)if {at least one topic has more than Nmax positive documents or all topics have more than Nmin positive documents} {for each topic Vi { link analysis using all documents of Vi as base set; hubs (Vi) = top Nhub documents; authorities (Vi) = top Nauth documents; sort docs of Vi in descending order of confidence; archetypes (Vi) = top Nconf from confidence ranking auth (Vi); remove from archetypes(Vi) all docs with confidence < mean of the previous iteration; archetypes (Vi) = archetypes(Vi) bookmarks (Vi) };for each topic Vi { perform feature selection based on archetypes (Vi); re-compute SVM decision model for Vi }re-initialize URL queue using hubs (Vi) to URL queue } }

Page 19: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Part IIIPart III

EvaluationEvaluation

Page 20: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

TestbedTestbedBookmarks: homepages of researchers in the various areasLeaf nodes were filled with 9 -15 bookmarksThe total training data comprised 81 documents

Focused crawl:Crawling time: 6hVisited: 11000 pages (1800 hosts), link distances 1 – 74230 positively classified (675 different hosts)

Entire crawl: 7 iterations with re-training.Parameters:

Nmin = 50, Nmax = 200,Nhub = 50, Nauth = 20, Nconf = 20.Feature selection: MI criterion, best 300 for each topic;Authority ranking: HITS algorithm

Page 21: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Crawling PrecisionCrawling Precision

Iteration Data Mining XMLEntire ontology

1 0,98 0,94 0,98

2 0,98 0,93 0,98

3 0,99 0,97 0,96

4 0,87 0,99 0,97

5 0,90 0,95 0,96

6 0,98 0,98 0,95

7 0,94 0,97 0,96

Page 22: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Crawling Precision (2)Crawling Precision (2)

Iteration BINGO!with

focusing,no MI

no focusing,

no MI

1 0,98 0.89 0.84

2 0,98 0.86 0.86

3 0,96 0.75 0.79

4 0,97 0.78 0.73

5 0,96 0.55 0.63

6 0,95 0.54 0.52

7 0,96 0.63 0.50

Page 23: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Crawling RecallCrawling Recall

Iteration Data Mining XMLEntire ontology

1 307 117 807

2 552 343 1615

3 1092 396 2436

4 1553 442 3245

5 2071 562 4072

6 2678 627 4898

7 3027 701 5715

Page 24: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Archetype SelectionArchetype SelectionTopic „Data Mining“:

URL SVM confidence

http://www.it.iitb.ernet.in/~sunita/it642/ 1.35 http://www.research.microsoft.com/research/datamine/ 1.31 http://www.acm.org/sigs/sigkdd/explorations/ 1.28http://robotics.stanford.edu/users/ronnyk/ 1.24 http://www.kdnuggets.com/index.html 1.18http://www.wizsoft.com/ 1.16 http://www.almaden.ibm.com/cs/people/ragrawal/ 1.14http://www.cs.sfu.ca/~han/DM_Book.html 1.14http://db.cs.sfu.ca/sections/publication/kdd/kdd.html 1.14http://www.cs.cornell.edu/johannes/publications.html 0.78

Page 25: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Archetype Selection (2)Archetype Selection (2)

Iteration Data Mining XMLEntire ontology

1 10 (1) 5 (0) 24 (4)

2 10 (2) 11 (0) 27 (5)

3 9 (1) 17 (1) 32 (4)

4 8 (0) 7 (0) 29 (3)

5 22 (2) 26 (2) 62 (8)

6 43 (4) 12 (2) 77 (10)

7 38 (0) 13 (1) 75 (8)

Page 26: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Feature SelectionFeature Selection

Topic „Data Mining“:

Feature MI weight

mine 0.178

knowledg 0.122

olap 0.106

frame 0.086

pattern 0.066

genet 0.061

discov 0.053

miner 0.053

cluster 0.049

dataset 0.044

Page 27: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Future WorkFuture WorkLarge-scale experiments (portal generator)

Annotation and semantic classification of HTML sources (e.g. transformation of HTML to XML for improved data management, detection of “information units”)

Advanced feature construction and feature selection algorithmsFault tolerance on document collections with wrong samples, adaptive re-training

... ?

Page 28: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Page 29: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

CrawlerCrawler

Key features:

asynchronous DNS lookups with caching

multiple download attempts

advanced duplicate recognition

following multiple redirects

advanced topic-balanced URL-queue

document filters for common datatypes

focusing strategies

Page 30: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Classifier (II)Classifier (II)

Training:Find hyperplane that separates the samples with maximum margin (quadratic optimization task):

Classification:Test unlabeled vector y for

Very efficient runtime in O(m)

w x b 0 ����������������������������

w y b 0 ����������������������������

n

ii 1

nii 1 i i

ni 1 i

1minimize : V( ,b, ) C

2

subj . to : y [ x b ] 1

0

��������������������������������������������������������

����������������������������

Page 31: BINGO!:  Bookmark-Induced Gathering of Information

BINGO!BINGO!: Bookmark-Induced Gathering of Information: Bookmark-Induced Gathering of Information Sergej SizovSergej Sizov

Related WorkRelated Work

General-purpose crawling

Focused crawling

Authority ranking

Classification of Web documents

Web ontologies