Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

81
Knowledge Extraction from the Web Monika Henzinger Steve Lawrence

Transcript of Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Page 1: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Knowledge Extraction from the Web

Monika Henzinger

Steve Lawrence

Page 2: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Outline

• Hyperlink analysis in web IR• Sampling the web:

– Web pages– Web hosts

• Web graph models• Focused crawling• Finding communities

Page 3: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Hyperlink analysis in web information retrieval

Page 4: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Graph structure of the web

• Web graph– Each web page is a node

– Each hyperlink is a directed edge

• Host graph– Each host is a node

– If there are k links from host A to host B, there is an edge with weight k from A to B.

Page 5: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Hyperlink analysis in Web IR

• Idea: Mine structure of the web graph to

improve search results

• Related work:

– Classic IR work (citations = links) a.k.a.

“Bibliometrics” [K’63, G’72, S’73,…]

– Socio-metrics [K’53, MMSM’86,…]

– Many Web related papers use this approach

[PPR’96, AMM’97, S’97, CK’97, K’98, BP’98,…]

Page 6: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Google’s approach

• Assumption: A link from page A to page B is a recommendation of page B by the author of A(we say B is successor of A)

Quality of a page is related to its in-degree

• Recursion: Quality of a page is related to– its in-degree, and to

– the quality of pages linking to it

PageRank [BP ‘98]

Page 7: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Definition of PageRank

• Consider the following infinite random walk (surf):– Initially the surfer is at a random page

– At each step, the surfer proceeds

• to a randomly chosen web page with probability d

• to a randomly chosen successor of the current page with

probability 1-d

• The PageRank of a page p is the fraction of steps

the surfer spends at p in the limit.

Page 8: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

PageRank (cont.)

By previous theorem:• PageRank = stationary probability for this

Markov chain, i.e.

where n is the total number of nodes in the graph

Euv

voutdegreevPageRankdn

duPageRank

),(

)(/)()1()(

Page 9: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Query Results= Start Set Forward SetBack Set

Neighborhood graph

An edge for each hyperlink, but no edges within the same host

Result1

Result2

Resultn

f1

f2

fs

...

b1

b2

bm

… ...

• Subgraph associated to each query

Page 10: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

HITS [Kleinberg’98]

• Goal: Given a query find:

– Good sources of content (authorities)

– Good sources of links (hubs)

Page 11: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Repeat until HUB and AUTH converge:

Normalize HUB and AUTH

HUB[v] := AUTH[ui] for all ui with Edge(v, ui)

AUTH[v] := HUB[wi] for all wi with Edge(wi, v)

HITS details

w1

wk

......Aw2

u1

uk

u2

......H

v

Page 12: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

PageRank vs. HITS

• Computation: – Once for all documents

and queries (offline)

• Query-independent – requires combination with query-dependent criteria

• Hard to spam

• Computation:– Requires computation for

each query

• Query-dependent

• Relatively easy to spam• Quality depends on

quality of start set• Gives hubs as well as

authorities

Page 13: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

PageRank vs. HITS

• [Lempel] Not rank-stable: O(1) changes in graph can change O(N2) order-relations

• [Ng, Zheng, Jordan01] “Value”-Stable: change in k nodes (with PR values p1,…pk) results in p* s.t.

• Not rank-stable

• “value”-stability depends on gap g between largest and second largest eigenvector in ATA: change of O(g) in ATA results in p* s.t.dppp

k

jj /2||||

1

*

)1(|||| * pp

Page 14: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Random sampling of web pages

Page 15: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Random sampling of web pages

• Useful for estimating:

– Web properties: Percentage of pages in a domain,

in a language, on a topic, indegree distribution …

– Search engine comparison: Percentage of pages in

a search engine index (index size)

Page 16: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Let’s do the random walk!

• Perform PageRank random walk

• Select uniform random sample from resulting pages

• Can’t jump to a random page; instead, jump to a

random page on a random host seen so far.• Problem:

– Starting state bias: finite walk only approximates PageRank.

“Quality-biased” sample of the web

Page 17: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Page Freq. Freq. RankWalk2 Walk1 Walk1

www.microsoft.com/ 3172 1600 1www.microsoft.com/windows/ie/default.htm 2064 1045 3www.netscape.com/ 1991 876 6www.microsoft.com/ie/ 1982 1017 4www.microsoft.com/windows/ie/download/ 1915 943 5www.microsoft.com/windows/ie/download/all.htm 1696 830 7www.adobe.com/prodindex/acrobat/readstep.html 1634 780 8home.netscape.com/ 1581 695 10www.linkexchange.com/ 1574 763 9www.yahoo.com/ 1527 1132 2

Most frequently visited pages

Page 18: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Site Frequency Frequency RankWalk 2 Walk 1 Walk 1

www.microsoft.com 32452 16917 1home.netscape.com 23329 11084 2www.adobe.com 10884 5539 3www.amazon.com 10146 5182 4www.netscape.com 4862 2307 10excite.netscape.com 4714 2372 9www.real.com 4494 2777 5www.lycos.com 4448 2645 6www.zdnet.com 4038 2562 8www.linkexchange.com 3738 1940 12www.yahoo.com 3461 2595 7

Most frequently visited hosts

Page 19: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Sampling pages nearly uniformly

• Perform PageRank random walk

• Sample pages from walk s.t.

• Don’t know PageRank(p):

• PR: PageRank computation of crawled graph

• VR: VisitRatio on crawled graph

“Nearly uniform” sample of the web

)(/1)crawled is |sampled is Pr( pPageRankpp

Page 20: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Sampling pages nearly uniformly

• “Nearly uniform” sample:

– Recall:

– A page is well-connected if it can be reached by

almost every other page by short paths (O(n1/2) steps)

– For short paths in a well-connected graph:

constant

crawled) is |sampled is Pr(crawled) is Pr()sampled is Pr(

pppp

)(

) of visitsofnumber ()crawled is Pr(

pPageRankL

pEp

)(/1)crawled is |sampled is Pr( pPageRankpp

Page 21: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Sampling pages nearly uniformly

• Problems:– Starting state bias: finite walk only approximates

PageRank.

– Dependence, especially in short cycles

Page 22: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Synthetic graphs: in-degree

Page 23: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Synthetic graphs: PageRank

Page 24: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Experiments on the real web

• Performed 3 random walks in Nov 1999 (starting from 10,258 seed URLs)

• Small overlap between walks – walks disperse well (82% visited by only 1 walk)

Walk#visited URLsunique URLs

12,702,939 990,2512 2,507,004 921,1143 5,006,745 1,655,799

Page 25: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Percentage of pages in domains

05

1015

202530

3540

4550

com edu org net jp gov de uk

CrawlWalk 1 UniformWalk 2 UniformWalk 3 UniformWalk 1 PRWalk 2 PRWalk 3 PRWalk 1 VRWalk 2 VRWalk 3 VR

Page 26: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Estimating search engine index size

• Choose a sample of pages p p1,p2,p3… pn according to near

uniform distribution

• Check if the pages are in search engine index S [BB’98]:– Exact match

– Host match

• Estimate for size of index S is the percentage of sampled

pages that are in S, i.e.

where I[pj in S] = 1 if pj is in S and 0 otherwise

][1

)( SpIn

Svj

j

Page 27: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Result set for index size (fall ’99)

Page 28: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Random sampling of sites

Page 29: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Publicly indexable web

• We analyzed the “publicly indexable web”• Excludes pages that are not indexed by the

major search engines due to– Authentication requirements– Pages hidden behind search forms– Robots exclusion standard

Page 30: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Random sampling of sites

• Randomly sample IP addresses (2564 or about 4.3 billion)• Test for a web server at the standard port• Many machines and network connections are temporarily

unavailable - recheck all addresses after one week• Many sites serve the same content on multiple IP

addresses for load balancing or redundancy– Use DNS - only count one address in publicly indexable web

• Many servers not part of the “publicly indexable web”– Authorization requirements, default page, sites “coming soon”,

web-hosting companies that present their homepage on many IP addresses, printers, routers, proxies, mail servers, etc.

– Use regular expressions to find a majority, manual inspection

Page 31: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Feb 99 results

• Manually classified 2,500 random web servers

• 83% of sites commercial• Percentage of sites in areas

like science, health,and government is relatively small– Would be feasible and very

valuable to create specialized services that are very comprehensive and up to date

• 65% of sites have a majority of pages in English

Page 32: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Metadata analysis

• Analyzed simple HTML meta tag usage on the homepage of the 2,500 random servers– 34% of sites had description or keywords tags

• Low usage of this simple standard suggests that acceptance and widespread use of more complex standards like XML and Dublin Core may be very slow– 0.3% of sites contained Dublin Core tags

Page 33: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Web graph models

Page 34: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Inverse power laws on the web

• Fraction of pages with k in-links =

1.2, k

Page 35: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Properties with inverse power law

1. indegree of web pages2. outdegree of web pages3. indegree of web pages, off-site links only4. outdegree of web pages, off-site links only5. size of weakly connected components6. size of strongly connected components7. indegree of hosts8. outdegree of hosts9. number of hyperlinks between host pairs10.PageRank11.…

Page 36: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Category specific web

• All US company homepages• Histogram with exponentially

increasing size buckets (constant size on log scale)

• Strong deviation from pure power law

• Unimodal body, power law tail

Page 37: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Web graph model [BA ’99]

• Preferential attachment model:• Start with nodes• At each timestep:

– add 1 node v and – m edges incident to v s.t. for each new edge:

P(other endpoint is node u) in-degree(u)

• Theorem: P(page has k in-links) k-3

0n

Page 38: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Combining preferential and uniform

• Extension of preferential attachment model:• Start with nodes• At timestep t:

– add 1 node v and

– m edges s.t. for each new edge:

P(node u is endpoint)=

• Theorem: P(page has k in-links)

tnmt

uindegree

0

1)1(

2

)(

0n

/11k

Page 39: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Preferential vs. uniform attachment

• always– Preferential attachment plays a

greater role in web link growth than uniform attachment

• Distribution of links to companies and newspapers close to power law

• Distribution of links to universities and scientists closer to uniform– More balanced mixture of

preferential and uniform attachment

5.0 Preferential attachment

Dataset

Companies 0.95

Newspapers 0.95

Web inlinks 0.91

Universities 0.61

Scientists 0.60

Web outlinks 0.58

Page 40: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

E-commerce categories

Page 41: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Other networks

• Most social/biological networks exhibit drop-off from power law scaling at small k

• Actor collaborations, paper citations, US power grid, global web outlinks, web file sizes

Page 42: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Graph model summary

– Previous research: power law distribution of inlinks - “winners take all”

– Only an approximation - hides important details– Distribution varies in different categories; may be

much less biased– New model accurately accounts for the distribution

of category specific pages, the web as a whole, and other social networks

– May be used to predict degree of “winners take all” behavior

Page 43: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Copy model [KKRRT’99]

• At each timestep add new node u with fixed outdegree d.

• The destinations of these links are chosen:– Choose existing node v uniformly at random.– For j=1,...d, the j-th link of u points to a random

existing node with probability and to the destination of v’s j-th link with probability 1- .

• Models power law as well as large number of small bipartite cliques.

Page 44: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Relink model

• Hostgraph exhibits drop-off from power law scaling at small k relink model:

• With probability select a random existing node u, and with probability 1- create a new node u. Add d edges to u.

• The destinations of these links are chosen:– Choose existing node v uniformly at random and

choose d random edges with source v.– Determine destinations as in the copy model.

Page 45: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Relink model

Page 46: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Linkage between domains

com Self 1 2 3 4

com 82.9 82.9 net 6.5 org 2.6 jp 0.8 uk 0.7

cn 15.8 74.1 tw 0.4 jp 0.2 de 0.2 hk 0.1

jp 17.4 74.5 to 0.8 cn 0.6 uk 0.2 de 0.1

tw 22.0 66.0 to 1.3 au 0.6 jp 0.6 ch 0.4

ca 19.4 65.2 uk 0.6 fr 0.4 se 0.3 de 0.3

de 16.0 71.2 uk 0.8 ch 0.6 at 0.5 nl 0.2

br 17.8 69.1 uk 0.4 pt 0.4 de 0.4 ar 0.2

fr 20.9 61.9 ch 0.9 de 0.8 uk 0.7 ca 0.5

uk 34.2 33.1 de 0.6 ca 0.5 jp 0.3 se 0.3

Page 47: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Finding communities

Page 48: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Finding communities

• Identifying communities is valuable for– Focused search engines– Web directory creation– Content filtering– Analysis of communities and relationships

Page 49: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Recursive communities

• Several methods proposed• One link based method:• A community consists of members that have

more links within the community than outside of the community

Page 50: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

s-t Maximum flow

• Definition: given a directed graph, G=(V,E), with edge capacities c(u,v) 0, and two vertices, s, t V, find the maximum flow that can be routed from the source, s, to the sink, t.

• Intuition: think of water pipes

• Note: maximum flow = minimum cut

• Maximum flow yields communities

Page 51: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Maximum flow communities

• If the source is in the community, the sink is outside of the community, and the degree of the source and sink exceeds the cut size, then maximum flow identifies the entire community.

Page 52: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Maximum flow communities

Page 53: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Maximum flow communities

Page 54: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

SVM web community

• Seed set consisted of:– http://svm.first.gmd.de/– http://svm.research.bell-labs.com/– http://www.clrc.rhbnc.ac.uk/research/SVM/– http://www.support-vector.net/

• Four EM iterations used

• Only external links considered

• Induced graph contained over 11,000 URLs

• Identified community contained 252 URLs

Page 55: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Top ranked SVM pages

1. Vladimir Vapnik's home page (inventor SVMs)

2. Home page of SVM light, a popular software package

3. A hub site of SVM links4. Text categorization corpus5. SVM application list6. John Platt's SVM page (inventor of

SMO)7. Research interests of Mario

Marchand (SVM researcher)8. SVM workshop home page9. GMD First SVM publication list10. Book: Advances in Kernel Methods

- SVM Learning

11. B. Schölkopf's SVM page 12. GMD First hub page of SVM

researchers 13. Y. Li's links to SVM pages14. NIPS SVM workshop abstract

page15. GMD First SVM links 16. Learning System Group of ANU 17. NIPS*98 workshop on large margin

classifiers18. Control theory seminar (with links

to SVM material)19. ISIS SVM page20. Jonathan Howell's home page

Page 56: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Lowest ranked SVM pages

• Ten web pages tied for the lowest score. All were personal home pages of scientists that had at least one SVM publication.

• Other results contained researchers, students, software, books, conferences, workshops, etc.

• A few false positives: NN and data mining.

Page 57: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

“Ronald Rivest” community summary

• One seed: http://theory.lcs.mit.edu/~rivest• Four EM iterations used• First EM iteration used internal links• Induced graph contained more than 38,000

URLs• Identified community contained 150 URLs

Page 58: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

“Ronald Rivest” top ranked pages

1. Thomas H. Cormen’s home page 2. The Mathematical Guts of RSA

Encryption3. Charles E. Leiserson’s home page4. Famous people in the history of

Cryptography5. Cryptography sites6. Massachusetts Institute of Technology7. general cryptography links8. Spektrum der Wissenschaft -

Kryptographie9. Issues in Securing Electronic

Commerce over the Internet10. course based on “Introduction to

Algorithms”

11. Recommended Literature for Self-Study

12. Resume of Aske Plaat

13. German article on who's who of the WWW

14. People Ulrik knows

15. A course that uses ``Introduction to Algorithms''

16. Bibliography on algorithms

17. an article on encryption

18. German computer science institute

19. security links

20. International PGP FAQ

Page 59: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

“Ronald Rivest” lowest ranked

• 23 URLs tied for the lowest ranked

• All 23 were personally related to Ronald Rivest or his research

• 11 / 23 were bibliographies of Rivest’s publications

Page 60: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

“Rivest” community n-grams1 F.rivest 21 F.chaffing_and2 F.l_rivest 22 F.shamir3 F.ronald_l 23 F.rivest_s4 F.ronald 24 F.security5 F.cryptography 25 F.public_key6 F.rsa 26 F.algorithms7 F.ron_rivest 27 F.cormen8 T.rivest 28 F.edu_rivest9 F.lcs 29 F.adi_shamir

10 T.l_rivest 30 F.cryptography_and11 T.ronald_l 31 F.mit_edu12 F.theory_lcs 32 F.computer_science13 F.encryption 33 F.ron14 F.lcs_mit 34 F.encrypt15 F.theory 35 F.mit16 T.ronald 36 F.leiserson17 F.chaffing 37 F.adi18 F.winnowing 38 F.http_theory19 F.crypto 39 F.adleman20 F.and_winnowing 40 F.rivest_and

Page 61: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

“Rivest” community rules

Page 62: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Web communities summary

• Approximate method gives promising results• Exact method should be practical as well• Both methods can be easily generalized• Applications are numerous and exciting

– Building a better web directory– Focused search engines– Filtering undesirable content

• Complements text-based methods

Page 63: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Focused crawling

Page 64: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Focused crawling

• Analyzing the web graph can help locate pages on a specific topic

• Typical crawler considers only the links on the current page

• Graph based focused crawler learns the context of the web graph where relevant pages appear

• Significant performance improvements

Page 65: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Focused crawling

Page 66: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

CiteSeer

Page 67: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

CiteSeer

• Digital library for scientific literature• Aims to improve communication and progress in science• Autonomous Citation Indexing, citation context extraction,

distributed error correction, citation graph analysis, etc. • Helps researchers obtain a better perspective and

overview of the literature with citation context and new methods of locating related research

• Lower cost, wider availability, more up-to-date than competing citation indexing services

• Faster, easier, and more complete access to the literature can speed research, better direct research activities, and minimize duplication of effort

Page 68: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

CiteSeer

• 575,000 documents• 6 million citations• 500,000 daily requests• 50,000 daily users

• Data for research available on request• [email protected]

Page 69: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Distribution of articles

SCI ResearchIndex

Page 70: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Citations over time

Page 71: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Citations over time

• Conference papers and technical reports play a very important role in computer science research– Citations to very recent research are dominated by these

types of articles

• When recent journal papers are cited they are typically “in press” or “to appear”

• The most cited items tend to be journal articles and books

• Conference and technical report citations tend to be replaced with journal and book citations over time– May not be a one-to-one mapping

Page 72: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Online or invisible?

Page 73: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Online or invisible?

• Analyzed 119,924 conference articles from DBLP• Online articles cited 4.5 times more than offline

articles on average• Online articles more highly cited because

– They are easier to access and thus more visible, or– Because higher quality articles are more likely to be made

available online?

• Within venues: online articles cited 4.4 times more on average– Similar when restricted to top-tier conferences

Page 74: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Persistence of URLs

• Analyzed URLs referenced within articles in CiteSeer

• URLs per article increasing

• Many URLs now invalid– 1999 - 23%– 1994 - 53%

Page 75: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Persistence of URLs

• 2nd searcher found 80% of URLs the 1st searcher could not find

• Only 3% of URLs could not be found after 2nd searcher

Page 76: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

How important are the “lost” URLs?

• With respect to the ability of future research to verify and/or build on the given paper

After 1st searcher After 2nd searcher

Page 77: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Persistence of URLs

• Many URLs now invalid• Can often relocate information• No evidence that information very important to future

research has been lost yet• Citation practices suggest more information will be

lost in the future unless these practices are improved• A widespread and easy to use web with invalid links

may be more useful than an improved system without invalid links but with added complexity or overhead

Page 78: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Extracting knowledge from the web

• Unprecedented opportunity for automated analysis of a large sample of interests and activity in the world

• Many methods for extracting knowledge from the web– Random sampling and analysis of pages and

hosts– Analysis of link structure and link growth

Page 79: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Extracting knowledge from the web

• Variety of information can be extracted– Distribution of interest and activity in different

areas– Communities related to different topics– Competition in different areas– Communication between different communities

Page 80: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

Collaborators

• Web communities: Gary Flake, Lee Giles, Frans Coetzee

• Link growth modeling: David Pennock, Gary Flake, Lee Giles, Eric Glover

• Hostgraph modeling: Krishna Bharat, Bay-Wei Chang, Matthias Ruhl

• Web page sampling: Allan Heydon, Michael Mitzenmacher, Mark Najork

• Host sampling: Lee Giles• CiteSeer: Kurt Bollacker, Lee Giles

Page 81: Knowledge Extraction from the Web Monika Henzinger Steve Lawrence.

More information

• http://www.henzinger.com/monika/• http://www.neci.nec.com/~lawrence/• http://citeseer.org/