Information Retrieval (9) Prof. Dragomir R. Radev
-
Upload
britney-holt -
Category
Documents
-
view
228 -
download
0
description
Transcript of Information Retrieval (9) Prof. Dragomir R. Radev
![Page 2: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/2.jpg)
IR Winter 2010
…14. Webometrics
The Bow-tie model…
![Page 3: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/3.jpg)
Brief history of the Web
• FTP/Gopher• WWW (1989)• Archie (1990)• Mosaic (1993)• Webcrawler (1994)• Lycos (1994)• Yahoo! (1994)• Google (1998)
![Page 4: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/4.jpg)
Size
• The Web is the largest repository of data and it grows exponentially.– 320 Million Web pages [Lawrence & Giles 1998]– 800 Million Web pages, 15 TB [Lawrence & Giles
1999]– 20 Billion Web pages indexed [now]
• Amount of data– roughly 200 TB [Lyman et al. 2003]
![Page 5: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/5.jpg)
Zipfian properties
• In-degree• Out-degree• Visits to a page
![Page 6: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/6.jpg)
Bow-tie model of the Web
SCC56 M
OUT44 M
IN44 M
Bröder & al. WWW 2000, Dill & al. VLDB 2001
DISC17 M
TEND44M
24% of pagesreachable froma given page
![Page 7: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/7.jpg)
Measuring the size of the web
• Using extrapolation methods• Random queries and their coverage by
different search engines• Overlap between search engines• HTTP requests to random IP addresses
![Page 8: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/8.jpg)
Bharat and Broder 1998
• Based on crawls of HotBot, Altavista, Excite, and InfoSeek
• 10,000 queries in mid and late 1997• Estimate is 200M pages• Only 1.4% are indexed by all of them
![Page 9: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/9.jpg)
Example (from Bharat&Broder)
A similar approach by Lawrence and Giles yields 320M pages (Lawrence and Giles 1998).
![Page 10: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/10.jpg)
What makes Web IR different?• Much bigger• No fixed document collection• Users• Non-human users• Varied user base• Miscellaneous user needs• Dynamic content• Evolving content• Spam• Infinite sized – size is whatever can be indexed!
![Page 11: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/11.jpg)
IR Winter 2010
…15. Crawling the Web Hypertext retrieval & Web-based IR Document closures Focused crawling …
![Page 12: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/12.jpg)
Web crawling• The HTTP/HTML protocols• Following hyperlinks• Some problems:
– Link extraction– Link normalization– Robot exclusion– Loops– Spider traps– Server overload
![Page 13: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/13.jpg)
Example• U-M’s root robots.txt file:• http://www.umich.edu/robots.txt
– User-agent: * – Disallow: /~websvcs/projects/ – Disallow: /%7Ewebsvcs/projects/ – Disallow: /~homepage/ – Disallow: /%7Ehomepage/ – Disallow: /~smartgl/ – Disallow: /%7Esmartgl/ – Disallow: /~gateway/ – Disallow: /%7Egateway/
![Page 14: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/14.jpg)
Example crawler
• E.g., poacher– http://search.cpan.org/~neilb/Robot-0.011/
examples/poacher– Included in clairlib
![Page 15: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/15.jpg)
&ParseCommandLine();&Initialise();$robot->run($siteRoot)
#=======================================================================# Initialise() - initialise global variables, contents, tables, etc# This function sets up various global variables such as the version number# for WebAssay, the program name identifier, usage statement, etc.#=======================================================================sub Initialise{ $robot = new WWW::Robot( 'NAME' => $BOTNAME, 'VERSION' => $VERSION, 'EMAIL' => $EMAIL, 'TRAVERSAL' => $TRAVERSAL, 'VERBOSE' => $VERBOSE, ); $robot->addHook('follow-url-test', \&follow_url_test); $robot->addHook('invoke-on-contents', \&process_contents); $robot->addHook('invoke-on-get-error', \&process_get_error);}#=======================================================================# follow_url_test() - tell the robot module whether is should follow link#=======================================================================sub follow_url_test {}#=======================================================================# process_get_error() - hook function invoked whenever a GET fails#=======================================================================sub process_get_error {}#=======================================================================# process_contents() - process the contents of a URL we've retrieved#=======================================================================sub process_contents{ run_command($COMMAND, $filename) if defined $COMMAND;}
![Page 16: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/16.jpg)
Focused crawling
• Topical locality– Pages that are linked are similar in content (and vice-
versa: Davison 00, Menczer 02, 04, Radev et al. 04)• The radius-1 hypothesis
– given that page i is relevant to a query and that page i points to page j, then page j is also likely to be relevant (at least, more so than a random web page)
• Focused crawling– Keeping a priority queue of the most relevant pages
![Page 17: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/17.jpg)
Challenges in indexing the web
• Page importance varies a lot• Anchor text• User modeling• Detecting duplicates• Dealing with spam (content-based and
link-based)
![Page 18: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/18.jpg)
Duplicate detection
• Shingles• TO BE OR• BE OR NOT• OR NOT TO• NOT TO BE• The use the Jaccard coefficient (size of
intersection/size of union) to determine similarity• Hashing• Shingling (separate lecture)
![Page 19: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/19.jpg)
Document closures for Q&A
capital
P L P
Madridspain
spain
capital
![Page 20: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/20.jpg)
Document closures for IR
Physics
P L P
PhysicsDepartment
University ofMichigan
Michigan
![Page 21: Information Retrieval (9) Prof. Dragomir R. Radev](https://reader035.fdocuments.net/reader035/viewer/2022062504/5a4d1b6e7f8b9ab0599b4743/html5/thumbnails/21.jpg)
The link-content hypothesis
• Topical locality: page is similar () to the page that points to it ().• Davison (TF*IDF, 100K pages)
– 0.31 same domain– 0.23 linked pages– 0.19 sibling– 0.02 random
• Menczer (373K pages, non-linear least squares fit)
• Chakrabarti (focused crawling) - prob. of losing the topic
Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001
21)1()(
e 03.01=1.8, 2=0.6,