1 Needle in the Haystack: The Technology of Internet Search Randy H. Katz The United...
-
date post
15-Jan-2016 -
Category
Documents
-
view
215 -
download
0
Transcript of 1 Needle in the Haystack: The Technology of Internet Search Randy H. Katz The United...
1
Needle in the Haystack:The Technology of Internet Search
Randy H. KatzThe United Microelectronics Corporation Distinguished Professor
Computer Science Division, EECS Department
University of California, Berkeley
Berkeley, CA 94720-1776 USA
2
Outline
ā¢ Historical Backgroundā¢ Information Tsunamiā¢ Anatomy of a Web Pageā¢ Anatomy of Web Accessā¢ The Challenge of Searchā¢ Googleās Page Rank Algorithmā¢ Fun and Games with Internet Searchā¢ New Directions
3
Search is BIG!
4
And the World is Going Digital
5
Outline
ā¢ Historical Backgroundā¢ Information Tsunamiā¢ Anatomy of a Web Pageā¢ Anatomy of Web Accessā¢ The Challenge of Searchā¢ Googleās Page Rank Algorithmā¢ Fun and Games with Internet Searchā¢ New Directions
6
Historical Background:The Perfect Storm
ARPANet 1969
NSFNet 1985
Commercial Internet 1995
World Wide WebMarc AndreessenNCSA Mosaic1993
Jim ClarkNetscape1995
Vannevar Bush āAs WeMay Thinkā MEMEX 1947
Ted Nelson Xanadu Hypertext 1965-1990Autodesk
SGML 1986
Tim Berners-Lee URL/HTTP/HTML 1989
Bill Atkinson Hypercard 1987
Est. $15.5 Billion spent on-lineThanksgivings to Xmas 2004, up 28% since 2003
7
Outline
ā¢ Historical Backgroundā¢ Information Tsunamiā¢ Anatomy of a Web Pageā¢ Anatomy of Web Accessā¢ The Challenge of Searchā¢ Googleās Page Rank Algorithmā¢ Fun and Games with Internet Searchā¢ New Directions
8
Information Tsunami
ā¢ Bit: Binary digit ā either a 0 or 1ā¢ Byte: 8 bits
ā 1 byte: single characterā 10 bytes: a single wordā 100 bytes: Telegram or punched card
ā¢ Kilobyte: 1,000 or 103 bytesā 1 kilobyte: Very short storyā 2 kilobytes: Typewritten pageā 10 kilobytes: Encyclopedia pageā 50 kilobytes: Compressed document image pageā 100 kilobytes: Low-res photoā 200 kilobytes: Box of punched cards
http://www.sims.berkeley.edu/research/projects/how-much-info/index.html
9
Information Tsunami
ā¢ Megabyte: 1,000,000 or 106 bytesā 1 megabyte: Small novel or 3.5in floppy diskā 2 megabytes: Hi-res photoā 5 megabytes: Complete works of Shakespeareā 10 megabytes: Minute of hi-fi soundā 100 megabytes: 1m shelved booksā 500 megabytes: CD-ROM
ā¢ Gigabyte: 1,000,000,000 or 109 bytesā 1 gigabyte: Pickup truck filled with paperā 2 gigabytes: Movie on a DVDā 50 gigabytes: Floor of booksā 100 gigabytes: Floor of academic journalsā 500 gigabytes: Biggest FTP site
http://www.sims.berkeley.edu/research/projects/how-much-info/index.html
10
Information Tsunami
ā¢ Terabyte: 1,000,000,000,000 or 1012 bytes ā 1 terabyte: 50,000 trees made into paper and printed
or 1 day of EOS dataā 2 terabytes: Academic research libraryā 10 terabytes: Printed collection of the U.S. Library of Congressā 50 terabytes: Contents of a large mass storage systemā 400 terabytes: National Climate Data Center (NOAA) database
ā¢ Petabyte: 1,000,000,000,000,000 or 1015 bytesā 1 petabytes: 3 years of Earth Observing System (EOS) dataā 2 petabytes: All U.S. academic research librariesā 8 petabytes: All information available on the Webā 200 petabytes: All printed material (2001)
http://www.sims.berkeley.edu/research/projects/how-much-info/index.html
11
Information Tsunami
ā¢ Exabyte: 1,000,000,000,000,000,000 or 1018 bytesā 2 exabytes: Total volume of information generated
worldwide annuallyā 5 exabytes: All words ever spoken by humans
ā¢ Zettabyte: 1,000,000,000,000,000,000,000 or 1021 bytesā¢ Yottabyte: 1,000,000,000,000,000,000,000,000 or 1024
bytes
http://www.sims.berkeley.edu/research/projects/how-much-info/index.html
12
Outline
ā¢ Historical Backgroundā¢ Information Tsunamiā¢ Anatomy of a Web Pageā¢ Anatomy of Web Accessā¢ The Challenge of Searchā¢ Googleās Page Rank Algorithmā¢ Fun and Games with Internet Searchā¢ New Directions
13
Anatomy of aWeb Page:
Randyās Home Page
ā¢ URL: Uniform Resource Locator
ā¢ Imagesā¢ Text
14
Anatomy of a Web Page:Randyās Home Page
<html><head><title>Professor Randy Howard Katz University of California BerkeleyComputer Science Division Home Page</title><meta name="descriptionā content="Home Page of Berkeley Computer Science
Professor Randy Howard Katz"> <meta name="keywordsā content="Katz Randy Howard Berkeley Professor
University California Electrical Engineering Computer Science Department RAID Redundant Arrays Inexpensive Disks SPUR Snoop Wireless Communications Networks Programmable Network Elements">
</head><body><p><img height="269" src="Randy_2004.jpg" width="182" align="bottom" naturalsizeflag="0"> <img height="269" src="RHK85a.jpg" width="177" align="bottom" naturalsizeflag="0"> </p><p><font size="-1">2005 vs. 1985 ... The hair is grayer, but the smirkremains the same!<br><br>"... Katz, a thin, almost gaunt man with horn-rimmed glasses magnifyingsunken eyes. ..."<br>--George Johnson, WIRED Magazine, (January 2000), page 150.</font></p><p><img
src="VISIONAR.JPG" align="bottom"> </p>ā¦
15
ā¢ Textā¢ Imagesā¢ Links!
16
Anatomy of a Web Page:Randyās Web Page
<hr align="left"><h1>Professor Randy H. Katz</h1><h3>Electrical Engineering and Computer Science
Department</h3><p><a href="http://www.umc.com.tw/"><img hspace="6"
src="UMCLogo.gif" align="left"> </a><b><font size="+1">The<a href="http://www.umc.com.tw/">United Microelectronics
Corporation</a> Distinguished Professor</font></b></p><p><font size="-1"><br clear="left">Ph.D., University of California, Berkeley, 1980.<br>M.S., University of California, Berkeley, 1978.<br>A.B., Cornell University, 1976.<br></font></p>
17
Outline
ā¢ Historical Backgroundā¢ Information Tsunamiā¢ Anatomy of a Web Pageā¢ Anatomy of Web Accessā¢ The Challenge of Searchā¢ Googleās Page Rank Algorithmā¢ Fun and Games with Internet Searchā¢ New Directions
18
Anatomy of Web Access
Web Browser Web Server
Web PageIn HTML
Naming System (DNS):Name-to-Address MappingIP address
Link URLhttp://www.umc.com.tw/
(1)(2)
(3)
(4)
Taiwan
19
EdgeCache
Anatomy of Web AccessContent Caching
Web Browser OriginWeb Server
Web PageIn HTML
Naming System (DNS)Origin IP
Link URLā¦/English/about/index.asp
(5)
(6)
(7)
(8)
Content Network DNSEdge Cache IP
ContentDistribution
Taiwan
San Jose
20
Outline
ā¢ Historical Backgroundā¢ Information Tsunamiā¢ Anatomy of a Web Pageā¢ Anatomy of Web Accessā¢ The Challenge of Searchā¢ Googleās Page Rank Algorithmā¢ Fun and Games with Internet Searchā¢ New Directions
21
Challenges of Search
ā¢ How to find all the pages on the Web?ā¢ How to order the pages by relevance?ā¢ How to make searchable the content on those
pages?ā¢ How to keep it all up-to-date?
ā¢ Web Crawlers/SpiderBotsā Network software executing in parallel that follow links in the
Web to find contentā Web pages āscrapedā for more links followā Web revisited on the order of once every two-three days
ā¢ Indexersā Web pages āscrapedā for search terms to build indexesā (Google) Page rank algorithm: order a page within the index
based (roughly) on how many pages refer to it
22
Quick (and Incomplete) History of Search Engines
UMinnVeronica &
Archieservices
for gopher &
ftp
MITWandex/
WWWWanderer
Aliweb
CMULycos
1st Commercial Search Engine
StanfordYahoo!
DirectoriesBattle for Popularity: Webcrawler (UWash)
HotBot (Wired)Excite (Stanford) Infoseek (ABC)
Inktomi (Berkeley) AltaVista (DEC)
Google (Stanford)
Yahoo! acquires Inktomi
Yahoo! acquires Overture
(AlltheWeb, AltaVista)
1993 1995 1997 1999 2001 2003Pre-Web 2005
Yahoo! deploys
jointtechnology
a9.comAlltheWebAsk Jeeves
ClustyGigablastEz2FindTeoma
WiseNutGoHookWalhelloKartoo
23
Search Challenges and Issues
ā¢ Web growing faster than search engines can indexā¢ Web pages updated frequently, forcing frequent
revisits ā¢ Key word only searches results in many false positivesā¢ Difficult to index dynamically generated sites: the so-
called āinvisible webāā¢ Some search engines order results by financial
āplacementā considerations rather than relevanceā¢ Some sites trick search engine to display them first
for some keywordsāresults in polluted search results, with more relevant links pushed down among the results
24
Outline
ā¢ Historical Backgroundā¢ Information Tsunamiā¢ Anatomy of a Web Pageā¢ Anatomy of Web Accessā¢ The Challenge of Searchā¢ Googleās Page Rank Algorithmā¢ Fun and Games with Internet Searchā¢ New Directions
25
Page Ranking Algorithms
ā¢ Web page relevancyā Many hits, how to insure the best/most relevant web
pages are presented first in answer to a search
ā¢ Location and Frequency of Keywordsā Index terms in page title raise its relevance for that
termā Keywords near ātopā of page more relevant than
bottomā High keyword frequency boosts relevance
ā¢ If search engine strategy is known, page developers will āgameā the strategy to get their pages ranked higher
26
Googleās Page Rank Algorithm
ā¢ Which is the most important page?
27
Googleās Page Rank Algorithm
ā¢ Googlese from their web page:ā PageRank relies on the uniquely
democratic nature of the web by using its vast link structure as an indicator of an individual page's value. Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important.ā
28
Google Page Rank Algorithm
ā¢ Basic idea:ā Pageās rank determined by the number of links to the page (also
known as citations)ā If citing page is more important (has a high page rank/authority page)
then the pages it cites are more importantā If citing page has many links, then cited page is less important
(normalize for number of links on citing page)
PR(P) is page rank of page P, T1, ā¦, TN are pages that cite P,C(P) is the # links from Page P, D is a ādecay factorā, e.g., 0.85then:
PR(P) = (1 ā d) + d (PR(T1)/C(T1) + ā¦ + PR(Tn)/C(Tn))
ā¢ See http://www-db.stanford.edu/~backrub/google.html
29
GoogleConceptual
Architecture
30
Google Server Architecture
ā¢ Index servers: search term partitioned and mapped to doc listā¢ Intersect to find document list, sort by page rankā¢ Document IDs used to extract text from Doc Serversā¢ Over 100,000 processors (and growing) in Googleplex
GoogleWeb Server
Spell Checker
Ad Server
Doc ServerDoc Server
Doc ServerDoc Server
Doc Server
Doc ServerDoc Server
Doc ServerDoc ServerIndex Server
31
Outline
ā¢ Historical Backgroundā¢ Information Tsunamiā¢ Anatomy of a Web Pageā¢ Anatomy of Web Accessā¢ The Challenge of Searchā¢ Googleās Page Rank Algorithmā¢ Fun and Games with Internet Searchā¢ New Directions
32
Fun and Games
ā¢ Google Scholarā¢ Googling Someoneā¢ Google Newsā¢ Comparison Shoppingā¢ Google Whacks
33
Google Scholar
34
Google Randy
35
Google Randy Katz āGoogle Indexā
AdvertisingPlacement
36
Google News
37
Comparison Shopping
38
elgooG
39
Google Whacks
40
Business ModelAd Placement and Click-
Thru
Old data (2002): Google is now market leader in ad revenue2004 revenue through 9/30/04: $2.1B
41
Outline
ā¢ Historical Backgroundā¢ Information Tsunamiā¢ Anatomy of a Web Pageā¢ Anatomy of Web Accessā¢ The Challenge of Searchā¢ Googleās Page Rank Algorithmā¢ Fun and Games with Internet Searchā¢ New Directions
42
Top 10 Search Engines
10. DMOZ.org9. Alltheweb.com8. KartOO.com7. MSN.com6. Dogpile.com5. AskJeeves.com4. About.com2. Yahoo.com2. Vivismio.com1. Google.com
43
Clustering
44
Google Video Search
45
Google Video Search
46
Amazonās A9
47
Amazonās A9
48
A9ās Yellow Pages
49
A9ās Yellow Pages
50
Innovations Now andYet to Come
ā¢ Index ever larger portions of the Web, even beyond traditional web pages, e.g., video
ā¢ Better quality/higher relevance searchesā¢ Better presentation of results, e.g., clustering,
site informationā¢ Better exploitation of semantic relationships for
improved page ranking, more personalization, e.g., userās zip code
ā¢ More services (Web, news groups, blogs, comparison shopping, video/audio, yellow pages, etc.)
ā¢ Integrate with desktop machine
51
Parting Thoughts
52
Parting Thoughts
53
āWhere is the wisdom we have lost in knowledge?Where is the knowledge we have lost in information?ā
T.S. Eliot, āChoruses from the rockā, Selected Poems, NY: Harvest / Harcourt, 1962, p. 107.
54
Needle in the Haystack: The Technology of Internet
Search
Thanks for Your Patience & Attention!Questions?