i206: Lecture 22: Web Search Engines; Recap Distributed Systems

1 i206: Lecture 22: Web Search Engines; Recap Distributed Systems Marti Hearst Spring 2012


i206: Lecture 22: Web Search Engines; Recap Distributed Systems. Marti Hearst Spring 2012. How Search Engines Work. There are MANY issues I ’ m only giving the basics. How Search Engines Work. Three main parts:. - PowerPoint PPT Presentation

Transcript of i206: Lecture 22: Web Search Engines; Recap Distributed Systems

Page 1: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


i206: Lecture 22:Web Search Engines; Recap Distributed Systems

Marti HearstSpring 2012

Page 2: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


How Search Engines Work

• There are MANY issues• I’m only giving the basics

Page 3: i206: Lecture 22: Web Search Engines; Recap Distributed Systems

3Slide adapted from Lew & Davis

How Search Engines Work

i. Gather the contents of all web pages (using a program called a crawler or spider)

ii. Organize the contents of the pages in a way that allows efficient retrieval (indexing)

iii. Take in a query, determine which pages match, and show the results (ranking and display of results)

Three main parts:

Page 4: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Standard Web Search Engine Architecture

Crawl theweb

Create an inverted


Check for duplicates,store the


Inverted index

Search engine servers


Page 5: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Standard Web Search Engine Architecture

Crawl theweb

Create an inverted


Check for duplicates,store the


Inverted index

Search engine servers


Show results to user


Page 6: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Search engine architecture,from “Anatomy of a Large-Scale Hypertext Web Search Engine”, Brin & Page, 1998.http://dbpubs.stanford.edu:8090/pub/1998-8

Page 7: i206: Lecture 22: Web Search Engines; Recap Distributed Systems

7Slide adapted from Lew & Davis

Spiders or crawlers

• How to find web pages to visit and copy?– Can start with a list of domain names, visit the

home pages there.– Look at the hyperlinks on the home page, and

follow those links to more pages.• Use HTTP commands to GET the pages

– Keep a list of urls visited, and those still to be visited.

– Each time the program loads in a new HTML page, add the links in that page to the list to be crawled.

Page 8: i206: Lecture 22: Web Search Engines; Recap Distributed Systems

8Slide adapted from Lew & Davis

Spider behavior varies

• Parts of a web page that are indexed• How deeply a site is indexed • Types of files indexed• How frequently the site is spidered

Page 9: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Four Laws of Crawling

• A Crawler must show identification• A Crawler must obey the robots

exclusion standardhttp://www.robotstxt.org/wc/norobots.html

• A Crawler must not hog resources• A Crawler must report errors

Page 10: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


The Internet Is Enormous

Image from http://www.nature.com/nature/webmatters/tomog/tomfigs/fig1.html

Page 11: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Lots of tricky aspects

• Servers are often down or slow• Hyperlinks can get the crawler into cycles• Some websites have junk in the web pages• Now many pages have dynamic content

– Javascript– The “hidden” web– E.g., schedule.berkeley.edu

• You don’t see the course schedules until you run a query.

• The web is HUGE

Page 12: i206: Lecture 22: Web Search Engines; Recap Distributed Systems



• Need to keep checking pages– Pages change

• At different frequencies

• Pages are removed

– Many search engines cache the pages (store a copy on their own servers) to save time/effort

• But pages that change a lot stymie this strategy

Page 13: i206: Lecture 22: Web Search Engines; Recap Distributed Systems

13Slide adapted from Lew & Davis

What really gets crawled?

• A small fraction of the Web that search engines know about; no search engine is exhaustive

• Not the “live” Web, but the search engine’s index

• Not the “Deep Web”• Mostly HTML pages but other file types too:

PDF, Word, PPT, etc.

Page 14: i206: Lecture 22: Web Search Engines; Recap Distributed Systems

14Slide adapted from Lew & Davis

ii. Index (the database)

Record information about each page• List of words

– In the title?– How far down in the page?– Was the word in boldface?

• URLs of pages pointing to this one• Anchor text on pages pointing to this one

Page 15: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


The importance of anchor text

<a href=http://courses.ischool…>i141 </a>

<a href=http://courses.ischool…>A terrific course on search engines </a>

The anchor text summarizeswhat the website is about.

Page 16: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Inverted Index

• How to store the words for fast lookup• Basic steps:

– Make a “dictionary” of all the words in all of the web pages

– For each word, list all the documents it occurs in.– Often omit very common words

• “stop words”– Sometimes stem the words

• (also called morphological analysis)• cats -> cat• running -> run

Page 17: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Inverted Index Example

Image from http://developer.apple.com/documentation/UserExperience/Conceptual/SearchKitConcepts/searchKit_basics/chapter_2_section_2.html

Page 18: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Inverted Index

• In reality, this index is HUGE• Need to store the contents across many

machines• Need to do optimization tricks to make lookup


Page 19: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Query Serving Architecture

• Index divided into segments each served by a node

• Each row of nodes replicated for query load

• Query integrator distributes query and merges results

• Front end creates a HTML page with the query results

Load Balancer



Node1,1 Node1,2 Node1,3 Node1,N

Node2,1 Node2,2 Node2,3 Node2,N

Node4,1 Node4,2 Node4,3 Node4,N

Node3,1 Node3,2 Node3,3 Node3,N









Page 20: i206: Lecture 22: Web Search Engines; Recap Distributed Systems

20Slide adapted from Lew & Davis

Results ranking• Search engine receives a query, then• Looks up the words in the index, retrieves

many documents, then• Rank orders the pages and extracts “snippets”

or summaries containing query words.– In the early days; statistical– Next: implicit AND,– Now: much more complex

• These are complex and highly guarded algorithms unique to each search engine.

Page 21: i206: Lecture 22: Web Search Engines; Recap Distributed Systems

21Slide adapted from Lew & Davis

Some ranking criteria • For a given candidate result page, use:

– Number of matching query words in the page– Proximity of matching words to one another– Location of terms within the page– Location of terms within tags e.g. <title>, <h1>, link

text, body text– Anchor text on pages pointing to this one– Frequency of terms on the page and in general– Link analysis of which pages point to this one– (Sometimes) Click-through analysis: how often the page

is clicked on– How “fresh” is the page

• Complex formulae combine these together.

Page 22: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Measuring Importance of Linking

• PageRank Algorithm– Idea: important pages are pointed

to by other important pages– Method:

• Each link from one page to another is counted as a “vote” for the destination page

• But the importance of the starting page also influences the importance of the destination page.

• And those pages scores, in turn, depend on those linking to them.

Image and explanation from http://www.economist.com/science/tq/displayStory.cfm?story_id=3172188

Page 23: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Measuring Importance of Linking• Example: each page starts with 100

points.• Each page’s score is recalculated by

adding up the score from each incoming link.– This is the score of the linking page

divided by the number of outgoing links it has.

– E.g, the page in green has 2 outgoing links and so its “points” are shared evenly by the 2 pages it links to.

• Keep repeating the score updates until no more changes.

Image and explanation from http://www.economist.com/science/tq/displayStory.cfm?story_id=3172188

Page 24: i206: Lecture 22: Web Search Engines; Recap Distributed Systems

24Slide adapted from Manning, Raghavan, & Schuetze

Bad Actors on the Web (Spam)

• Cloaking– Serve fake content to search engine robot– DNS cloaking: Switch IP address. Impersonate

• Doorway pages– Pages optimized for a single keyword that re-

direct to the real target page• Keyword Spam– Misleading meta-keywords, excessive repetition

of a term, fake “anchor text”– Hidden text with colors, CSS tricks, etc.

• Link spamming– Mutual admiration societies, hidden links,

awards– Domain flooding: numerous domains that point

or re-direct to a target page• Robots– Fake click stream– Fake query stream– Millions of submissions via Add-Url

Is this a SearchEngine spider?






Meta-Keywords = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp3, britney spears, viagra, …”

Page 25: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Inter-Related Topics

• Networked Systems • Distributed Systems

– Web search engines– Tools like Hadoop

• Web security• Cryptography (next time)

Page 26: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Motivation for Hadoop

• How do you scale up applications?– Run jobs processing 100’s of terabytes of data– Takes 11 days to read on 1 computer

• Need lots of cheap computers– Fixes speed problem (15 minutes on 1000

computers), but…– Reliability problems

• In large clusters, computers fail every day• Cluster size is not fixed

• Need common infrastructure– Must be efficient and reliable

Slide adapted from Xiaoxiao Shi, Guan Wang

Page 27: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Query Serving Architecture

• Index divided into segments each served by a node

• Each row of nodes replicated for query load

• Query integrator distributes query and merges results

• Front end creates a HTML page with the query results

Load Balancer



Node1,1 Node1,2 Node1,3 Node1,N

Node2,1 Node2,2 Node2,3 Node2,N

Node4,1 Node4,2 Node4,3 Node4,N

Node3,1 Node3,2 Node3,3 Node3,N









Page 28: i206: Lecture 22: Web Search Engines; Recap Distributed Systems



• Open Source Apache Project• Hadoop Core includes:

– Distributed File System - distributes data– Map/Reduce - distributes application

• Written in Java• Runs on

– Linux, Mac OS/X, Windows, and Solaris– Commodity hardware

Slide adapted from Xiaoxiao Shi, Guan Wang

Page 29: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Hadoop Users• Who use Hadoop?

• Amazon/A9• AOL• Facebook• Fox interactive media• Google • IBM• New York Times• PowerSet (now Microsoft)• Quantcast• Rackspace/Mailtrust• Veoh• Yahoo!• More at http://wiki.apache.org/hadoop/PoweredBy

Slide adapted from Xiaoxiao Shi, Guan Wang

Page 30: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Typical Hadoop Structure

• Commodity hardware– Linux PCs with local 4 disks

• Typically in 2 level architecture– 40 nodes/rack– Uplink from rack is 8 gigabit– Rack-internal is 1 gigabit all-to-all

Slide adapted from Xiaoxiao Shi, Guan Wang

Page 31: i206: Lecture 22: Web Search Engines; Recap Distributed Systems

31Slide adapted from Xiaoxiao Shi, Guan Wang

Page 32: i206: Lecture 22: Web Search Engines; Recap Distributed Systems

32Slide adapted from Xiaoxiao Shi, Guan Wang

Page 33: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Hadoop structure

• Single namespace for entire cluster– Managed by a single namenode.– Files are single-writer and append-only.– Optimized for streaming reads of large files.

• Files are broken into large blocks.– Typically 128 MB– Replicated to several datanodes, for reliability

• Client talks to both namenode and datanodes– Data is not sent through the namenode.– Throughput of file system scales nearly linearly with the

number of nodes.

• Access from Java, C, or command line.

Slide adapted from Xiaoxiao Shi, Guan Wang

Page 34: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Map Reduce Architecture

Slide adapted from Xiaoxiao Shi, Guan Wang

Page 35: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Example of Hadoop Programming

• Intuition: design <key, value>• Assume each node will process a paragraph…• Map:

– What is the key?– What is the value?

• Reduce:– What to collect?– What to reduce?

Slide adapted from Xiaoxiao Shi, Guan Wang

Page 36: i206: Lecture 22: Web Search Engines; Recap Distributed Systems



Page 37: i206: Lecture 22: Web Search Engines; Recap Distributed Systems




Page 38: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Page 39: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Check the Results


Page 40: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Twitter Course This Fall!

• Big Data Analysis with Twitter• Topics will include:

– Large Scale Anomaly Detection (at Twitter)– Intro to Pig and Scalding– Recommendation Algorithms– Real-time Search– Information Diffusion and Outbreak Detection on

Twitter– Trend detection in social streams– Graph Algorithms for the Social Graph

Page 41: i206: Lecture 22: Web Search Engines; Recap Distributed Systems


Next Two Lectures + Additional Review

• Apr 24: Cryptography (Prof. John Chuang)• Apr 26: Review (Monica and Alex)

• Tues May 1: Optional Review (Marti)