CS/INFO 430 Information Retrieval

1

CS/INFO 430Information Retrieval

Lecture 17

Web Search 3

2

Course Administration

3

Information Retrieval Using PageRank

Simple Method: Rank by Popularity

Consider all hits (i.e., all document that match the query in the Boolean sense) as equal.

Display the hits ranked by PageRank.

The disadvantage of this method is that it gives no attention to how closely a document matches a query

4

Combining Term Weighting with Reference Pattern Ranking

Combined Method

1. Find all documents that contain the terms in the query vector.

2. Let sj be the similarity between the query and document j, calculated using tf.idf or a related method.

3. Let pj be the popularity of document j, calculated using PageRank or another measure of importance.

4. The combined rank cj = sj + (1- )pj, where is a constant.

5. Display the hits ranked by cj.

5

Questions about PageRank

Most pages have very small page ranks

• For searches that return large numbers of hits, there are usually a reasonable number of pages with high PageRank.

• For searches that return smaller numbers of hits, e.g, highly specific queries, all the pages may have very small PageRanks, so that it is difficult to rank them in a sensible order.

Example

A search by a customer for information about a product may rank a large number of mail order businesses that sell the product above the manufacturer's site that provides a specification for the product. Small numbers of links may make big changes to rank.

6

Advanced Graphical Methods: www.teoma.com

• Carry out a search

• Divide Web sites found by a search into clusters, known as communities

• Calculate authority within communities

• Calculate hubs within communities, known as experts

Note: Teoma does not publish the precise algorithms it uses

7

Other Factors in Ranking

Coefficient sj and pj may be varied by adding other evidence.

Similarity ranking sj might weight:

• structural mark-up, e.g., headings, bold, etc.

• meta-tags

• anchor text and adjacent text in the linking page

• file names

Popularity ranking pj might weight:

• usage data of page

• previous searches by same user

8

Anchor Text and Adjacent Text

Document A provides information about document B

Adjacent text

Anchor text

9

Anchor Text and File Names

<a href="http://www.cis.cornell.edu/">The Faculty of Computing and Information Science</a>

The source of Document A contains the marked-up text:

This string provides the following index terms about Document B:

Anchor text: faculty, computing, information, science

File name: cis, cornell

Note: A specific stop list is needed for each category of text.

10

Indexing Non-Textual Materials

Factors that can be used to index non-textual materials:

• anchor text, including <alt> tags

• text adjacent to an anchor

• file names

• PageRank

This is the concept behind image searching on the Web.

11

Context: Image Searching

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

<img src="images/Arms.jpg" alt="Photo of William Arms">

HTML source

From the Information Science web site

Captions and other adjacent text on the web page

12

Evaluation Web Searching

Test corpus must be dynamic

The web is dynamic (10%-20%) of URLs change every month

Spam methods change change continually

Queries are time sensitive

Topic are hot and then not

Need to have a sample of real queries

Languages

At least 90 different languages

Reflected in cultural and technical differences

Amil Singhal, Google, 2004

13

Evaluation: Search + Browse

Users give queries of 2 to 4 words

Most users click only on the first few results; few go beyond the fold on the first page

80% of users, use search engine to find sites:

search to find site

browse to find information

Amil Singhal, Google, 2004

Browsing is a major topic in the lectures on Usability

14

Evaluation: The Human in the Loop

Search index

Return hits

Browse documents

Return objects

15

Scalability

Question: How big is the Web and how fast is it growing?

Answer: Nobody knows

Estimates of the Crawled Web:

1994 100,000 pages

1997 1,000,000 pages

2000 1,000,000,000 pages

2005 8,000,000,000 pages

Rough estimates of the Crawlable Web suggest at least 4x

Rough estimates of the Deep Web suggest at least 100x

16

Scalability: Software and Hardware Replication

Search serviceindex serverindex serverindex serverindex serverindex serverindex serverindex server

document serverdocument serverdocument serverdocument serverdocument serverdocument serverdocument server

spell checkingspell checkingspell checkingspell checkingspell checkingspell checkingspell checker

advertisement server






17

Scalability: Large-scale Clusters of Commodity Computers

"Component failures are the norm rather than the exception.... The quantity and quality of the components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies...."

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, "The Google File System." 19th ACM Symposium on Operating Systems Principles, October 2003.http://portal.acm.org/citation.cfm?doid=945445.945450

18

Scalability: Performance

Very large numbers of commodity computersAlgorithms and data structures scale linearly

• Storage– Scale with the size of the Web– Compression/decompression

• System– Crawling, indexing, sorting simultaneously

• Searching– Bounded by disk I/O

19

Scalability of Staff: Growth of Google

In 2000: 85 people

50% technical, 14 Ph.D. in Computer Science

In 2000: Equipment

2,500 Linux machines 80 terabytes of spinning disks 30 new machines installed daily

Reported by Larry Page, Google, March 2000At that time, Google was handling 5.5 million searches per day

Increase rate was 20% per month

By fall 2002, Google had grown to over 400 people.By fall 2006, Google had over 9,000 people.

20

Scalability: Numbers of Computers

Very rough calculation

In March 2000, 5.5 million searches per day, required 2,500 computers

In fall 2004, computers were about 8 times more powerful.Estimated number of computers for 250 million searches per day:

= (250/5.5) x 2,500/8

= about 15,000

Some industry estimates (based on Google's capital expenditure) suggest that Google and Yahoo may have had as many as 250,000+ computers in fall 2006.

21

Scalability: Staff

Programming: As the number of programmers grows it becomes increasingly difficult to maintain the quality of software.

Have very well trained staff. Isolate complex code. Most coding is single image.

System maintenance: Organize for minimal staff (e.g., automated log analysis, do not fix broken computers).

Customer service: Automate everything possible, but complaints, large collections, etc. still require staff.

22

Scalability of Staff: The Neptune Project

The Neptune Clustering Software:

Programming API and runtime support, which allows a network service to be programmed quickly for execution on a large-scale cluster in handling high-volume user traffic.

The system shields application programmers from the complexities of replication, service discovery, failure detection and recovery, load balancing, resource monitoring and management.

Tao Yang, University of California, Santa Barbara

http://www.cs.ucsb.edu/projects/neptune/

23

Web search services are centralized systems

• Over the past 12 years, Moore's Law has enabled Web search services to keep pace with the growth of the Web and the number of users, while adding extra function.

• Will this continue?

• Possible areas for concern are: staff costs, telecommunications costs, disk and memory access rates, equipment costs.

Scalability: the Long Term

24

Growth of Web Searching

In November 1997:

• AltaVista was handling 20 million searches/day.

• Google forecast for 2000 was 100s of millions of searches/day.

In 2004, Google reported 250 million webs searches/day, and estimated that the total number over all engines was 500 million searches/day.

Moore's Law and Web searching

In 7 years, Moore's Law predicts computer power increased by a factor of at least 24 = 16.

It appears that computing power is growing at least as fast as web searching.

25

Other Uses of Web Crawling and Associated Technology

The technology developed for Web search services has many other applications.

Conversely, technology developed for other Internet applications can be applied in Web searching

• Related objects (e.g., Amazon's "Other people bought the following").

• Recommender and reputation systems (e.g., ePinion's reputation system).

CS/INFO 430 Information Retrieval

Documents

Transcript of CS/INFO 430 Information Retrieval