1 CS 430: Information Discovery Lecture 21 Web Search 3.

1

CS 430: Information Discovery

Lecture 21

Web Search 3

2

Course Administration

Thursday, November 11

No office hours

Tuesday, November 16

No class

Wednesday, November 17

Discussion class requires you to read three short papers.

Wednesday, December 1

Discussion class requires you to search for and read materials on a specified topic.

3

Effective Information Retrieval

1. Comprehensive metadata with Boolean retrieval (e.g., monograph catalog).

Can be excellent for well-understood categories of material, but requires expensive metadata, which is rarely available.

2. Full text indexing with ranked retrieval (e.g., news articles).

Excellent for relatively homogeneous material, but requires available full text.

Neither of these methods is very effective when applied directly to the Web.

4

Effective Information Retrieval (cont)

3. Full text indexing with contextual information and ranked retrieval (e.g., Google, Teoma).

Excellent for mixed textual information with rich structure.

4. Contextual information with non-textual materials and ranked retrieval (e.g., Google and Yahoo image retrieval).

Promising, but still experimental.

5

New concepts in Web Searching

• Goal of search is redefined to emphasize precision of the most highly ranked group of hits.

• Concept of relevance is changed to include importance of documents as a factor in ranking.

• Browsing is tightly connected to searching.

• Contextual information is used as an integral part of the search.

6

Browsing

Users give queries of 2 to 4 words

Most users click only on the first few results; few go beyond the fold on the first page

80% of users, use search engine to find sites

search to find site

browse to find information

Amil Singhal, Google, 2004

7

Browsing and Searching

Searching is followed by browsing.

Browsing the hit list:

helpful summary records (snippets) removal of duplicates grouping results from a single site

Browsing the web pages themselves:

direct links from the snippets to the pagescache with highlightstranslation in same format

8

Dynamic Snippets

Query: Cornell sports

LII: Law about...Sports... sports law: an overview. Sports Law encompasses a multitude areas of law brought together in unique ways. Issues ... vocation. Amateur Sports. ... www.law.cornell.edu/topics/sports.html

Query: NCAA Tarkanian

LII: Law about...Sports... purposes. See NCAA v. Tarkanian, 109 US 454 (1988). State action status may also be a factor in mandatory drug testing rules. On ... www.law.cornell.edu/topics/sports.html

9

Contextual information

The context in which an item exists may give useful information for searching.

Information about a document:• Content (terms, formatting, etc.)• Metadata (externally created following rules)• Context (citations and links, reviews, annotations, etc.)

Context has many uses:• Selecting documents to index• Retrieval clues (e.g., anchor text)• Ranking

10

Context: Anchor Text

words words words Cornell University words words words

<a href = "http://www.cornell.edu">Cornell University</a>

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Linking page Linked to page

HTML source

11

Context: Image Searching

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

<img src="images/Arms.jpg" alt="Photo of William Arms">

HTML source

From the Information Science web site

Captions and other adjacent text on the web page

12

Reference Pattern Ranking using Dynamic Document Sets

PageRank calculates document ranks for the entire (fixed) set of documents. The calculations are made periodically (e.g., monthy) and the document ranks are the same for all queries.

Concept of dynamic document sets. Reference patterns among documents that are related to a specific query convey more information than patterns calculated across entire document collections.

With dynamic document sets, references patterns are calculated for a set of documents that are selected based on each individual query.

13

Reference Pattern Ranking using Dynamic Document Sets

Teoma Dynamic Ranking Algorithm (used in Ask Jeeves)

1. Search using conventional term weighting. Rank the hits using similarity between query and documents.

2. Select the highest ranking hits (e.g., top 5,000 hits).

3. Carry out PageRank or similar algorithm on this set of hits. This creates a set of document ranks that are specific to this query.

4. Display the results ranked in the order of the reference patterns calculated.

14

Scalability

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

1994 1997 2000

The growth of the web

15

Web search services are centralized systems

• Over the past 9 years, Moore's Law has enabled the services to keep pace with the growth of the web and the number of users, while adding extra function.

• Will this continue?

• Possible areas for concern are: staff costs, telecommunications costs, disk access rates.

Scalability

16

Growth of Web Searching

In November 1997:

• AltaVista was handling 20 million searches/day.

• Google forecast for 2000 was 100s of millions of searches/day.

In 2004, Google reports 250 million webs searches/day, and estimates that the total number over all engines is 500 million searches/day.

Moore's Law and web searching

In 7 years, Moore's Law predicts computer power will increase by a factor of at least 24 = 16.

It appears that computing power is growing at least as fast as web searching.

17

Growth of Google

In 2000: 85 people

50% technical, 14 Ph.D. in Computer Science

In 2000: Equipment

2,500 Linux machines 80 terabytes of spinning disks 30 new machines installed daily

Reported by Larry Page, Google, March 2000At that time, Google was handling 5.5 million searches per day

Increase rate was 20% per month

By fall 2002, Google had grown to over 400 people.In 2004, Google plans to hire 1,000 new people.

18

Scalability: Performance

Very large numbers of commodity computersAlgorithms and data structures scale linearly

• Storage– Scale with the size of the Web– Compression/decompression

• System– Crawling, indexing, sorting simultaneously

• Searching– Bounded by disk I/O

19

Software and Hardware Replication

Search serviceindex serverindex serverindex serverindex serverindex serverindex serverindex server

document serverdocument serverdocument serverdocument serverdocument serverdocument serverdocument server

spell checkingspell checkingspell checkingspell checkingspell checkingspell checkingspell checking

advertisement server






20

Scalability: Numbers of Computers

Very rough calculation

In March 2000, 5.5 million searches per day, required 2,500 computers

In fall 2004, computers are about 8 times more powerful.

Estimated number of computers for 250 million searches per day:

= (250/5.5) x 2,500/8

= about 15,000

Some industry estimates suggest that Google may have as many as 100,000 computers.

21

Scalability: Staff

Programming: Have very well trained staff. Isolate complex code. Most coding is single image.

System maintenance: Organize for minimal staff (e.g., automated log analysis, do not fix broken computers).

Customer service: Automate everything possible, but complaints, large collections, etc. require staff.

22

Evaluation Web Searching

Test corpus must be dynamic

The web is dynamic (10%-20%) of URLs change every month

Spam methods change change continually

Queries are time sensitive

Topic are hot and then not

Need to have a sample of real queries

Languages

At least 90 different languages

Reflected in cultural and technical differences

Amil Singhal, Google, 2004

23

Other Uses of Web Crawling and Associated Technology

The technology developed for web search services has many other applications.

Conversely, technology developed for other Internet applications can be applied in web searching

• Related objects (e.g., Amazon's "Other people bought the following").

• Recommender and reputation systems (e.g., ePinion's reputation system).

24

Google API

25

Selective searching

26

Google News

1 CS 430: Information Discovery Lecture 21 Web Search 3.

Documents

Transcript of 1 CS 430: Information Discovery Lecture 21 Web Search 3.