1 CS 430: Information Discovery Lecture 6 Data Structures for Information Retrieval.
1 CS 430: Information Discovery Lecture 21 Web Search 3.
-
date post
20-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of 1 CS 430: Information Discovery Lecture 21 Web Search 3.
2
Course Administration
Thursday, November 11
No office hours
Tuesday, November 16
No class
Wednesday, November 17
Discussion class requires you to read three short papers.
Wednesday, December 1
Discussion class requires you to search for and read materials on a specified topic.
3
Effective Information Retrieval
1. Comprehensive metadata with Boolean retrieval (e.g., monograph catalog).
Can be excellent for well-understood categories of material, but requires expensive metadata, which is rarely available.
2. Full text indexing with ranked retrieval (e.g., news articles).
Excellent for relatively homogeneous material, but requires available full text.
Neither of these methods is very effective when applied directly to the Web.
4
Effective Information Retrieval (cont)
3. Full text indexing with contextual information and ranked retrieval (e.g., Google, Teoma).
Excellent for mixed textual information with rich structure.
4. Contextual information with non-textual materials and ranked retrieval (e.g., Google and Yahoo image retrieval).
Promising, but still experimental.
5
New concepts in Web Searching
• Goal of search is redefined to emphasize precision of the most highly ranked group of hits.
• Concept of relevance is changed to include importance of documents as a factor in ranking.
• Browsing is tightly connected to searching.
• Contextual information is used as an integral part of the search.
6
Browsing
Users give queries of 2 to 4 words
Most users click only on the first few results; few go beyond the fold on the first page
80% of users, use search engine to find sites
search to find site
browse to find information
Amil Singhal, Google, 2004
7
Browsing and Searching
Searching is followed by browsing.
Browsing the hit list:
helpful summary records (snippets) removal of duplicates grouping results from a single site
Browsing the web pages themselves:
direct links from the snippets to the pagescache with highlightstranslation in same format
8
Dynamic Snippets
Query: Cornell sports
LII: Law about...Sports... sports law: an overview. Sports Law encompasses a multitude areas of law brought together in unique ways. Issues ... vocation. Amateur Sports. ... www.law.cornell.edu/topics/sports.html
Query: NCAA Tarkanian
LII: Law about...Sports... purposes. See NCAA v. Tarkanian, 109 US 454 (1988). State action status may also be a factor in mandatory drug testing rules. On ... www.law.cornell.edu/topics/sports.html
9
Contextual information
The context in which an item exists may give useful information for searching.
Information about a document:• Content (terms, formatting, etc.)• Metadata (externally created following rules)• Context (citations and links, reviews, annotations, etc.)
Context has many uses:• Selecting documents to index• Retrieval clues (e.g., anchor text)• Ranking
10
Context: Anchor Text
words words words Cornell University words words words
<a href = "http://www.cornell.edu">Cornell University</a>
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Linking page Linked to page
HTML source
11
Context: Image Searching
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
<img src="images/Arms.jpg" alt="Photo of William Arms">
HTML source
From the Information Science web site
Captions and other adjacent text on the web page
12
Reference Pattern Ranking using Dynamic Document Sets
PageRank calculates document ranks for the entire (fixed) set of documents. The calculations are made periodically (e.g., monthy) and the document ranks are the same for all queries.
Concept of dynamic document sets. Reference patterns among documents that are related to a specific query convey more information than patterns calculated across entire document collections.
With dynamic document sets, references patterns are calculated for a set of documents that are selected based on each individual query.
13
Reference Pattern Ranking using Dynamic Document Sets
Teoma Dynamic Ranking Algorithm (used in Ask Jeeves)
1. Search using conventional term weighting. Rank the hits using similarity between query and documents.
2. Select the highest ranking hits (e.g., top 5,000 hits).
3. Carry out PageRank or similar algorithm on this set of hits. This creates a set of document ranks that are specific to this query.
4. Display the results ranked in the order of the reference patterns calculated.
14
Scalability
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
10,000,000,000
1994 1997 2000
The growth of the web
15
Web search services are centralized systems
• Over the past 9 years, Moore's Law has enabled the services to keep pace with the growth of the web and the number of users, while adding extra function.
• Will this continue?
• Possible areas for concern are: staff costs, telecommunications costs, disk access rates.
Scalability
16
Growth of Web Searching
In November 1997:
• AltaVista was handling 20 million searches/day.
• Google forecast for 2000 was 100s of millions of searches/day.
In 2004, Google reports 250 million webs searches/day, and estimates that the total number over all engines is 500 million searches/day.
Moore's Law and web searching
In 7 years, Moore's Law predicts computer power will increase by a factor of at least 24 = 16.
It appears that computing power is growing at least as fast as web searching.
17
Growth of Google
In 2000: 85 people
50% technical, 14 Ph.D. in Computer Science
In 2000: Equipment
2,500 Linux machines 80 terabytes of spinning disks 30 new machines installed daily
Reported by Larry Page, Google, March 2000At that time, Google was handling 5.5 million searches per day
Increase rate was 20% per month
By fall 2002, Google had grown to over 400 people.In 2004, Google plans to hire 1,000 new people.
18
Scalability: Performance
Very large numbers of commodity computersAlgorithms and data structures scale linearly
• Storage– Scale with the size of the Web– Compression/decompression
• System– Crawling, indexing, sorting simultaneously
• Searching– Bounded by disk I/O
19
Software and Hardware Replication
Search serviceindex serverindex serverindex serverindex serverindex serverindex serverindex server
document serverdocument serverdocument serverdocument serverdocument serverdocument serverdocument server
spell checkingspell checkingspell checkingspell checkingspell checkingspell checkingspell checking
advertisement server
advertisement server
advertisement server
advertisement server
advertisement server
advertisement server
20
Scalability: Numbers of Computers
Very rough calculation
In March 2000, 5.5 million searches per day, required 2,500 computers
In fall 2004, computers are about 8 times more powerful.
Estimated number of computers for 250 million searches per day:
= (250/5.5) x 2,500/8
= about 15,000
Some industry estimates suggest that Google may have as many as 100,000 computers.
21
Scalability: Staff
Programming: Have very well trained staff. Isolate complex code. Most coding is single image.
System maintenance: Organize for minimal staff (e.g., automated log analysis, do not fix broken computers).
Customer service: Automate everything possible, but complaints, large collections, etc. require staff.
22
Evaluation Web Searching
Test corpus must be dynamic
The web is dynamic (10%-20%) of URLs change every month
Spam methods change change continually
Queries are time sensitive
Topic are hot and then not
Need to have a sample of real queries
Languages
At least 90 different languages
Reflected in cultural and technical differences
Amil Singhal, Google, 2004
23
Other Uses of Web Crawling and Associated Technology
The technology developed for web search services has many other applications.
Conversely, technology developed for other Internet applications can be applied in web searching
• Related objects (e.g., Amazon's "Other people bought the following").
• Recommender and reputation systems (e.g., ePinion's reputation system).