What are we searching for? {week 9 }

16
What are we searching for? {week 9} Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. om Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

description

Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. What are we searching for? {week 9 }. from Search Engines: Information Retrieval in Practice , 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0. - PowerPoint PPT Presentation

Transcript of What are we searching for? {week 9 }

Page 1: What are we searching for? {week  9 }

What are we searching for?{week 9}

Rensselaer Polytechnic InstituteCSCI-4220 – Network ProgrammingDavid Goldschmidt, Ph.D.

from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

Page 2: What are we searching for? {week  9 }

What is search?

What is search? What are we searching for? How many searches are

processed per day? What is the average number of

words in text-based searches?

Page 3: What are we searching for? {week  9 }

Finding things

Applications and varieties of search: Web search Site search Vertical search Enterprise search Desktop search As-you-type search Proximity search

search

Page 4: What are we searching for? {week  9 }

Acquisition and indexing

Page 5: What are we searching for? {week  9 }

User interaction and querying

Page 6: What are we searching for? {week  9 }

Measures of success (i)

Relevance Search results contain information

the searcher was looking for Problems with vocabulary mismatch ▪ Homonyms (e.g. “Jersey shore”)

User relevance Search results relevant to one user

may be completely irrelevant toanother user

SNOOKI

Page 7: What are we searching for? {week  9 }

Measures of success (ii)

Precision Proportion of retrieved documents

that are relevant How precise were the results?

Recall (and coverage) Proportion of relevant documents

that were actually retrieved Did we retrieve all of the relevant

documents?

http://trec.nist.gov

Page 8: What are we searching for? {week  9 }

Measures of success (iii)

Timeliness and freshness Search results contain information that

is current and up-to-date

Performance Users expect subsecond response times

Media User devices are constantly changing

(cellphones, mobile devices, tablets, etc.)

Page 9: What are we searching for? {week  9 }

Measures of success (iv)

Scalability Designs that perform equally well as the

system grows and expands▪ Increased number of documents, number of users,

etc.

Flexibility (or adaptability) Tune search engine components to

keep up with changing landscape

Spam-resistance

Page 10: What are we searching for? {week  9 }

Information retrieval (IR) Gerard Salton (1927-1995)

Pioneer in information retrieval

Defined information retrieval as: “a field concerned with the

structure, analysis, organization, storage, searching, and retrieval of information”

This was 1968 (before the Internet and Web!)

Page 11: What are we searching for? {week  9 }

(Un)structured information Structured information:

Often stored in a database Organized via predefined

tables, columns, etc. Select all accounts with balances less than $200

Unstructured information Document text (headings, words, phrases) Images, audio, video (often relies on textual

tags)

account number

balance

7004533711 $498.19

7004533712 $781.05

7004533713 $147.15

7004533714 $195.75

Page 12: What are we searching for? {week  9 }

Processing text

Search and IR has largelyfocused on text processingand documents

Search typically uses thestatistical properties of text Word counts Word frequencies But ignore linguistic features (noun,

verb, etc.)

Page 13: What are we searching for? {week  9 }

Politeness and robots.txt Web crawlers adhere to a politeness

policy: GET requests sent every few seconds or

minutes A robots.txt file

specifies whatcrawlers areallowed to crawl:

Page 14: What are we searching for? {week  9 }

Sitemaps

default priority is 0.5

some URLs might not be discovered by crawler

Page 15: What are we searching for? {week  9 }

A day in the life of a crawler

what about checkingfor updated pages?

Page 16: What are we searching for? {week  9 }

Freshness vs. age

Freshness is essentially a Boolean value

Age measures the degree to which crawled page is out of date