Search Engines

42
© Tefko Saracevic 1 part 1: search engines part 2: digital libraries

Transcript of Search Engines

Page 1: Search Engines

© Tefko Saracevic 1

part 1: search engines

part 2: digital libraries

Page 2: Search Engines

© Tefko Saracevic 2

dictionary definitions

searchCOMPUTING (transitive verb) to examine a computer

file, disk, database, or network for particular information

enginesomething that supplies the driving force or energy to

a movement, system, or trend

search enginea computer program that searches for particular

keywords and returns a list of documents in which they were found, especially a commercial service that scans documents on the Internet

Page 3: Search Engines

© Tefko Saracevic 3

about definition of search engines

• oh well … search engines do not search only for

keywords, some search for other stuff as well

• and they are really not “engines” in the classical sensebut then mouse is not a “mouse”

Page 4: Search Engines

© Tefko Saracevic 4

use of search engines … among others

Page 5: Search Engines

© Tefko Saracevic 5

Your Browser

How Search Engines Work(Sherman 2003)

The Web

URL1

URL2

URL3 URL4

Crawler

Indexer

SearchEngine

Database Eggs?Eggs.

Eggs - 90%Eggo - 81%Ego- 40%

Huh? - 10%

All AboutEggsby

S. I. Am

Page 6: Search Engines

© Tefko Saracevic 6

how do search engines work? elaboration

• crawlers, spiders: go out to find content in various ways go through the web

looking for new & changed sitesperiodic, not for each query

no search engine works in real time

some search engines do it for themselves, others not

buy content from companies such as Inktomi

for a number of reasons crawlers do not cover all of the web – just a fraction

what is not covered is “invisible web”

Page 7: Search Engines

© Tefko Saracevic 7

elaboration …• organizing content: labeling, arranging

indexing for searching – automatic keywords and other fields arranging by URL popularity - PageRank as Google

classifying as directory mostly human handpicked & classified

• as a result of different organization we have basically two kinds of search engines:

search – input is a query that is then searched & displayed

directory – classified content – a class is displayed– and fused: directories have now also search

capabilities & vice versa

Page 8: Search Engines

© Tefko Saracevic 8

elaboration (cont.)

• databases, caches: storing content humongous files usually distributed over many

computers

• query processor: searching, retrieval, display takes your query as input

engines have differing rules how handled displays ranked output

some engines also cluster output and provide visualization

• at the other end is your browser

Page 9: Search Engines

© Tefko Saracevic 9

elaboration…similarities, differences

• all search engines have these basic parts in common

• BUT the actual processes – methods how they do it – are based on various algorithms & they differ most are proprietary with details kept

mostly secret but based on well known principles from information retrieval or classification

to some extent Google is an exception – they published their method

Page 10: Search Engines

© Tefko Saracevic 10

case of

• developed by Sergey Brin and Lawrence Page while students at Stanford in the beginning run on Stanford computers

• basic approach has been described in their famous paper “The Anatomy of a Large-Scale Hypertextual Web Search Engine” well written, simple language, has their pictures in acknowledgement they cite the support by NSF’s

Digital Library Initiative i.e. initially, Google came out of government sponsored research

describe their method PageRank - based on ranking hyperlinks as in citation indexing

“We chose our system name, Google, because it is a common spelling of googol, or ten on hundredth power”

Page 11: Search Engines

© Tefko Saracevic 11

coverage differences

• no engine covers more than a fraction of WWW estimates: none more than 16% hard (even impossible) to discern & compare coverage, but

they differ substantially in what they cover

• in addition: many national search engines

own coverage, orientation, governance many specialized or domain search engines

own coverage geared to subject of interest many comprehensive sources independent of search

engines some have compilations of evaluated web sources

Page 12: Search Engines

© Tefko Saracevic 12

searching differences

• substantial differences among search engines on searching, retrieval displayneed to know how they work & differ in

respect to defaults in searching a query searching of phrases, case sensitivity, categories searching of different fields, formats, types of

resources advance search capabilities and features possibilities for refinement, using relevance

feedback display options personalization options

Page 13: Search Engines

© Tefko Saracevic 13

business model differences

several business models• public good - have independent budget

e.g. PubMed, Librarians’ Index to Internet

• earn revenue from provision of information all commercial search engines

• using search engines to promote their other activities e.g. telephone directories

Page 14: Search Engines

© Tefko Saracevic 14

sponsorship differences

• need to understand treatment of sponsorship – they influence what they search & how they display resultssome list separately results from sponsored

sites so you are reasonably clear what is there because it is sponsored & not

some have display-per-pay - showing first sites that paid most & do not even tell you that

some have pay per update of sites

• imperative to find sources that explain these models for different engines to know what is covered & what are you are getting

Page 15: Search Engines

© Tefko Saracevic 15

limitations

• every search engine has limitation as tocoverage

meta engines just follow coverage limitations & have more of their own

search capabilitiesfinding quality information

• some have compromised search with economics becoming little more than advertisers

• but search engines are also many times victims of spamindexingaffecting what is included and how ranked

Page 16: Search Engines

© Tefko Saracevic 16

spamming a search engine

• use of techniques that push rankings higher than they belong is also called spamdexing methods typically include textual as well as

link-based techniques like e-mail spam, search engine spam is a

form of adversarial information retrieval the conflicting goals of accurate results of search

providers & high positioning by content page rank

Page 17: Search Engines

© Tefko Saracevic 17

meta search engines

• meta engines search multiple engines getting combined results from a

variety of engines

• do not have their own databasesbut have their own business models

affecting results

• a number of techniques usedinteresting ones: clustering, statistical

analyses

Page 18: Search Engines

© Tefko Saracevic 18

how to find a search engine?

• variety of resources that list or categorize engines

• SearchEngines.comsearch for engines by topic, geography, reference

Search Engine Guideengines categorized by topic; other engine information

Search Engine Colossus international directory of search engines by country, topic from

198 countries and 61 territories; engines in choice of languages

Phil Bradley’s country based search enginesover 2000 serach engines from countries all over the globe

Page 19: Search Engines

© Tefko Saracevic 19

sample of meta engines- with organized results

Dogpile results from a number of leading search engines;

gives source, so overlap can be compared; (has also a (bad) joke of the day)

Surfwax gives statistics and text sources & linking to

sources; for some terms gives related terms to focus

Teomaresults with suggestions for narrowing; links

resources derived; originated at Rutgers

Turbo10provides results in clusters; engines searched can

be edited

Page 20: Search Engines

© Tefko Saracevic 20

meta search engines (cont.)

• Large directory Complete Planet

directory of over 70,000 databases & specialty engines

• Results with graphical displays Vivisimo

clusters results; innovative

Webbrain results in tree structure – fun to use

Kartooresults in display by topics of query

Page 21: Search Engines

© Tefko Saracevic 21

domain engines & catalogs

• cover specific subjects & topics• important tool for subject searches

particularly for subject specialistvalued by professional searchers

• selection mostly hand-picked rather than by crawlers, following inclusion criteriaoften not readily discernablebut content more trustworthy

Page 22: Search Engines

© Tefko Saracevic 22

domain engines … sample

Open Directory Project large edited catalog of the web – global, run by

volunteers

BUBL LINK selected Internet resources covering all academic

subject areas; organized by Dewey Decimal System – from UK

Profusion search in categories for resources & search engines

Resource Discovery Network – UK“UK's free national gateway to Internet resources for

the learning, teaching and research community”

Page 23: Search Engines

© Tefko Saracevic 23

domain engines … sample

Think Quest – Oracle Education Foundation • education resources, programs; web sites created by students

All Music Guide • resource about musicians, albums, and songs

Internet Movie Database • treasure trove of American and British movies

Genealogy links and surname search engineswell.. that is getting really specialized (and popular)

Daypopsearches the “living web” “The living web is composed of sites

that update on a daily basis: newspapers, online magazines, and weblogs”

Page 24: Search Engines

© Tefko Saracevic 24

science, scholarship engines …sample free

access Psychcrawler - Amer Psychological

Association web index for psychology

Entrez PubMed – Nat Library of Medicinebiomedical literature from MEDLINE & health journals

CiteSeer - NEC Research Center scientific literature, citations index; strong in computer

science

Scholar Googlesearches for scholarly articles & resources

Infominescholarly internet research collections

Scirusscientific information in journals & on the web

Page 25: Search Engines

© Tefko Saracevic 25

science, scholarship engines …sample

commercial access • an addition to freely accessible engines

many provide search free but access to full text paid by subscription or per itemRUL provides access to these & many more:

ScienceDirectElsevier: “world's largest electronic collection of science,

technology and medicine full text and bibliographic

information” ACM PortalAsoc. for Computing Machinery: access to ACM Digital Library &

Guide to Computing

Page 26: Search Engines

© Tefko Saracevic 26

where to find out?

• information about search engines in sources that have updates, news, tips for searching and more – a MUST for searchers : Search Engine Watch

ratings, news, statistics, charts, explanations, tutorials Search Engine Showdown

“The users’ guide to web searching” - run by a librarian, news links, ratings

Virtual Chase a site about “Teaching Legal Professionals How To

Do Research;,” this section has very good tips and links for consideration of quality on the web

Page 27: Search Engines

© Tefko Saracevic 27

where? ….

SiteLinesa blog, written by Rita Vine, a professional

librarian, & web search trainer; many evaluations in archive

ResourceShelf“Resources and News for Information

Professionals,” edited by Gary Price, a librarian & author of Invisible Web – has extensive archive

WebsearchAboutnot evaluative, but provides news, capabilities,

sources, articles about web searching

Page 28: Search Engines

© Tefko Saracevic 28

art of searching search engines

Page 29: Search Engines

© Tefko Saracevic 29

part 2: digital libraries

Page 30: Search Engines

© Tefko Saracevic 30

definition

• digital libraries are viewed from several perspectivestechnical: “Digital library is a managed collection

of information, with associated services, where information is stored in digital format and accessible over a network.” (Arms, 2000)

institutional: “Digital libraries are organizations that provide the resources, including the specialized staff, to select, structure, offer intellectual access to, interpret, distribute, preserve the integrity of, and ensure the persistence over time of collections of digital works so that they are readily and economically available for use by a defined community or set of communities.” (Waters, 1998)

Page 31: Search Engines

© Tefko Saracevic 31

a bit of context

• short but volatile history research & development took of by start/mid

1990’s in the next decade phenomenal growth

worldwide large investment in research & building

• number of communities involvedcomputer science, primarily in researchmany subjects: digital libraries in their domain library & information science: operations, studies

of users, use, usability

• number of types emerged

Page 32: Search Engines

© Tefko Saracevic 32

libraries & digital resources

• libraries (particularly research, academic & special) directed massive funding toward such resourceselectronic journalsdatabasescatalogsdigitization of parts of collection

• thus becoming in effect digital libraries – or more accurately hybrid libraries with graphic and digital versions or types of

resources

Page 33: Search Engines

© Tefko Saracevic 33

emphasis here

• on large academic or research digital libraries that also are related to searching provide search capabilities or access to

search enginesprovide electronic journals that provide full

text of articles after a search

• such libraries have become also search portals of sort, essential for their users in education, research & related activities

Page 34: Search Engines

© Tefko Saracevic 34

sample

New York Public Library Digital“NYPL Digital is your gateway to The New York Public Library’s

rare and unique collections in digitized form.” Includes access to searchable databases

U California Berkeley Digital Library SUNsite“builds digital collections and services while providing

information and support to digital library developers worldwide.

The British Library“The world’s knowledge.” Includes “Services fro library and

information Professionals.”

Los Angeles Public Library Kids’ Pathresources for children; search through directory

Page 35: Search Engines

© Tefko Saracevic 35

sample …

New Zealand Digital Librarysearching of a number of digital collections, including

humanity development library

Research Library Group“RLG is a not-for-profit organization of over 150 research

libraries, archives, museums, and other cultural memory institutions.” Includes links to a number of searchable collections

Public Library of Science“PLoS is a nonprofit organization of scientists and physicians

committed to making the world's scientific and medical literature a public resource.” Publishes open access journals

Page 36: Search Engines

© Tefko Saracevic 36

Rutgers libraries – digital components

• strategic planning in developing digital access

• rich & complex content of digital resourcesseveral hundred indexes & databases for

searchingsome 20,000 electronic journals thousand & more digital reference sourcessubject research guidesSearchpath & other tutorialselectronic reserve

• affected teaching, learning, research by the whole community

Page 37: Search Engines

© Tefko Saracevic 37

some critical issues for searching

• no way yet to do federated searching in digital libraries to search several indexes at the same timeeach source has to be searched separately

most have very different search features, capabilities

• finding items in indexes does not mean that always able to get full text

• thus, searching time-consuming, chaotic

Page 38: Search Engines

© Tefko Saracevic 38

where to find out?

• information about digital libraries LibWeb U California, Berkeley“lists currently over 7200 pages from libraries in over 125

countries”

Digital Library Federation“a consortium of libraries and related agencies that are

pioneering the use of electronic-information technologies to extend their collections and services”

D-Lib Magazine“a solely electronic publication with a primary focus on

digital library research and development, including but not limited to new technologies, applications, and contextual social and economic issues”

Page 39: Search Engines

© Tefko Saracevic 39

where? …

Ariadne (UK)“to report on information service developments and

information networking issues worldwide, keeping the busy practitioner abreast of current digital library initiatives”

Information Technology and LibrariesALA publication; “related to all aspects of libraries and

information technology, including digital libraries” Journal of Digital Information

“Publishing papers on the management, presentation and uses of information in digital environments”

Biblio Tech Review “Information Technology for Libraries” – monthly news

and review magazine

Page 40: Search Engines

© Tefko Saracevic 40

in conclusion

• search engines are great but you have to KNOW what is under the hoodas to coverage, business model, search

features, outputs … they are NOT for every kind of information

need

• digital libraries are great for searching but you have to KNOW requirements for searching different resources that are included there is no federated searching as yet, or for

the time to come

Page 41: Search Engines

© Tefko Saracevic 41

art of searching digital libraries

Page 42: Search Engines

© Tefko Saracevic 42

and rewards …