Tefko Saracevic 1 search engines digital libraries [email protected]@rutgers.edu; tefko/tefko

43
Tefko Saracevic 1 search engines digital libraries [email protected] ; http://comminfo.rutgers.edu/~tefko/

Transcript of Tefko Saracevic 1 search engines digital libraries [email protected]@rutgers.edu; tefko/tefko

Page 1: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

Tefko Saracevic 1

search engines

digital libraries

[email protected]; http://comminfo.rutgers.edu/~tefko/

Page 2: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

Central ideas

Search enginesWhile the structure & basic

operation of search engines is similar

• a great number & variety exists beyond Google with their own features many of them in

specialized domains

Digital libraries

They have rich & varied resources of use in accessing & searching

of variety of databases & reference tools in many domains

accessing of journals for delivery of full texts in all fields

Tefko Saracevic 2

As a searcher you are also using

Knowing searching = also knowing these resources

Page 3: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

ToC

1. Search engines2. Digital libraries

Tefko Saracevic 3

Page 4: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

Definitions. How they work. Diversity1. Search engines

Tefko Saracevic 4

Page 5: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

5

dictionary definitions

searchCOMPUTING (transitive verb) to examine a computer

file, disk, database, or network for particular information

enginesomething that supplies the driving force or energy to

a movement, system, or trend

search enginea computer program that searches for particular

keywords and returns a list of documents in which they were found, especially a commercial service that scans documents on the Internet

Tefko Saracevic

Page 6: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

6

about definition of search engines

• oh well … search engines do not search only for

keywords, some search for other stuff as well

• and they are really not “engines” in the classical sensebut then mouse is not a “mouse”

Tefko Saracevic

Page 7: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

7

use of search engines … among others

Tefko Saracevic

Page 8: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

8

Your Browser

How Search Engines Work(Sherman 2003)

The Web

URL1

URL2

URL3 URL4

Crawler

Indexer

SearchEngine

Database Eggs?Eggs.

Eggs - 90%Eggo - 81%Ego- 40%

Huh? - 10%

All AboutEggsby

S. I. Am

Tefko Saracevic

Page 9: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

9

how do search engines work? elaboration

• crawlers, spiders: go out to find content in various ways go through the web

looking for new & changed sitesperiodic, not for each query

no search engine works in real time

some search engines do it for themselves, others not

buy content from other companies

for a number of reasons crawlers do not cover all of the web – just a fraction

what is not covered is “invisible web”Tefko Saracevic

Page 10: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

10

elaboration …

• organizing content: labeling, arranging indexing for searching – automatic

keywords and other fields arranging by URL popularity - PageRank as Google

classifying as directory mostly human handpicked & classified

• as a result of different organization we have basically several kinds of search engines:

search – input is a query that is then searched & displayed

directory – classified content – a class is displayed fused: directories have now also search capabilities &

vice versaTefko Saracevic

Page 11: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

11

elaboration (cont.)

• databases, caches: storing content humongous files usually distributed over many computers

• query processor: searching, retrieval, display takes your query as input

engines have differing rules how handled displays ranked output

some engines also cluster output and provide visualization

• at the other end is your browser in addition to Explorer a number of the exists

Mozilla Firefox for instance – became quite popular

Tefko Saracevic

Page 12: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

12

elaboration…similarities, differences

• all search engines have these basic parts in common

• BUT the actual processes – methods how they do it – are based on various algorithms & they differ most are proprietary with details kept

secret but based on well known principles from information retrieval or classification

to some extent Google is an exception – they published their original method, but not further

Tefko Saracevic

Page 13: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

13

case of

• developed by Sergey Brin and Lawrence Page while students at Stanford in the beginning run on Stanford computers

• basic approach has been described in their famous paper “The Anatomy of a Large-Scale Hypertextual Web Search Engine” well written, simple language, has their pictures in acknowledgement they cite the support by NSF’s

Digital Library Initiative i.e. initially, Google came out of government sponsored research

describe their method PageRank - based on ranking hyperlinks as in citation indexing

“We chose our system name, Google, because it is a common spelling of googol, or ten on hundredth power” Tefko Saracevic

Page 14: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

14

coverage differences

• no engine covers more than a fraction of WWW estimates: none more than 16% hard (even impossible) to discern & compare coverage, but

they differ substantially in what they cover

• in addition: many national search engines

own coverage, orientation, governance many specialized or domain search engines

own coverage geared to subject of interest many comprehensive sources independent of search

engines some have compilations of evaluated web sources

Tefko Saracevic

Page 15: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

searching differences• substantial differences among search engines

on searching, retrieval displayneed to know how they work & differ in respect to

defaults in searching a query searching of phrases, case sensitivity, categories searching of different fields, formats, types of resources advance search capabilities and features possibilities for refinement, using relevance feedback display options personalization options

• Greg Notess’ chart & features describe differences

Tefko Saracevic 15

Page 16: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

16

business model differences

several business models• public good - have independent budget

e.g. PubMed, Librarians’ Index to Internet

• earn revenue from provision of information all commercial search engines

• using search engines to promote their other activities e.g. telephone directories

Tefko Saracevic

Page 17: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

17

sponsorship differences

• need to understand treatment of sponsorship – they influence what they search & how they display resultssome list separately results from sponsored

sites so you are reasonably clear what is there - what is sponsored & not

some have display-per-pay - showing first sites that paid most & do not even tell you that

some have pay per update of sites

• imperative to find sources that explain these models for different engines to know what is covered & what are you are getting Tefko Saracevic

Page 18: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

18

limitations

• every search engine has limitation as tocoverage

meta engines just follow coverage limitations & have more of their own – have to be careful in their use

search capabilitiesfinding quality information

• some have compromised search with economics becoming little more than advertisers

• but search engines are also many times victims of spamindexingaffecting what is included and how ranked Tefko Saracevic

Page 19: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

19

spamming a search engine

• use of techniques that push rankings higher than they belong is also called spamdexing methods typically include textual as well as

link-based techniques like e-mail spam, search engine spam is a form

of adversarial information retrieval the conflicting goals of accurate results of search

providers & high positioning by content page rank

• search engines are constantly battling this with their own special (& secret) tools

Tefko Saracevic

Page 20: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

search engine features, reviews, tutorials -

• Search Engine Showdown• lists, reviews, follows search engines, blog – look at Chart• by Greg Notess (librarian) – book Teaching Web Search Skills has

live links

• Recommended search engines by UC Berkeley

• library workshop; lists features, evaluates

• Search Basics: Web Search Essentials• among others, has a large section on search engines

• Search features chart• with explanations

Tefko Saracevic 20

Page 21: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

21

how to find a search engine?

• resources that list or categorize enginesSearch Engine Guideengines categorized by topic; other engine information

Search Engine Colossus international directory of search engines by country, topic

from 351 countries and territories; engines in many languages

Phil Bradley’s country based search engines“currently a total of 4,017 search engines and 222

countries, territories, islands and regions”

Tefko Saracevic

Page 22: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

all questions are not created equal

• what engine, what resource to use for what kind of question or information need? An exhaustive classification in:Finding information: search engines by Phil Bradley

Sources for different topics:Choose the Best Search for Your Information Need

by NoodleTools

List of capabilities for major search engines:Best Search Tools Chart by Infopeople

Tefko Saracevic 22

Page 23: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

meta search engines

Tefko Saracevic 23

• meta engines search multiple engines getting combined results from a variety

of engines

• do not have their own databasesbut have their own business models

affecting results

• a number of techniques usedinteresting ones: clustering, statistical

analyses

Page 24: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

24

sample of meta engines- with organized results

Dogpile results from a number of leading search engines;

gives source, so overlap can be compared; has SearchSpy -listing searches that were performed

Surfwax gives text sources & linking to sources; for some

terms gives related terms to focus

Turbo10provides results in clusters; engines searched can

be edited

Clustyresults grouped by topics or clusters for further

sources Tefko Saracevic

Page 25: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

25

meta search engines (cont.)

• large directory Complete Planet

directory of over 70,000 databases & specialty engines; classified

• results with graphical displaysKartoo

results in display by topics of query

• new kid on the block (not a meta engine, but a search engine)

CuilClaim: “Cuil searches more pages on the Web than anyone else—three times as many as Google and ten times as many as Microsoft”. Well … I do not know if it holds.

Tefko Saracevic

Page 26: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

multilingual

• English still the major language but declining, now slightly over 50%

• multilingual retrieval search enginesEuroseek

searches in a number of languagesAll the Web

results in 45 languages

Tefko Saracevic 26

Page 27: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

where to find out?

Tefko Saracevic 27

• information about search engines in sources that have updates, news, tips for searching and more – a MUST for searchers : Search Engine Watch

ratings, news, statistics, charts, explanations, tutorials Search Engine Showdown

“The users’ guide to web searching” - run by a librarian, news links, ratings

Virtual Chase a site about “Teaching Legal Professionals How To Do

Research” - this section has very good tips and links for consideration of quality on the web

Page 28: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

28

where? ….

SiteLinesa blog, written by Rita Vine, a professional

librarian, & web search trainer; many evaluations in archive

ResourceShelf“Resources and News for Information

Professionals,” edited by Gary Price, a librarian & author of Invisible Web – has extensive archive

WebsearchAboutnot evaluative, but provides news, capabilities,

sources, articles about web searching

Tefko Saracevic

Page 29: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

29

art of searching search engines

Tefko Saracevic

Page 30: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

30

part 2: digital libraries

Tefko Saracevic

Page 31: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

31

definition

• digital libraries are viewed from several perspectivestechnical: “Digital library is a managed collection

of information, with associated services, where information is stored in digital format and accessible over a network.” (Arms, 2000)

institutional: “Digital libraries are organizations that provide the resources, including the specialized staff, to select, structure, offer intellectual access to, interpret, distribute, preserve the integrity of, and ensure the persistence over time of collections of digital works so that they are readily and economically available for use by a defined community or set of communities.” (Waters, 1998)

Tefko Saracevic

Page 32: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

32

a bit of context

• digital libraries have a short but volatile history research & development took of by start/mid 1990’s in the next decade phenomenal growth worldwide large investment in research, development, keeping

up

• number of communities involvedcomputer science, primarily in research library & information science: operations, studies of

users, use, usabilitymany subjects: digital libraries in their domain

• diversity is largemany institutions e..g. museums developed own

Tefko Saracevic

Page 33: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

libraries & digital resources

Tefko Saracevic 33

• libraries (particularly research, academic & special) invested massive & ongoing funding towardelectronic journalsdatabases reference sourcesdigitization of parts of collection

• thus becoming in effect digital libraries – or more accurately hybrid libraries with graphic and digital versions or types of

resources

RUL has substantial holdings & expenditures in all of these

Page 34: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

34

emphasis here

• on large academic or research digital libraries that also are related to searching including provision of search capabilities & access to databaseselectronic journals that provide full text of

articles after a searchdigital reference sources

• such libraries have become also search portals of sort, essential for their users in education, research & related activities

Tefko Saracevic

Page 35: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

35

sample

New York Public Library Digital CollectionsA gateway to rare and unique collections in digitized form & to

databases. Access to most searchable databases requires library card number

U California Berkeley Digital Library SUNsitedigital collections and services

The British Library“The world’s knowledge.” Includes “

Services for library and information Professionals.”

Los Angeles Public Library Kids’ Pathresources for children; search through directory

Tefko Saracevic

Page 36: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

36

sample …

New Zealand Digital Librarysearching of a number of digital collections, incl. humanitarian

and UN collections; provision of free software for digital libraries

Public Library of Science“PLoS is a nonprofit organization of scientists and physicians

committed to making the world's scientific and medical literature a public resource.” Publishes open access journals

Closer to home: New Brunswick Free Public Libraryhas online resources, databases (some require library PIN),

historical archives and moreexample of great many public libraries that have databases for

searching

Tefko Saracevic

Page 37: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

37

Rutgers libraries – digital components

• strategic planning in developing digital access

• rich & complex content of digital resourcesseveral hundred indexes & databases for

searchingsome 20,000 electronic journals thousand & more digital reference sourcessubject research guidesSearchpath & other tutorialselectronic reserve

• affected teaching, learning, research by the whole community

Tefko Saracevic

Page 38: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

38

some critical issues for searching

• no way yet to do effective federated searching in digital libraries (to search several indexes at the same time)

RUL has Searchlight – searches only 8 major databases

each source has to be searched separately most have very different search features, capabilities

• finding items in indexes does not mean that always able to get full text

• thus, searching time-consuming, chaotic

Tefko Saracevic

Page 39: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

39

where to find out?

• information about digital libraries for searching LibWeb Webjunction formerly U California, Berkeley“lists currently over 7900 pages from libraries in over 146 countries”

Digital Library Federation“a consortium of libraries and related agencies that are pioneering

the use of electronic-information technologies to extend their collections and services”

D-Lib Magazine“a solely electronic publication with a primary focus on digital library

research and development, including but not limited to new technologies, applications, and contextual social and economic issues”

Tefko Saracevic

Page 40: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

40

where? …

Ariadne (UK)“to report on information service developments and

information networking issues worldwide, keeping the busy practitioner abreast of current digital library initiatives”

Journal of Digital Information“Publishing papers on the management, presentation

and uses of information in digital environments” Tool Kit for the Expert Web Searcher

one of the wikis by Library Information and Technology Association, a division of the American Library Association

Expert Web Search Tipsone of many informative articles from the Living

Internet Tefko Saracevic

Page 41: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

in conclusion

Tefko Saracevic 41

• search engines are great but you have to KNOW what is under the hoodas to coverage, business model, search

features, outputs … they are NOT for every kind of information

need

• digital libraries are great for searching but you have to KNOW requirements for searching different resources that are includedas yet federated searching is limited

Page 42: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

42

art of searching digital libraries

Tefko Saracevic

more

Page 43: Tefko Saracevic 1 search engines digital libraries tefkos@rutgers.edutefkos@rutgers.edu; tefko/tefko

43

and rewards …

Tefko Saracevic