1 CS 430 / INFO 430 Information Retrieval Lecture 24 Architecture of Information Retrieval Systems.
-
Upload
charlene-pearson -
Category
Documents
-
view
219 -
download
3
Transcript of 1 CS 430 / INFO 430 Information Retrieval Lecture 24 Architecture of Information Retrieval Systems.
1
CS 430 / INFO 430Information Retrieval
Lecture 24
Architecture of Information Retrieval Systems
2
Course Administration
CS 490 and CS 790 Independent Research Projects
• Web Research Infrastructure -- Build a system to bring complete crawls of the Web from the Internet Archive to the Cornell Theory Center and make them available for researchers through a standard API. (Continues planning work carried out this semester.)
• There will not be an independent research project in information retrieval.
3
Course Administration
Final Examination
• The final examination is on Monday, December 13, between 12:00 and 1:30.
• A make-up examination will be available on another date, which has not yet been chosen. The proposed date is December 9. If you would like to take the make-up examination, send an email message to Anat Nidar-Levi ([email protected]).
4
Distributed Architecture 1: Standard Search Protocols
Find x
Find x
Strict adherence to standards allows any user interface to search any conforming search service.
5
Distributed Architecture 1: Standard Search Protocols
Example: Z 39.50 Family of Standards for Searching Library Catalogs
Content: Anglo American Cataloging Rules
Structure of Content: MARC
Encoding Rules: Base Encoding Rules (character sets, separators, etc.)
Message Passing Protocol: Z 39.50
Query Format: Bib 1 (Boolean), Type 102 (full text)
In addition, there are the underlying network standards, e.g. the Internet suite of protocols.
6
Z39.50 principles
• Servers store a set of databases with searchable indexes
• Interactions are based on a session
• The client opens a connection with the server(s), carries out a sequence of interactions and then closes the connection.
• During the course of the session, both the server and the client remember the state of their interaction.
7
State
Z39.50
• The server carries out the search and builds a results set
• Server saves the results set.
• Subsequent message from the client can reference the result set.
• Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database.
8
Standard Search Protocols
Example: Z 39.50 Family of Standards for Searching Library Catalogs
The Z 39.50 family of standards has proved successful in a tightly knit community, where:
• There is a strong tradition of standardization, with many professionally trained people.
• The categories of material change gradually, allowing a slow-moving standardization process.
The standardization approach has failed where these two criteria are not met.
Historic note: WAIS was based on an early version of Z39.50.
9
Distributed Architecture 2: Broadcast Search (a.k.a. Federated Search)
Find xInterface Service
An interface server broadcasts a query to each collection, combines the results and returns them to the user.
Examples: Dienst (digital library protocol), Web metasearch services
10
Distributed Architecture 2: Broadcast Search
Interface Service: Can be a separate server (e.g., CGI), or run on the user's computer (e.g., applet).
Protocols: In the simple version, each collection must support the same standards and protocols (e.g., Z 39.50, http, etc.).
11
Distributed Architecture 2: Broadcast Search
Problems with Broadcast Search
• Performance: If any collection does not respond, the Interface Server waits for a time out.
• Recall: If any collection does not respond, documents in that collection are not found.
• Ranking and duplicates: There are great difficulties in reconciling ranked lists from different collections.
Broadcast searching is as bad as its weakest link!
Conclusion: Broadcast search does not scale beyond about five or ten collections, even with strict standardization.
12
Standardization: Function Versus Cost of Acceptance
Function
Cost of acceptance
Many adopters
Few adopters
13
Example: Textual Mark-up
Function
Cost of acceptance
SGML
ASCIIHTML
XML
14
Distributed Architecture 3: Centralized Search Services
Find x
Batch indexing: Metadata about all items is accumulated in a central system.
Real-time searching: The user (a) searches the central system, and (b) retrieves items from collections.
Examples: Union catalogs, Web search services
Search Service
retrieve
search
15
Distributed Architecture 3: Centralized Search Services
Gathering by Web Crawling
• Entirely automatic, low cost. Highly efficient at gathering very large amounts of material.
but ...
• Can only gather openly accessible materials.
• Cannot gather material in databases unless explicit URLs are known.
• Cannot easily make use of metadata provided by collections.
Examples: Web search services.
16
Distributed Architecture 3: Centralized Search Services
Harvesting
• Each collection makes a copy of its metadata available from a sever associated with the collection.
• A search service harvests metadata from all collections on a regular cycle and builds a central search system.
Advantages ...
• Can index material from databases without explicit URLs.
• Allows authentication and selection of material.
but ...
• Requires that collections have metadata and support harvesting protocol (e.g., Open Archives Initiative Protocol for Metadata Harvesting).
17
Open Archives Initiative Protocol for Metadata Harvesting
• Low-barrier protocol for exposing structured information (metadata) from cooperating repositories
• Provides opportunity for building comprehensive service network
• http://www.openarchives.org/
18
DiscoveryCurrent
AwarenessPreservation
Service Providers
Data Providers
Metadata
harvesting
OAI-PMH: A simple two party model for sharing structured information
19
OAI-PMH Key technical features
• Simple HTTP encoding• Built on of established XML standards• Multiple metadata formats, but Dublin Core required• Repository partitioning (sets)• Selective harvesting (sets and dates)• Clean partition between core and implementation-specific
extensions – Multiple item-level metadata– Collection level metadata
20
OAI Verbs
• Identify – repository characteristics
• ListMetadataFormats – DC required
• ListSets – repository partitioning
• ListRecords – (selectively) harvest metadata
• ListIdentifiers – (selectively) harvest metadata identifiers
• GetRecord – known item retrieval
21
The Integration Task is to provide a coherent set of collections and services across great diversity (all digital collections relevant to science education).
The National Science Digital Library
http://nsdl.org/
22
Interoperability in the NSDL
The Problem
Conventional approaches require partners to support agreements (technical, content, and business)
But NSDL needs thousands of very different partners
... most of whom are not directly part of the NSDL program
The challenge is to create incentives for independent digital libraries to adopt agreements
23
Basic Assumptions
• The integration team will not manage most of the collections
• The integration team will not create most of the metadata
Architecture for Searching
24
Full Text or Metadata?
Full text indexing is excellent, but is not possible for all materials (non-textual, no access for indexing).
Comprehensive metadata is available for very few of the materials.
What Architecture to Use?
Few collections support an established search protocol (e.g., Z39.50).
The NSDL Search Service
25
NSDL: The Spectrum of Interoperability
Level Agreements Example
Federation Strict use of standards AACR, MARC(syntax, semantic, Z 39.50and business)
Harvesting Digital libraries expose Open Archivesmetadata; simple metadata harvesting
protocol and registry
Gathering Digital libraries do not Web crawlerscooperate; services must and search enginesseek out information
26
Users
Collections
NSDL Repository
The NSDL Repository
ServicesThe repository is a resource for service providers.
It holds information about every collection and item known to the NSDL, including contextual information.
27
NSDL Search Service: First Phase
Portal
Portal
Portal
Search andDiscovery
Service
Collections
SDLIPOAI
harvest
crawl
NSDL Repository
Inquery -> Lucene
28
NSDL Search Service: First Phase
Approach
(a) Collections map metadata to Dublin Core, provide via Open Archives protocol.
(b) Search service augments Dublin Core metadata with indexing of full-text where available.
(c) User interface returns snippets derived from the metadata, links to full content and to metadata.
29
NSDL Search Service: First Phase
Weaknesses
(a) Ranking by similarity to query not sufficient.
(b) Snippets do not indicate why item was returned (e.g., terms in full text but not in metadata).
(c) Dublin Core records provide limited information.
(d) Browsing environment limited.
(e) Most users begin their search with a Web search engine (e.g., Google)
30
NSDL Search Service: Second Phase Developments
Metadata
(a) Accept any metadata that is available in a range of formats
(b) System for reviews and annotations, with reputation management
Search system
(a) Multimodal retrieval and ranking
(b) Dynamic generation of snippets by search engine
31
NSDL Search Service: Second Phase Developments (cont.)
Usability and human factors
(a) Wider range of browsing tools (e.g., collection visualization)
(b) Filters by education level and education quality, where known
Web compatibility
(a) Expose records for Web crawlers to index
(b) Browser bookmarklet to add NSDL information to Web pages