Edward A. Fox (with Hussein Suleman, Ming Luo) fox@vt fox.cs.vt

Post on 12-Jan-2016

54 views 0 download

description

Building Digital Libraries Made Easy: Toward Open Digital Libraries ICADL 2002 – Singapore – Dec. 2002. Edward A. Fox (with Hussein Suleman, Ming Luo) fox@vt.edu http://fox.cs.vt.edu CS DLRL Internet TIC NDLTD CITIDEL NSDL … - PowerPoint PPT Presentation

Transcript of Edward A. Fox (with Hussein Suleman, Ming Luo) fox@vt fox.cs.vt

Building Digital Libraries Made Easy:Toward Open Digital Libraries

ICADL 2002 – Singapore – Dec. 2002

Edward A. Fox(with Hussein Suleman, Ming Luo)

fox@vt.edu http://fox.cs.vt.eduCS DLRL Internet TICNDLTD CITIDEL NSDL …Virginia Tech, Blacksburg, VA, USA

Acknowledgements (Selected)

• Sponsors: ACM, Adobe, DLF, IBM, Mellon Foundation, Microsoft, NSF (Grants CDA-9312611; DUE-0121741, 0136690, 0121679; IIS-0080748, 0086227, 0002935, and 9986089), OCLC, SOLINET, UNESCO, US Dept. Ed. (FIPSE), VTLS, …

• Faculty/Staff (now): Boots Cassel, Su-Shing Chen, Debra Dudley, Jeremy Frumkin, Joe Futrelle, Lee Giles, Martin Halbert, Rex Hartson, John Impagliazzo, Deborah Knox, JAN Lee, Kurt Maly, Gail McMillan, Eric Morgan, Manuel Perez, Muhammad Zubair, …

• Students: Fernando Das Neves, Marcos Goncalves, Rohit Kelapure, Aaron Krowne, Paul Mather, Ryan Richardson, Priya Shivakumar, Wensi Xi, Liang Xu, Baoping Zhang, …

Outline

• Overview, Problem• Experience: Case Study Projects• Open Archives Initiative• Hussein Suleman Dissertation• DL in a Box, OCKHAM• Summary and Conclusion

Overview

We• address the problem of how to develop DLs;• build on experience in building many DLs;• strive for simplicity as per OCKHAM initiative;• build upon the Open Archives Initiative;• demonstrate our approach in diverse situations;• and invite all to

• use DL-in-a-box and• help build Open Digital Libraries.

Problem

Why do DL developers continue to “reinvent the wheel”? The top 10 reasons are:

1. The library budget won’t allow purchase of a commercial DL system.

2. Unless the development effort is local, there won’t be any control.

3. DLs are extensions of DBMSs, so they are simple applications to develop.

4. Since DLs operate on the Web, one must adopt the newest W3C proposal.

Problem – cont’d

5. Since technology moves so quickly, it is essential to follow the latest fad.

6. CS students always develop from scratch.

7. This team knows it can do it better.

8. This system must have more capabilities than any other system.

9. This DL has to be more flexible and extensible.

10. This is the right system architecture – at last!

Outline

• Overview, Problem• Experience: Case Study Projects• Open Archives Initiative• Hussein Suleman Dissertation• DL in a Box, OCKHAM• Summary and Conclusion

Experience: Case Study Projects

• AmericanSouth.org

• NDLTD

• CSTC

• JERIC

• CITIDEL

• NSDL

• Digital Library in a Box

AmericanSouth.org

• Domain: culture and history of the southern region of America (USA)

• Genre: diverse distributed collections at a dozen universities

• Submission & Collection: local sites Emory University (for SOLINET)

Networked Digital Library of Theses and Dissertations (NDLTD)

• Domain: graduate education and research

• Genre: electronic theses and dissertations (ETDs)

• Submission & Collection: local sites www.ndltd.org, www.theses.org

Computer Science Teaching Center (CSTC)

• Domain: teaching computer science

• Genre: courseware

• Submission & Collection: www.cstc.org

CS Teaching Center (CSTC): Lessons Learned

• Instead of building large, expensive multimedia packages, that become obsolete and are difficult to re-use, concentrate on small knowledge units.

• Learners benefit from having well-crafted modules that have been reviewed and tested.

• Use digital libraries to build a powerful base of support for learners, upon which a variety of courses, self-study tutorials & reference resources can be built.

Browsing (2)

ACM Journal of Educational Resources in Computing (JERIC)

• Domain: teaching computer science

• Genre: courseware, scholarly articles

• Submission & Collection: CSTC, ACM Digital Library

JERIC

• JJournal of EEducational RResources iin CComputing

• Accessible from www.cstc.org and www.acm.org and www.citidel.org

• ACM and SIGCSE support

• Refereed and interactive

• Part of ACM Digital Library

Computing and Information Technology Interactive Digital Educational Library (CITIDEL)

• Domain: computing / information technology

• Genre: one-stop-shopping for teachers & learners: courseware (CSTC, JERIC), leading DLs (ACM, IEEE-CS, DB&LP, CiteSeer), PlanetMath.org, technical reports, …

• Submission & Collection: sub/partner collections www.citidel.org

CITIDEL Team

• An NSDL Collection Track project

• Led by Virginia Tech, with co-PIs:• Fox (director, DL systems)• Lee (history)• Perez (user interface, Spanish support)

• Partners• College of New Jersey (Knox)• Hofstra (Impagliazzo)• Villanova (Cassel)• Penn State (Giles)

Summary of Spring 2001 Survey of CITIDEL-related Collections

and their Sizes

Size of Collection

1-5 items

6-100 items

101-999items

+1000items

Number ofCollectionsIdentified

100-300 50 20-35 10-25

English

Spanish

Nominated

Editor reviewed

Java

Multimedia

LLaanngguuaaggee TTooppiicc

QQuuaalliittyy

Identified by crawl

Peer reviewed

Algorithms

Multi-dimensional Categorization

CITIDEL Collection Sources

metadata

JERIC

fulltext

Experts’finding

aids

IEEE-CS…

include

CSTC ResearchIndex

ACM

NEC’sdata

dataprocessedw. R.I.

SIGCSEproceedings

ACMDL

include

include

include

include

include

Borner’sinfo vizsoftware

repository

NCSTRL

CITIDEL Collection Buildingthru

aided by

after

using

or thru

using

Submitting

VIADUCTGetSmart

Searching,Browsing

Classifying

Nominating

Crawling

Crawlifier

thru

Composing

include afterCreating

include after

DIGITAL LIBRARY SERVICES

REPOSITORIES

USER PORTALS

Overview of CITIDEL architecture

Union Metadata Repository

OAI Data

Provider

Laboratories Repository

Applets Repository

Papers Repository

Syllabi Repository

. . .

Digital Library Services

OAI Data

Harvester

Distributed repository structure

Annotations

OAI Data

Harvester

EDUCATORS

ADMINISTRATORS LEARNERS

Multilingual Searching

Revising Annotating Filtering Browsing Administering

Filtering Profiles User Profiles

Union Metadata

OAI Data

Provider

Remote and Peer Digital Libraries (eg. NSDL -CIS)

PORTALS

SERVICES

REPOSITORIES

Digital library architecture for localand interoperable CITIDEL services

National Science Digital Library (NSDL)

• Domain: undergraduate and K-12 education, etc.

• Genre: educational resources

• Submission & Collection: sites of 90 projects www.nsdl.org

NSDL Information ArchitectureDeveloped by the Technical Infrastructure Workgroup

referenceditems &

collections

referenceditems &

collections

Special Databases

NSDLServicesNSDL

ServicesOther NSDLServices

CI Services

annotation

CI Services

discussion

CI Services

personalization

CI Services

authentication

CI Services

browsing

Core Services:information retrieval

Core Collection-Building Services

harvesting

Core Collection-Building Services

protocols

Core Services:metadata gathering

Portals &ClientsPortals &

ClientsPortals &Clients

Usage Enhancement

Collection Building

User Interfaces

NSDLCollections

NSDLCollections

NSDLCollections

CoreNSDL“Bus”

Digital Library in a Box

• Domain: helping DL projects

• Genre: any domain, but especially those involved in NSDL (since funded in part is through NSDL – with U. FL, NCSA)

• Software and Documentation: http://dlbox.nudl.org

Outline

• Overview, Problem• Experience: Case Study Projects• Open Archives Initiative• Hussein Suleman Dissertation• DL in a Box, OCKHAM• Summary and Conclusion

Open Archives Initiative

OAIwww.openarchives.org

openarchives@openarchives.org

DiscoveryCurrent

AwarenessPreservation

Service Providers

Data Providers

Meta

data

harv

estin

g

The World According to OAI

Technical Umbrella for Practical Interoperability…

ReferenceLibraries

PublishersE-Print

Archives

…that can be exploited by different communities

Museums

Tiered Model of Interoperability

Mediator services

Metadata harvesting

Document models

OAI – Black Box Perspective

OA 1

OA 2

OA 4

OA 3

OA 5OA 6

OA 7

Browse SummarizeSearch Visualize

DO DODODODODODO

Services:

Docs:

Metadata:

Aggregation throughOAI Harvesting

Archive

Lite Sites

NCSTRL

Eprints

IEEE-CS, ACM, …

Own: History, ResearchIndex,

CSTC, …

CITIDEL

Active

Protocol for Metadata Harvesting

• Service Requests• Identify

• ListMetadataFormats

• ListSets

• GetRecord

• ListIdentifiers

• ListRecords

• Metadata Multiplicity

• Date/Time Ranges

• Sets (with semantics depending on local data providers)

• Resumption Tokens

NDLTD OAI Example

NDLTD Site / Member

Local DB

OAI Server

Local Search / Brow se

Student Entry

NDLTD Central

OAI Harvester

Name Authority Service

(e.g. OCLC)

MARIAN Union

Catalog

VTLS Union Catalog

MARC DB

Virtua

Conversion

Alternate MARC Transport (f tp?) tapes?)

Librarian Verif ication / Validation / Enrichment / Maintenance

Outline

• Overview, Problem• Experience: Case Study Projects• Open Archives Initiative• Hussein Suleman Dissertation• DL in a Box, OCKHAM• Summary and Conclusion

Open Digital Library (ODL) Hypothesis (Hussein Suleman)

• Can we leverage the successful model of the OAI Protocol for Metadata Harvesting to alleviate our architectural problems ?

Maybe … if

Digital Libraries can be modeled as• networks of extended Open Archives, where• each extended Open Archive is a• source of data and/or a provider of services.

Example Architecture (NDLTD)

Humboldt

Duisburg

MIT Filter

MIT

Browse

Union Catalog

Search Recent

User Interface

User Interface

OAI/ODL archive

OAI/ODL protocol

leg

end

Virginia Tech

PhysNet

CalTech

Dresden

ODL Demonstration - FrontPage

ODL Demonstration - Search

ODL Demonstration - Browse

Hussein Suleman’s Thesis Summary

• Open Digital Libraries (DLs)

• Open Archives Initiative (OAI)

• Protocol for Metadata Harvesting (PMH)

• Extending OAI-PMH provides the glue for building componentized DLs.

• Lightweight protocols connect the components to support modular systems with good efficiency.

Research in a Nutshell

• We build extensible modular systems with customizable services.

• This supports interoperability and allows distributed development.

• This is in use in www.cstc.org, AmericanSouth.org, www.citidel.org, …

• Components include search, browse, annotate, editorial support, union, filter, whats-new, submit, rate, recommend, …

1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

Document

1010100101010010101010010101010101010101

Document

1010100101010010101010010101010101010101

Document1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

Image

1010100101010010101010010101010101010101

Image

1010100101010010101010010101010101010101

Image1010100101010010101010010101010101010101

Video

1010100101010010101010010101010101010101

Video

1010100101010010101010010101010101010101

Video

users digital objects

?

1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

Document

1010100101010010101010010101010101010101

Document

1010100101010010101010010101010101010101

Document

1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

Image

1010100101010010101010010101010101010101

Image

1010100101010010101010010101010101010101

Image

1010100101010010101010010101010101010101

Video

1010100101010010101010010101010101010101

Video

1010100101010010101010010101010101010101

Video

componentized digital library

?

?

?

?

???

?

?

?

?

??

? ?

?

?

?

?

?

?

?

1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

Document

1010100101010010101010010101010101010101

Document

1010100101010010101010010101010101010101

Document

1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

Image

1010100101010010101010010101010101010101

Image

1010100101010010101010010101010101010101

Image

1010100101010010101010010101010101010101

Video

1010100101010010101010010101010101010101

Video

1010100101010010101010010101010101010101

Video

open digital library

OA OA

OA

OA

OA

OA

OA

OA

OA

PMH

PMH

XPMH

XPMH

XPMH

XPMH

XPMH

XPMH

XPMH

XPMH

XPMH

XPMH

XPMH

ODL Component Requirements

• Search• Retrieve a list of items• Index new items

• Annotate• Add annotation to item• Retrieve a list of annotations for an item

Open Digital Library Components

• Running now• XML-File (data provider from file system)• Union, search, browse, recent, filter• E-journal/review, Submit, Edit, Annotation

• Class projects• High performance multilingual search• Recommender, Rating; Mirroring (see JCDL’02)• Working with NCSA: from DB, unstructured text

• Others discussed• Classification/categorization• DL-Viz interconnection (VIDI – Jun Wang ETD)

Harvest from data providers

DBUnion Archive Merger Component

DBBrowse Browse Engine

IRDB-1 Search Engine

As Metadata Search Service Provider

As Metadata Browse Service Provider

XML File Coll. & Data Provider 1

XML File Coll. & Data Provider 2

XML File Coll. & Data Provider 3

Open Digital Library: Extended

What’s NewEngine

As What’s New Service Provider

OAI-PMHData Provider

Submit Archive

OAIB (NCSA:from RDBMS)

Filter

Recommend

RateEngine

AnnotationEngine

IRDB-2 Search Engine

As Annotation Search Service

Provider

As Recommend & Rate Service Provider

1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

Document

1010100101010010101010010101010101010101

Document

1010100101010010101010010101010101010101

ETD-1

1010100101010010101010010101010101010101

Program

1010100101010010101010010101010101010101

ETD-2

1010100101010010101010010101010101010101

Image

1010100101010010101010010101010101010101

Image

1010100101010010101010010101010101010101

ETD-3

1010100101010010101010010101010101010101

Video

1010100101010010101010010101010101010101

Video

1010100101010010101010010101010101010101

ETD-4

Digital Library for the Networked Digital Libraryof Theses and Dissertations (www.ndltd.org)

SearchFilter

Filter

Union

Recent

Browse

PMH

PMH

PMH

ODLRecent

ODLBrowse

ODLUnion

ODLUnion

ODLSearch

ODLUnionPMH

PMH

US

ER

INT

ER

FA

CE

Students and researchers

ETD collections

Example Open Digital Library

DBReview Box: Reviews

USER INTERFACE

Box: Resources

under Review

DBUnion: Metadata

Union

User Interface OAI/ODL component OAI/ODL protocol

Box: Accepted

Resources

IRDB

Box: Users

DBUnion: Legacy

Metadata

Thread

DBRate

Suggest

DBBrowse

Example Open Digital Library

Digital Library for theComputer Science Teaching Center (www.cstc.org)

CSTC User Interface

Open Digital Library Component

Extended OPEN ARCHIVE

OPENARCHIVE

Layer 1 : OAI PMH

• Protocol for Metadata Harvesting• Transfer stream of metadata from one archive

or component to another

• Service Requests• Identify, ListSets, ListMetadataFormats• GetRecord, ListIdentifiers, ListRecords

Layer 2 : Extended OAI-PMH

• OAI-PMH + extensions for general-purpose inter-component communication• Added in generic containers in every response

for additional information• Added “PutRecord” to submit a record• Increased granularity to support times as well

as dates (same as OAI-PMH v2.0)• Ignored DC requirement

Layer 3 : ODL Protocols

• Specialized protocol semantics for different components, e.g.:• Search component uses ODLSearch protocol

• ListRecords and ListIdentifiers embed query terms in “set” parameter

• Annotation component uses ODLAnnotate protocol

• ListRecords and ListIdentifiers specify the item for which annotations are requested in the “set” parameter

• PutRecord adds an annotation to an item

Performance Optimizations

• Caching of responses

• Persistent CGI mechanisms• FastCGI• SpeedyCGI

• Request multiple records in a single operation (proposed)

What have we accomplished ?• Complete protocol-level separation among

components within the DL

• Seamless integration with little “glue”

• Simple extensions of OAI-PMH

• Modular and portable components

• Efficient in speed – but not as efficient in storage

Outline

• Overview, Problem• Experience: Case Study Projects• Open Archives Initiative• Hussein Suleman Dissertation• DL in a Box, OCKHAM• Summary and Conclusion

Digital Library In A Box

• http://dlbox.nudl.org• Part of NSF’s National Science Digital

Library (www.nsdl.org)• Offers “Shrink-wrap” Open Digital Library

Components – Open Source Software• Users install ready-made digital library

solutions, or build their own from snap-together components.

OCKHAM

• Simplicity (a la OCCAM’s razor)

• Support by Mellon and DLF

• Next meeting in Atlanta Jan. 8, 2003

• Four main ideas:

1. Components

2. Lightweight protocols

3. Open reference models (e.g., 5S, OAIS)

4. Community perspective and involvement

5S Layers

Societies

Scenarios

Spaces

Structures

Streams

Outline

• Overview, Problem• Experience: Case Study Projects• Open Archives Initiative• Hussein Suleman Dissertation• DL in a Box, OCKHAM• Summary and Conclusion

Summary and Conclusion

• It is possible to build DLs easily.

• The ODL approach to this has been developed and validated in a number of settings.

• Everyone is invited to:

• Use ODL components

• Refine or add ODL components, protocols

• Join ODL and OCKHAM

• For more information see:

(Somewhat) Open Issues• Is this scalable? Portable ? Extensible ?• Can we define all popular DL services using such

a methodology? (completeness problem)• Can we define DLs as configurations of ODL

components? (composition problem)• Is OAI-PMH a good baseline protocol ? Can we

design a better baseline protocol upon which to base harvesting and repository access?

• To what degree is an ODL network equivalent to a monolithic system? (comparison problem)

Ultimate Goal• Package different configurations into

instant DL systems or subsystems

• DL building = component configuration

• All DLs speak the same language(s)

• Basic services are trivial to provide so more effort is spent on advanced capabilities of DLs

Selected Links

• CITIDEL – www.citidel.org

• NCSTRL – www.ncstrl.org

• NDLTD – www.ndltd.org

• NSDL – www.nsdl.org• Open Archives Initiative

• www.openarchives.org• www.openarchives.org/OAI/openarchivesprotocol.htm• www.dlib.vt.edu/projects/OAI/

More Links

• Hussein Suleman’s Dissertation• http://purl.org/net/hsdiss/odl.pdf

• Repository Explorer• http://purl.org/net/oai_explorer

• DL Courseware – http://ei.cs.vt.edu/~dlib • Virginia Tech Digital Library Research

Laboratory (DLRL) – www.dlib.vt.edu• Listservs

• dl-in-a-box-l@listserv.vt.edu• ockham-sys@listserv.cc.emory.edu