1 William Y. Arms Cornell University October 25, 2002 The National Science Digital Library (NSDL) as...
-
Upload
nathaniel-thompson -
Category
Documents
-
view
215 -
download
1
Transcript of 1 William Y. Arms Cornell University October 25, 2002 The National Science Digital Library (NSDL) as...
1
William Y. ArmsCornell University
October 25, 2002
The National Science Digital Library (NSDL) as an Example of Information
Science Research
2
Some Light Reading
William Y. Arms, "Economic models for open-access publishing." iMP, March 2000. http://www.cisp.org/imp/march_2000/03_00arms.htm
William Y. Arms, "Automated digital libraries." D-Lib Magazine, July/August 2000. http://www.dlib.org/dlib/july20/07contents.html
William Y. Arms, "What are the alternatives to peer review? Quality control in scholarly publishing on the web." Journal of Electronic Publishing, 8(1), August 2002. http://www.press.umich.edu/jep/08-01/arms.html
William Y. Arms, et al., "A Spectrum of Interoperability: The Site for Science Prototype for the NSDL." D-Lib Magazine, 8(1), January 2002. http://www.dlib.org/dlib/january02/arms/01arms.html
3
A Scenario
A faculty member wished to find a paper for students to read in a class. He began by asking an expert. She suggested the original research paper as suitable.
Later, he typed a few terms into Google, browsed the hits, selected one that led to ResearchIndex, found the paper, and downloaded a PDF version from the author's web site.
4
SocietyCognitiveStudies HCI
Viewpoints
Computer Science
5
HCI: Eye Tracking
6
7
SocietyCognitiveStudies HCI
Computer Science
Applications
Information Science
8
Open Access to Scientific, Scholarly and
Professional Information
9
Before the Web
Access to Scientific, Medical, Legal Information
In the United States:
excellent if you belonged to a rich organization (e.g, a major university)
very poor otherwise (e.g., most K-12 schools)
In many countries of the world:
very poor for everybody
10
Research Libraries are Expensive
library materials
buildings & facilities
staff
11
Baumol's Cost Disease
Year
Price
1900 1950 2000
Bundle of goods and services
Labor-intensive services
Manufactured goods
2050
12
Baumol's Cost Disease
Year
Price
1900 1950 2000
Bundle of goods and services
Labor-intensive services
Manufactured goods
2050
Moore's Law
13
Brute Force Computing
Few people really understand Moore's Law
Computing power doubles every 18 monthsIncreases 100 times in 10 yearsIncreases 10,000 times in 20 years
Simple algorithms
plus
immense computing power
can outperform human intelligence
14
Example: Catalogs and Indexes
Cost disease: catalogs and indexes
Catalog, index and abstracting records are very expensive when created by skilled professionals
Moore's Law: automatic indexing of full text
Retrieval effectiveness using automatic indexing can be at least as effective as manual indexing with controlled vocabularies
(Cleverdon 1967, reporting on experiments by Salton)
15
Brute Force Computing:Substitutes for Human Intelligence
Automated algorithms for information discovery
Similarity of two documents
Vector space and statistical methods
(Salton, Sparc Jones, et al.)
Importance of digital object
Rank importance of web pages by analysis of the graph of web links
(Kleinberg, Page, et al.)
16
Information Discovery:1992 and 2002
1992 2002
Content print digital
Computing expensive inexpensive
Choice of content selective comprehensive
Index creation human automatic
Frequency one time monthly
Vocabulary controlled not controlled
Query Boolean ranked retrieval
Users trained untrained
17
Brute Force Computing: Automated Metadata Extraction
Informedia (Carnegie Mellon)
Automatic processing of segments of video, e.g., television news.
Algorithms for:
dividing raw video into discrete items
generating short summaries
indexing the sound track using speech recognition
recognizing faces
(Wactlar, et al.)
18
19
Simple algorithms
plus
immense computing power
plus
the intelligence of the user
can replace labor-intensive services
CognitiveStudies HCI
Brute Force Computing + Intelligence of the User
Computer Science
2020
The National Science Foundation'sNational Science Digital Library
(NSDL)
http://www.nsdl.org
2121
ScopeAll digital information relevant to any level of education in any branch of science.
Scientific and technical information
Materials used in education
Materials tailored toeducation
2222
All branches of science, all levels of education, very broadly defined:
Five year targets
1,000,000 different users
10,000,000 digital objects
10,000 to 100,000 independent sites
How Big might the NSDL be?
2323
... to provide a coherent set of collections and services across
great diversity
The Integration Task ...
2424
Resources
Integration team
Budget $4-6 million
Staff 25 - 30
Management Diffuse How can a small team, without direct management control, create a very large-scale digital library?
2525
It is possible to build a very large digital library with a small staff.
But ...
Every aspect of the library must be planned with scalability in mind.
Some compromises will be made.
Philosophy
2626
Example 1:
The Mortal behind the Portal
[This space left intentionally blank.]
2727
Example 2: Interoperability
The Problem
Conventional approaches require partners to support agreements (technical, content, and business)
But NSDL needs thousands of very different partners
... most of whom are not directly part of the NSDL program
The challenge is to create incentives for independent digital libraries to adopt agreements
2828
Function Versus Cost of Acceptance
Function
Cost of acceptance
Many adopters
Few adopters
2929
Example: Textual Mark-up
Function
Cost of acceptance
SGML
ASCII
HTML
XML
3030
The Spectrum of Interoperability
Level Agreements Example
Federation Strict use of standards AACR, MARC(syntax, semantic, Z 39.50and business)
Harvesting Digital libraries expose Open Archivesmetadata; simple metadata harvesting
protocol and registry
Gathering Digital libraries do not Web crawlerscooperate; services must and search enginesseek out information
3131
Example 3: Searching
Basic Assumptions
The integration team will not manage any collections
The integration team will not create any metadata
3232
Effective Information RetrievalComprehensive metadata with Boolean retrieval (e.g.,
monograph catalog).
Can be excellent for well-understood categories of material, but requires expensive metadata, which is rarely available.
Full text indexing with ranked retrieval (e.g., news articles).
Excellent for relatively homogeneous material, but requires available full text.
Full text indexing with contextual information and ranked retrieval (e.g., Google).
Excellent for mixed textual information with rich structure.
Contextual information without non-textual materials and ranked retrieval (e.g., Google image retrieval).
Promising, but still experimental.
3333
Full Text or Metadata?
Full text indexing is excellent, but is not possible for all materials (non-textual, no access for indexing).
Comprehensive metadata is available for very few of the materials.
What Architecture to Use?
Few collections support an established search protocol (e.g., Z39.50).
The NSDL Search Service
3434
Broadcast Searching does not Scale
User interfaceserver
User
Collections
3535
Users
Collections
Metadata repository
The Metadata Repository
Services
The metadata repository is a resource for service providers.
It holds information about every collection and item known to the NSDL, including contextual information.
3636
The Metadata Repository as a Resource
Records are exposed through Open Archives Initiative protocol for metadata harvesting.
Core Integration team provides some services based on the metadata repository.
The architecture encourages others to build services.
Support for Service Providers
3737
Search Service
Portal
Portal
Portal
Search andDiscoveryServices Collections
SDLIP OAI
http
Metadata repository
James Allan, Bruce Croft (University of Massachusetts, Amherst)
3838
Where is the Center of the Universe?
NSDL
Alexandria
Elsevier
Informedia
Library of Congress
Joe's PicturesMath DL
3939
Where is the Center of the Universe?
NSDL
British Library
Elsevier
OCLC
Library of Congress
Internet Archive
Harvard
4040
Where is the Center of the Universe?
NSDL
Course web sites
News and weather
Bill Arms
Office
Technical documentation
Directories
4141
The NSDL is a program of the National Science Foundation's Directorate for Education and Human Resources, Division of Undergraduate Education.
The NSDL Core Integration is a collaboration between the University Center for Atmospheric Research (Dave Fulker), Columbia University (Kate Wittenberg) and Cornell University (Bill Arms). The Technical Director is Carl Lagoze (Cornell University).
Acknowledgement