The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27,...

27
The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo University of Illinois at Urbana-- Champaign Grainger Engineering Library Information Center

Transcript of The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27,...

Page 1: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

The Illinois Digital Library Initiative:

Processing and Access Issues for Full-Text Journals

May 27, 1998Pennsylvania State University

William H. Mischo University of Illinois at Urbana--Champaign

Grainger Engineering Library Information Center

Page 2: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Overview• Testbed Goals & Mission.• Testbed Issues.• Testbed Technologies.• SGML Processing Methodology.• Accomplishments.• Transaction Log Analysis• Federation Tests & Distributed Repository Model.• Future Foci. • What We Have Learned.• Questions

Page 3: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

“The Business of a University is Information…The Production and

Dissemination of Information is the Work of the University.”

• Tom Everhart, President, California Institute of Technology

Page 4: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Digital Library Initiative Program

• Funded by National Science Foundation (NSF), DARPA, and NASA.

• Awarded grants to 6 universities (and partners), September 1994--August 1998.

• The 6: Illinois, Michigan, Stanford, Berkeley, Carnegie Mellon, Santa Barbara.

• Each project: $4 million over 4 year project.

• Illinois: Testbed, Research, Evaluation, Web Software.

Page 5: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Scholarship, Publishing, Libraries• Changing Paradigm: Authors, Publishers,

Libraries, A & I Services.

• Scholarly Publishing Issues (We Pay Twice).

• Publisher Costs (85% for First Copy).

• Idea of Universities as Publishers.

• Users’ Information Seeking Behavior (personal collection, colleagues, e-mail, Web, Library).

• Archiving Issues (Depository idea GB, Canada)

• Role of the Library (Function as well as Place).

Page 6: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Scholarship• “The normal mode of scientific growth is

exponential…(we are) entering a period of crisis marked by rapidly increasing concern over problems of manpower, literature, and expenditure that demand solution by reorganization.”– Derek de Solla Price, 1986.

• Year and Number of Journals:– 1665 1– 1932 6,000– 1981 96,000– 1996 165,000

• Avg. Price of U.S. Periodical rose 155%, 1986-96.

Page 7: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Testbed Goals & Objectives• Construct Large-Scale, Multipublisher,

SGML-Based Full-Text Testbed.

• Investigate Processing, Indexing, Normalization, Retrieval and Rendering.

• Study End-User Searching Behavior and Needs.

• Look at One-Stop-Shopping Retrieval Models (Integration of Services).

• Identify Models for Effective Retrieval in Electronic Full-Text Publishing Environment.

Page 8: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Testbed: 54 Journals, 39K ArticlesAll items in SGML & 2/3 in PDF

• American Institute of Physics--APL, JAP, RSI– 12,000 articles, 1995--, weekly updates.

• American Physical Society--PRL– 8,800 articles, 1995--, weekly updates.

• ASCE Journals (25 titles)– 5,000 articles, 1995--.

• IEE Proceedings and Electronics Letters– 7,400 articles, 1993--.

• IEEE Computer Society (14 titles): 5,000 articles, 1996--.

Page 9: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Issues• Toward the Holy Grail of Smart Document.

• Top Menu Integration and Cross-Resource Links.

• Searching over Full-Text of Journals vs. Abstract & Index Service Database.

• Full-Text Display (Mathematics Rendering: SGML, HTML, PDF, XML, Math ML, TeX.).

• Web-Based Problems & Connectivity.

• Breadth and Depth of Collections.

• User Response.

Page 10: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Testbed Technologies• Open Text (HPUX) Search Engine / LiveLink Web.

• Item Metadata for Normalization and Short-Entry Display.

• TCP/IP and HTTP for Full-Text, DCOM DLLs for A&I Links, Java Applets (Wordwheels).

• SGML rendering via Panorama.

• Custom Processing Programs on NT and Unix Platforms (Visual Basic, C++, Perl).

• Microsoft IIS (Web Retrieval, ASP for Links and Top Menu, Authentication w/ Bluestem).

Page 11: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.
Page 12: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Accomplishments (Overview)• Distributed Repository Model (within

Testbed & with AIP).

• Process & Retrieve from Multiple Publishers & Heterogeneous DTDs.

• Use of Aliasing (Normalization) for Cross-Repository Access from Single Client Search Argument.

• Item Metadata Definition.

• Dynamic Linking of Resources and Proxy A&I Service Access from / to Testbed.

• Focused User Studies.

Page 13: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

UIUC DLI Testbed Architectures Under Investigation

Repositories(SGML, PDF)

MetadataIndexes

Gateways

Clients

Testbed Links to:A & I Services,Other Full TextIEE

IEEE CS

APS

ASCE

AIP

Urbana

New York

HTTPJAVAASP

LiveLink

AuthenticationAuthorization

Page 14: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

DeLIver Features• Retrieval over Subset of Repositories.• Forward (Citation) & Backward

(Bibliography) Links to Testbed.• Links to INSPEC, Compendex, Current

Contents from Items & Bibliography.• Ovid INSPEC/Compendex Proxy.• Integration with Other Library Resources• Web-Kerberos Based Authentication.• Capability of Digital Signing.• User Transaction Logs.

Page 15: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Toplevel Menu Transactions(Total 19738)

Compendex 2927 Online Catalog 6552 Wilson 496

Curr Serials 1656 Call Nos 298 New Books 324

Grainger Hm Pg 745 Faculty Interest 200 Comments 49

Ref Coll 1677 First Search 698 Reserves 380

DeLIver 519 ASTI 685 Sci Citation 498

CCON 446 PsychLit 92 INSPEC 793

Help Starting 186 FAQ 90 News 54

Page 16: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Transaction Logs (1)4035 total end-user sessions (September through May).

3023 end-user sessions where searches were performed

Top Bar # Sessions Total #

About DeLIver 427 536

Browse (all) 1585 2277

Browse Only 1012

Help 175 190

Quicktips 189 245

Download Software 1001 1086

Other Resources 230 289

Page 17: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Transaction Logs (2)4035 total end-user sessions (September through May).

3023 end-user sessions where searches were performed

Search Fields # Sessions Total #

Keyword 2083 6090Abstract 194 747Article Title 368 976 Article Author 377 926All Author 185 468Citations 39 74Body of Article 76 336Figure Caption 26 60Table Caption 9 12Journal Title 218 530Title, Headings, Caption 118 358

Page 18: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Transaction Logs (3)4035 total end-user sessions (September thru May).

3023 end-user sessions where searches were performed.

Searching Characteristics # Sessions Total #Average Length of Search 727 seconds

Display Full-Text 2079 4267

PDFs 842 10104

SGMs 1516 4660

Extended Citation 578 2212

Boolean Operators 856 5773

ANDS 682

Ors 204 668

NOTs 30 79

KWIC Display 389 780

Links to Inspec/Compendex 261 404

Multiword Search Arguments 1848 6134

Page 19: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Transaction Logs (4)4055 end-user sessions (September thru May)

3023 end-user sessions where searches were performed

Publisher Choices # Sessions Total #

All Publishers 2535 9185

AIP 65 238

APS 33 84

ASCE 96 247

IEE 38 98

Page 20: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Transaction Logs (5)4055 end-user sessions (September thru May)

3023 end-user sessions where searches were performed

Points:Not much use of Help or Quicktips;

a lot of Browsing but < 50% of search sessions;

Not jumping to A&I Services from DeLIver;

mostly Keyword Searching, also fair amount of Author, Article Title, Journal Title;

much more Display Full-Text than Extended Citation (why?);

25% of sessions use Boolean operators;

Multiword Search Arguments (complex terms, not single words) being entered;

Linking to INSPEC/Compendex in 20% of sessions;

predominantly All Publishers being searched.

Page 21: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Testbed User Authentication• Approach:

– Authenticate Once per Session / Authorize per Use

• Current Mechanism:– On 1st Request, User Referred to Bluestem Script– Upon Bluestem Authentication:

• Authorization Record Written to SQL Database• Cookie Set Which Points to that Record

• Need to Fix Redirection Problem with MS IE• Need to Extend Outside Cookie-Setting Domain

Page 22: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Future Work• Implementation of Distributed Repository

Model.• Expand Breadth of Testbed (Loading Locally

and Linking to other Repositories).• Use of Digital Object Identifiers and other

Standards.• Rendering via HTML 4.0 & CSS, XML & XSL.• Adding Dynamic retrieval Mechanisms

(Wordwheels, Co-Occurrence Matrices). • Expand Simultaneous Search Mechanisms.• Expanded User Studies.

Page 23: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

SGML vs. HTML vs. XML

• SGML:– Supports Powerful Indexing, Search & Retrieval– But Client, Delivery, & Rendering Issues Remain

• HTML:– Ubiquitous; Rendering Has Become More Robust– But Remains Presentation Oriented, Less Semantic

• XML:– Subset Retains SGML Features of Primary Interest– But XML Is New, Untested, Under-Supported

Page 24: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

Converting DLI Testbed to XML• XML Differences from SGML:

– No SHORTREF (Tag Minimization)– Tags Are Case Sensitive– Restrictions on Entities, Attributes, Link Mechanisms– Empty Tags Handled Differently

• Math ML vs. ISO 12083 Math– Math ML a Major Departure -- Adds Semantics– Focus on Java / ActiveX for Initial Deployment; Long-

Term Success May Hinge on XSL / DSSSL• ‘Content-Markup’ requires XSL, Dynamic HTML functionality

Page 25: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

CSS, XSL, DSSSL• CCS1 & CCS2 Have Added:

– Overlapping Glyphs, Absolute & Relative Positioning– Downloadable Fonts (Platform, Browser Variable)– Styling by Attributes, 2 Levels of Hierarchy

• XSL, DSSSL, DSSSL-O:– XSL Uses XML Notation, Is Extensible (ECMAScript)– Allows More Extensive Manipulation In Formatting

• Supports Re-arrangement, Navigator Frames, etc.

– Not Yet Implemented in Production Browsers

Page 26: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

What We Have Learned (1)• Power of SGML for Indexing & Retrieval.

• Problems with rendering mathematics--SGML, TeX, HTML, XML, Math ML.

• Depth and breadth of collection (TULIP/ Red Sage Syndrome; note use of Ovid client).

• Local Processing Implications

• Metadata needs and robustness of Distributed Model.

Page 27: The Illinois Digital Library Initiative: Processing and Access Issues for Full-Text Journals May 27, 1998 Pennsylvania State University William H. Mischo.

What We Have Learned (2)• Efficacy of Full-Text (stand-alone, integrated

with A & I, part of TOC Service).

• The Idea of a Digital Library in the Digital Chaos--the role of the Gateway and Linking of Resources.

• Changing roles of Authors, Publishers, A & I Services, Libraries.

• These Technologies Will Transfer to the Web (CSS I & II, HTML 4.0, Dynamic HTML, XML).