Collaborative Digital Libraries: Their Virtual Collections ... · heterogeneous collections of...

Collaborative Digital Libraries

1

Collaborative Digital Libraries: Their Virtual Collections and Aggregating Their Metadata

Collaborative Digital Libraries: Their Virtual Collections and Aggregating Their Metadata

Research Proposal

Bonnie MacGregor

San Jose State University

Libr 285

Spring 2010


2

Introduction

Libraries, museums, and archives offer a rich medley of information, artifacts, and

primary source materials that reflect our shared human interests and history. While each

institution in performing their traditional role remains essential, many are focusing their efforts

in creating collaborative digital libraries and making their virtual collections visible and

accessible to the world. A digital library that blends resources from varying institutions

depends on collaborative exchanges and contributions. Working together, these specialized

digital libraries and their virtual collections can be distributed across different servers, be

owned by different organizations, and be displayed in many different orderings and

arrangements. Defining and describing these virtual collections is an important function in

making these collections visible and accessible to our users but not all institutions describe their

resources in the same way, nor do all institutions rely on the same standards which govern

description.

Although the advent of the Open Archives Initiative Protocol for Metadata Harvesting

(OAI-PMH) has facilitated sharing of item-level descriptive metadata and harvesting across

institutional lines, one concern is that when item-level metadata is created it retains implicit

contextual information associated to the local setting in which it was created. When that item-

level metadata is removed from that context, inherent and referential information is lost

(Foulonneau et al., 2005). Furthermore when that data is harvested to other larger

heterogeneous collections of records, users may find it difficult to retrieve needed results when

records loose this contextual information after aggregation.


3

“Contextual information about the authority of a resource, its relationship to other resources,

its format and type, geographic and temporal coverage, and restrictions and usage rights can be

lost when item-level metadata is aggregated without the retention of implicit context”

(Foulonneau et al, 2005, 32).

The National Science Digital Library (NSDL) is one example of a highly specialized union

database which aggregates resources from varying institutions and offers organized access to

high quality resources and tools. Currently the NSDL Cornell University team has created a new

open-source library platform called NCore (for NSDL Core). One of the key tools used in their

new architecture is the use of aggregations. While there are several complexities maintained

within the data model, of particular interest is the aggregator objects – which are a special type

of data schema’s which collect and provide key contextual information contained and retained

in the harvested item-level metadata, as well as the collection level metadata.

Research Question

Do NSDL NCore’s aggregation objects improve a digital object’s contextual metadata?

Literature Review

According to the Digital Library Reference Model (DELOS), a “digital library is an

organization that comprehensively collects, manages and preserves for the long term rich

digital content, and offers to its user communities specialized functionality on that content of

measurable quality” (Chang et al., 2004, p. 335). Digital libraries are freed from the boundaries

of physical space and media and operate as rich and adaptive networked systems. Digital

libraries and their virtual collections offer users a plethora of enriched access points and


4

alternative methods for browsing and exploration, while maintaining a functionality that allows

their collections to be segmented, rearranged, annotated, enhanced, and integrated in ways

not possible before. Thus digital libraries offer the advantage of providing access to multiple

objects existing in separate collections and repositories, and aggregating them in different ways

to coexist. Yet a commonly noted risk is the loss of contextual metadata when aggregating

objects from differing repositories. “Item-level metadata records are typically written at a level

of descriptive granularity most appropriate to a local application, when item-level metadata is

removed from that context, implicit and referential information is lost” (Foulonneau, Cole,

Habing, and Shreeves, 2005, p. 32). In the essay, Using Collection Descriptions to Enhance an

Aggregation of Harvested Item-Level Metadata, contextual information on the authority of a

resource, its relationship to other resources, its format and type, its geographic and temporal

coverage, and its use restrictions and rights can be lost (Foulonneau et al., 2005). This

information is vital for articulating the scope, intent, and function of a record; not only at a

collection-level but also at the item-level of an object.

Defining context

Contextual information is achieved by the creation of metadata; it can be bibliographic

data, provenance data, and/or social and cultural data. More recently it has come to be

understood in a more expansive view as a means to understand patterns of use; pedagogical

goals, the nature of learners' educational systems; learners’ abilities, preferences and prior

knowledge. It can also refer to capturing opinions, comments, and reviews about library

resources and their history of use (Lagoze, Kraft, Payette, and Jesuroga, 2005). According to


5

Kraft, Birkland, and Kramer (2008) many libraries have identified user-contributed content,

personalization, and re-purposing of content as essential value-add features of “Next

Generation” digital libraries, improving the context around those resources, and enriching them

with new information and relationships that express the usage patterns and knowledge of the

library community. “The digital library then becomes the milieu for information collaboration

and accumulation – much more than just a place to find information and access it" (Lagoze et

al., 2006, p.2). Many digital libraries have relied on an information model based on the

simplicity of a union catalog, such as a ‘search and access’ model which at the core, collects,

index, and provides queries over a catalog of metadata records (Geisler et al., 2002). There has

been recent consensus that a more expansive view on digital libraries is necessary, one that

identifies digital libraries as collaborative, adaptive, and reflexive systems, and one that

elaborates the definition of user-contributed contextual information.

“They should be collaborative, allowing users to contribute knowledge to the library, through annotations, reviews, and the like, or passively through their patterns of resource use. In addition, they should be contextual, expressing the expanding web of inter-relationships and layers of knowledge that extend among selected primary resources. In this manner, the core of the digital library should be an evolving information base, weaving together professional selection and the "wisdom of crowds”(Lagoze et al., 2006, p.57).

The basic ‘search and access’ record-oriented model employed by most digital (and traditional)

libraries has a limited ability to fully model this multi-dimensional information context (Lagoze

et al., 2005).


6

Defining Aggregators

The National Science Digital Library (NSDL) aims to push the frontiers and capabilities of

digital library technology through its creation of an open-source architecture software platform

NCore. NCore is a techno-ecosystem that “can support digital library/repository needs ranging

from cultural heritage materials in the arts and humanities, to scholarly communication and

collaboration, to education at every level in every discipline” (Kraft et. al., 2008, p.313). The

central data model and architecture of NCore is quite complex yet it was designed to represent

multiple types of descriptions offered by their contributors. Resources themselves are not

homogeneous. A digital library will collect a variety of resources, i.e. images, audio, simulations,

and multi-media learning objects. Supporting this diversity raises the modeling complexities, in

particular, how to best accommodate information at the user interface level while

simultaneously representing the special characteristics of each type of resource, also known as

its context. “In such an environment, data surrounding a resource, such as a subject’s metadata

or membership in an aggregation, does not purely originate from a single cohesive and

consistent curation policy, but from a variety of independent agents with their own

motivations” (Kraft et al., 2008, p. 314).

What is of importance in regards to our examination of NSDL NCore model lies in the

systems implementation of ‘aggregator’ objects. These aggregations are first-class objects that

occupy a central role in representing and mediating context within the system. Five primary

objects are involved in the schema including “the resource object that contains or specifies

content, a metadata object that contains structured statements about a resource, an


7

aggregation object that collects resources with other aggregations into a set, a metadata

provider object that provides provenance information, and finally an agent object that specifies

the source for the metadata statements and the selector for aggregations” (314). Through the

use of these aggregation objects - all function as the building blocks of many complex structures

occurring within the digital library. The second major release of the NSDL technical

infrastructure, NSDL 2.0, supports creating this web of context around the resources in the

library in effect claiming that users will be able to discover resources by their context.

Typically a user must examine a resource’s information included in the catalog or else

examine the resource itself. Over several years of operation, NSDL has consistently received

suggestions asserting that users do not want a simple list of resources but rather want to

understand how to use them. The context of a resource - what benchmarks or educational

standards it meets; how it relate to other resources; how teachers have incorporated it into a

lesson plan; and what teachers, scientists’ and librarians have to say about it- are all critical in

making the digital library effective (The National Science Digital Library [NSDL], 2006).

Contextualization is a critical component in active learning. Gaining an understanding of a

concept includes the process of relating in a meaningful way to an idea, of seating it cognitively

in personal experience or understanding (pg.18). These critical features of NSDL 2.0: will easily

represent the web of related information around and among library resources, and it will make

it very easy for qualified library users to understand and add new contextualization to content

within the library. Therefore this study seeks to evaluate whether The National Science Digital

Library’s (NSDL) NCore aggregation objects have in fact improved contextual metadata, based


8

on the evaluation and analysis of respondents level of satisfaction with records retrieved from

their system.

Methodology Study Population

Our targeted study population will attempt to attract students currently enrolled in the

San Jose State University school of Library and Information Science master’s program. Due to

the subject matter of the curriculum, students will be knowledgeable about the technical

information infrastructures, classifications, and terminology associated with computer

engineering. This study population will also have specialized understanding in fields concerning

information retrieval processes and human-computer interactions. Respondents involved in

this exploratory study must have completed 25 units or more within the graduate program, to

ensure that respondents have an adequate level of knowledge needed to reliably complete this

evaluation study.

Sampling Design

This study will rely on a systematic sampling technique with a random start to select

respondents. By locating and evaluating the University’s current enrollment records, a list will

be compiled that contains all potential respondents meeting the sampling frame requirement

mentioned above. The hope of this study is to identify at least 100 persons that fit the profile

and recruit at least 50 respondents to participate in the study. To ensure any bias in the sample,

a random sampling interval will be selected to jump start the selection procedure specifically a

numerical value between one and ten. Once the interval has been selected, every kth unit in


9

our list will be chosen for inclusion (Babbie, 2009). For example, if 7 is the chosen sampling

interval, every seventh name on the list will be chosen for inclusion. This probability sampling

technique will ensure that all members of the population will have an equal chance of being

selected and be representative of the population in which it has been selected.

While it important to note that not all probability samples can ever be perfectly

representative, there is another danger involved with systematic sampling and that is

periodicity. The arrangement of elements, specifically if they are arranged in a cyclical pattern

may be biased if they coincide with the sampling interval (Babbie, 2009). Facilitators of this

study will be aware of such a problem and if patterns begin to emerge that are predictable or

ascribe to periodicity, a new sampling interval will be chosen and/or a newly generated list of

names will be produced.

A letter of intent (See Appendix: A) will be electronically mailed to each randomly

selected respondent. This cover letter will include information about the purpose of the study,

details regarding when and where the study will be administered, confidentiality terms and

conditions, and compensation. In addition to the letter of intent, a more in-depth introduction

and literature review will be dispersed as well in order to bring more focus on the intentions of

the study and what it hopes to accomplish.

Data Collection Instruments

Standardized survey questionnaires will be administered during face-to-face interviews

and will be the primary means of data collection for this evaluation study (See Appendix: B).

Interviews will be conducted in groups of ten . Because individuals are the unit of analysis and


10

their level of satisfaction about contextual metadata of a particular record is under evaluation,

survey research provides the best method in measuring the attitudes and opinions of each

respondent (Babbie, 2009). Interviews will be conducted in a semi-structured manner where

both efficiency and probing can occur.

Study subjects will be required to perform basic IR functions on the NCore platform

administered through NSDL. Subjects will be asked to query the system and retrieve one book

record, one image record, and one primary document record. Respondents will then observe

the characteristics of each record, paying special attention to ‘contextual information’ (as

defined in the literature review above). A 30 minute timeframe will be set, limiting all

respondents to search within this allocated amount of time.

Once respondents have completed basic IR processes on the NSDL NCore platform,

respondents will then be asked to sit down with the lead investigator and/or other properly

trained interviewer’s where a series of open-ended and close-ended questions will be ask by

way of a prepared questionnaire. Each interviewer will be responsible for digitally recording

each interview administered as well as transcribing additional notes where more elaborate

responses are required and given by subjects. Interviewers will act as neutral mediums and

their presence should not affect a respondent’s perception. Each interviewer must transcribe

responses verbatim. “No attempt should be made to summarize, paraphrase, or correct bad

grammar” (Babbie, 2009, p. 265).

According to Babbie, there are several advantages to implementing this method-

claiming interview surveys have higher response rates, and obtain higher completion rates.


11

Interviews also decrease the number of incomplete answers and the interviewer has the ability

to make observations such as respondent’s reactions to questions (Babbie, 2009). Regardless,

survey interviews are particularly flexible. Many questions can be asked on a given topic, giving

you considerable flexibility in your analysis.

Data analysis techniques

The process begins by quantifying the data into a numerical form before any statistical

analysis can be performed. Code categories will be developed after the data collection process

has been completed in order to identify categories that reflect our research purpose as well as

reflect the logic that emerges from the data (Babbie, 2009). Because our survey-interview

contains both open and close-ended questions, relying on what emerges from our data

collection will be essential in determining code categories that are both exhaustive and

mutually exclusive. While hired interviewer’s will be responsible for data collection, data

analysis will be performed solely by the principle investigator, therefore eliminating the need to

train coders in the definitions of code categories and showing them how to use those

categories properly. In an attempt to eliminate any discrepancies in the coding scheme the

principle investigator will rely on the assistance of a fellow colleague and have them code a

sample of the data in order to establish whether similar assignments are being made and

highlight any discrepancies.

A codebook will then be created; converging data categories into numerical codes.

Within the codebook the location of variables will be organized, giving the investigator the

ability to locate the connotation of the codes which ultimately represent the different


12

attributes of the variables under evaluation (Babbie, 2009). In essence the codebook tells the

researcher where to find the variables and what the code assigned to the variable – represents.

The principle investigator will review each questionnaire and begin coding the data directly

onto questionnaire. After completion, the investigator will take up data entry into an Excel

spreadsheet that can later be uploaded into some type of software that performs statistical

analysis.

Once the data has been fully quantified- quantitative analysis will commence. Univariate

analysis will be performed on the data that will involve analysis of a single variable (Babbie,

2009). In presenting this univariate data a measure of central tendency, such as averages will

be implemented. The most frequently occurring attribute, also known as mode, will be the

primary mean in calculating the average. The advantage with using averages’ lies in the

inherent reduction of raw data to the most manageable form, meaning a single number (or

attribute) can represent all the detailed data collected (Babbie, 2009).

Project Schedule

The table below showcase’s the main objectives or tasks that need to be accomplished. The

study has allocated a full year to complete this study. Objectives will fall under four categories

which represent particular stages in the overall project design and schedule. Each objective

identified will have an approximate deadline that will coincide within a particular month.

Because specific processes are highly iterative, setting a flexible deadline will allow revisions to

take place yet tasks are expected to be completed at the end of the time allocated.


13

Objective

Approx. Scheduled Completion

2010

Initial Stage

Form research team

Identify research topic, purpose, objectives, and outcomes

Define research methods

January - February

Stage I.

Complete thorough literature review

Define research methods and measurement techniques

Complete research proposal

Submit proposal to library director for approval

Submit proposal to SJSU IRB Board for approval

Design questionnaire

March - May

Stage II.

Contact respondents

Schedule interviews

Train interviewer’s

Conduct data collection

June - August

Stage III.

Code data for analysis

Perform data analysis

Complete draft of research findings

September - November

Stage VI.

Review any changes or corrections

Submit for publication

December


14

Qualifications

The principle investigator is currently a graduate student in the School of Library and

Information Science at San Jose State University. With a keen interest in library, museum, and

archival practices, she continually evaluates new technologies and studies that seek to blur

institutional practices in order to create more dynamic and collaborative libraries. In addition

to working at two academic libraries as well as in a museum archival department, the

investigator possesses specialized knowledge concerning professional museum standards as

well as curatorial methods, procedures, and techniques alongside more traditional and

specialized library practices. Her research interests include digital or hybrid libraries, open

source architectures, and special collections.

Significance of Work & Summary

“The ultimate goal of digital library evaluation is to study how digital libraries transform

research, education, learning and life” (Sudatta, Chowdhury, Landoni, Gibb, and Forbes, 2006,

p. 659). Digital libraries are difficult to evaluate due to their richness, complexity, and variety of

uses and users. Recent developments in the field have significantly influenced the ways in

which user’s access and use electronic information, and the issues explored typically have to do

with information retrieval and usability studies. To date there is no standard model for digital

library evaluation, nor is there a comprehensive set of models and toolkits that can be used by

digital libraries (Suddatta et al., 2006).

There is a need for more studies that focus on other factors involved with digital library

creation such as implementation issues, hardware, software, networking, data formats, access

and transfer times, failure rates, and development and maintenance costs . This study seeks to


15

contribute quantified data to the fields concentrated on hardware and implementation issues

faced by digital libraries. By basing the study in a real-world application such as NCore, this

study aims to provide a current quantitative analysis measuring how well NSDL NCore

aggregators improve contextual information. The results will be analyzed based on a

respondent’s level of satisfaction with a record’s metadata.

Studies such as this one- that deal with the complex hardware and software elements

are important not only in contributing to the scientific field but also in performing a leadership

type role, where other organizations can adopt methods and practices created by NCore.

Studies such as this one can also assist with strategic planning with respect to services and

management issues but also investigate the use and impact of this open-source information

architecture and suggest ways in which existing services can be improved.


16

References

Babbie, E. R. (2009). The Practice of Social Research (12th ed.). Pacific Grove, CA: Wadsworth Publishing

Change, M., Legget, J., Furuta, R., Kern, A., Williams, P., Burns, S., & Bias, R. (2004, June).

Collection understanding. Presented at the Joint Conference on Digital Libraries,

(Tucson, Arizona), ACM, 334-342.

Foulonneau, M., Cole, T., Habing, T., & Shreeves, S. (2005, June). Using collection descriptions to

enhance an aggregation of harvested item-level metadata. Presented at the Joint

Conference on Digital Libraries, (Denver, Colorado), ACM, 32-41.

Geisler, G., Giersch, S., McArthur, D., & MeClelland, M. (2002 July). Creating virtual collections

in digital libraries: Benefits and implementation issues. Presented at the Joint

Conference on Digital Libraries, (Portland, Oregon), ACM, 210-218.

Krafft, D., Birkland, A., & Cramer, E. ( 2008 June). NCore: Architecture and implementation of a

flexible, collaborative digital library. Presented at the Joint Conference on Digital

Libraries, (Pittsburgh, Pennsylvania), ACM, 313-322.

Lagoze, C., Kraft, D., Cornwell, T., Eckstrom, D., Jesuroga, S., & Wilper, C. (2006). Representing

contextualized information in the NSDL. Presented at the European Conference on

Digital Libraries, (Alicante, Spain), Springer, 1-12

Lagoze, C., Kraft, D., Cornwell, T., Eckstrom, D., Jesuroga, S., & Wilper, C. (2006). Metadata

aggregation and “automated digital libraries”: A retrospective on the NSDL experience.

Presented at the Joint Conference on Digital Libraries, (Chapel Hill, NC), ACM, 33-67.

Lagoze, C., Kraft, D., Payette, S., & Jesuroga, S. (2005). What is a digital library anymore,

anyway? Beyond search and access in the NSDL. D-lib Magazine, 11 (11). Accessed via

http://www.dlib.org/dlib/november05/lagoze/11lagoze.html

Marshall, Y., Zhang, H., Chen, A., Lally, R., Shen, E., Fox, A., & Cassel, L. (2003). Convergence of

knowledge management and e-Learning: The getsmart experience. Presented at

ACM/IEEE Joint Conference on Digital Libraries, (Houston, TX), ACM, 49-67.

Sudatta, C., Chowdhury, S., Landoni, M., Gibb, M., & Forbes, A. (2006). Usability and impact of

digital libraries. Online Information Review, 30(6), 656-680.

The National Science Digital Library. (2006). NSDL 2006 Annual Report: Leveraging Collaborative Networks. Retrieved from http://nsdl.org/news/?pager=publication

http://www.dlib.org/dlib/november05/lagoze/11lagoze.html

http://nsdl.org/news/?pager=publication


17

Appendix A: Recipient’s letter of Intent

Hello!

You have been selected through a random sampling method to participate in a study sponsored by the Association of College and Research Libraries (ACRL) in collaboration with San Jose State University School of Library and Information Science.

This study seeks to evaluate whether The National Science Digital Library’s (NSDL) NCore aggregation objects have improved contextual metadata based on users level of satisfaction with records retrieved from their system.

You will be asked to perform a simple information retrieval (IR) process on the NSDL digital library for an allocated 30 minutes. Once records have been retrieved and evaluated, you will be asked to sit down with an interviewer where he/she will administer a questionnaire and your responses will be recorded. From start to finish the study will take approximately one hour. The results of this study will help further the understanding contextual metadata plays when digital libraries aggregate their collections from differing repositories.

This survey is voluntary, and you may refuse to participate if you wish. Participation in this

study does not pose any direct benefits or risks to you . All respondents who complete the study will be given a one year subscription to WIRED magazine. All participants’ responses will be kept confidential, and all identifiable information will be removed from the results. Only researchers involved in this study will have access to the data collected from the survey.

If you are interested in participating in our study please contact the principle

This project has been reviewed and approved by the SJSU Institutional Review Board. Questions about your rights as a participant may be sent to IRB Coordinator Alena Filip by email at [email protected] or by phone at (408) 555-2479. If you have any questions about this study, please contact the principal investigator Bonnie MacGregor by email at [email protected]

Thank you for your participation! Best regards, ACRL & SJSU

mailto:[email protected]

mailto:[email protected]


18

Appendix B: Survey/Questionnaire The National Science Digital Library’s NCore metadata

Name of interviewer: ____________________ Date of interview: ______________________

Serial No. ______________________________

1) Were you able to retrieve all three records from the NSDL online catalog? Yes No

2) Based on your observation of the records information(s), circle the number that best represents your level of satisfaction with the data provide? [The Likert scale attributes 1 to be the lowest level of satisfaction and 5 being the highest]

1 2 3 4 5

3) Besides the records bibliographic data, for example, author, title, year published, etc - please circle all other types of data that were present in the record.

Provenance data Descriptive statements Comments Opinions Reviews Restrictions Usage rights

4) Based on your answer from the previous question, please describe what you felt was missing from the data or what you would of found helpful that was not included in the record. (Open-ended)

5) After retrieving the three required records, do you feel that each record’s contextual information adequately reveals relationships to other materials within the collection?

Yes No


19

6.) Were you able to add content to the record?

If yes, please explain how If no, please explain why

7.) When observing a record’s metadata, did you feel that there was too much information provided or too little?

Too much Too little Undecided

8.)Where you able to identifying the originating provider (repository) that a record belonged to? Yes No

9.) Was the format and type of record evident just from observing the metadata? Or were more steps in the IR process needed to locate such information? Please explain. [Open-ended]

10.) NSDL produced rich and dynamic results based on my information request?

Strongly Agree Agree Strongly Disagree No Opinion


20

Collaborative Digital Libraries: Their Virtual Collections ... · heterogeneous collections of...

Documents

Transcript of Collaborative Digital Libraries: Their Virtual Collections ... · heterogeneous collections of...