Digital Assets Repository 3 -...

43
Digital Assets Repository 3.0 PASIG User Group Conference Noha Adly Bibliotheca Alexandrina

Transcript of Digital Assets Repository 3 -...

Page 1: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Digital Assets Repository 3.0

PASIG User Group Conference

Noha AdlyBibliotheca Alexandrina

Page 2: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

DAR 3.0

• DAR manages the full lifecycle of a digital asset: its creation, ingestion, metadata management, storage, dissemination, publishing and dissemination, publishing and archival

• An eco-system of components for an integrated institutional repository.

Page 3: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

DAR 3.0

• Modular design with integrated components

• Consolidation of assets

• Flexible content model for different types of

digital objects based on current standards digital objects based on current standards

• Integration with different sources of

metadata, e.g ILS, repositories, databases, …

• Repository-bound applications

• Preservation

Page 4: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 5: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Conceptual Overview

• Digital Assets Factory (DAF) – Flexible management for the digitization workflow

– Unified means of ingestion into the system

– Support both physical and born digital materials

• Digital Assets Metadata (DAM) manages the • Digital Assets Metadata (DAM) manages the metadata even in an incomplete state.

• Digital Assets Publishing (DAP) components allow applications to synchronize objects and their metadata stored in their databases/indexes with the repository

• Digital Assets Keeper (DAK) manages access to the object files, versions and caching.

Page 6: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 7: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Conceptual Overview

• Collections/Sets:– DAR manages one instance of the object

– Objects are consolidated into sets/collections

– An object can belong to different sets

– Objects are shared among applications– Objects are shared among applications

– Applications synch with repository getting latest updates of their objects

– Applications maintain different derivatives of same object

– Relies on RDF to define sets and relations between objects

Page 8: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Conceptual Overview

• Discovery layer– Core files are kept online on spinning drives

– Simple derivatives for display

– Users can browse and search using simple viewers

– Provides full text search across the whole – Provides full text search across the whole collection, based on the access rights granted to the user.

• Ingestion plugins– Flexible Integration with different sources of

metadata

– Allow ingestion and synchronization with external sources

Page 9: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 10: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 11: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Digital Assets Factory (DAF)

• Full control over the digitization process workflow

• Configurable and flexible management tool for any digitization workflow

• Flexible workflow definition including • Flexible workflow definition including

– Definition of sequence of phases

– Pre-phase and post-phase checks

– Redirects

• Special workflows are defined for different object types

Page 12: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 13: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Digital Assets Factory (DAF)

• Automated integrity checks at each step of the workflow.

• Automated ingestion into the repository and archiving.

• Integrates with external sources of metadata • Integrates with external sources of metadata thru plugins

• Integrates with enterprise tools and automated software used for digitization

• Compliant with OAIS

• Available for download at http://wiki.bibalex.org/DAFWiki

Page 14: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 15: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Metadata Management

• METS and MODS standards for recording

metadata

• Fedora as a metadata registry

• Content Models (Hybrid)• Content Models (Hybrid)

– Photo (atomistic) / Album (aggregate)

– Book (compound ) / Bibliographic (aggregate)

Page 16: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 17: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Triple Store and Handles

• Triple Store– RDF relations between objects are stored in Triple

Store

– Currently using Mulgara

– Scalability Issues– Scalability Issues

– Alternatives: 4Store? Integration with Fedora

• Handles– Each object has a unique identifier UUID

– UUID is used to generate Handle

– list of external identifiers is maintained

Page 18: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 19: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

METS Store

• A METS skeleton is created for each

object even if metadata is incomplete

• When metadata complete, send to Fedora

and disseminate

• Accommodate digitizing objects before

metadata is ready

• METS store can be used to reconstruct

Fedora

Page 20: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Metadata Synchronization

– External sources• Synchronization is based on XML templates

• Templates map the output of ILS or DB into MODS

• Templates can be easily created for different sources

– Metadata Tool• No source of info to extract metadata• No source of info to extract metadata

• Relies on human data entry (normal users)

• Generates human friendly forms thru configurable XML templates

• Offers type validation, controlled vocabulary, authority lists

• Metadata is synchronized with METS store

• Allows full text search (Solr) across items in sets/collections

• Represent s objects in a hierarchy depicting sets /collections

• Supports simple workflow with designated roles e.g. editors, reviewers, etc.

Page 21: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 22: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Copyright and Access Module

• Access control policy for specific sets or objects

• Can define rights to certain operations (e.g. view, print, download …etc) based on the application requesting access

• Can define exceptions to override rules (e.g. • Can define exceptions to override rules (e.g. prevent a certain object from being displayed)

• Coordinate access to objects based on the number of licenses

Page 23: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 24: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Authentication and Authorization

• Single Sign On module

• Set management and ACLs

• LDAP integration and local users• LDAP integration and local users

Page 25: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 26: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Digital Assets Keeper

• Keep a working copy of the object online

• Maintain a unique copy of the object with persistent identifier

• Handle entries and external identifiers• Handle entries and external identifiers

• A storage abstraction layer isolate repository from storage implementation

• Manages different versions of items

• Manages caching and derivates

• Load balancing among nodes

Page 27: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 28: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Online Archive (OnA)

• Complete hardware and software solution for archival

• Provides reliable and scalable storage

• based on commodity hardware with • based on commodity hardware with spinning hard drives

• uses in house developed software for data management

• Any AIP ingested is mirrored at least once

• Heavily relies on Checksums to ensure the integrity of the data

Page 29: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 30: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Digital Assets Publishing (DAP)

• Different Viewers and applications are built using the Restful API

• Applications are highly integrated with repository; not separate silos: Repository-bound

• DAR manages one instance of each object• DAR manages one instance of each object

• Applications have access to slice of the data (Sets of Objects) based on their access rights

• Applications synch with DAR: queries API for new or updated metadata and files

• Applications maintain different derivatives independently

Page 31: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Discovery Layer

• Stores simple derivatives for all objects

• Users can browse and search all assets stored within using simple viewers.

• Provides full text search across the • Provides full text search across the metadata and textual content, based on the access rights granted to the user.

• Full text search is built on Solr with support for 5 languages: Arabic, English, French, Spanish and Italian

Page 32: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Current Status• More than 430,000 objects including

– Books

– Photos

– Manuscripts

– Maps

– Documents

• Specialized viewers been built to display items stored within the repository, such as books and photos. within the repository, such as books and photos.

• More viewers are still under e.g. tiled image viewer and manuscript viewer.

• Print on demand (POD) integration layer makes part of DAR available through the POD system.

• Several interfaces can also be built on top of this API to integrate DAR with other systems.

Page 33: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

DAR Books

– Application built on top of DAR using Restful API

– displays books stored in the repository (185,000)

– Faceted Search, including content

• Morphological full text search (5 languages)• Morphological full text search (5 languages)

• Search results highlighting

• Embeddable book viewer, can be added to any

webpage.

• Whenever a book is added to or updated in DAR, it

is automatically retrieved by DAR books.

Page 34: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

DAR Books

• Annotations Tools– Sticky Notes

– Highlight and underline, colors

– More to come…• Open Annotations, Annotea, …etc

• Web 2.0 Social Features: • Web 2.0 Social Features: – Rating and comments

– Create your own bookshelves

– Sharing and embedding

– Adding to other social sites: Facebook, Twitter,…

Page 35: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 36: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 37: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 38: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online
Page 39: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Text Highlighting

Page 40: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Text Underlining

Page 41: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Adding Sticky Notes

Page 42: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Future Work

• Enhance the Storage Layer: exploring iRODS, pair trees …etc

• Extending the Copyright and Access modulemodule

• Explore the potential of triple stores

– Beyond defining sets and collections

– Scalability

• Migrating existing applications into repository-bound

Page 43: Digital Assets Repository 3 - web.stanford.eduweb.stanford.edu/group/dlss/pasig/PASIG_May2011/IntroductionRecap/... · persistent identifier ... • Load balancing among nodes. Online

Thank You