BNSC Report Fall 2007 David Giaretta. CASPAR Consortium Integrated project Total spend 16MEuro.

40
BNSC Report Fall 2007 David Giaretta

Transcript of BNSC Report Fall 2007 David Giaretta. CASPAR Consortium Integrated project Total spend 16MEuro.

Page 1: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

BNSC Report Fall 2007

David Giaretta

Page 2: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

CASPAR Consortium

http://www.casparpreserves.eu

Integrated project

Total spend 16MEuro

Page 3: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

…CASPAR

• Strongly based on OAIS

• Passed 1st year EU review

Page 4: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

CASPAR Aims• Produce tools and techniques to support digital

preservation and make it easier to share the cost– must be relatively easy to use– must have a low “buy-in” in terms of effort required for adoption– must avoid requiring wholesale change of everyone else’s

systems– must be decentralised and reproducible so that it can live on

after the formal end of the CASPAR project– must be “preservable”– must be open: open source, open standards

• Cannot do everything– Working closely with other projects

Page 5: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Validation• How can we judge any proposed solution?

• CASPAR validation metrics:– Theoretic underpinning– Testbed scenarios addressing real issues

• No “hand-waving” – use what is there now• Accelerated lifetime tests

– Hardware and Software – Environment– People

– Improved “trustability”/”certifiability”

Live a long time

Evidence - not proof

Page 6: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

CASPAR information flow architecture

•Rep

•Info

Virtualisation

Page 7: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

RegRepData

Curator

RepInfo toolkit

Repository

Gap ManagerOrchestration

ApplicationUser

Data Source

INFRASTRUCTURE ELEMENTS

Page 8: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Preservation Aware Storage and Preservation DataStores

• Preservation Aware Storage - The storage component of a digital preservation system that has built-in support for both bit preservation and logical preservation.

• Presevation DataStores (PDS) is a new OAIS-based preservation-aware storage. It offloads functionality to the storage layer– Decrease the probability of data loss– Simplify the applications– Provide improved performance and robustness– Utilize locality properties

• Compute data intensive functions internally e.g. fixity• Provide better support for links among objects

Page 9: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Preservation Aware Storage Functionality

Functionality Rational

Physically co-locate the Information Object (AIP).

However, this is relaxed if the AIP data already resides in an existing archive

Ensure metadata is never lost when raw data survives

Execute data intensive functions at the storage component:

–fixity computations and validation–data transformation

Utilize the data locality property

Lessen data transfers to applications

Handle technical provenance events internally

E.g. migration and copy occurs at the storage

Simplify applications

Support the loading and execution of external transformations

Ideally performed during bit-migration performed close to data

Page 10: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Preservation Aware Storage Functionality (Cont.)

Functionality Rational

Maintain referential integrityUpdate links during migration

Ideally done during migration

Ensure readability of the data by a different system in the future.

Support global self-described formats

Interaction with backend storage

Support media migrationLoad and execute transformations Portable export format

Interaction with backend storage

Support a graceful loss of dataSelf-describing self-contained media format

Minimize the effect of media loss/corruption

Page 11: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

PDS ArchitecturePreservation Web Services

Applications

Ingest, Access, Administration, …

backend

Preservation Engine Layer

Pre

serv

atio

n D

ataS

tore

AIP

XAM Layer

Object/File Layer

Layered approach Prototype based on open standards

OAIS, XAM, OSD Generic gradual mapping from logical

to physical object Independent of physical storage Independent of stored data type Scalable

Page 12: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

PDS Architecture

HL OSD +

Object Store

XAM to OSD

Preservation Web Services

XAM Library

Applications

Preservation WSDL

Ingest, Access, Administration, …

XAM API

WAS CE

backend

Security Adminweb service

XAM to FS

File System

VIM API

sockets

VIM API

RepInfo Mgr

Placement MgrMigration Mgr

PDI Mgr

Preservation Engine

Pre

serv

atio

n D

ataS

tore

HL OSD

AIP

posix I/O

Preservation

Engine

Layer

XAM

Layer

Object

Layer

Page 13: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Preservation DataStores

• Preservation DataStores are OAIS-based preservation aware storage• API covers different options for ingest and access, configure policies and

enables updates of AIPs and PDS code • Prototype implements mainly ingest and access using web services• References

– “Towards OAIS-Based Preservation Aware Storage - A White Paper“.• http://www.haifa.il.ibm.com/projects/storage/datastores/public.html

– “The Need for Preservation Aware Storage - A Position Paper". • ACM SIGOPS Operating Systems Review, Special Issue on File and

Storage Systems, Volume 41, Issue 1 (Jan 2007), pp 19-23.– “Preservation DataStores: Architecture for Preservation Aware Storage”, to

appear in 24th IEEE Conference on Mass Storage Systems and Technologies (MSST), 2007.

– Web site - http://www.haifa.il.ibm.com/projects/storage/datastores/index.html

Page 14: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Data Value

Vector

Image

Earth Observation

image

Astronomical image

Spectrum

Time Series

Virtualisation - building up data types…

3-D data

Page 15: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Content dependent components • Representation Information tools

– Structure• EAST• DRB• DFDL• Virtualisation assistant

– Semantics• RDF editors• RDFSuite• Terminology capture

– Software• UVC• Hardware emulators

• Trust, Authenticity & Provenance tools– Certification assistant– PREMIS

• Packaging tools– XFDU toolkit

Use existing tools where applicable

Develop new tools as needed and resources allow

Page 16: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Strawman Architecture…

Page 17: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

…CASPAR Architecture Overview

Page 18: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

CASPAR meets OAIS - 2

Page 19: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

OAIS Information Model and CASPAR API

Page 20: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

OAIS Information Model

Capture in UML diagrams

1. Add “obvious” methods• get/set for sub-components e.g. we know

AIP has PDI so need get/setPDI

2. Add “best guess” methods• Iterators over contents• May need to change

Page 21: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

class Identifier Taxonomy

java.lang.Comparable

«interface»

Identifier

+ getLocators() : Collection<Locator>+ setLocator(Collection<Locator>) : void

DataObject

«interface»

PhysicalObjectLocator

«interface»

InfoObjectIdentifier

«interface»

PersistentIdentifier

«interface»

CurationPersistentIdentifier

«interface»

Locator

+ getIdValue() : String+ getResolver() : String+ setIdValue(String) : void+ setResolver(String) : void

Page 22: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

class Representation Information

ContentInformation

«interface»

RepresentationInformation

+ getClassificationConcepts() : Collection<ClassificationConcept>+ getLatestVersion() : CurationPersistentIdentifier+ getStatus() : String+ setClassificationConcepts(Collection<ClassificationConcept>) : void

«interface»

SemanticRepInfo

«interface»

StructureRepInfo

«interface»

OtherRepresentationInformation

«interface»

RepresentationRenderingSoftware

«interface»

AccessSoftware

«interface»

RepInfoLabel

+ getDOM() : org.w3c.org.Document+ setDOM(org.w3c.doc.Document) : void

XXX Have made RepresentationInformation extend InformationPackage

«interface»

ClassiciationConcept

+ getConceptPath() : List<Concept>+ getConceptPath(List<Concept>) : void

«interface»

Concept

+ getDescription() : String+ getName() : String+ setDescription(String) : void+ setName(String) : void

Interpreted using

Page 23: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

class Information Package Contents

java.lang.Comparable

«interface»

InformationPackage

+ getContentInformation() : ContentInformation+ getPackageDescription() : PackageDescription+ getPackagingInformation() : PackagingInformation+ getPDI() : PreservationDescriptionInformation+ getVersion() : Version+ setContentInformation(ContentInformation) : void+ setPackageDescription(PackageDescription) : void+ setPackagingInformation(PackagingInformation) : void+ setPDI(PreservationDescriptionInformation) : void

InformationObject

«interface»

ContentInformation

InformationObject

«interface»

PreservationDescriptionInformation

+ getContextInformation() : ContextInformation+ getFixityInformation() : FixityInformation+ getProvenanceInformation() : ProvenanceInformation+ getReferenceInformation() : ReferenceInformation

InformationObject

«interface»

PackagingInformation

«interface»

PackageDescription

ISSUE: VerionsIdentifiers point to specific versionsThis may cause an issue with Provenance and handing an iAIP from one OAIS to another - if the Provenance changes then does the version (and therefore the identifier) associated with that AIP.

delimited by

1

described by*

0..1

1

further described by

*

1

identifies

*

derived from

1

Page 24: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

class Archiv al Information Package Contents

InformationPackage

«interface»

ArchivalInformationPackage

+ isValid() : boolean

InformationObject

«interface»

PreservationDescriptionInformation

+ getContextInformation() : ContextInformation+ getFixityInformation() : FixityInformation+ getProvenanceInformation() : ProvenanceInformation+ getReferenceInformation() : ReferenceInformation

Note that an AIP must have some PDI. A general Information Package is not required to have any PDI.

1*

Page 25: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

class Information Object Contents

«interface»

InformationObject

+ getDataObject() : DataObject+ getRepresentationInformation() : RepresentationInformation+ setDataObject(DataObject) : void+ setRepresentationInformation(RepresentationInformation) : void

«interface»

DataObject

ContentInformation

«interface»

RepresentationInformation

+ getClassificationConcepts() : Collection<ClassificationConcept>+ getLatestVersion() : CurationPersistentIdentifier+ getStatus() : String+ setClassificationConcepts(Collection<ClassificationConcept>) : void

«interface»

DigitalObject

+ getDataResource() : DataResource+ getInformationsObjects() : Collection<InformationObject>+ setDataResource(DataResource) : void+ setInformationObjects(Collection<InformationObject>) : void

«interface»

BitSequence

Identifier

«interface»

PhysicalObjectLocator

*

Interpreted using*

Interpreted using

1..*1

Page 26: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Summary• The Conceptual Model is based on OAIS and works out some

implications

• It suggests area of Research– Intelligibility– Structure

• Virtualisation

– Authenticity

• It leads into the Architecture which is– Broadly applicable– Is useful not just for Preservation but also interoperability

• Note - Registry/Repository of Representation Information– http://registry.casparpreserves.eu– http://registry.dcc.ac.uk

Page 27: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Digital Curation Centre

• DCC Development closely linked to CASPAR

• Other linked JISC funded projects:– SCARP– Significant properties of software– …may be others

Page 28: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Audit and Certification

Page 29: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

The need for Trustable Repositories• Task Force on Archiving of Digital Information

(1996) declared,– “a critical component of digital archiving

infrastructure is the existence of a sufficient number of trusted organizations capable of storing, migrating, and providing access to digital collections.”

– “a process of certification for digital archives is needed to create an overall climate of trust about the prospects of preserving digital information.”

• A recurring request in many subsequent studies and workshops

Page 30: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Trusted Digital Repositories

• Invited group, hosted by Research Library Group (RLG)

• Concerned with organisational and financial issues

• Trusted Digital Repositories: Attributes and Responsibilities (TDR)– http://www.rlg.org/legacy/longterm/repositories.pdf

Page 31: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Critique of TRAC• Closed process

– Single review of draft document• Many changes based on unpublished “test audits”• Underplays “understandability”

– Important for data– Assumed not to be important for “documents”

• Simple list –– Do ALL boxes have to be ticked?– What does a “tick” mean anyway?

• Link to other standards – ISO 17799/27001 for security (overlap with TRAC section C)– ISO 9000 – say what you do and do what you say

– but impractical to demand multiple independent audits

Page 32: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

ISO process status• New group set up with the primary aim of producing an

ISO standard– Repository Audit and Certification (RAC)

• OPEN process– Wiki open to all– Mailing list open to all– Virtual meetings normally every week– See http://wiki.digitalrepositoryauditandcertification.org

• Into ISO via CCSDS – same route as OAIS– Some organisational/procedural changes in CCSDS

• Currently a Birds of a Feather (BoF) group– To demonstrate adequate support for the work

• Subsequently should become a Working Group• Documents agreed by the WG will then be reviewed by

CCSDS and more broadly via international ISO review process

Page 33: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Current status

• Reviewing and comparing– TRAC– NESTOR– DCC documents

• Do we need another ISO standard?– Could we could simply add to existing standards

e.g. ISO 27001– The view is that ISO 27001 CANNOT be modified

adequately• It’s view of Information is too limited

• Started drafting a straw man document– Taking TRAC and add concepts from other docs

Page 34: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Key Issues• How to get from a checklist to an international

accreditation/ certification system?

• Evidence – short term• Evidence – long term

– The real crunch!• Quantification

– The marking system• Levels of audit?

– External review– Internal maturity

Page 35: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

The Market• Transparency

• Trustable?– certified by whom?– to what level?– what evidence?– for what Designated Community

• relevant/sensible?

• What cost?

Page 36: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Links

• RAC group Wiki: – http://wiki.digitalrepositoryauditandcertifiation.org

• TRAC document – http://www.crl.edu/PDF/trac.pdf

• Digital Curation Centre– http://www.dcc.ac.uk

• CASPAR project – EU project on digital preservation – Science, Culture and

Arts data• Infrastructure, tools and detailed case studies – what does one

need to actually “understand” the data?

– http://www.casparpreserves.eu

Page 37: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Alliance for Permanent Access• Members:

– Science and Technology Facilities Council– Koninklijke Bibliotheek– Deutsche Nationalbibliothek– Max Planck Gesellschaft– International Association of Scientific, Technical and Medical

Publishers– European Space Agency, ESRIN– Fernuniversität in Hagen– European Organization for Nuclear Research– Georg-August-Universitat Gottingen Stiftung Oeffentlichen

Rechts– European Science Foundation, – Centre National d’Etudes Spatiales, – Centre Informatique National de l’Enseignement Supérieur,– UK Joint Information Systems Committee, – British Library– National Archives of Sweden

Page 38: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Alliance status

• First stage – fairly informal sign-up

• Preparing for Conference in Nov

• More formal framework next year

Page 39: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

PARSE bid

• Consortium is a sub-group of the Alliance

• EU bid

• Aims at E-Infrastructure for Preservation– Roadmap– Survey of what is in place and planned– Gap Analysis – Impact Analysis tool

Page 40: BNSC Report Fall 2007 David Giaretta. CASPAR Consortium  Integrated project Total spend 16MEuro.

Other opportunities

• NSF solicitation, entitled Sustainable Digital Data Preservation and Access Network Partners (DataNet)– http://www.nsf.gov/pubs/2007/nsf07601/nsf07601.pdf – informational meeting for prospective Principal

Investigators will be held 10 am to noon, Tuesday, November 6, 2007, Room 595 NSF Stafford II building, Arlington, Virginia.

– www.nsf.gov/dir/index.jsp?org=OCI