The Other Security: A New and Nimble Approach to Digital Preservation Stephen Abrams Perry Willett...
-
date post
21-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of The Other Security: A New and Nimble Approach to Digital Preservation Stephen Abrams Perry Willett...
The Other Security:A New and Nimble Approach to Digital
Preservation
Stephen AbramsPerry Willett
Digital Preservation ProgramCalifornia Digital Library
University of California
UCCSC 2009: Focus on SecurityUC Davis, June 16–17, 2009
Focus on Security
“Traditional” security risks
– Natural disaster– Infrastructure failure– Storage failure– Server failure– Operating system failure– Application failure– Human error– Malicious attack
Focus on Security
The “other” security risks
– Legal encumbrances– External dependencies– Media obsolescence– Format obsolescence– Staff competencies– Institutional commitment– Financial stability– Changing user expectations
Focus on Security
The “other” security risks
– Anything that interferes with the usability of managed digital assets now or in the future
Libraries Have a Long Time Horizon
The UC Melvyl union catalog holds over 28 million items; 11,000 are more than 500 years old
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
<1500 <1600 <1700 <1800 <1900 <2000 <2100
Date
Mel
vyl
Cat
alo
g R
eco
rds
(logrithmic
)
By century
Total
Libraries Have a Long Time Horizon
What can we do to ensure that today’s digital assets are still usable 500 years from now?
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
<1500 <1600 <1700 <1800 <1900 <2000 <2100
Date
Mel
vyl
Cat
alo
g R
eco
rds
(logrithmic
)
By century
Total
Agenda
What is digital curation?
Redefining the repository: A micro-services approach to curation
Web archiving
CDL/campus curation collaborations
Trusted digital curation services
Summary
Digital Curation
Activities focused on maintaining and adding value to trusted digital content
Encompasses preservation and access, which are complementary, not disparate functions
– Preservation ensures access over time– Access depends on preservation up to a point in time
How can we make the “Save” button really mean “save”?
ManageCreate Use
Add value
Curation Imperatives
Integrated business process
– Robust technological infrastructure and– Human analysis and decision-making
Programmatic (not project-oriented)
Services (not systems)
Content (not repositories)
Agenda
What is digital curation?
Redefining the repository: A micro-services approach to curation
Web archiving
CDL/campus curation collaborations
Trusted digital curation services
Summary
D'où venons nous, que sommes nous, où allons nous?
Paul Gauguin, 1897-98, Museum of Fine Arts Boston, 32.270
Where are we from, what are we, where are we going?
Paul Gauguin, 1897-98, Museum of Fine Arts Boston, 32.270
Where [is our stuff] from, what [is it], where are we going [with it]?
Paul Gauguin, 1897-98, Museum of Fine Arts Boston, 32.270
Where From? What? Where To?
Producer
Ingest
Repository
Data management /archival storage
Consumer
Access /preservation
planning
Where From? What? Where To?
Producer
Ingest
Provenance
Repository
Data management /archival storage
Characterization
Consumer
Access /preservation
planning
View paths
Information Landscape
Increasing diversity in types and uses of content
Content arising from non-library contexts
Inevitable technological change
Infrastructure Design Goals
Devolve repository function into a set of independent, but interoperable, services
– Since each is small and self-contained, they are more easily developed and maintained
– Since the level of investment is lower, they are more easily replaced
Provide complex function through the flexible combination of atomistic services
Infrastructure Design Goals
Support interaction through procedural APIs, command line applications, and web interfaces
– Let content managers and curators interact with the services without requiring changes to existing work practices
Rather than force content to come to the services, push the services out to the content
– Easy deployment centrally or locally, either independently or in strategic combinations
Infrastructure Design Goals
Defer implementation decision making until needs and outcomes are clearly articulated
– Requirements are first stated as sets of values and strategies that promote those values
– Strategies are then embodied as abstract services, and, finally, instantiated in technical systems
Object-Centric Values and StrategiesValue Justification Strategy
Identity To distinguish an object from others Persistent naming, actionable resolution
Viability To recover an object from its medium Redundancy, heterogeneity, media refresh
Fixity To ensure that an object is unchanged from its accepted state
Redundancy, error-correcting codes, message digests
Authenticity To ensure that an object is what it purports to be
Provenance, cryptographically-secure signatures
Ontology To understand an object’s significant nature
Syntactic, semantic, and pragmatic characterization
Visibility To enable users to find objects of interest
Public discovery systems and registries, exposure for web harvest
Utility To expose an object’s underlying content
Behavior-rich delivery
Portability To facilitate content sharing and succession planning
Self-contained, self-documenting objects, packaging standards
Appraisement To understand the consequences of time Analysis and assessment
Timeliness To know when a preservation value is threatened
Technology watch, stakeholder engagement
Service-Centric Values and StrategiesValue Justification Strategy
Availability To provide access at a time of user choosing
Redundancy, automated failover
Responsivity To provide appropriate throughput Redundancy, load-balancing
Security To enforce appropriate use of services and content
Cryptographically-secure identity and role management
Interoperability To facilitate creative reuse of content and services
Standard interfaces
Extensibility To enable graceful evolution over time Granularity, orthogonality, virtualization
Trustworthiness To promote users’ sense of predictability and reliability
Transparency, audit, certification
Sustainability To ensure ongoing access and use Commodity components, institutional commitment, financial cost-recovery, professional development
Micro-Services
Interoperation
Publication Annotation
Application Ingest Index Search Transformation
Replication Identity Storage Fixity Replication
Interpretation
Catalog Characterization
Value
Service
Context
State
Curation
Preservation
“Lots of uses keeps stuff valuable”
“Lots of services keeps stuff useful”
“Lots of description keeps stuff meaningful”
“Lots of copies keeps stuff safe”
Suf
ficie
ncy
Nec
essi
ty
Design Process
What are the conceptual entities underlying the service?
What are their state properties?
What are their behaviors?
Storage Service
Storage service
– An aggregation of storage nodes
Storage node
– A particular configuration of object storage
Object
– An aggregation of files over time
Version
– A particular configuration of files at a point in time
File
– A formatted bit stream
Storage Service Methods
Help [idempotent, safe]
Get-state [idempotent, safe]
Get-node-state [idempotent, safe]
Get-object-state [idempotent, safe]
Get-object [idempotent, unsafe]
Get-version-state [idempotent, safe]
Get-version [idempotent, unsafe]
Get-file-state [idempotent, safe]
Get-file [idempotent, unsafe]
Add-version [non-idempotent, unsafe]
Storage Service Interfaces
METHOD Get-File-State [idempotent, safe]
Node Identifier Mandatory Node identifier
Object Identifier Mandatory Object identifier
Version Identifier Optional Version identifier
File Identifier Mandatory File identifier
ResponseForm Enum Optional Response form
RETURN Response form Mandatory File state
GET /node/object/version/file?m=state HTTP/1.1 Accept: application/json
% store –get node/object/version/file –m state –f JSON
File.getState (“node/object/version/file”, Format.JSON);
Technological Change and Invariance
Circa 1989– FTP– POSIX– SQL
Circa 2029?– HTTP– URI– XML
Due to their inherent abstracting nature, protocols and interfaces last longer than systems
Storage Service Implementation
Using the file system as the controlling managerial abstraction, what is the thinnest smear of additional functionality that will make it an effective object store?
– Namaste– CAN– Pairtree– Dflat– ReDD
Name As Text (Namaste) Tags
Directory-level signature files extending Dublin Core Kernel metadata
– [ Tag h0 ] 0=name_version
– Who h1 1=who
– What h2 2=what
– When h3 3=when
– Where h4 4=where
Content Access Node (CAN)
File system conventions (structure and reserved names) for an object store
can/ 0=can_0.2 can-info.txt log/ store/ pairtree...
Pairtree
Use a bigram decomposition of an object’s identifier to determine its file system path
pairtree/ 0=pairtree_0.1 pairtree-info.txt pairtree_root/ id/ en/ ti/ fi/ er/ dflat...
Dflat
A “digital flat” for object data and metadata
dflat/ 0=dflat_0.11 dflat-info.txt v001/ d-manifest.txt delta/ redd... v002/ f-manifest.txt full/ data/ metadata/ enrichment/ annotation/
Reverse Delta Directory (ReDD)
File-level reverse delta compression
redd/ 0=redd_0.1 add/ delete.txt
Performance Scaling
Modern file systems, e.g. ZFS, exhibit good performance characteristics at reasonable scale
Average CopyTime
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
1 92 183 274 365 456 547
AverageCopyTime
Average MkDir Time
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
1 100 199 298 397 496 595
AverageMkDir Time
Traverse Time
0
5000
10000
15000
20000
25000
1 7 13 19 25 31 37 43 49 55 61 67 73
Traverse Time
2,272,000 files = 28.5 TB
127,058,820 files = 25.7 TB
Status
We are completing development of the foundational Storage and Identity services
– Identity is based on N2T (name-to-thing) and Noid systems
We are planning for the Ingest, Catalog, and Characterization services
– Characterization is based on JHOVE2
As these services become available they will be deployed centrally and locally on campuses
Agenda
What is digital curation?
Redefining the repository: A micro-services approach to curation
Web archiving
CDL/campus curation collaborations
Trusted digital curation services
Summary
Today’s Web is History’s Source Material
The web is indispensible to science, commerce, education, entertainment, and culture
Yet, it is highly volatile
UC faculty and researchers have their own web publications
Libraries and archives wish to preserve important websites
How can we secure this valuable content into the future?
Web Archiving Service (WAS)
Provides open source tools for curators to select and preserve content from the free web
Allows curators to define scope of collection, frequency of crawling, work collaboratively
Content is saved in “projects,” grouped by common subject matter or publisher
WAS Public Access
Starting in July, curators will be able to provide public access to their projects
Rights based on recommendations of Section 108 Study Group
– 6 month embargo– Opportunities for content owner to opt-out
Libraries will add links in their online catalogs to documents, websites
Advantages: curated collections, persistent access and URLs, full-text searching
WAS Partners
Library of Congress: grant funding for development
– UC campuses, University of North Texas, and others
Internet Archive: software and experience
– Heritrix crawler, Wayback display, Nutch indexing
National Library of France: standards and leadership
– IWAW international web archiving workshop
– IIPC (national libraries consortium) commitment
Agenda
What is digital curation?
Redefining the repository: A micro-services approach to curation
Web archiving
CDL/campus curation collaborations
Trusted digital curation services
Summary
CDL Curation Collaborations
DataOne
– NSF-funded project to preserve distributed scientific data and develop infrastructure for distributed scientific research on global change
– University of New Mexico, UC Santa Barbara
Media Vault Program
– UC Berkeley
Historical Newspapers
– UC Riverside
Agenda
What is digital curation?
Redefining the repository: A micro-services approach to curation
Web archiving
CDL/campus curation collaborations
Trusted digital curation services
Summary
Trusted Digital Repositories
Trusted Repositories Audit and Certification (TRAC)
– Criteria for evaluating repository trustworthiness
– Developed by RLG, OCLC, NARA, CRL
– Based on Open Archival Information System (OAIS) reference model (ISO 14721)
TRAC
Basic approach
– TRAC checklist provides framework– Organization documents planning and policies
Allows organizations to self-audit and identify gaps
Allows other organizations to perform external audit
Total Transparency Is Not Possible
Budgets, personnel issues
NDAs, competitive environment
Computer security, firewalls
Burden of documentation and maintenance
Trust but Verify
Process requires both trust and willingness to question assumptions
For process to work, the underlying motivation must be a desire to improve service…
– Resulting in greater transparency– Leading to trust between repositories and clients
Summary
Safety through redundancy
Meaning through description
Utility through service
Value through use
Code to interfaces
Orthogonality, but interoperability
Composition, not addition
Bring services to content, not content to services