SAN DIEGO SUPERCOMPTER CENTERUC SAN DIEGO LIBRARIESNDIIPP PARTNERS MEETING 07.09.08 David Minor SDSC...
-
date post
22-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of SAN DIEGO SUPERCOMPTER CENTERUC SAN DIEGO LIBRARIESNDIIPP PARTNERS MEETING 07.09.08 David Minor SDSC...
SAN DIEGO SUPERCOMPTER CENTER UC SAN DIEGO LIBRARIESNDIIPP PARTNERS MEETING 07.09.08
David MinorSDSC
Robert H. McDonaldSDSC
Sangchul Song UMIACS
Bryan BeecherICPSR
Justin LittmanLC
Chronopolis in Practice
SAN DIEGO SUPERCOMPTER CENTER UC SAN DIEGO LIBRARIESNDIIPP PARTNERS MEETING 07.09.08
Outline Current Chronopolis Implementation
Accomplishments (2/08 – Present) Ingested Content Transmission
Technologies for Ingest ICPSR – SRB CDL – Bagit NCSU - Bagit
Technologies for Integrity Audit Control Environment
Questions
SAN DIEGO SUPERCOMPTER CENTER UC SAN DIEGO LIBRARIESNDIIPP PARTNERS MEETING 07.09.08
Chronopolis Implementation
Sun 614062TB
Sun 614062TB
SRB D-Broker
SRB D-Broker
SRB MCAT
Sun SAM-QFS
Sun SAM-QFS
SRB D-Broker
SRB D-Broker
SRB MCAT
Apple XsanApple Xsan
SRB D-Broker
SRB D-Broker
SRB MCAT
CDL Server
ICPSR Server
NCAR Network
MarylandNetwork
SDSC Network
ICPSR Network
UC Berkeley Network
Chronopolis Data 12-25TB
Chronopolis Data 12-25TB
Chronopolis Data 12TB
Chronopolis Data 12TB
CDL Server
SDSC Network
NCAR Network
UMD Network
Tape SilosTape Silos
Adapted from Bryan Banister (SDSC)
SAN DIEGO SUPERCOMPTER CENTER UC SAN DIEGO LIBRARIESNDIIPP PARTNERS MEETING 07.09.08
Key Deliverables 07/08 7.2 - A well-integrated network and data grid for
content sharing among CDL and ICPSR supporting sustained high-capacity transfer rates.
7.3 - An integrated set of monitoring tools for the Chronopolis Data Grid using the replication monitor, ACE, and INCA for the Library community.
7.5 - A Dissemination Information Package (DIP) for content submitted by both ICPSR and CDL will be available for both ICPSR and CDL to retrieve their content from the Chronopolis gateway.
7.7 - An ingested content collection from ICPSR of 12-15 TB
7.8 - An ingested content collection from CDL of 25 TB
SAN DIEGO SUPERCOMPTER CENTER UC SAN DIEGO LIBRARIESNDIIPP PARTNERS MEETING 07.09.08
7.5 Deliverable Refinements
7.5 A Dissemination Information Package (DIP) for content submitted by both ICPSR and CDL will be available for both ICPSR and CDL to retrieve their content from the Chronopolis gateway.
Two Components Emerging
Component 1
DIP based on Bagit structure
Component 2
DIP that supports transmission package to load into Fedora repository software
SAN DIEGO SUPERCOMPTER CENTER UC SAN DIEGO LIBRARIESNDIIPP PARTNERS MEETING 07.09.08
Accomplishments (2/08-Present)
NDIIPP Client Ingested Content ICPSR – 5 TB (Staging) CDL – 4 TB (Staging)
Chronopolis Replicated Content SDSC UMIACS – 3 TB (Copy 2) SDSC NCAR (forthcoming)
Transmission Speed-Ingest ICSPR – Approx 1 TB per day CDL – Bagit Tests using LC python scripts (15
processes)City Bag – 46.22 Mb/sec – 498.96 GB per dayState Bag – 42.88 Mb/sec – 463.10 GB per day
SAN DIEGO SUPERCOMPTER CENTER UC SAN DIEGO LIBRARIESNDIIPP PARTNERS MEETING 07.09.08
New Partners
N.C. State GIS Data@5 TBs
Already working with BagIt Format
Scripps Institute of Oceanography Data@2 TBsAlready working with SRB
SAN DIEGO SUPERCOMPTER CENTER UC SAN DIEGO LIBRARIESNDIIPP PARTNERS MEETING 07.09.08
Technologies for Ingest/Replication
SRB to SRB Connections ICPSR-Client Scripps-Client UMIACS-Chronopolis Partner NCAR-Chronopolis Partner
Bagit Transfers CDL NC State
Transfer Methodology (ICPSR – Client)
• Synchronize collections of content with SDSC’s storage grid Original scope was just our web-delivered content
• Compressed• 400GB• Tens of thousands of files
Since then we have copied our complete holdings• Uncompressed• 5000GB• Millions of files
Transfer method
• SRB utilities are the base Sput Srsync
• Cannot use the utilities “out of the box” Too many files Too many timeouts
• Wrap the utilities with some simple shell script grouping
Example
• Metadata resides in Oracle; dump it nightly to SRB Sput –fK /path/to/oracle/export s:/SDSC-chron/icpsr.umich/database
• Files reside elsewhere and there are LOTS Wrap Sinit, Srsync and Sexit in a script, Ssend Invoke via a mechanism like this:
• find /archive <criteria> | xargs –n 3 –P 0 Ssend
Select a bunch of “just big enough” directories to feed into Ssend, and not too many at a time
BagIt• Motivating use cases:
– Transfer of content internally and between preservation partners
– Long-term storage of content
• Needs:
– Minimally self-identifying and self-describing packages– Support for error detection and transfer optimization
• Characteristics:
– Low overhead
– Content agnostic
– Supported by off-the-shelf tools (e.g., MD5Deep)
• Informed by
• LC's eDeposit Pilot Project
• NDIIPP Archive and Ingest Handling Test (AIHT)• Tabata et al., “Enclose-and-Deposit Method,”
IWAW ’05
• Documented at
• www.ietf.org/internet-drafts/draft-kunze-bagit-01.txt
• www.cdlib.org/inside/diglib/bagit/bagitspec.html
• Basic bag:<bag_dir>/
bagit.txt
manifest-<algorithm>.txt
[optional additional tag files]
data/
[content file hierarchy]
• Bag parts:– bagit.txt: Bag signature– manifest-<algorithm>.txt: List of content files and
fixities
• Example, manifest-md5.txt:
49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png
408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt
– package-info.txt: Bag contents metadata (optional)– fetch.txt: Bag contents included by reference (optional)
UNIVERSITY of MARYLAND INSTITUTE for ADVANCED COMPUTER STUDIES
ACE – Auditing Control Environment
• Software to ensure the long term integrity of digital objects.
• Underpinnings are based on rigorous cryptographic techniques and a third party integrity management and auditing.
• Automatic regular audits based on policies set by the archive manager.
• Scalable, cost-effective, and can interoperate with any archiving architecture.
UNIVERSITY of MARYLAND INSTITUTE for ADVANCED COMPUTER STUDIESACE – System Architecture
reply
Token Registry
hdd
Archiving Node
cd-romtape drive
request
ACE Audit Manager
Third-Party Integrity Management System
CryptoSummary
Information
reply
Token Registry
hdd
Archiving Node
cd-romtape drive
request
ACE Audit Manager
witnesses witnesses
Audit Policy Audit Policy
UNIVERSITY of MARYLAND INSTITUTE for ADVANCED COMPUTER STUDIES
ACE Audit
Each digital object is periodically audited using the integrity token, according to the policy set by the local manager.
Cryptographic summaries are audited as necessary by the archive or an independent party using the published witness values.
UNIVERSITY of MARYLAND INSTITUTE for ADVANCED COMPUTER STUDIES
ACE Screen Shots
Last audit:successful
Adding a CollectionAuditing a CollectionViewing an Error Report
Action Pane(Collection Specific)
Status Pane(Overview)
Start AuditingEdit Collection LocationRemove CollectionBrowse Collection
View Events
View Error Report