Duraspace Hot Topics Series 6: Metadata and Repository Services

44
Hot Topics Web Seminar Series: Research Data in Repositories The UC San Diego Experience Second Webinar: Metadata and Repository Services for Research Data Curation

description

Presented by Declan Fleming, Arwen Hutt, and Matt Critchlow. The second in a three part Webinar series on Research Data Curation at UC San Diego, as part of the larger Research Cyberinfrastructure initiative.

Transcript of Duraspace Hot Topics Series 6: Metadata and Repository Services

Page 1: Duraspace Hot Topics Series 6: Metadata and Repository Services

Hot Topics Web Seminar Series: Research Data in Repositories

The UC San Diego ExperienceSecond Webinar: Metadata and Repository Services for

Research Data Curation

Page 2: Duraspace Hot Topics Series 6: Metadata and Repository Services

General Series Intro

• First webinar: Intro and Framing: UC San Diego decisions and planning

• Second Webinar: Deep dive into technology and metadata

• Third Webinar: The perspective from researchers, next steps

Page 3: Duraspace Hot Topics Series 6: Metadata and Repository Services

Your esteemed presenters …

First webinar:

David Minor – Program Director, Research Data Curation Declan Fleming - Chief Technology Strategist

Second webinar:

Declan Fleming - Chief Technology Strategist Arwen Hutt - Metadata Librarian

Matt Critchlow - Manager of Development and Web Services

Third webinar:

Dick Norris – Professor, Scripps Institution of Oceanography Rick Wagner – Data Scientist at San Diego Supercomputer Center

Page 4: Duraspace Hot Topics Series 6: Metadata and Repository Services

Today we will …

• Discuss real-world researcher interaction

• Document how metadata and files combine to make digital objects

• Describe the DAMS data model and how it supports complex research objects

• Detail the technology driving the DAMS

• Point to the future

Page 5: Duraspace Hot Topics Series 6: Metadata and Repository Services

Working with Researchers: Pilots

• The Brain Observatory

• NSF OpenTopography Facility

• Levantine Archaeology Laboratory

• Scripps Institute of Oceanography

Geological Collections

• The Laboratory for Computational

Astrophysics

Page 6: Duraspace Hot Topics Series 6: Metadata and Repository Services

Working with Researchers: Process

• Introductory meeting• Metadata point person• Ongoing discussions • One on one work

Iterative, collaborative, customized, experimental…pilot!

Page 7: Duraspace Hot Topics Series 6: Metadata and Repository Services

Working with Researchers: Data management

• Collocation• Clean up• Identifiers• Metadata

Page 8: Duraspace Hot Topics Series 6: Metadata and Repository Services

Working with Researchers: What is an object?

• What are the boundaries on a discreet set or subset of data? What is required to make the data intelligible, usable and reusable?

• What needs to be preserved?• What do they want to display and/or share?• What do they want to be able to refer to or

cite?

Page 9: Duraspace Hot Topics Series 6: Metadata and Repository Services

Working with Researchers: What is an object?

Slice

Etc…

or

Brain

Artifact

Site

or

Page 10: Duraspace Hot Topics Series 6: Metadata and Repository Services

Working with Researchers: Take Aways

They are the subject experts

There are a lot of broad level similarities

But no such thing as one size fits all

Page 11: Duraspace Hot Topics Series 6: Metadata and Repository Services

We want a new data model…

• One that is flexible and accommodates disparate metadata from a variety of sources

• While promoting consistency within the data store• One that supports relationships within and between

objects• One that is more community engaged, both sharing

vocabularies and technology, and utilizing others shared vocabularies and technologies

• One that supports improved management of objects and metadata

Page 12: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Data Model Development Process

• Five people, in a room, 16 hours a week for 4 months• Worked through existing data, use case scenarios,

known data requirements, investigated known ontologies, etc.

• Lots and lots and lots of discussion• Utilizes MADS (Metadata Authority Description

Schema)• Results = a data dictionary and an OWL ontology• Living document

Page 13: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Data Model: Flexibility

• The data model provides enough flexibility that we can accommodate a wide variety of data within the schema– Vocabularies– Use of “types” or “display labels” to distinguish

specific subtypes of a data field– Flexible structures and relationships– Extensible

Page 14: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Data Model: Consistency

• But enough consistency that searching and display rules do not need to be customized for each individual collection of material– Rules can be applied at the level of the broader

concept• As well as establishing the organizational

structure necessary for maintaining consistency over time– Evaluation and approval of modifications

Page 15: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Data Model: Relationships

• It allows us to create a number of different relationships– Collections and sub-collections– Collections and objects– Objects and components

(complex hierarchical objects)– Other related resources internal

or external to the DAMScomplex objectexample

Page 16: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Data Model: Vocabularies

• Allow management of local & community vocabularies– Vocabulary terms as entities– Ability to encode authority data (vocabulary

source, value uri, etc.) as well as sameAs relationships between the same term expressed in multiple sources

– Ability to update authority records as community vocabularies become more formalized.

Page 17: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Data Model: Management

• One that supports improved management of objects and metadata– Authority management of vocabulary terms– Event metadata!

Page 18: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Architecture

Page 19: Duraspace Hot Topics Series 6: Metadata and Repository Services

Preservation: Chronopolis

Current DAMS Process1. Create Bagit bags for all objects2. Host via HTTP(S) 3. Bags are retrieved and ingested into Chronopolis

DAMS4 Process4. Create Bagit bags for Δ objects using Event metadata5. Host via HTTP(S) or enqueue on messaging queue for

ingestion

Page 20: Duraspace Hot Topics Series 6: Metadata and Repository Services

Storage

Page 21: Duraspace Hot Topics Series 6: Metadata and Repository Services

Storage: EMC Isilon 72NL

Storage For Library Collections

1 cluster of 5 Nodes1 Node = 36 x 2TB DrivesTotal Current Usable Storage of 320TBOneFS 7.0.2.1

Page 22: Duraspace Hot Topics Series 6: Metadata and Repository Services

Storage: OpenStack

Storage For Research Data Collections

Testing:• Performance versus Local Storage• Large Files (up to 1TB)

– Segmenting files > 5GB– Lexical order bug fix: 1,10,2 -> 0001,0002,…0010

• Rackspace CloudFiles API VS OpenStack REST API

Testing Notes:https://libraries.ucsd.edu/blogs/dams/openstack-testing-notes/

Page 23: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Repository

Page 24: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Repository

Core Repository Application: Create, Read, Update, Delete (CRUD)

Uses:Jena, ActiveMQ, JHOVE, Apache Tika, FFMPEG, ImageMagick

Manages:• Metadata Triplestore• Storage• Solr

Page 25: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Repository: Metadata Triplestore

Page 26: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Repository: Metadata Triplestore

Triplestore was: Allegrograph

Triplestore is: PostgresSQL DB + Jena• Schema: (ID), Parent, Subject, Predicate, Object

Jena Usage:• Core/RDF API – Parsing, loading, updating, serializing RDF• ARQ API – SPARQL queries

Page 27: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Repository: REST API

Page 28: Duraspace Hot Topics Series 6: Metadata and Repository Services

Hydra Framework

Source: https://wiki.duraspace.org/display/hydra/Technical+Framework+and+its+Parts

Page 29: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Repository: Fedora API-ish

Page 30: Duraspace Hot Topics Series 6: Metadata and Repository Services

Fedora API – Next PID

Page 31: Duraspace Hot Topics Series 6: Metadata and Repository Services

Fedora API – Next PID

Page 32: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Manager

Page 33: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Manager

Java application using Spring MVC framework

• Collection Management– Metadata Ingest and Export– File Ingest– Derivative Generation– Solr indexing by Collection

• Administrative Reporting and Statistics

Page 34: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Hydra Head

Page 35: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Hydra Head

Page 36: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Hydra Head: Blacklight

Page 37: Duraspace Hot Topics Series 6: Metadata and Repository Services

RDF in Hydra

Page 38: Duraspace Hot Topics Series 6: Metadata and Repository Services

RDF in Hydra: (Read) Nested Attributes

Page 39: Duraspace Hot Topics Series 6: Metadata and Repository Services

RDF in Hydra: (Create) Nested Attributes

Page 40: Duraspace Hot Topics Series 6: Metadata and Repository Services

DAMS Hydra Head: Complex Objects

Page 41: Duraspace Hot Topics Series 6: Metadata and Repository Services

Next Steps

Beta Release: Late October

Production Release: January

Future:• Sufia/Curate Integration for administrative functionality• Additional Linked Data Integration and Crosswalks

– Schema.org, OpenURL, Dublin Core, ResourceSync

• Fedora4

Page 42: Duraspace Hot Topics Series 6: Metadata and Repository Services

More Information

DAMS Overviewhttps://github.com/ucsdlib/dams/wiki/DAMS-Manual

DAMS Hydra Headhttps://github.com/ucsdlib/damspas

DAMS Ontologyhttps://github.com/ucsdlib/dams/tree/master/ontology

DAMS REST APIhttps://github.com/ucsdlib/dams/wiki/REST-API

Hot Topics Series 3: Get a Head on the Repository with Hydrahttp://duraspace.org/hot-topics

Hydra Technical Overviewhttps://wiki.duraspace.org/display/hydra/Technical+Framework+and+its+Parts

OneFS Technical Overviewhttp://www.emc.com/collateral/hardware/white-papers/h10719-isilon-onefs-technical-overview-wp.pdf

Isilon Overviewhttp://www.emc.com/collateral/software/data-sheet/h10541-ds-isilon-platform.pdf

Page 43: Duraspace Hot Topics Series 6: Metadata and Repository Services

Coming Up Next

Final Webinar (October 31)

The researcher perspective from two of our pilot participants

Dick Norris – Professor, Scripps Institution of Oceanography

Rick Wagner – Data Scientist at San Diego Supercomputer Center

Page 44: Duraspace Hot Topics Series 6: Metadata and Repository Services

Questions?

Thanks!

Declan Fleming@declan | [email protected]

Arwen Hutt@arwenh | [email protected]

Matt Critchlow@mattcritchlow | [email protected]