Scalable Identifiers for Natural History Collections

26
Scalable Identifiers for Natural History Collections 12 August 2012 University of California Curation Center California Digital Library

description

 

Transcript of Scalable Identifiers for Natural History Collections

Page 1: Scalable Identifiers for Natural History Collections

Scalable Identifiers for Natural History Collections

1 2 A u g u s t 2 0 1 2

U n i v e r s i t y o f C a l i f o r n i a C u r a ti o n C e n t e rC a l i f o r n i a D i g i t a l L i b r a r y

Page 2: Scalable Identifiers for Natural History Collections

California Digital Library

CDL supports the research lifecycle

• Collections• Digital Special Collections• Discovery & Delivery• Publishing Group• UC Curation Center (UC3)

Serving the University of California• 10 campuses• 360K students, faculty, and staff• 100’s of museums, art galleries,

observatories, marine centers, botanical gardens

• 5 medical centers• 5 law schools• 3 National Labs

Page 3: Scalable Identifiers for Natural History Collections

The research data problem

an article about data, but no data

Page 4: Scalable Identifiers for Natural History Collections

What EZID data citation offers

• Precise identification of a dataset (DOI, ARK)• Credit to data producers and data publishers• A link from traditional literature to the data• Exposure and research metrics for datasets

(Web of Knowledge, Google)

Page 5: Scalable Identifiers for Natural History Collections

EZID: Long term identifiers made easy

Primary Functions1. Create persistent identifiers2. Manage identifiers (and associated

metadata) over time3. Resolve identifiers

Take control of the management and distribution of your research, share and get credit for it, and build your reputation through its collection and documentation

Page 6: Scalable Identifiers for Natural History Collections

EZID: Long term identifiers made easy

Primary Functions1. Create persistent identifiers2. Manage identifiers (and associated

metadata) over time3. Resolve identifiers

Take control of the management and distribution of your research, share and get credit for it, and build your reputation through its collection and documentation

Page 7: Scalable Identifiers for Natural History Collections

German National Library of Economics (ZBW)

German National Library of Science and Technology (TIB)

German National Library of Medicine (ZB MED)

GESIS - Leibniz Institute for the Social Sciences, Germany

Australian National Data Service (ANDS)

ETH Zurich, Switzerland

Canada Institute for Scientific and Technical Info. (CISTI)

Technical Information Center of Denmark

Institute for Scientific & Technical Information (INIST-

CNRS), France

TU Delft Library, The Netherlands

The Swedish National Data Service (SNDS)

The British Library , UK

California Digital Library (CDL), USA

Office of Scientific & Technical Information (OSTI), USA

Purdue University Library

DataCite

Page 8: Scalable Identifiers for Natural History Collections

EZID Clients

UC Berkeley Library (on behalf of the UC Berkeley campus) Sponsored accounts:

The Digital Archaeological Record (tDAR)

Open Context Dryad Digital Repository

CRCNS.org

UC San Diego Library (on behalf of the UC San Diego campus)

Fred Hutchinson Cancer Research Center

American Astronomical Society (AAS) LabArchives

Centre national de documentation pédagogique (CNDP)

National Center for Atmospheric Research (NCAR)

Cornell Institute for Social & Economic Research

USGS/Earth Sciences Data Clearinghouse (formerly National Biological Info. Infrastructure)

A current, partial list

Page 9: Scalable Identifiers for Natural History Collections

New features in development

• Suffix pass-thru: do NT and get N/ST/S for free• Service replicas: manager and resolver• Content negotiation and inflections: ? ?? / .• URN (Uniform Resource Name) support (urn:uuid:)• ARK community and governance, eg, registries

Page 10: Scalable Identifiers for Natural History Collections

Some identifier dimensions

• registration (storing and updating ids for resolution)

• non-registration (id awareness via rules)• persistence flavors• resolution• clusters (closely coupled ids)• other relations (part, whole, related)

Page 11: Scalable Identifiers for Natural History Collections

Identifier generation

• inspiration ("I think I'll call it MyKitty/Photos")• systematic inspiration (title/author/vol/issue)• counter (421, 422, 423, ...)• timestamp• hash computed over content (MD5, SHA256)• hash of randomized timestamp plus registry

(uuidgen, noid)• randomized counter plus registry (EZID/noid)

Page 12: Scalable Identifiers for Natural History Collections

Identifier registration

• use filesystem tree as resolver (any old website)

• use web server config file• use web server backing database• use a service (bit.ly, EZID, DataCite, local

Handle service)

Page 13: Scalable Identifiers for Natural History Collections

Identifier non-registration

Identifiers “exposed” but not registered, eg, awareness via rules

• extension (abc/def is "part of" abc)• parameter (abc_N_M works for N or M less

than 100,000)• general query (arbitrary data cells)

Page 14: Scalable Identifiers for Natural History Collections

Identifier persistence flavors

• persistent id to very dynamic content (eg, home page)

• persistent id to stable but correctable content (eg, landing page)

• persistent id to never-changing content (eg, spreadsheet)– persistent ids to non-recommended content

• persistent id to stable but growing content (serial pub)

Page 15: Scalable Identifiers for Natural History Collections

Identifier resolution

• DNS (domain names)• DNS + HTTP (any website)• DNS + HTTP + redirects (eg, URL shorteners,

N2T/EZID system)• DNS + HTTP + redirects + Handle resolver

(DOIs and Handles)

Page 16: Scalable Identifiers for Natural History Collections

Identifier clusters

Related, but very closely couple identifiers• object files• alternate object files• object metadata

Page 17: Scalable Identifiers for Natural History Collections

GUID Definitions

• GUID -- Definition 1 (wikipedia)– A 128-bit id generated per RFC 4122, eg,– uuidgen -> EEF45689-BBE5-4FB6-9E80-

41B78F6578E2• GUID -- definition 2 (earth sciences?)– any globally unique identifier

Page 18: Scalable Identifiers for Natural History Collections

Service replicas

• EZID is an id manager that populates N2T– It tolerates down time– Other id manager services might one day populate N2T

• N2T (Name-to-Thing) is an id resolver that ...– It is very intolerant of down time, since it services all

access requests for locations and metadata– N2T replicas underway

Page 19: Scalable Identifiers for Natural History Collections

URN support

• N2T and EZID are agnostic about kinds of things, names, and metadata– Digital, physical, abstract, living, fictional, groups, etc.– Any metadata & known profiles (DataCite, Dublin Kernel)– ARK, DOI, URN, Handle, IVOA, LSID, PMID, etc., requiring

namespace “write” permission, eg, via DataCite

• In test: Uniform Resource Names (URNs)– urn:uuid namespace

Page 20: Scalable Identifiers for Natural History Collections

Under the hood keysmithing terms: bows, shoulders, blades, tips, covers

Page 21: Scalable Identifiers for Natural History Collections

Suffix pass-thru: NT gets N/ST/S for free

Idea: if name N points to target T, then requests for N extended by any suffix N/S can take you to T/S

• For dataset doi:10.5072/Big4 with 10,000 nameable components,– Register and manage 10,001 names or 1 name?– Eg, http://x.y.z/foo/Big4/db/table/cell/45-8.txt could be

reached with doi:1.5072/Big4/table/cell/45-8.txt• In test with ARKs. Conflict with other resolvers?

Page 22: Scalable Identifiers for Natural History Collections

Tombstone and other surrogate pages

Tombstone, incubation, and other surrogate pages (probation?) auto-generated from metadata, eg,

http://n2t.net/ezid/tombstone/id/ark:/20775/bb3243444z

Page 23: Scalable Identifiers for Natural History Collections

Reserved identifiers and multiple targets

• Some ids must be created and managed (reserved) before going public, eg, for manuscript preparation

• In test: infrastructure for multiple targets and multiple instances of any metadata element

• What should user experience be for multiple targets? – Present a menu of targets (burden of choice)?– One target chosen for them (burden of inflexibility)?

Page 24: Scalable Identifiers for Natural History Collections

Identifier (ARK) inflections: ? ?? / .

• Inflect: change endings w.o. creating new words– Terminal ? means “I want metadata”, which is similar to

linked data content negotiation (also in EZID test)– Terminal ?? means “I also want support metadata”– Drawing board: / could mean “I want a landing page”

and . could mean “I want the usual computable thing”• Allow inflections beyond ARKs to DOIs/URNs?

Page 25: Scalable Identifiers for Natural History Collections

Example: http://n2t.net/ark:/13030/qt0349g1rh?

erc:who: Renninger, Heidi,; Phillips, Nathan,; Hodel, Donald,what: Comparative hydraulic and anatomic properties in palm trees (Washingtonia robusta) of varying heightswhen: 2009-04-29where: ark:/13030/qt0349g1rh

Renninger, Heidi; Phillips, Nathan; Hodel, Donald. “Comparative hydraulic and anatomic properties in palm trees (Washingtonia robusta) of varying heights”. 2009-04-29. ark:/13030/qt0349g1rh

HTML content with embedded comments in ANVL/ERC and RDF

Page 26: Scalable Identifiers for Natural History Collections

ARK community and governance

• ARK mailing list: [email protected]• Topics: governance, community, standardization• Registry maintenance: shoulders and NAANs• N2T consortium with alternative EZID-like services