Supporting Evaluation with Unique and Persistent Identifiers
Introduction to Persistent Identifiers
-
Upload
eudat -
Category
Technology
-
view
51 -
download
2
Transcript of Introduction to Persistent Identifiers
www.eudat.eu
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065
Introduction to Persistent IdentifiersPIDs in EUDAT
This work is licensed under the Creative Commons CC-BY 4.0 licence.Attribution: EUDAT – www.eudat.eu
Content
What are persistent identifiers?Why use persistent identifiers?Different persistent identifier systemsThe HANDLE systemEPIC PID system PoliciesUse cases
PID Training
PERSISTENT IDENTIFIERS
Science Data
Data generation is getting easier/cheaperComplexity-shift from data generation to data processing & analysis The number of data output is increasing
Data needs to be
ReusableAccessible
FindableInteroperable
PID Training
Briefly, what are PIDs?
Pointers to data resourcesData files, metadata files, documents …
Globally uniqueExist infinitely long
Identify and retrieve resourcesCan be resolved to the resource
Examples: ISBN, DOIs, PURLs, Handles …
Data Creation Cycle
tem
pora
ry d
ata
cita
ble
data
refe
rabl
e da
ta
raw data
registration & preservation
analysis & enrichment
Citable publication
Persistent and robust identification
PID Training
PIDs are static
Data2Data
1
Data4
Data3
PID 1 PID 2 PID 3 PID 4
World of data infrastructure
PID Training
Example: Data hierarchies
Data
PIDMetadata
Data
PIDMetadata
Data
PIDMetadata
Data
PIDMetadata
Data
PIDMetadata Raw data
Published data
Analysed data
What is the Problem? Why not use simple URLs?
The URL specifies the location, on a particular server, from which the resource could be retrieved, strictly network locations for digital resources.
domain may changeresource may be relocated link may change
B2SAFE Training
BUT
URLs a year or two later, often no longer workin the long term In the long term URLs a year later, often no longer work
“link rot”
Persistent over time
today 2016 .... .... 2030
11839/abc123 11839/abc123
111000010001111
111000010001111
http://www.example.com/ http://www.moved.com/
Supports access to resource as it moves from one location to another.
.. by design
Persistent Identifiers
A Persistent Identifier is distinct from a URLnot strictly bound to a specific server or filename
“A persistent identifier (PID) is a long-lasting reference to a digital object—a single file or set of files.“
https://en.wikipedia.org/wiki/Persistent_identifier
111000010001111
11839 / abc123
reso
lutio
n
prefix suffix
Identifier points to a resource with no actual knowledge of the resource
Responsibility of the PID owner to keep it up-to-date when the resource changes
Persistent Identifier
points to a resource(s) Is globally unique
111000010001111
11839 / abc123
reso
lutio
nprefix sufffix
11839 / abc123prefix sufffix
Once the PID created, the resource is globally addressable.
DataMetadata
DocumentCode
Prefix: designates administratory domain, comes from an issuing instance Suffix: unique in the realm of the prefix
PID Training
Managing Persistent Identifiers
Managing data Includes managing the persistent Identifier for the data. =
domain may changeresource may be relocated link may change
“link rot”PID needs to be updated to point to the new location (URL). PID continues to provide the latest information about the resource.
With PIDs
Redirection Layer
1839/bc123
Unstableadministrative recordStable
111000010001111111000010001111111000010001111
http://….http://….http://….
Redirection layer bridging the stable and unstable worlds at the cost of some administrative responsibilities
Data moves over time
Data is always Reachable by its PID
PID Advantages
Persistent Identity via IndirectionStatic references into fluid systems over time
Data on networks movesOwnership/responsibility changeFormats change
Embedded IdsFor data object in hand – current state data
UpdatesNew related entities
Networks of Persistent LinksData / metadata linksProvenance chains
PID Disadvantages
Extra level of effort / cost on creationAnalysis – what to identify / granularityCoordination across organisationsMaintain resolution system
Persistence requires sustained effortOrganisational disciplineTechnology necessary but not sufficient
Analyse cost/benefit ratioDon’t start unless it is worthwhileIs your data worth it?
PID Training
PID SYSTEMS
PID Training
Identifier – PID
Every identifier consists of two parts: its prefix and a unique local name under the prefix known as its suffix
Prefix - designates adminstratory domain, is generated by an issuing instance which makes sure tat all prefixes are uniqueSuffix - local name must be unique under its prefix.
The uniqueness of a prefix and the local name under that prefix ensures that any identifier is globally unique within the context of the System.
< PREFIX > / < SUFFIX > (e.g. 11111/123456745)
PID Systems
Persistent URLs (PURLs)a
Cost: noMetadata: No additional metadata
purl: GPO/gpo46189
Handle Systemb
Cost: $50 annual fee per prefixMetadata: Associate any metadata
hdl:11210/123
Digital Object Identifier (DOI)d
Cost: fee per DOI + annual feeMetadata: The INDECS schema
DOI: 10.1000/182
Archival Resource Key (ARK)c
Cost: noMetadata: ERC (Electronic Resource Citation) metadata
ark: /12025/654xz321
PID Training
PIDs system Requirements
Attach multiple URLs to a PID
Allow part identifiers for complex objects. Granularity issue
Allow attaching of extra data records to the PID (MD5 check, etc)
Actionable (URLified) PIDs
HTTP proxy for resolving (use port 80 only)
Control by user community
REST or SOAP interface for administration of PIDs from applications
Delegation of PID administration to other organisations
Distributed, robust, highly-available, scalable
No single-point of failure
Acceptable non-commercial business model
PID Training
Identifier String Requirements
Not based on any changeable attributes of the entity
LocationOwnershipAny other attribute that may change w/o changing identity
UniqueAvoid collisions, referential uncertainty
Opaque, preferably a ‘dumb number’
A well known pattern invites assumptions that may be misleadingMeaningful semantics invite IP wars, language problems
Nice to haveHuman-readableCut-able, paste-ableFits common systems, e.g., URI specification
that contributes to persistence
PID Training
PIDs in EUDAT
EUDAT has adopted Handle-based persistent identifiers
A combined solution of handle system and EPIC service (today)
Employing the latest Handle v.8
EUDAT developed a library to interact with Handle v.8 B2HANDLE
PID Training
HANDLE SYSTEM
PID Training
The Handle System
The Handle System is a technology specification for assigning, managing, and resolving persistent identifiers for digital objects and other resources. The protocols specified enable a distributed computer system to store identifiers (names, or handles) of digital resources and resolve those handles to the information necessary to locate, access, and otherwise make use of the resources. That information can be changed as needed to reflect the current state or location of the identified resource without changing the handle.
PID Training
Handle System
The main goal of the handle system is to contribute to persistence.The Handle system is:
reliable scalable flexible trusted built on open architecturetransparent
PID Training
A handle Record
Handle Data Type
Index Handle data Timestamp
10232/1234
URL 1 https://www.eudat.eu/ex1 2014-04-09 12:46:53Z
INST 2 EUDAT 2014-04-09 12:46:53Z
HS_ADMIN 100 eudat/user1 2014-04-09 12:46:53Z
PID – handle : 10232/1234Actionable URL: http://hdl.handle.net/10232/1234
HANDLE Record Types
Common typesURL: one or more, pointing to the location(s) referenced by this HANDLE
HS_ADMIN: special record encoding the permissions configured for this HANDLE
10320/LOC: supports multiple locations based on intelligent decision.
CustomChecksum: Useful for integrity verification
EUDAT/ROR: EUDAT specific for B2SAFE. ROR: (Repository of Records), the repository where data was stored first.
EUDAT/PPID: EUDAT specific for B2SAFE. the PID associated to the source object in a replication chain. If the chain has only two elements, the master copy and the first replica, then the PPID = ROR.
PID Training
10320/loc Handle Type
The 10320/LOC field is specifically designed to allow the http handle resolver to make an intelligent decision which location to return if multiple locations are availableOptions:
Weight: specifies a weight per location. Load will be distributed over all locations according to their assigned weightsCountry: specifies where this location is being hosted. This allows the http resolver to return the location closest to the user (based on GeoIP lookup)Weight: Selects a single location based on a random choice.
PID Training
10320/loc Handle Type Example
<locations> <location id="0" href="http://uk.example.com/" country="gb" weight="0" /> <location id="1" href="http://www1.example.com/" weight="1" /> <location id="2" href="http://www2.example.com/" weight="1" /> </locations>
PID: 10232/1234
Reference-1: from a client located in the UKReference 2: from a client located outside the UK Reference 3: 10232/1234?locatt=id:1Reference 4: 10232/1234?locatt=id:0 Reference 5: 10232/1234?locatt=country:us
PID Training
Part Identifiers
Part identifiers compute an unlimited number of handles on the fly, by registering just one.
A single template handle can be created as a base that will allow any number of extensions to that base to be resolved as full handles, according to a pattern, without each such handle being individually registered.
In the handle system the part - fragment identifier is enabled with a template. The template is a syntax that defines a delimiter and an extension (extension is the option to add any kind of string behind the delimiter).
PID Training
Part Identifiers - Examples
Use Part Identifiers: to reference a part of a dictionaryto reference an unlimited number of ranges within a videoto reference a part of a collection of items
Video Example Create one handle : 10232/1234576A range: 10232/1234576@from=1:05&to=1:14
PlD is used to point to a location. So please note that when your system offers part identifiers, it is responsible of maintaining the part identification fragment as well
PID Training
PID SYSTEM IN EUDAThandle system and EPIC
PID Training
PID System: How does it work?
PID Service generate and manage PIDs for digital objects
PID Replication replicate the database of Handles to guarantee an robust and high-availability PID resolution function
Resolution Serviceservice to guarantee reliable resolution of the PIDs. Forwarding the user to the resource.
Global Handle Mirror
A mirror of the Global Handle in Europe
handle system and EPIC
PID Training
PID Service
A RESTful web service, using the HTTP application protocol.
[GET]: for getting the data of a selected PID, search for PIDs[POST]: for creating a new PID with automatic generation of suffix name[PUT]: for creating/updating a PID with manual generation of suffix name[DELETE]: for deleting a PID
PID Training
Resolution Service
The web address for the handle resolution service that EUDAT uses is http://hdl.handle.net.
PID Training
EUDAT options for PIDs
In order to access a data object stored in EUDAT, an associated persistent identifier (PID) is needed. EUDAT requires integration of Handle in your infrastructure. Before your community or data centre can create PIDs you need a prefix. There are two options:
you can run your own Handle system; or you can pass the details to EUDAT partners to manage it on your behalf.
additional benefit of using the EUDAT systems is access to a REST API to manage your PID handles
PID Training
POLICIES
PID Training
How may I use a PID
By the time you own a PID use itOnlineIn your PublicationsIn your linked data
You may also use itTo get the dataTo refer to the data
Use it as an actionable URL: http://hdl.handle.net/11239/GRNET
PID Training
Policy Document
When to use persistent identifiers?There is no one-size fits all strategy for implementing PIDs
Create a Policy Document of What & WhenAnalyze the use of PIDs, create a policy for the management What to registerWhen it the data management life cycle
analysis and thought
PID Training
Policy Document
Simple QuestionsWhich data objects need a PID (collections, files., metadata records)?What kinds of data are likely to stay online long enough?What kinds of data are likely to be linked to ?What kinds of data are likely to be analysed/processed with tools? What will happen after data goes off-line?etc..
analysis and thought
PID Training
USE CASES
PID Training
Example 1: B2SHARE
B2SHARE is a user-friendly, reliable and trustworthy way for researchers, scientific communities and citizen scientists to store and share small-scale research data from diverse contexts.
PID Training
Example 2: B2SAFEB2SAFE employs PIDs to keep track and link replicas of data in the EUDAT network
PID Training
Example 2: Enable data flows
Link directly to the data (?locatt=id:0 )Optionally include a (mime)type in the handle record - Can be used to select appropriate tooling
Summary
Persistent Identifiers provide a solution to the “link rot” problem by providing an extra layer of indirection
Several systems are available; some offer additional functionality in the form of support for storing additional metadata, providing a global resolver, etc.
Policy Document: How to use persistent identifiers in your repository requires some analysis and thought
Summary
The HANDLE system - via EPIC system - is EUDAT PID framework of choice because:
Low cost, only a flat annual feeRobust, scalable and performantFlexible, allows addition of any metadataProvides a global resolver
However, there are some challenges. Especially in scenarios where multiple administrative domains are involved
Hands-on material
Material on PID hands-on (part 7)Hands-on tutorial which shows how to:
Create, manage and delete PIDsWork with PIDs in workflows
Examples for handle V8
Epicclient.pycURL commandsB2HANDLE library
https://github.com/EUDAT-Training/B2SAFE-B2STAGE-Training
Training module which provides hands-on material for:
EUDAT B2SAFEiRODS4B2HANDLEand the EUDAT B2STAGE service.
Thanks
www.eudat.eu
Authors Contributors
This work is licensed under the Creative Commons CC-BY 4.0 licence
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures.Contract No. 654065
Themis Zamani, GRNETWillem Elbers, CLARINChristine Staiger, SURFsara
Ellen Leenarts, DANS
Thank you