Distributed data access: THREDDS, OAI, CDP

25
Supercomputing • Communications • D NCAR Scientific Computing Div Distributed data access: THREDDS, OAI, CDP Presented By: Michael Burek Acknowledgments: CDP staff: Dave Brown, Luca Cinquini, Don Middleton, Rob Markel, Scott Nixon, Nate Wilhelmi

description

Distributed data access: THREDDS, OAI, CDP. Presented By: Michael Burek. Acknowledgments: CDP staff: Dave Brown, Luca Cinquini, Don Middleton, Rob Markel, Scott Nixon, Nate Wilhelmi. Outline. Community Data Portal (CDP) THREDDS in the CDP introduction THREDDS in detail - PowerPoint PPT Presentation

Transcript of Distributed data access: THREDDS, OAI, CDP

Page 1: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Distributed data access: THREDDS, OAI, CDP

Presented By:Michael Burek

Acknowledgments:CDP staff: Dave Brown, Luca Cinquini, Don Middleton,

Rob Markel, Scott Nixon, Nate Wilhelmi

Page 2: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Outline

• Community Data Portal (CDP)• THREDDS in the CDP introduction• THREDDS in detail• THREDDS applied in the CDP, some details• OAI -- Open archives initiative• Demo• Thoughts about future developments

Page 3: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Introduction to the CDP

Community Data Portal (CDP) Project

UCAR wide, uniform, community resource for discovery (search and browse) across the organization

Search/browse:o Supports free or structured queries to find datao Boolean combinationso Keyword, controlled vocabularies

– Creator, Publisher, Science Keyword (GCMD), Variable name (CF)– Data Format, Data Type, Data Delivery Service

o Geographic, Time, Altitude Data delivery Services

o aggregation, subsetting, FTP, HTTP, Mass Store, LAS/FERRET, OPEnDAP

Page 4: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Introduction to the CDP, cont.

The CDP serves diverse range of data providers:o Project based archives -- small, often limited resourceso Multi institutional teams -- geographically separatedo Multiple data types within a project: measurements,

models, imageso The CDP cooperates with NCAR existing data organizationso A few unusual datasets -- HAO divisiono Model software. Visualizations.

Page 5: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

CDP, Technologies

o The CDP was begun in 2001o Uses THREDDS* catalogs as to describe data content and structureo Uses Lucene as the search/discovery back endo Uses Open Archives Initiative OAI to share metadatao Uses SRM to access deep archive data, share data externally (ESG

project)o Experimental use of SRB to share intra-institutiono Sister site, Earth System Grid (ESG), uses grid technology to share datao Uses DODS/OPEnDAP for aggregation and subsetting data setso Uses a distributed model for accessing data and metadata

https://cdp.ucar.edu/

*Thematic Realtime Environmental Distributed Data Services

Page 6: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Introduction, THREDDS in the CDP

• THREDDS is a schema used for DATA DELIVERY Can be also used for geoscience data search and discovery

THREDDS catalogs:• Are ingested into Lucene and GEO extent searching tools

for search and discovery• Are used to supply data for search results and browse

pages• Specify data access mechanisms

http, http restricted, OPEnDAP, MSS, TDS, LAS, GDS, CDP/agg• Point to and use non-THREDDS metadata

ESG, DC, NcML, NcML, GML, DIF Can interoperate with WMO metadata when available

Page 7: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Introduction, THREDDS in the CDP, cont

• The CDP federates directly with other sites that use THREDDS catalogs NCAR DSS, NCAR EOL, UCAR UNIDATA

• THREDDS catalogs are used inside DODS/OPEnDAP, GDS, and forthcoming Thredds Data Server

• THREDDS will support a data access control system, locally and distributed

Page 8: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

THREDDS Background

• THREDDS v0.6 Support for describing the hierarchical structure of

datasets Support for describing data delivery services Some very basic descriptive metadata Support for extensible and distributed catalogs Support for “inheritance” of metadata and services Allows other descriptive schemas to be part of the

catalog

Emphasizes the hierarchical relationships between data items, containing datasets and groups of datasets

Page 9: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

THREDDS V1.0

• THREDDS v1.0 Added descriptive “minimal” metadata tuned for Earth Science

search/discovery “Minimal” defined -- Metadata sized for search/discovery Again, Metadata can be inherited within the hierarchy Design goal was to interoperate with core elements of DIF, ISO-

19115, DC metadata UNIDATA looking at incorporating THREDDS metadata in NetCDF*

and forthcoming TDS** Exploring possibly interoperating with BADC model extensions V1.0x will have access control elements

URL: http://my.unidata.ucar.edu/content/projects/THREDDS/index.htm

*NetCDF UNIDATA defined binary data format for gridded and other geoscience data. Includes metadata that describes the data in the file header

**TDS THREDDS data server -- will handle GRIB and NetCDF, will have WCS

Page 10: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

THREDDS -- CDP

• CDP THREDDS design choices

Use THREDDS descriptive metadata for search/discovery Use GCMD DIF controlled vocabularies for science keyword

hierarchies, creator, publisher, project Use Climate and Forecasting CF conventions for variable

names when applicable Mandate use of unique identifier to identify data Use forthcoming THREDDS elements for data access

control Use OAI to import DIF records from BADC and GCMD,

transform these records into equivalent THREDDS for use in the CDP

Import ESG (CCSM) records (THREDDS, ESG), extract a subset of descriptive metadata for search and discovery

Page 11: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

THREDDS, the details

General Structure of a simple THREDDS catalog<catalog>

<service name=“httpService” type=“HTTP” base=“http://dataportal.ucar.edu/data/abcData/”> <service name=“mssService” type=“MSS” base=“/mssRoot/abcData/”/

<dataset name=“abc” ID=“ucar.scd.cdp.datasetName”> <!-- container dataset --> <metdadata inherit=“true”> <!-- descriptive metadata -->

<description type=“summary”> <creator> <geospatialCoverage> <!-- geographic location --> <….> <!-- other metadata (13 total) --> </metadata> <dataset ID=“ucar.scd.cdp.datasetName.item1”> <!-- describes a data item --

> <dataSize units=“Kbytes”>123</datasize> <access serviceName=“httpService" urlPath=”subDataset/SOLVE_DC8_19991119.nc>

<access serviceName=”mssService" urlPath=”subDataset/SOLVE_DC8_19991119.nc> </dataset><more datasets> <!-- more dataset items -->

</dataset> <! -- close enclosing dataset -></catalog>

Dataset URL = base + access points to local server or local service

Page 12: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

catalogserviceservice

THREDDS, simple catalog

HTTP data service

MSS data service

metadata description creator geospatialCoverage other elements

dataset (container)

dataset (data item) access, size, extentdataset access, size, extentdataset access, size, extentdataset access, size, extent

Local data access/ local MSS service

Page 13: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

THREDDS, distributed catalogs example

1. Descriptive Metadata is in a separate file, could be on anther server.

2. Dataset contains references to remote catalogs.3. Catalog Level Access control elements

catalog metadata description creator geospatialCoverage other elements

dataset (container)

metadata link

catalog (remote)ACCESS CONTROL service metadata description …datasets

catalog (remote) service metadata description …

datasets

Remote data services

catalogRef ACCESS CONTROL

catalogRef

dataset.thredds.xml

Remote Server

Page 14: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

THREDDS, database application example

Virtual catalogservice External HTTP data

service

metadata

dataset (data item) access, size, extentdataset access, size, extentdataset access, size, extentdataset access, size, extent

External

Data hosting

External ServerArbitraryMetadata

Database

Database to THEDDS catalog builder(web service)

Page 15: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

THREDDS, distributed data example

1. Data is not on CDP, service is external, service can implement access control if required

2. Descriptive metadata is in a separate file, does not have to be THREDDS

catalogserviceservice

External HTTP data service

MSS data service

metadata description creator geospatialCoverage other elements

dataset (container)

Metadata external reference

Metadata external reference

dataset (data item) access, size, extentdataset access, size, extentdataset access, size, extentdataset access, size, extent

External

Data hosting

ISO-19115 iso-19115 elements

External Server

Page 16: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

LANL, ORNL, LBNL LANL, ORNL, LBNL

CDP - distributed datasets, overview

Community Data Portal

NCAR Data Support Section

D T

THREDDScatalog top

CDP data storage: WACCM, ACD, CME,CGD, ….

T

T T T

= THREDDS catalog

NCAR Atmospheric Chemistry.

D

T

NCAR EOL section

D

T

D = Data Archive, M= MSS deep archive

Metadata database

NCAR MSS

MMASS Store

T

BADC OAI

DDIFDIFDIFDIF

OAIserver

OAIclient

DIFs

XSLT

T

T

Boston University

DSRB

SRB

T

T

LANL, ORNL, LBNL

MSRM

A = Access control

A

A

A

SRM(ESG)

T

T

SRM

Page 17: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

THREDDS review/summary

• THREDDS is a schema used for DATA DELIVERY • Contains basic geoscience discovery data• Is designed to work with distributed data, distributed

metadata• Contains elements for data access restriction• Can work with real time data• Can be a container for non-THREDDS descriptive

metadata• Defines the hierarchical relationships of datasets• Defines data delivery services• Supports a hierarchical view of metadata• Integrated with many data delivery and visualization

services

Page 18: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Distributed Descriptive Metadata with OAI

• Metadata is immediately “distributed” if metadata is contained in or is pointed to by THREDDS catalogs

• Metadata can also be shared using OAI technology OAI -- Open Archives Initiative from the Digital Library (DL)

community OAI is a web service definition for sharing metadata OAI uses six verbs to define the service OAI uses Dublin Core, DC, as the baseline schema OAI can specify other XML schemas -- we use this capability OAI can be used as a gateway to send information to an

established DL community -- THREDDS -> DC => DL community via OAI

• OAI disadvantage -- hierarchical relationships are lost

Page 19: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Distributed Metadata with OAI -- CDP

• THREDDS records are “flattened” (hierarchy collapsed) one record -> one dataset

• Flattened records are shared using OAI• For a test, the THREDDS records were

transformed into DIF using XSLT• DIF records were ingested from BADC

transformed into THREDDS catalogs, and ingested into CDP search and browse

Page 20: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

parse

CDP metadata architecture

invokesTHREDDScatalog

externalmetadata

Catalog Parsing

XML viewerweb application XML results

THREDDS catalogs browserWeb UI

Metadata DB(Lucene)

index into

passed to

Metadata Processing

free-textSearch Query UI

Web Interface/Web Service

DIF metadata

Metadata Conversion

DC metadata

Metadata repository

OAI server

remote Data Centeror Digital Library

OAI client

exportimport

THREDDS records

write

Structured,Geospatial, Temporal

Query UI

THREDDS recordsreadTHREDDS

records

Page 21: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Data publication on the CDP

Catalog crawler

application

XSLTrenderin

g

Dataset

Disk,HTTP,

Database,…

THREDDShierarchymetadata

THREDDSdescriptiv

emetadata

BROWSERCDP

CatalogPresentatio

n

Access control

link

MetadataAuthoring

tool

Edits

Starts

Allows

Creates

HTTP

Metadata indexing

application

Lucene Index

Creates

Is ingested

Creates

Page 22: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Demo

• Data searching: controlled vocabularies, GEO searching

• Data browsing: access control• BADC shared metadata directory• Metadata editing• IDV Bundle showing integrated data source

Page 23: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Experimental Topology to share data?GISC -> CDP

WMO GISC

THREDDS

CATALOG

D

WW

W WMO metadata

WW

WMO DCPC

D

WWWW

THREDDS

CATALOG

NetCDFGRIB

…WWWT

XSLT

OAIOAI

WWWW

CDPSearch

T THREDDS metadata

OAI

OAI

XSLT

WWWT

CDP

DB

HTTP

Crawler

WWWT

1. OAI transfers of WMO records2. CDP Crawls data hierarchy -- no metadata3. GISC creates Web interface to produce virtual THREDDS Catalogs (embedded WMO descriptive metadata)

Page 24: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Experimental Topology to share data CDP->GISC

WMO GISC

W WMO metadata

THREDDS

CATALOG

NetCDFGRIB

…WWWT

WMOSearch

WWWW

CDPSearch

T THREDDS metadata

OAI

OAI

XSLT

WWWW

CDP

Page 25: Distributed data access: THREDDS, OAI, CDP

Supercomputing • Communications • Data

NCAR Scientific Computing Division

Questions?