Distributed data access: THREDDS, OAI, CDP
description
Transcript of Distributed data access: THREDDS, OAI, CDP
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Distributed data access: THREDDS, OAI, CDP
Presented By:Michael Burek
Acknowledgments:CDP staff: Dave Brown, Luca Cinquini, Don Middleton,
Rob Markel, Scott Nixon, Nate Wilhelmi
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Outline
• Community Data Portal (CDP)• THREDDS in the CDP introduction• THREDDS in detail• THREDDS applied in the CDP, some details• OAI -- Open archives initiative• Demo• Thoughts about future developments
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Introduction to the CDP
Community Data Portal (CDP) Project
UCAR wide, uniform, community resource for discovery (search and browse) across the organization
Search/browse:o Supports free or structured queries to find datao Boolean combinationso Keyword, controlled vocabularies
– Creator, Publisher, Science Keyword (GCMD), Variable name (CF)– Data Format, Data Type, Data Delivery Service
o Geographic, Time, Altitude Data delivery Services
o aggregation, subsetting, FTP, HTTP, Mass Store, LAS/FERRET, OPEnDAP
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Introduction to the CDP, cont.
The CDP serves diverse range of data providers:o Project based archives -- small, often limited resourceso Multi institutional teams -- geographically separatedo Multiple data types within a project: measurements,
models, imageso The CDP cooperates with NCAR existing data organizationso A few unusual datasets -- HAO divisiono Model software. Visualizations.
Supercomputing • Communications • Data
NCAR Scientific Computing Division
CDP, Technologies
o The CDP was begun in 2001o Uses THREDDS* catalogs as to describe data content and structureo Uses Lucene as the search/discovery back endo Uses Open Archives Initiative OAI to share metadatao Uses SRM to access deep archive data, share data externally (ESG
project)o Experimental use of SRB to share intra-institutiono Sister site, Earth System Grid (ESG), uses grid technology to share datao Uses DODS/OPEnDAP for aggregation and subsetting data setso Uses a distributed model for accessing data and metadata
https://cdp.ucar.edu/
*Thematic Realtime Environmental Distributed Data Services
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Introduction, THREDDS in the CDP
• THREDDS is a schema used for DATA DELIVERY Can be also used for geoscience data search and discovery
THREDDS catalogs:• Are ingested into Lucene and GEO extent searching tools
for search and discovery• Are used to supply data for search results and browse
pages• Specify data access mechanisms
http, http restricted, OPEnDAP, MSS, TDS, LAS, GDS, CDP/agg• Point to and use non-THREDDS metadata
ESG, DC, NcML, NcML, GML, DIF Can interoperate with WMO metadata when available
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Introduction, THREDDS in the CDP, cont
• The CDP federates directly with other sites that use THREDDS catalogs NCAR DSS, NCAR EOL, UCAR UNIDATA
• THREDDS catalogs are used inside DODS/OPEnDAP, GDS, and forthcoming Thredds Data Server
• THREDDS will support a data access control system, locally and distributed
Supercomputing • Communications • Data
NCAR Scientific Computing Division
THREDDS Background
• THREDDS v0.6 Support for describing the hierarchical structure of
datasets Support for describing data delivery services Some very basic descriptive metadata Support for extensible and distributed catalogs Support for “inheritance” of metadata and services Allows other descriptive schemas to be part of the
catalog
Emphasizes the hierarchical relationships between data items, containing datasets and groups of datasets
Supercomputing • Communications • Data
NCAR Scientific Computing Division
THREDDS V1.0
• THREDDS v1.0 Added descriptive “minimal” metadata tuned for Earth Science
search/discovery “Minimal” defined -- Metadata sized for search/discovery Again, Metadata can be inherited within the hierarchy Design goal was to interoperate with core elements of DIF, ISO-
19115, DC metadata UNIDATA looking at incorporating THREDDS metadata in NetCDF*
and forthcoming TDS** Exploring possibly interoperating with BADC model extensions V1.0x will have access control elements
URL: http://my.unidata.ucar.edu/content/projects/THREDDS/index.htm
*NetCDF UNIDATA defined binary data format for gridded and other geoscience data. Includes metadata that describes the data in the file header
**TDS THREDDS data server -- will handle GRIB and NetCDF, will have WCS
Supercomputing • Communications • Data
NCAR Scientific Computing Division
THREDDS -- CDP
• CDP THREDDS design choices
Use THREDDS descriptive metadata for search/discovery Use GCMD DIF controlled vocabularies for science keyword
hierarchies, creator, publisher, project Use Climate and Forecasting CF conventions for variable
names when applicable Mandate use of unique identifier to identify data Use forthcoming THREDDS elements for data access
control Use OAI to import DIF records from BADC and GCMD,
transform these records into equivalent THREDDS for use in the CDP
Import ESG (CCSM) records (THREDDS, ESG), extract a subset of descriptive metadata for search and discovery
Supercomputing • Communications • Data
NCAR Scientific Computing Division
THREDDS, the details
General Structure of a simple THREDDS catalog<catalog>
<service name=“httpService” type=“HTTP” base=“http://dataportal.ucar.edu/data/abcData/”> <service name=“mssService” type=“MSS” base=“/mssRoot/abcData/”/
<dataset name=“abc” ID=“ucar.scd.cdp.datasetName”> <!-- container dataset --> <metdadata inherit=“true”> <!-- descriptive metadata -->
<description type=“summary”> <creator> <geospatialCoverage> <!-- geographic location --> <….> <!-- other metadata (13 total) --> </metadata> <dataset ID=“ucar.scd.cdp.datasetName.item1”> <!-- describes a data item --
> <dataSize units=“Kbytes”>123</datasize> <access serviceName=“httpService" urlPath=”subDataset/SOLVE_DC8_19991119.nc>
<access serviceName=”mssService" urlPath=”subDataset/SOLVE_DC8_19991119.nc> </dataset><more datasets> <!-- more dataset items -->
</dataset> <! -- close enclosing dataset -></catalog>
Dataset URL = base + access points to local server or local service
Supercomputing • Communications • Data
NCAR Scientific Computing Division
catalogserviceservice
THREDDS, simple catalog
HTTP data service
MSS data service
metadata description creator geospatialCoverage other elements
dataset (container)
dataset (data item) access, size, extentdataset access, size, extentdataset access, size, extentdataset access, size, extent
Local data access/ local MSS service
Supercomputing • Communications • Data
NCAR Scientific Computing Division
THREDDS, distributed catalogs example
1. Descriptive Metadata is in a separate file, could be on anther server.
2. Dataset contains references to remote catalogs.3. Catalog Level Access control elements
catalog metadata description creator geospatialCoverage other elements
dataset (container)
metadata link
catalog (remote)ACCESS CONTROL service metadata description …datasets
catalog (remote) service metadata description …
datasets
Remote data services
catalogRef ACCESS CONTROL
catalogRef
dataset.thredds.xml
Remote Server
Supercomputing • Communications • Data
NCAR Scientific Computing Division
THREDDS, database application example
Virtual catalogservice External HTTP data
service
metadata
dataset (data item) access, size, extentdataset access, size, extentdataset access, size, extentdataset access, size, extent
External
Data hosting
External ServerArbitraryMetadata
Database
Database to THEDDS catalog builder(web service)
Supercomputing • Communications • Data
NCAR Scientific Computing Division
THREDDS, distributed data example
1. Data is not on CDP, service is external, service can implement access control if required
2. Descriptive metadata is in a separate file, does not have to be THREDDS
catalogserviceservice
External HTTP data service
MSS data service
metadata description creator geospatialCoverage other elements
dataset (container)
Metadata external reference
Metadata external reference
dataset (data item) access, size, extentdataset access, size, extentdataset access, size, extentdataset access, size, extent
External
Data hosting
ISO-19115 iso-19115 elements
External Server
Supercomputing • Communications • Data
NCAR Scientific Computing Division
LANL, ORNL, LBNL LANL, ORNL, LBNL
CDP - distributed datasets, overview
Community Data Portal
NCAR Data Support Section
D T
THREDDScatalog top
CDP data storage: WACCM, ACD, CME,CGD, ….
T
T T T
= THREDDS catalog
NCAR Atmospheric Chemistry.
D
T
NCAR EOL section
D
T
D = Data Archive, M= MSS deep archive
Metadata database
NCAR MSS
MMASS Store
T
BADC OAI
DDIFDIFDIFDIF
OAIserver
OAIclient
DIFs
XSLT
T
T
Boston University
DSRB
SRB
T
T
LANL, ORNL, LBNL
MSRM
A = Access control
A
A
A
SRM(ESG)
T
T
SRM
Supercomputing • Communications • Data
NCAR Scientific Computing Division
THREDDS review/summary
• THREDDS is a schema used for DATA DELIVERY • Contains basic geoscience discovery data• Is designed to work with distributed data, distributed
metadata• Contains elements for data access restriction• Can work with real time data• Can be a container for non-THREDDS descriptive
metadata• Defines the hierarchical relationships of datasets• Defines data delivery services• Supports a hierarchical view of metadata• Integrated with many data delivery and visualization
services
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Distributed Descriptive Metadata with OAI
• Metadata is immediately “distributed” if metadata is contained in or is pointed to by THREDDS catalogs
• Metadata can also be shared using OAI technology OAI -- Open Archives Initiative from the Digital Library (DL)
community OAI is a web service definition for sharing metadata OAI uses six verbs to define the service OAI uses Dublin Core, DC, as the baseline schema OAI can specify other XML schemas -- we use this capability OAI can be used as a gateway to send information to an
established DL community -- THREDDS -> DC => DL community via OAI
• OAI disadvantage -- hierarchical relationships are lost
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Distributed Metadata with OAI -- CDP
• THREDDS records are “flattened” (hierarchy collapsed) one record -> one dataset
• Flattened records are shared using OAI• For a test, the THREDDS records were
transformed into DIF using XSLT• DIF records were ingested from BADC
transformed into THREDDS catalogs, and ingested into CDP search and browse
Supercomputing • Communications • Data
NCAR Scientific Computing Division
parse
CDP metadata architecture
invokesTHREDDScatalog
externalmetadata
Catalog Parsing
XML viewerweb application XML results
THREDDS catalogs browserWeb UI
Metadata DB(Lucene)
index into
passed to
Metadata Processing
free-textSearch Query UI
Web Interface/Web Service
DIF metadata
Metadata Conversion
DC metadata
Metadata repository
OAI server
remote Data Centeror Digital Library
OAI client
exportimport
THREDDS records
write
Structured,Geospatial, Temporal
Query UI
THREDDS recordsreadTHREDDS
records
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Data publication on the CDP
Catalog crawler
application
XSLTrenderin
g
Dataset
Disk,HTTP,
Database,…
THREDDShierarchymetadata
THREDDSdescriptiv
emetadata
BROWSERCDP
CatalogPresentatio
n
Access control
link
MetadataAuthoring
tool
Edits
Starts
Allows
Creates
HTTP
Metadata indexing
application
Lucene Index
Creates
Is ingested
Creates
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Demo
• Data searching: controlled vocabularies, GEO searching
• Data browsing: access control• BADC shared metadata directory• Metadata editing• IDV Bundle showing integrated data source
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Experimental Topology to share data?GISC -> CDP
WMO GISC
THREDDS
CATALOG
D
WW
W WMO metadata
WW
WMO DCPC
D
WWWW
THREDDS
CATALOG
NetCDFGRIB
…WWWT
XSLT
OAIOAI
WWWW
CDPSearch
T THREDDS metadata
OAI
OAI
XSLT
WWWT
CDP
DB
HTTP
Crawler
WWWT
1. OAI transfers of WMO records2. CDP Crawls data hierarchy -- no metadata3. GISC creates Web interface to produce virtual THREDDS Catalogs (embedded WMO descriptive metadata)
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Experimental Topology to share data CDP->GISC
WMO GISC
W WMO metadata
THREDDS
CATALOG
NetCDFGRIB
…WWWT
WMOSearch
WWWW
CDPSearch
T THREDDS metadata
OAI
OAI
XSLT
WWWW
CDP
Supercomputing • Communications • Data
NCAR Scientific Computing Division
Questions?