The NERC DataGrid

38
The The NERC DataGrid NERC DataGrid The The NERC NERC DataGrid DataGrid Bryan Lawrence, BADC David Boyd Kerstin Kleese Roy Lowry Dean Williams Bob Drach Mike Fiorino Deputy Director CLRC e- Science centre DL: Climate Database Expert BODC: Marine Database Expert PCMDI: ESG Principle Investigator PCMDI: ESG Metadata Architecture PCMDI: Meteorologist Acronym Summary: PCMDI: Program for Climate Model Data Intercomparison (US Department of Energy, Lawrence-Livermore National Lab) ESG: Earth System Grid

description

The NERC DataGrid. Bryan Lawrence, BADC David Boyd Kerstin Kleese Roy Lowry Dean Williams Bob Drach Mike Fiorino. Deputy Director CLRC e-Science centre. DL: Climate Database Expert. BODC: Marine Database Expert. PCMDI: ESG Principle Investigator. PCMDI: ESG Metadata Architecture. - PowerPoint PPT Presentation

Transcript of The NERC DataGrid

Page 1: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

TheThe NERCNERC DataGridDataGridTheThe NERCNERC DataGridDataGrid

Bryan Lawrence, BADC

David Boyd

Kerstin Kleese

Roy Lowry

Dean Williams

Bob Drach

Mike Fiorino

Deputy Director CLRC e-Science centre

DL: Climate Database Expert

BODC: Marine Database Expert

PCMDI: ESG Principle Investigator

PCMDI: ESG Metadata Architecture

PCMDI: Meteorologist

Acronym Summary:

PCMDI: Program for Climate Model Data Intercomparison

(US Department of Energy, Lawrence-Livermore National Lab)

ESG: Earth System Grid

(US Grid Project: NCAR, Argonne, PCMDI, USC …)

Page 2: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Outline

• Motivation• The Earth System Grid

– definitions of “portals” and applications– ontologies

• Relations with other NERC e-science programmes.• Architecture

– querying– software Stack

• Initial steps and Project Management• Connectivity with other grid projects• Success and Failure• Summary of what we are doing and the road to the

future

Page 3: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

The BADC – part of NCAS!

The Role: Key words: Curation and Facilitation!http://www.badc.rl.ac.uk

Page 4: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Just under half of BADC users are NOT atmospheric scientists:

Page 5: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Motivation – Town meeting 2001

E-science should be involved with:• delivering an enhanced meta-data record of archived

data.• 'dictionary' building.• building systems to translate data and link databases.• integrating computer and natural science communities.• the ability to generate a single query across multiple

datasets (in different catalogues) returning both metadata and data.

• the ability to acquire large datasets in near real time (NRT).

• the automatic production of metadata, both by models, and where possible, by observing systems.

Summary from two of the four working groups!

Page 6: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Relevant to many stakeholders

Energy

Water Management

Food Chain

Health

WeatherRisk

(Slide from Julia Slingo’s introduction to CGAM as part of NCAS)

Page 7: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Motivation

Page 22:

NERC will …... ensure that Earth system science is underpinned by e-science investments to enable access, manipulation … of data from diverse sources.

Page 8: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

The Data Use Chain

Discovery

Authentication

Authorisation

Extraction

Sub-Sampling

Regridding

Processing Display

Delivery

Formatting

Time-line

Page 9: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

NERC Metadata Gateway - SST

• Geospatial coordinates forgotten. Time reference forgotten. Need to get entire field(s), and find correct time!

•And if I want to compare data from different locations?

- multiple logins

- multiple formats

- discovery?

Page 10: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Searching: need comprehensive metadata!

A priori would any user know to look in the COAPEC data set?

Earth system-science means we have to remove these boundaries!

• detailed file level metadata isn’t visible, and so data mining applications impossible.

- need ontologies to help queries match actual data descriptions.

NB: Dynamic catalogues!

Page 11: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

What is an Ontology?

An ontology defines the terms used to describe and represent an area of knowledge by specifying the following kinds of concepts:

•Classes (general things) in the many domains of interest •The relationships that can exist among things •The properties (or attributes) those things may have

Ontologies are usually expressed in a logic-based language, so that detailed, accurate, consistent, sound, and meaningful distinctions can be made among the classes, properties, and relations..

Page 12: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Ontology Example:

An example of part of ontology defined using OIL (e.g. see Oil in a Nutshell, D. Fensel et.al.)

ontology-definitions slot-def eats inverse is-eaten-by slot-def has-part inverse is-part-of properties transitive

class-def defined carnivore subclass-of animal slot-constraint eats value-type animal class-def defined herbivore subclass-of animal slot-constraint eats value-type plant OR (slot-constraint is-part-of has-value plant)

With current funding, the NDG does not aim to build a formal ontology, but we do aim to being to build a thesaurus that can form the basis of one, and we do hope to spin off a project to build one and integrate it in the NDG

class-def animalclass-def plant subclass-of NOT animal class-def tree subclass-of plant class-def branch slot-constraint is-part-of has-value tree class-def leaf slot-constraint is-part-of has-value branch class-def

class-def giraffe subclass-of animal slot-constraint eats value-type leaf class-def lion subclass-of animal slot-constraint eats value-type herbivore

Relationships

Classes

Properties

(OIL: Ontology Inference Layer)

Page 13: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

ESG: Example of a Web-based Data Portal

ESG will provide support for:

• large but simple data sets,

• limited metadata, but not searchable.

NDG will provide support for

•Small-but-complex datasets.

•Data-mining (searchable metadata).

NDG is complementary to ESG!

Page 14: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Live Access Server (1)

… we will keep the basic structure, but gradually replace components.

Page 15: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Live Access Server (2)

Data Request Structure:

Page 16: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

ESG: Example of a Client Application

We will:

• Provide python based classes for our observational data to complement the access to 3D gridded data.

• Provide a web services wrapper so that other grid applications can access NDG data.

Page 17: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Applications and Portals

Wider InternetNERC Grid

taperobot

XML data-base

XML data-base

BADC NDG Wrapper

OnlineData

OnlineData

BODC NDGWrapper

OnlineData

XML data-base

Group NDGWrapper

Software Agent

Grid User

Satellite Supercomputer

Research Group DataSources

Internet Link

Internet User

Internet LinkESG (&other)Applications

Wider Internet

NDGWeb

Portal

XML data-base

Page 18: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Relationship to GODIVA (Haines et.al.)(Grid for Ocean Diagnostics, Interactive Visualisation and Analysis)

Architecture of the GODIVA Grid: NDG will:

• improve data discovery tools for GODIVA (even for their own datasets).

• provide metadata creation tools for GODIVA participants.

• provide access to data held outside GODIVA participants.

GODIVA team have already discovered issues with the XML database

interface they are going to use.

Page 19: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

ClimatePrediction.com

•Scientific•investigators

•Participants &•policy-makers

•Summary•statistics

•100Tb of key output at 10-20 sites

•1Pb total output on 1M participants’ PCs

•ESG-II/NERC •DataGrid•GridFTP

•HTTP (DODS URL) •Live Access Server

•HTTP •HTTP

•Datamining •Peer-to-peer •visualisation

•Conventional FTP/HTTP

•Obs CP.COM will need the NDG to make best use of

observational data in evaluating their parameter space.

Page 20: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Mining on the Grid

Grid Mining Agent

Grid Processor

Satellite Data

Archive X

Satellite Data

Archive Y

Grid Mining Agent

Grid Processor

Grid Mining Agent

Grid Processor

From Hinke’s NASA IPG presentation at CEOS, Rome, May 2002

Page 21: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Data mining: Grid Miner Architecture

IPG Mining Agent

IPG Processor

MiningDaemon

ControlDatabase

IPG Processor

IPG Mining Agent

IPG Processor

Mining OperationsRepository

IPG Processor

Data

Archive X

Satellite Data

Archive Y

MiningConfiig

Info

IPG Processor

From Hinke’s NASA IPG presentation at CEOS, Rome, May 2002

The devil is in the detail: how does the

data mining agent get at the data?

Need data mining clients – objects which can read specific datatypes and present themselves to agents!

Page 22: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Finding data: Querying!

• Requires databases of metadata & querying those databases.• Each part of the NDG will have an internal metadata catalogue (&/or

database), and data (either in flat files or the database).– so the querying strategy must support centralised querying on partially indexed

data, followed (if necessary) by distributed querying, which may or may not need mapping into a local database schema.

– In the grid environment the indexes themselves will be replicated, and some data may also be replicated.

• Major NDG design issue: developing appropriate data models, database schema and indexing strategies!– This is not a generic problem, it will be specific to our datatypes.– Technology needs to be public domain (i.e. free) for uptake!– NDG approach to database technology will be developed in conjunction with

DBTF.

Page 23: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Query Pathway; software components NERC DataGrid

Exi

stin

g a

nd

Re

qu

ire

dG

rid

Mid

dle

wa

reN

ew

Da

ta I

nte

rfa

ces

an

d S

erv

ice

sE

xist

ing

Da

ta a

nd

Se

rvic

es

Ap

plic

atio

n L

eve

l

QueryDistributor

(CheckAuthentication)

Query Handler

Response: DataSetMetadata

UserAssessment

inadequate

GenerateExpansion

Query (e..g:time and space)

Query Distributor(Check

Authorisationagainst "Locating")

"Dataset"Catalogue Search

(CheckAuthorisation

against "Looking")

ReformatMetadata Query Handler Granule

Catalogue Search;Return

SatisfactoryGranule Metadata

PotentiallyInteresting Data Exists?

Continue toExtraction?

Yes

CheckAuthorisation for

"Extraction"

Exit or return toprevious step at

this level

No

Not OK

DefineRequirements

for Sub-Sampling andReformatting

OK for extraction

Extract DataFile

Sub-Sampleand Reformat

Deliver Data toProcessor (s)(and cache)

UserProcessing,

Displayand/or

Visualisation

User Query

Interfaces:NERC

internationalgeneric

Discoveryand Extraction

Path

New Model andData Ingestionand Metadata

CreationInterfaces

Data Pathinto Archive

Data andMetadataArchives

CollateMultipleReturns

Data Extraction Path for Known Datasets

Network Pathand Cache

Identification

Parallel Queries Parallel Queries

BNL V1.01 - 12/01

Page 24: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

DataCentreData

RDBMS

GranuleCatalog

XMLIngestor

NERC DataGrid Information Structure

DataFile

010010010

DataFile

010010010

DataFile

010010010

Structured Ingestor(e.g. cdscan)

Docs DocsDocs RawData

010010010

RawData

010010010

RawData

010010010

Raw Ingestor(e.g. for PP & Grib data)

Descriptor files

Documentation Ingestor

Catalogue XML

LibraryCataloguedatabase

DMSData Manipulation

System

DataCataloguedatabase

DSSDistributed Search

System

PythonAPI

RDBMSIngestor

Docs

WebInterface

GUIInterface

PCMDI Components

NDG Components

Joint Interfaces

Information Structure

Existing Components

Page 25: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Simplified Software Stack

Key point:make use of existing technology, allow component replacement with time!

Achievable by:interface definition and integration.

Note: Any application will be able to access our data services via the OGSA wrapper in the middleware.

Existing ESG toolsbut ANY application will be able to

call NDG services.

Globus Middleware Layer

New NDGComponents

NDGEnhancements

to existingESG

Components

Key

Existing Data

Page 26: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Software stack

Existing ESG toolsbut ANY application will be able to

call NDG services.

Globus Middleware Layer

New NDGComponents

NDGEnhancements

to existingESG

Components

Key

Existing Data

GUI Application Web Client

Data Access Instantiation XML Parsing

Access &Authorisation

ObjectOrientated

Class DefinitionsXML Schema

DataBase APILibraries

Data File APILibraries

Data Files, Databases

XML Data I/O

Query HandlerProcessing Options(Python Packages)

NERC DataGrid Software Stack

Network Transport Layer - GlobusGridFTP/DODS

NERC DataGrid API (Python)

Web Service/OGSA wrapper

XML Descriptor Files

Page 27: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

NDG: Ingestion TasksNERC DataGrid: BADC Data Ingestion

BNL 03/01/02

DataFiles

010010010

Docs

RawData

010010010

Generate XML forGranule Catalog

Generate XML forDataSet Catalog

Generate XML forLibrary Catalog

Docs

Docs

Raw Data Input: - dataset documentation - binary data files - possibly doc files with individual data files

Phase One: Produce "Self Describing Data" (e.g. NetCDF).Phase Two: Generate XML MetadataPhase Three: Ingest Metadata into catalogues, and relocate files

IngestMetadata,

Relocate Files

Normally desirable to directly ingest data already in self-describing format(along with additional documentation)!

Page 28: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Draft Project Schedule

Phase One Delivery

Page 29: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Metadata Gateway

zserver

MetadataGateway(zgate)

otherzserver

Z39.50 Z39.50

BADC SGMLMetadata

isite index

UserWeb

Browser

other SGMLMetadata

Existing NERC Metadata Gateway (BADC perspective) SJP 12/06/01BNL 02/01/02

BADC Data,Docs & StaticWeb Pages

BADC MetadataINGRES

Catalogue

BADC MetadataDynamic HTMLDataset Pages

UserFTP

Interface

badc.rl.ac.uk

tornado.badc.rl.ac.uk

badc.rl.ac.uk

returns link to HTML pages

browse www.nmp.rl.ac.uk

NB: All metadata isat the dataset

collection level. Noinfo for individual

data files or fields!No actual data is

returned!

Page 30: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

NERC DataGrid: Phase One Architecture

User WebBrowser

NERCDataGrid File

RequestManager

PCMDICDMS(LDAP

registry)

Live AccessServer

BADCDisk

Farms

BADCTapes

New BADCStorage

Environment

SRBMCAT

Datasets supported at phase one will be existing 3D data such as ECMWF and Met OfficeUM analyses at the BADC, and UM simulation data in university groups

Phase one depends on theintegration of existingtechnologies:

- SRB- LDAP- CDAT/CDMS- XML cataloging- Live Access Server- Cookies, and Unix authentication- wraping Z39.50 inWDSL (Zoom)?

along with a new requestmanager.

UM Data Files heldin Uni Res. Grps

dataflow pathway

registry pathway

IngresMetadata DB

Web ServerPerl Scripts

Existing BADC Technology

NERCMetadataGateway

registry pathway

Replace with

GlobusGiggle?

Next steps include:

•Replacing the transport layers in the metadata gateway with SOAP

•Replacing the SGML in the metadata gateway with XML

…etc

Page 31: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Connectivity? Evolution!Innovation?

Plagiarism: Copying from one person

… we can’t afford to be too innovative!

Research : Copying from many people

NERC DataGridESG IIEU

DataGridWP9

UKDataBase

Task Force

ClimatePrediction.com

? Future ?Other

Programmes

U.S.Thredds/NOMADS

DigitalLibraries(Zoom)

Ontologies- Nesc

-MyGrid

QinetiQCEOSBNSC

CLRCe-science

Data Portal

BADC BODC

PARADISEGODIVA

NERC DataGridESG IIEU

DataGridWP9

UKDataBase

Task Force

ClimatePrediction.com

? Future ?Other

Programmes

U.S.Thredds/NOMADS

DigitalLibraries(Zoom)

Ontologies- Nesc

-MyGrid

QinetiQCEOSBNSC

CLRCe-science

Data Portal

BADC BODC

NEODCOther

DDC-CEH

PARADISEGODIVA

Page 32: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Indicators of Success

Finding and making use of data:

• Possible to find, reformat, and visualise disparate datasets from disparate organisations within one application.

• No longer necessary to rely on personal contacts to locate and acquire data of interest if it’s held in the BADC/BODC.

• Key requirement for interdisciplinarity; the ability to test data comparison ideas without learning foreign formats and establishing personal relationships every time.

• Other NERC data designated data centres implementing NDG.

Take up by community:

• NDG software (but not necessarily graphics tools) in use in GODIVA project and in wider UK university community (including data repositories in research groups).

• Earth System Grid uses NDG components.

Page 33: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Risks Of Failure• Someone else does it first – unlikely!• Performance too slow for users!

– More cache and replication– Improve database performance (UK DBTF!)– Data-compression layer for XML– Reduce scope and search depth (don’t want to do this!)

• Globus 3 (OGSA) delivery heavily delayed– Web services implementation + Globus2 + datagrid service registry

• Availability of people with appropriate skills– re-deploy existing staff where possible– Schedule begins with three months training.

• ESG-II architecture delayed or incompatible with UK architecture– Close relationship with PCMDI means we will be able to proceed

effectively anyway.

Page 34: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

XML Catalogue

Server

1

NDG expected evolution

Computation

At USER Institution

Data Repositories

DataFile

010010010

Other: e.g. PML/ESSC

NERC DDC

DataFile

010010010

2

Catalogue Client

Computation

Graphics

Based on LAS

Satellite

Local Catalogue

CatalogueIngestor4

3

Python API

CatalogueClient

Computation

Evolving to OGSA 5Docs

6

Page 35: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Beyond the next three years: The NDG and earth systems science

Extension to the other NERC data centres, requires:– online (or near-line) data.

– appropriate ingestion tools, appropriate mappings between specific discipline specific metadata and generic metadata.

– GRID enabling data centres.

– Decisions about policy and access.

Page 36: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

TheThe NERCNERC DataGridDataGridTheThe NERCNERC DataGridDataGrid

Bryan Lawrence, BADC

David Boyd, CLRC E-science

Kerstin Kleese, CLRC E-science

Roy Lowry, BODC

Dean Williams, PCMDI

Bob Drach, PCMDI

Mike Fiorino, PCMDI

Page 37: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

Project Management

• Weekly workgroup meetings (teleconference and physical).

• Milestoning code and documentation reviews at quarterly intervals.

• Quarterly liaison with both US colleagues and other NERC projects (GODIVA, ClimatePrediction.com etc).

• Bi-Annual target-reprofiling.

• Professional project management at the code level:– Both RAL SSTD and RAL e-Science have considerable experience

managing and delivering large software projects.

• Two key tenets of management philosophy:– Build early, build often.

– Evolve from a working system.

Page 38: The NERC DataGrid

TheThe NERC DataGridNERC DataGridTheThe NERC DataGridNERC DataGrid

The NDG: What will we do?Key components: BADC/BODC• Project Management.

• Ingestion tools for station data, oracle database data, and other (eg PP - includes tools based on ESML and Marine XML).

• Format conversion tools within CDAT.

• Ingestion! Migrate NERC Metadata gateway to WDSL/SOAP (Zoom?).

Key components: CLRC e-science• Globus Installation at all sites.

• Functional decomposition and interface definitions.

• Search database schema; search software python API, wrappers.

• Database Population. Logical to Physical File Manager.

• Amalgamating search API into – LAS (or successor) , VCDAT, metadata gateway.

• Add data retrieval interfaces into metadata gateway.