Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Data and the Grid
description
Transcript of Data and the Grid
![Page 2: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/2.jpg)
2http://www.ogsadai.org.uk
Motivation
Entering an age of data– Data Explosion
• CERN: LHC will generate 1GB/s = 10PB/y• VLBA (NRAO) generates 1GB/s today• Pixar generate 100 TB/Movie
– Storage getting cheaper Data stored in many different ways
– Data resources• Relational databases• XML databases• Flat files
Need ways to facilitate – Data discovery– Data access– Data integration
Empower e-Business and e-Science– The Grid is a vehicle for achieving this
![Page 3: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/3.jpg)
3http://www.ogsadai.org.uk
Composing Observations in Astronomy
Data and images courtesy Alex Szalay, John Hopkins
No. & sizes of data sets as of mid-2002, grouped by wavelength• 12 waveband coverage of large areas of the sky• Total about 200 TB data• Doubling every 12 months• Largest catalogues near 1B objects
![Page 4: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/4.jpg)
4http://www.ogsadai.org.uk
Cardiovascular Functional Genomics
Glasgow Edinburgh
Leicester
Oxford
LondonNetherlands
Shared data Public curateddata (US)
BRIDGESIBM
![Page 5: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/5.jpg)
5http://www.ogsadai.org.uk
Data Grid
CustomerSupport
Web-basedDashboard
Customer Data Customer Order Information
Marketing department identifies likely buyers of new product
• Company wants real-time integrated view of customer buying behavior
• Data resides in various distributed CRM & ERP systems
• Grid allows developers and apps to access and integrate customer data sources together in real time--across many distributed databases
SAPOracle SiebelDB2
Business Intelligence and Customer Data
Thanks to Dave Berry, Andrew Grimshaw – OGSA-WG
![Page 6: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/6.jpg)
6http://www.ogsadai.org.uk
Providing Data to Cluster-Based Analytical Application
Forward Proxy Data Caches of
Remote Data
CentralizedCompute Cluster
HeadquartersIllinois
• Company has centralized HPC cluster running compute-intensive applications
• Source data for analysis distributed among 3 global sites, one of them an external partner
• Manual data-sharing processes increase costs/errors, and hinder time-to-results
• Grid enables secure, automatic provisioning of remote data to HPC cluster—feeding CPUs more data faster
Data Grid
Data Grid
Data Grid
Data Grid
R&D West Coast
TestingIndia
Engineering East Coast
AnalyticalApplications
Thanks to Dave Berry, Andrew Grimshaw – OGSA-WG
![Page 7: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/7.jpg)
7http://www.ogsadai.org.uk
Data Virtualisation
Client
Other DataServices
“Primitive”Data Sources
• Abstraction
• Federation
• Transformation
Data Service
API
Thanks to Dave Berry, Andrew Grimshaw – OGSA-WG
Client API
Data Service
![Page 8: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/8.jpg)
8http://www.ogsadai.org.uk
Grand Challenges
![Page 9: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/9.jpg)
9http://www.ogsadai.org.uk
Share and share alike!
Many challenges:– Scalability, performance, heterogeneity, ownership, economics– Common schema, data description and semantics, data formats,
process and procedure, provenance
Can be solved only through collaboration and the sharing of:– Ideas– Efforts– Resources
Perhaps most importantly: sharing of data– Beware of data huggers!
Emerging Open Grid Infrastructures will– Allow global collaboration – Change the way that we can work
![Page 10: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/10.jpg)
10http://www.ogsadai.org.uk
Three-way Alliance
Experiment &Advanced Data
Collection→
Shared Data
Requires Much Engineering, Much Innovation
Changes Culture, New Mores, New Behaviours
New Opportunities, New Results, New Rewards
TheoryModels & Simulations
→Shared Data
Computing ScienceSystems, Notations &
Formal Foundation→ Process & Trust
Multi-national, Multi-discipline, Computer-enabledConsortia, Cultures & Societies
![Page 11: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/11.jpg)
11http://www.ogsadai.org.uk
Emergence of Virtual Organisations
![Page 12: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/12.jpg)
12http://www.ogsadai.org.uk
Data Requirements
What do we need for effective sharing of data?– Structured, organised, annotated & curated data– Computable data models– Visualisation of data– Data provenance– Shared distributed systems– Networked workplaces, instruments, data sources– Metadata, ontologies, standards– Authentication, authorisation, accounting, policies
![Page 13: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/13.jpg)
13http://www.ogsadai.org.uk
Database Growth
![Page 14: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/14.jpg)
14http://www.ogsadai.org.uk
Terabyte → Petabyte
Terabyte Petabyte
RAM time to move 15 minutes 2 months
1GB WAN move time 10 hours ($1000) 14 months ($1 million)
Disk cost 7 disks =
$5000 (SCSI)
6800 Disks + 490 units + 32 racks = $7 million
Disk power 100 Watts 100 Kilowatts
Disk weight 5.6 Kg 33 Tonnes
Disk footprint Inside machine 60 m2
Approximately Correct in May 2003 Distributed Computing EconomicsJim Gray, Microsoft Research, MSR-TR-2003-24
![Page 15: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/15.jpg)
15http://www.ogsadai.org.uk
Mohammed & Mountains
Petabytes of Data cannot be moved– It stays where it is produced or curated
• Hospitals, observatories, European Bioinformatics Institute
– A few caches and a small proportion cached
Distributed collaborating communities– Expertise in curation, simulation & analysis
Diverse data collections– Discovery depends on insights– Unpredictable or unexpected use of data
![Page 16: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/16.jpg)
16http://www.ogsadai.org.uk
Move computation to the data
Assumption: code size << data size– Minimise data transport
Provision combined storage & compute resources Develop the database philosophy for this?
– Enhanced stored procedures– Pre-query analysis for more concise queries– Mobile code sandbox– Robustness
Develop the storage architecture for this?– Compute closer to disk? – System on a Chip using free space in the on-disk controller
Develop experiment, sensor & simulation architectures– That take code to select and digest data as an output control
Data Cutter a step in this direction– Sub-setting and aggregation of datasets using filters executed close to data– http://www.cs.umd.edu/projects/hpsl/ResearchAreas/DataCutter.htm
![Page 17: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/17.jpg)
17http://www.ogsadai.org.uk
Meta-data: describing data
Choosing data sources– How do you find them?– How are they described and advertised?– Is the equivalent of Google possible?
Meta-data is required describing:– Structure of data– Types of data– Operations supported/available– Access requirements– Quality of service?
No established standards for heterogeneous data sources
![Page 18: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/18.jpg)
18http://www.ogsadai.org.uk
Cultural Challenges
Changing the way we work? Publication and sharing of results
– Increased volume and diversity = increased opportunity?– Allows independent validation of methods and derivatives– Responsibility, ownership, credit, citation
Many distributed data resources– Data collected from observation, simulation & experiment– Independently owned & managed
• No common goals or design• Work hard for agreements on foundation types and ontologies• Autonomous decisions change data, structure, policy, etc
– Requires negotiations and patience!
Diversity– No “one size fits all” solutions will work
![Page 19: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/19.jpg)
19http://www.ogsadai.org.uk
Economic Challenges
Data production, publication & management– Many researchers contributing increments of data – Who pays for storage, transport, management and
curation?
Data longevity– Research requirements may outlive technical decisions– Data does not preserve itself indefinitely!
Costs must be shared somehow…
![Page 20: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/20.jpg)
20http://www.ogsadai.org.uk
Security Challenges: Medical Imaging Data
Diagnosing based on sensitive patient data– Users: a (group of) doctor(s)– Retrieve an image, run algorithm, examine result and write diagnosis,
maybe re-run another algorithm.
Secure Data Retrieval– Patient data is sensitive, needs to be stored anonymously at all times– Site admins are not trustworthy – strip or encrypt patient data from
image– Replication of data not always allowed
High security needs– Strong authorization– Fine-grained access control mechanisms– Leaking patient information results in prosecution.
Thanks to Dave Berry, Andrew Grimshaw – OGSA-WG
![Page 21: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/21.jpg)
21http://www.ogsadai.org.uk
Scientific Discovery
Obtaining access to that data– Overcoming administrative barriers– Overcoming technical barriers
Understanding that data– The parts you care about for your research
Combing them using sophisticated models– The picture of reality in your head
Analysis on scales required by statistics– Coupling data access with computation
Repeated Processes– Examining variations, covering a set of candidates– Monitoring the emerging details– Coupling with scientific workflows
![Page 22: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/22.jpg)
22http://www.ogsadai.org.uk
DAIS Working Group
![Page 23: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/23.jpg)
23http://www.ogsadai.org.uk
DAIS WG Goals
Provide service-based access to structured data resources as part of OGSA architecture
Specify a selection of interfaces tailored to various styles of data access starting with relational and XML
Interact well with other GGF OGSA specs
![Page 24: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/24.jpg)
24http://www.ogsadai.org.uk
DAIS WG Non-Goals
No new common query language No schema integration or common data
model No common namespace or naming scheme No data resource management
– e.g. starting/stopping database managers
No push based delivery – Information Dissemination WG?
![Page 25: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/25.jpg)
25http://www.ogsadai.org.uk
DAIS View Of Data Services Model
0-* 0-*
Consumer Data Service Data Resource 0- * 0-*
0-*
0-*
This structure is not exposed through the Data Service interface to the Consumer.
A Data Service presents a Consumer with an interface to a Data Resource. A Data Resource can have arbitrary complexity, for example, a file on an
NFS mounted file system or a federation of relational databases. A Consumer is not typically exposed to this complexity and operates within
the bounds and semantics of the interface provided by the Data Service
![Page 26: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/26.jpg)
26http://www.ogsadai.org.uk
Specification Names
Web Services Data Access and Integration (WS-DAI)– The specification formerly known as the Grid Data Service
Specification– A paradigm-neutral specification of descriptive and
operational features of services for accessing data
The WS-DAI Realisations– WS-DAIR: for relational databases– WS-DAIX: for XML repositories
![Page 27: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/27.jpg)
27http://www.ogsadai.org.uk
DAIS Specification Landscape
OGSA Data Services
WS-DAI
WS-DAIXWS-DAIR
Scenarios for MappingDAIS Concepts
Is Informed By
Extend
GWD-I
GWD-R
![Page 28: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/28.jpg)
28http://www.ogsadai.org.uk
INFOD
GridFTP
DT
TMBoFADFBoF
GGF
Arch Data ISP SRM
OGSA
CMM GIR
GSM
GFS DAIS
CGS GRAAP
Policy
IETF
SNMP
Other Standards Bodies
????W3C
XQuery
ANSI
SQL
DMTF
CIM
OASIS
WS-DM
WS-RF
WS-N
WSPolicy
WSAddress
OREP
JDBC
JCP
DFDL
DAIS and Other Standards/Specs
![Page 29: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/29.jpg)
29http://www.ogsadai.org.uk
DAIS Data Access
Database Data Service
Relational Database
SQLResponse
SQLDescription: Readable Writeable ConcurrentAccess TransactionInitiation TransactionIsolation Etc.
SQLAccess
Consumer
SQLExecute ( SQLExpression )
![Page 30: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/30.jpg)
30http://www.ogsadai.org.uk
DAIS Derived Data Access Database
Data Service
SQL Response Data Service
Relational Database
Row Set
SQLExecuteFactory ( SQLExpression BehaviouralProperties )
RDBMS specific mechanism for generating result set
SQLFactory
SQLResponseDescription
SQLResponseAccess
Consumer
GetRowset ( rowsetnumber )
Reference to SQLResponse Data Service
Rowset
SQLDescription: Readable Writeable ConcurrentAccess TransactionInitiation TransactionIsolation Etc.
![Page 31: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/31.jpg)
31http://www.ogsadai.org.uk
Usage Patterns
GA
Q
S+R
Data
Q - QueryD - DeliveryS - StatusR - ResultU - UpdateI - Data id
Q+D
A
C
GS
R
G
C
A
Q
S
D
R
A G
Q+U
S
Retrieve Update/Insert Pipeline
G2=C
G1=P
A I
Q1
S2
S1
U/R
Q2+D
Q1+D
G2=C
A
G1=P
S2
S1
Q2
U/R
Actors
- OGSI process - Non-OGSI processA - AnalystC - ConsumerG - GDSP - Producer
CallResponse
Data Flow
A
PG
U
IQ
S
A
PG
U
I
S
Q+D
![Page 32: Data and the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062409/568145b3550346895db2b878/html5/thumbnails/32.jpg)
32http://www.ogsadai.org.uk
The story so far
Technology enables Grids and more data Distributed systems for sharing information
– Essential, ubiquitous & challenging– Therefore share methods and technology as much as possible
Collaboration is essential– Combining approaches– Combining skills– Sharing resources
Structured Data is the language of Collaboration– Data Access & Integration a Ubiquitous Requirement– Primary data, metadata, administrative & system data
Many hard technical challenges– Scale, heterogeneity, distribution, dynamic variation– Intimate combinations of data and computation with autonomous
development of both