Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director 2 nd December 2003.
-
Upload
anthony-parker -
Category
Documents
-
view
226 -
download
0
Transcript of Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director 2 nd December 2003.
Data Challenges in e-Science
Aberdeen
Prof. Malcolm AtkinsonDirector
www.nesc.ac.uk
2nd December 2003
What is e-Science?
Foundation for e-Science
sensor nets
Shared data archives
computers
software
colleagues
instruments
Grid
e-Science methodologies will rapidly transform science, engineering, medicine and business
driven by exponential growth (×1000/decade) enabling a whole-system approach
Diagram derived fromIan Foster’s slide
Three-way Alliance
Computing ScienceSystems, Notations &
Formal Foundation→ Process & Trust
TheoryModels & Simulations
→Shared Data
Experiment &Advanced Data
Collection→
Shared Data
Multi-national, Multi-discipline, Computer-enabledConsortia, Cultures & Societies
Requires Much Engineering, Much Innovation
Changes Culture, New Mores, New Behaviours
New Opportunities, New Results, New Rewards
Biochemical Pathway Simulator
Closing the inf ormation loop – between lab and computational model.
(Computing Science, Bioinformatics, Beatson Cancer Research Labs)
DTI Bioscience Beacon Project Harnessing Genomics Programme
Slide from Professor Muffy Calder, Glasgow
Why is Data Important?
Information, knowledge,decisions & designs
Derived &Synthesised
Data
Data as Evidence – all disciplines
Hypothesis,Curiosity or …
Collectionsof
Data
Analysis&
Models
Driven by creativity, imagination and perspiration
Information, knowledge,decisions & designs
Data as Evidence - Historically
Hypothesis,Curiosity or …
Collectionsof
Data
Analysis&
Models
Derived &Synthesised
Data
Individual’sidea
Personalcollection
Personaleffort
LabNotebook
Driven by creativity, imagination, perspiration & personal resources
Information, knowledge,decisions & designs
Data as Evidence – Enterprise Scale
Hypothesis,Curiosity,
Business Goals
Collectionsof
Digital Data
Analysis &Computational
Models
Derived &Synthesised
Data
Agreed Hypothesisor Goals
Enterprise databases& (archive) file stores
Data ProductionPipelines
Data Products& Results
Driven by creativity, imagination, perspiration & company’s resources
Information, knowledge,decisions & designs
Data as Evidence – e-Science
Shared GoalsMultiple
hypotheses
Collectionsof Published
& Private Data
AnalysisComputationAnnotation
Derived &Synthesised
Data
Communitiesand Challenges
Multi-enterprise& Public Curation
Synthesis fromMultiple SourcesMulti-enterpriseModels, Computation &Workflow
SharedData Products& Results
Driven by creativity, imagination, perspiration & shared resources
global in-flight engine diagnostics
in-flight data
airline
maintenance centre
ground station
global networkeg SITA
internet, e-mail, pager
DS&S Engine Health Center
data centre
Distributed Aircraft Maintenance Environment: Universities of Leeds, Oxford, Sheffield &York
100,000 aircraft
0.5 GB/flight
4 flights/day
200 TB/day
LHC Distributed Simulation &
Analysis
Tier2 Centre ~1 TIPS
Online System
Offline Farm~20 TIPS
CERN Computer Centre >20 TIPS
RAL Regional Centre
US Regional Centre
French Regional Centre
Italian Regional Centre
InstituteInstituteInstituteInstitute ~0.25TIPS
Workstations
~100 MBytes/sec
~100 MBytes/sec
100 - 1000 Mbits/sec
•One bunch crossing per 25 ns
•100 triggers per second
•Each event is ~1 Mbyte
Physicists work on analysis “channels”
Each institute has ~10 physicists working on one or more channels
Data for these channels should be cached by the institute server
Physics data cache
~PBytes/sec
~ Gbits/sec or Air Freight
Tier2 Centre ~1 TIPS
Tier2 Centre ~1 TIPS
~Gbits/sec
Tier Tier 00
Tier Tier 11
Tier Tier 33
Tier Tier 44
1 TIPS = 25,000 SpecInt95
PC (1999) = ~15 SpecInt95
ScotGRID++ ~1 TIPS
Tier Tier 22
1. CERN1. CERN
DataGrid Testbed
Dubna
Moscow
RAL
Lund
Lisboa
Santander
Madrid
Valencia
Barcelona
Paris
Berlin
LyonGrenoble
Marseille
BrnoPrague
Torino
Milano
BO-CNAFPD-LNL
Pisa
Roma
Catania
ESRIN
CERN
HEP sites
ESA sites
IPSL
Estec KNMI
(>40)
[email protected] - [email protected]
Testbed Sites
Shared GoalsMultiple
hypotheses
Collectionsof Published
& Private Data
AnalysisComputationAnnotation
Derived &Synthesised
Data
Shared GoalsMultiple
hypotheses
Collectionsof Published
& Private Data
AnalysisComputationAnnotation
Derived &Synthesised
Data
Multiple overlapping communities
Shared GoalsMultiple
hypotheses
Collectionsof Published
& Private Data
AnalysisComputationAnnotation
Derived &Synthesised
Data
Supported by common standards & shared infrastructure
Life-science Examples
Database GrowthPDB Content Growth
Bases 45,356,382,990
Wellcome Trust: Cardiovascular Functional Genomics
Glasgow Edinburgh
Leicester
Oxford
LondonNetherlands
Shared dataPublic curated
data
BRIDGESIBM
Depends on building & maintaining security, privacy & trust
Comparative Functional Genomics
Large amounts of dataHighly heterogeneous
Data typesData formscommunity
Highly complex and inter-relatedVolatile
myGrid Project: Carole Goble, University of Manchester
UCSF
UIUC
From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign
Community =1000s of home computer usersPhilanthropic computing vendor (Entropia)Research group (Scripps)
Common goal= advance AIDS research
Home ComputersEvaluate AIDS Drugs
From Steve Tuecke 12 Oct. 01
Astronomy Examples
Global Knowledge Communitiesdriven by Data: e.g., Astronomy
No. & sizes of data sets as of mid-2002, grouped by wavelength• 12 waveband coverage of large areas of the sky• Total about 200 TB data• Doubling every 12 months• Largest catalogues near 1B objects
Data and images courtesy Alex Szalay, John Hopkins
Sloan Digital Sky Survey Production System
Slide from Ian Foster’s ssdbm 03 keynote
Supernova Cosmology Requires Complex,
Widely Distributed Workflow Management
Engineering Examples
whole-system simulations
•braking performance•steering capabilities•traction•dampening capabilities
landing gear models
•lift capabilities•drag capabilities•responsiveness
wing models
•deflection capabilities•responsiveness
stabilizer modelsairframe models
crew capabilities- accuracy- perception- stamina- reaction times- SOP’s
human models •thrust performance•reverse thrust performance•responsiveness•fuel consumption
engine models
NASA Information Power Grid: coupling all sub-system simulations - slide from Bill Johnson
Mathematicians Solve NUG30Looking for the solution to the NUG30 quadratic assignment problem An informal collaboration of mathematicians and computer scientistsCondor-G delivered 3.46E8 CPU seconds in 7 days (peak 1009 processors) in U.S. and Italy (8 sites)
14,5,28,24,1,3,16,15,10,9,21,2,4,29,25,22,13,26,17,30,6,20,19,8,18,7,27,12,11,23
MetaNEOS: Argonne, Iowa, Northwestern, WisconsinFrom Miron Livny 7 Aug. 01
Network for EarthquakeEngineering Simulation
NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each otherOn-demand access to experiments, data streams, computing, archives, collaboration
NEESgrid: Argonne, Michigan, NCSA, UIUC, USCFrom Steve Tuecke 12 Oct. 01
National Airspace Simulation Environment
NASA Information Power Grid: aircraft, flight paths, airport operations and the environmentare combined to get a virtual national airspace
VirtualNational Air
SpaceVNAS
GRCengine models
LaRC
airframe models
landinggear models
ARC
wing models
stabilizer models
human models
• FAA ops data• weather data• airline schedule data• digital flight data• radar tracks• terrain data• surface data
22,000 commercialUS flights a day
50,000 engine runs
22,000 airframe impact runs
132,000 landing/take-off gear runs
48,000 human crew runs
66,000 stabilizer runs
44,000 wing runs
simulationdrivers
Data Challenges
Derived from Ian Foster’s slide at ssdbM July 03
It’s Easy to ForgetHow Different 2003 is From
1993Enormous quantities of data: Petabytes
For an increasing number of communitiesgating step is not collection but analysis
Ubiquitous Internet: >100 million hostsCollaboration & resource sharing the normSecurity and Trust are crucial issues
Ultra-high-speed networks: >10 Gb/sGlobal optical networksBottlenecks: last kilometre & firewalls
Huge quantities of computing: >100 Top/sMoore’s law gives us all supercomputersOrganising their effective use is the challenge
Moore’s law everywhereInstruments, detectors, sensors, scanners, …Organising their effective use is the challenge
Tera → Peta BytesRAM time to move
15 minutes
1Gb WAN move time10 hours ($1000)
Disk Cost7 disks = $5000 (SCSI)
Disk Power100 Watts
Disk Weight5.6 Kg
Disk FootprintInside machine
RAM time to move2 months
1Gb WAN move time14 months ($1 million)
Disk Cost6800 Disks + 490 units + 32 racks = $7 million
Disk Power100 Kilowatts
Disk Weight33 Tonnes
Disk Footprint60 m2
May 2003 Approximately Correct
See also Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR-2003-24
The Story so Far
Technology enables Grids, More Data & …Information Grids will dominateCollaboration essential
Combining approachesCombining skillsSharing resources
(Structured) Data is the language of Collaboration
Data Access & Integration a Ubiquitous Requirement
Many hard technical challengesScale, heterogeneity, distribution, dynamic variation
Intimate combinations of data and computationWith unpredictable (autonomous) development of both
Scientific DataOpportunities
Global Production of Published DataVolume DiversityCombination Analysis Discovery
ChallengesData Huggers
Meagre metadata
Ease of Use
Optimised integration
Dependability
OpportunitiesSpecialised IndexingNew Data OrganisationNew AlgorithmsVaried ReplicationShared AnnotationIntensive Data & Computation
ChallengesFundamental PrinciplesApproximate MatchingMulti-scale optimisationAutonomous ChangeLegacy structuresScale and LongevityPrivacy and Mobility
UK e-Science
UK e-Science
e- Science and the Grid‘e- Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’
‘e- Science will change the dynamic of the way science is undertaken.’
J ohn TaylorDirector General of Research Councils
Offi ce of Science and Technology
From presentation by Tony Hey
e-Science Programme’s Vision
UK will lead the in the exploitation of e-Infrastructure
New, faster and better researchEngineering design, medical diagnosis, decision support, …e-Business, e-Research, e-Design & e-Decision
Depends on Leading e-Infrastructure development & deployment
e-Science and SR2002
Research Council 2004-6 2001-4Medical £13.1M (£8M)Biological £10.0M (£8M)Environmental £8.0M (£7M)Eng & Phys £18.0M (£17M)HPC £2.5M (£9M)Core Prog. £16.2M (£15M) +
£20MParticle Phys & Astro £31.6M (£26M)Economic & Social £10.6M (£3M)Central Labs £5.0M (£5M)
www.nesc.ac.uk
Nationale-
ScienceCentre
HPC(x)
National e-Science InstituteInternational relationshipsEngineering Task ForceGrid Support CentreArchitecture Task ForceOGSA-DAIOne of 11 Centre ProjectsGridNet to support standards workOne of 6 administration projectsTraining team5 Application projects15 “Fundamental” Research projectsEGEE
Globus Alliance
NeSI in Edinburgh
National e-Science Centre
NeSI Events held in the 2nd Year
(from 1 Aug 2002 to 31 Jul 2003)We have had 86 events: (Year 1 figures in
brackets)
11 project meetings ( 4)11 research meetings ( 7)25 workshops (17 + 1) 2 “summer” schools (0)15 training sessions (8)12 outreach events (3)5 international meetings (1)5 e-Science management meetings (7)
(though the definitions are fuzzy!)
> 3600 Participant Days
Suggestions always welcome
Establishing a training team
Investing in community building, skill generation & knowledge development
NeSI Workshops
Space for real workCrossing communitiesCreativity: new strategies and solutionsWritten reports
Scientific Data Mining, Integration and VisualisationGrid Information SystemsPortals and PortletsVirtual Observatory as a Data GridImaging, Medical Analysis and Grid EnvironmentsGrid SchedulingProvenance & WorkflowGeoSciences & Scottish Bioinformatics Forum
http://www.nesc.ac.uk/events/
E-Infrastructure
OGSA
Infrastructure Architecture
OGSI: Interface to Grid Infrastructure
Data Intensive Applications for Application area X
Compute, Data & Storage Resources
Distributed
Simulation, Analysis & Integration Technology for Application area X
Data Intensive Users
Virtual Integration Architecture
Generic Virtual Data Access and Integration Layer
Structured DataIntegration
Structured Data Access
Structured Data Relational XML Semi-structured-
Transformation
Registry
Job Submission
Data Transport Resource Usage
Banking
Brokering Workflow
Authorisation
1a. Request to Registry for sources of data about “x”
1b. Registry responds with
Factory handle2a. Request to Factory for access to database
2c. Factory returns handle of GDS to client
3a. Client queries GDS with XPath, SQL, etc
3b. GDS interacts with database
3c. Results of query returned to client as XML
SOAP/HTTP
service creation
API interactions
Registry
Factory
2b. Factory creates GridDataService to manage access
Grid Data Service
Client
XML / Relational database
Data Access & Integration Services
GDTS2 GDS3
GDS2
GDTS1
Sx
Sy
1a. Request to Registry for sources of data about “x” & “y”
1b. Registry responds with
Factory handle
2a. Request to Factory for access and integration from resources Sx and Sy
2b. Factory creates GridDataServices network
2c. Factory returns handle of GDS to client
3a. Client submits sequence of scripts each has a set of queries to GDS with XPath, SQL, etc
3c. Sequences of result sets returned to analyst as formatted binary described in a standard XML notation
SOAP/HTTP
service creation
API interactions
Data Registry
Data Access& Integrationmaster
Client
Analyst XML database
Relational database
GDS
GDS
GDS
GDTS
GDTS
3b. Client tells analyst
GDS1
Future DAI Services?
“scientific”Applicationcodingscientificinsights
ProblemSolving
Environment
SemanticMeta data
Application Code
Integration is our Focus
Supporting CollaborationBring together disciplinesBring together people engaged in shared challengeInject initial energyInvent methods that work
Supporting Collaborative ResearchIntegrate compute, storage and communicationsDeliver and sustain integrated software stackOperate dependable infrastructure serviceIntegrate multiple data sourcesIntegrate data and computationIntegrate experiment with simulationIntegrate visualisation and analysis
High-level tools and automation essentialFundamental research as a foundation
Take Home MessageData is a Major Source of Challenges
AND an Enabler of New Science, Engineering , Medicine, Planning, …
Information GridsSupport for collaborationSupport for computation and data gridsStructured data is fundamentalIntegrated strategies & technologies needed
E-Infrastructure is Here – More to do – technically & socio-economically
Join in – explore the potential – develop the methods & standards
NeSC would like to help you develop e-ScienceWe seek suggestions and collaboration