Data- and Compute-Driven Transformation of Modern Science
description
Transcript of Data- and Compute-Driven Transformation of Modern Science
Data- and Compute-Driven Transformation of Modern
Science
Edward SeidelAssistant Director, Mathematical and
Physical Sciences, NSF(Director, Office of Cyberinfrastructure)
1
2
Profound Transformation of Science
Gravitational Physics Galileo, Newton usher in birth of
modern science: c. 1600 Problem: single “particle” (apple)
in gravitational field (General 2 body-problem already too hard)
MethodsData: notebooks (Kbytes)Theory: driven by dataComputation: calculus by hand (1
Flop/s) Collaboration
1 brilliant scientist, 1-2 student
33
Profound Transformation of Science
Collision of Two Black Holes• Science Result
• The “Pair of Pants”
• Year: 1994
• Team size
• ~ 10
• Data produced
• ~ 50Mbytes
• Science Result
• The “Pair of Pants”
• Year: 1994
• Team size
• ~ 10
• Data produced
• ~ 50Mbytes
Science ResultThe “Pair of Pants”
Year: 1972 Team size
1 person (S. Hawking) Computation
Flop/s Data produced
~ Kbytes (text, hand-drawn sketch)
400 years later…same!
Move to 3D: 1000x more data!
4
• 3D Collision
• Science Result
• Year: 1998
• Team size
• ~ 15
• Data produced
• ~ 50Gbytes
• 3D Collision
• Science Result
• Year: 1998
• Team size
• ~ 15
• Data produced
• ~ 50Gbytes
5
Just ahead: Complexity of UniverseLHC, Gamma-ray bursts!
Gamma-ray bursts! GR now soluble: complex problems in
relativistic astrophysics All energy emitted in lifetime of sun
bursts out in a few seconds: what are they?! Colliding BH-NS? SN?
GR, hydrodynamics, nuclear physics, radiation transport, neutrinos, magnetic fields: globally distributed collab!
Scalable algorithms, complex AMR codes, viz, PFlops*week, PB output!
LHC: Higgs particle? ~10K scientists, 33+ countries, 25PB Planetary lab for scientific discovery! Remote InstrumentRemote Instrument
Remote InstrumentRemote Instrument
6
Grand Challenge Communities Combine it All...
Where is it going to go?
6
Same CI useful for black holes, hurricanes
Framing the QuestionScience is Radically Revolutionized by CI
Modern science Data- and compute-
intensive Integrative
Multiscale Collaborations for Complexity Individuals, groups Teams, communities
Must Transition NSF CI approach to support Integrative, multiscale 4 centuries of
constancy, 4 decades 109-12 change!
7
…But such radical change cannot be
adequately addressed with (our current) incremental approach!
…But such radical change cannot be
adequately addressed with (our current) incremental approach!
We still think like
this…
We still think like
this…
Students take note!Students
take note!
Scientific Computing and Imaging Institute, Scientific Computing and Imaging Institute, University of UtahUniversity of Utah
Data Crisis: Information Big BangPCAST Digital DataPCAST Digital Data
NSF Experts StudyNSF Experts Study
Wired, NatureWired, Nature
Storage Networking Industry Association (SNIA) 100 Year Archive Requirements Survey Report
“there is a pending crisis in archiving… we have to create long-term methods for preserving information, for making it available for analysis in the future.” 80% respondents: >50 yrs; 68% > 100 yrs
IndustryIndustry
9
Explosive Trends in Data Growth Comparative Metagenomics
DNA sequencing of entire families of organisms
Already hundreds of TB, thousands of users HD Collaborations and Optiportals
Multichannel HD, gigapixel visualizations Petascale-Exascale simulation
They generate peta-exabytes per simulation! Square Kilometer Array
3000 radio receivers, 1 km2 area!19 countries! Possibly beginning in 201X,
operational 202XData: exabyte per week! Analysis: Exaflops!
Provenance in Science
Provenance is as important as the result
Not a new issue Lab notebooks have been used for a long time What is new?
Large volumes of data Complex analyses—
computational processes Writing notes is no
longer an option… GC Communities require
open, sharable data, standards, metadata
DNA recombinationBy Lederberg
WhenWhen
Observed dataObserved data
AnnotationAnnotation
Source: Juliana Freire, U of Utah
11
“Data Deluge” Drives Change at NSF Data Issues resonate the most across NSF units! DataNet – $100M investment in Sustainable
Archive & Access Partners – development of widely accessible network of interactive data archives; Driven by today’s grand challenges, integrating multiple disciplines
INTEROP – Community-led interoperability – interdisciplinary, community approaches to combine and re-use data in ways not envisioned by their creators
Data-intensive computing: SDSC Gordon facility NSF Data Policy – The “Data Working Group” (NSF-
wide group of Program Directors) working to assure that data are shared within and across disciplines
There is a major shift in science towards data-
intensive methods. NSF is
responding…
There is a major shift in science towards data-
intensive methods. NSF is
responding…
NSF Vision and National CI Blueprint
12
Track 1Track 1Track 1Track 1
Track 2Track 2Track 2Track 2
Track 2Track 2Track 2Track 2
Track 2Track 2Track 2Track 2
CampusCampusCampusCampusCampusCampusCampusCampus
CampusCampusCampusCampusCampusCampusCampusCampus CampusCampusCampusCampus
CampusCampusCampusCampusCampusCampusCampusCampus
CampusCampusCampusCampus
DataNetDataNetDataNetDataNet
DataNetDataNetDataNetDataNet
SoftwareSoftwareSoftwareSoftware
NetsNetsNetsNets
DataNetDataNetDataNetDataNet
DataNetDataNetDataNetDataNet
DataNetDataNetDataNetDataNet
Education Crisis: Education Crisis: I need all of this I need all of this to start to solve to start to solve
mymy problem! problem!
Education Crisis: Education Crisis: I need all of this I need all of this to start to solve to start to solve
mymy problem! problem! Science is becoming unreproducible in this environment. Validation?Provenance? Reproducibility?
Science is becoming unreproducible in this environment. Validation?Provenance? Reproducibility?
The Shift Towards DataImplications
All science is becoming data-dominatedExperiment, computation, theory
Totally new methodologiesAlgorithms, mathematicsAll disciplines from science and engineering
to arts and humanities End-to-end networking becomes critical
part of CI ecosystemCampuses, please note!
How do we train “data-intensive” scientists?
Data policy becomes critical!13
Recent NSF Activities on Data Policy and Implementation
14
15
Fundamental points on data and publication policy
Publicly funded scientific data and publications should be available, and science benefits
There has to be a place to keep data, and a way to access it
There needs to be an affordable, sustainable cost model for this
15
Who pays? The NSF? The
Institution? What is the cost model? What is
reasonable?
Who pays? The NSF? The
Institution? What is the cost model? What is
reasonable?
What data must be made available? Raw
data? Peer reviewed? When is it available? 6 months?
1 year? After publication?
What data must be made available? Raw
data? Peer reviewed? When is it available? 6 months?
1 year? After publication?
Where is it placed?
Author web site?
Library? NSF sites?
Where is it placed?
Author web site?
Library? NSF sites?
How long is it made available?
How do we enforce it post-
award?
How long is it made available?
How do we enforce it post-
award?
There is great variability in requirements across science communities: peer review can help guide this process.
There is great variability in requirements across science communities: peer review can help guide this process.
Changes Coming for Data! Long-standing NSF Policy on Data (Proposal &
Award Policies & Procedures Guide)“Investigators are expected to share with other
researchers, at no more than incremental cost and within a reasonable time, the primary data... created or gathered in the course of work under NSF grants”
16
NSF will soon require a Data Management Plan (DMP), subject to peer review; criterion for award The DMP will be in the form of a 2-page
supplementary document to the proposal It will not be possible to submit proposals without
such a document Customization by discipline, program necessary
Upcoming Implementation of NSF Data Policy
Directorate-Specific Issues & Peer Review Many details are implemented/enforced via
peer review and Program Officer discretion, including things like embargo period, standards, etc.
The challenge at NSF is that “no one size fits all” so each Directorate will be responsible for its own recommendations for DMP content, appropriate institutional repositories, etc.
This does not address Open Access as applied
17
Electronic Access to Scientific Publications
18
Why is this Important? Science requires it
Science progress accelerated by making publications available and searchable
Results in one community need to easily propagate to another for multidisciplinary complex problem solving
Search technologies can be brought to bear Publications need to be associated with rich information:
videos of simulations, supporting data, simulation and analysis codes…
Equality and Broadening participation Young scientists at smaller universities at a needless
disadvantage without it. They may lose journal access. This hurts science and puts talent at risk
US Administration focus on transparency and accountability
19
Current Activities We have begun serious discussions within
NSF on these issuesNational Science Board Committee on Data
startedGoals similar to those for Data
We have had numerous visits from funding agencies from around the worldPrimary topic: what is NSF doing on OA?
Discussion with various publishers, libraries to explore optionsQuality of science relies on peer review systems
of best journals; need a way to support OA20
On Working with Publishers
Quality of science, identification of talented scientists: we rely on the peer review systems of the best journals NSF receives an assurance that the work done on a grant
meets a standard Universities use impact factors as part of their tenure and
promotion process
I believe it is in the interests of science, and hence the public interest, to help journals find a viable OA business model..
Bernard Schutz, Presentation to NSF, May 200921
Final Remarks
Science is becoming collaborative and data dominated
We are accelerating efforts to advance NSF in all aspects of dataScience requires that data need to be open and
accessible; we are working to achieve this All forms of data are important, and must be
more tightly connected in the futureCollections, software, publications
Time is of the essence
22