The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre...

33
The unique qualities & responsibilit ies of a geographical cyberinfrastr ucture Mark Gahegan Centre for eResearch & Computer Science University of Auckland

Transcript of The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre...

Page 1: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

The unique qualities &

responsibilities of a geographical

cyberinfrastructure

Mark GaheganCentre for eResearch &

Computer ScienceUniversity of Auckland

Page 2: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Overview

1. Data-intensive GIScience: from data poor to drowning in 40 years

2. Challenges of organizing Big Data for GIScience

3. Challenges of computing with Big Data for GIScience

Page 3: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Data poor to drowningthe case of remote sensing

Page 4: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Early remote sensing platform

Page 5: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

1980s: 30m x 30m pixels

Page 6: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

2000s: 2.5 m x 2.5m pixels

Page 7: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Airborne Sensor platform(much cheaper and more flexible than satellite)

Page 8: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

One of the latest unmanned remote sensing platforms

Page 9: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

How much data so far?

• NASA’s Earth Observation System (EOS) program has about 4.2 petabytes (2010)– 430 times larger than the DT-LoC– 3 times smaller than the output from the Large Hadron Collider

in a single year

• Similar sized collections can be expected in Europe and Asia

• EOS contains mostly satellite data…not air photos, map or field data

• What about ‘Volunteered’ data?• And “The long tail of dark data…”?

Page 10: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

How does that compare to other

science disciplines? • Large Hadron Collider

(Physics)– 10-14TB year– A 20km high stack of DVDs or

400,000 large PC disks

• Genomics (Biology)– Imaging sequencers: Data

volume doubling every 6 months

– Can’t back it up to tape fast enough

Page 11: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Big Data Challenges for CI1. Storing unprecedented volumes of data (and accelerating)

– Data production passed storage capacity in 2007– Cost differential is increasing, Rate of data production is increasing

2. Describing what we have in ways that are helpful to future users (and our future selves)

– Metadata and Semantics for describing content (this tends to be producer-focused)– But also use-case metadata and emergent relationships (tends to be consumer-focused)

3. Finding what we need, in the context of our current task– semantically-enabled search engines that can use the above descriptions, (ideally from within analytical tools and

workflows)

4. Working out what we do not need to keep– Because it will not be used again or offers no ‘information gain’– Because it is easier to recreate than to store

5. Governing data collections well, within their communities of use– information and knowledge portals– effective governance of data resources– quality control strategies, including peer review and rewarding excellent contributions

Page 12: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

The Big Data responsibilities of CyberGIS• Create successful tools and languages to describe

and find data, so that reuse is actively encouraged• Enable the analysis,

• re-educate to reset the expectations…

• The data that we collect forms a natural history of the changing planet on which we live– The same cannot be said for many other sciences...

• This ongoing record is more important than the individual research we each engage in– Note we may not anticipate the questions that future

researchers may need to answer using our data

Page 13: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Emerging data opportunities…

What do these four different spatial analysis tasks have in common?

• Find traffic bottlenecks…?• Compute earthquake epicenters…?• Track Influenza epidemics…?• Perform land cover classification…?

Page 14: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

‘Fourth paradigm’ –science led from Big Data

Page 15: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

BUT

Page 16: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

HPC Analysis challenges of massive data in Earth Science

Re-express (spatial) analysis algorithms so that they scale across HPC hardware AND Big Data:

– Geometry: Point / line / region / volume—algebra, selection, transformation, projection

– Topology: connectivity, route-finding, friction– (Spatial) statistics: classification, interpolation– Point pattern analysis / discovery: cluster detection

The challenge is to be SYSTEMATIC, not piecemeal

Page 17: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

What’s limiting the task?• Memory?

– 1TB on a single compute node now– 2-8TB on some equipment (e.g. SGI UV)

• CPU?– Tightly bound—needs a lot of inter-process communication– Embarrassingly parallel—can be perfectly decomposed

• Data?– Random? Linear? Blocky? (Degree of locality of reference)– Replication?

• Communications?– Data channels, infiniband, metadata

• Nothing?– Not everything needs to be parallelized…just the limiting segments

Page 18: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.
Page 19: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Domain Decomposition

Architecture examples

Use case Scaling

Shared memory:(OpenMP)

Usually bound to a single compute node. Requires code rewrite…

Increase node core count / memory

Distributed memory:(MPI)

Scaling beyond the compute nodeRequires major rewrite

Additional MPI fabric used to pass messages between nodes

Adaptive:(Cassandra/Hadoop/SOG)

Data-intensive, evolving, decomposition not fully understood at outset

More data bandwidth by adaptively dividing up and replicating the data

Massively parallel:(BlueGene / GPU)

Very high degree of parallelizationand power efficiency

Potentially scales to 1,000,000 processors

Page 20: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Or to put it another way…• How do GISci algorithms map onto the well

understood supercomputing templates– Dense Grids– Sparse Grids– Computational Fluid Dynamics– N-Body interactions– Monte Carlo– Data Intensive– etc...

(Are all our algorithms covered by these templates?)

Page 21: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Cost of reengineering vs. slowdown for GIS algorithms

Cost of reengineering

Utility

Page 22: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Sticky CyberGISHow to attract and keep the community involved?

– Outreach & community engagement– Compelling and appealing functionality

• Data and method repositories• Workflows• Semantic interoperability• Killer Apps…

– Incentives to contribute– Continuity

Page 23: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.
Page 24: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Computational workflows embedded in

social media• Scripts, workflows,

simulations, experimental plans statistical models, ...

• Repeatable, reproducible, comparable and reusable

• Sharing propagates expertise and builds reputation

• One can be ‘friends with an experiment’ in a science, social network

,

http://myexperiment.org

Page 25: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Semantically translating map dataSemDat Web Service:

http://semdat.bestgrid.org/semdat/

Page 26: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Geologic structure (USArray)Geologic structure (USArray)

RuptureRupturedynamicsdynamics(SAFOD, (SAFOD, ANSS,ANSS,

USArray)USArray)InSAR Image of theHector Mine Earthquake

A satellitegeneratedInterferometricSynthetic Radar(InSAR) image ofthe 1999 HectorMine earthquake.

Shows thedisplacement fieldin the direction ofradar imaging

Each fringe (e.g.,from red to red)corresponds to afew centimeters ofdisplacement.

SeismicHazardModel

Seismicity (ANSS)Seismicity (ANSS) PaleoseismologyPaleoseismology Local site effectsLocal site effects

Faults Faults (USArray)(USArray)

Stress Stress transfertransfer(InSAR, (InSAR,

PBO,PBO,SAFOD)SAFOD)

Crustal motion (PBO)Crustal motion (PBO) Crustal deformation (InSAR)Crustal deformation (InSAR) Seismic velocity (USArray)Seismic velocity (USArray)(from Leinen, 2004)

Killer App example from geosciences: earthquake modelling

GEON: Chaitan Baru, SDSC

Page 27: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Conclusions• Big Data creates new ways of approaching GIScience: discovery-led

rather than theory-led– Need to scale up our storage– Useful data is the data that can be reused…

• Scalable GIScience methods are needed now– Domain decomposition has always been the challenge for GIScience,

and is still.– A systematic analysis of algorithm bottlenecks and amenability to

parallelization has been missing for 20 years– Such an analysis is an ongoing task…as new parallel HPC and data

paradigms become possible– Re-educate to reset expectations among researchers

• Use the best technologies and tools from other disciplines who have made this leap, especially bio-informatics, computational chemistry, high energy physics

Page 28: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Questions?

Page 29: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Fourth paradigm and data complexity1. Experiment & Measurement 2. Analytical Theory3. Numerical Simulations4. Data Intensive Computing

Data fusion + data mining + synthesis/learning + explanation

http://research.microsoft.com/en-us/collaboration/fourthparadigm/

Page 30: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Utilizing massive data to discover and explain

Is not as easy as you might think…– Poor and sparse samples, surrogates, bias…– As number of dimensions increases it becomes

increasingly difficult to add in any data point without giving rise to some kind of statistically significant ‘pattern’ or ‘cluster’

– And parametric distributions become unreliable– It is very difficult to discover useful things that are

unknown by experts

Page 31: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

GEOLOGIC AGE ROCK TYPE

Volcanic

from www.GEONgrid.org

aligning heterogeneous definitions in content, schema

Era

Eon

Period

Series

STANDARD DEFINITIONS• data content: rock types, time scale, …• data schema

We need to capture the meaning of data, not just the data itself

Page 32: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

OneGeology interoperability portalData from different countries can be integrated, despite using different geologic categories /legends

Page 33: The unique qualities & responsibilities of a geographical cyberinfrastructure Mark Gahegan Centre for eResearch & Computer Science University of Auckland.

Complete connected neighborhood of a research article or dataset (Alfred knowledge browser)