Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009
description
Transcript of Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009
![Page 1: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/1.jpg)
Data- Driven Computational Science and Future Architectures at the
Pittsburgh Supercomputing CenterRalph Roskies
Scientific Director, Pittsburgh Supercomputing Center
Jan 30, 2009
![Page 2: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/2.jpg)
2
NSF TeraGrid Cyberinfrastructure
• Mission: Advancing scientific research capability through advanced IT
• Resources: Computational, Data Storage, Instruments, Network
![Page 3: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/3.jpg)
3
Now is a Resource Rich Time
• NSF has funded two very large distributed memory machines available to the national research community
– Trk2a-Texas- Ranger (62,976 cores, 579 Teraflops, 123 TB memory)
– Trk2b-Tennessee- Kraken (18048 cores, 166 Teraflops, 18 TB memory) growing to close to a Petaflop
– Track 2d: data centric; experimental architecture; … proposals in review
• All part of TeraGrid. Largest single allocation this past September was 46M processor hours.
• In 2011, NCSA is going to field a 10 PF machine.
![Page 4: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/4.jpg)
4
Increasing Importance of Data in Scientific Discovery
Large amounts from instruments and sensors.
• Genomics.
• Large Hadron Collider
• Huge astronomy data bases – Sloan Digital Sky Survey– Pan Starrs– Large Scale Synoptic Telescope
Results of large simulations ( CFD, MD, cosmology,…)
![Page 5: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/5.jpg)
5
Insight by VolumeNIST Machine Translation Contest
• In 2005, Google beat all the experts by exploiting 200 billion words of documents (Arabic to English, UN high quality translation), and looking at all 1- word, 2-word,…5 word phrases, and estimating their best translation. Then applied that to the test text.
• No one on the Google team spoke Arabic or understood its syntax!
• Results depend critically on the volume of text analyzed. 1 billion words would not have sufficed
![Page 6: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/6.jpg)
6
What computer architecture is best for data intensive work?
Based on discussions with many communities, we believe that a complementary architecture embodying large shared memory will be invaluable– Large graph algorithms (many fields including web
analysis, bioinformatics, …)
– Rapid assessment of data analysis ideas, using OpenMP rather than MPI, and with access to large data
![Page 7: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/7.jpg)
77
History of first or early systems
![Page 8: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/8.jpg)
PSC Facilities
Storage Silos2 PB
DMF Archive ServerVisualization NodesNVidia Quadro4 980XGL
Storage Cache Nodes100 TB
XT3 (BigBen)4136p, 22 TFlop/s
Altix (Pople)768 p, 1.5TB shared memory
![Page 9: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/9.jpg)
9
PSC Shared Memory Systems
• Pople- introduced this March 2008– SGI Altix 4700, 768 Intel cores, 1.5
TB coherent shared memory, Numalink Interconnect
• Highly oversubscribed
• Already stimulated work in new areas, because of perceived ease of programming in shared memory – Game theory (Poker),
– Epidemiological modeling
– Social network analysis:
– Economics of Internet Connectivity:
– fMRI study of Cognition:
![Page 10: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/10.jpg)
10
Desiderata for New System
• Powerful Performance
• Programmability
• Support for current applications
• Support for a host of new applications and science communities.
![Page 11: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/11.jpg)
11
Proposed Track 2 System at PSC
• Combines next generation Intel processors (Nehalem EX) with SGI next generation interconnect technology, (NUMAlink-5)
• ~100,000 cores, ~100TB memory, ~1 Pf peak
• At least 4TB coherent shared memory components, with full globally addressable memory
• Superb MPI and IO performance
![Page 12: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/12.jpg)
12
• MPI Offload Engine (MOE) – Frees CPU from MPI activity– Faster Reductions (2-3x compared to competitive
clusters/MPPs)– Order of magnitude faster barriers and random access
• NUMAlink 5 Advantage– 2-3× MPI latency improvement– 3× bandwidth of InfiniBand QDR– Special support for block transfer and global operations
• Massively Memory-mapped I/O– Under user control– Big speedup for I/O bound apps
Accelerated Performance
![Page 13: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/13.jpg)
13
Enhanced productivity from Shared Memory
• Easier shared memory programming for rapid development/prototyping
• Will allow large scale generation of data, and analysis on the same platform without moving- (a major problem for current Track2 systems)
• Mixed shared memory/ MPI programming between much larger blocks (e.g. Woodward’s PPM code or example below)
![Page 14: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/14.jpg)
14
High-Productivity, High-PerformanceProgramming ModelsThe T2c system will support programming models for:
HighProductivityStar-P: parallelMATLAB,Python, R
HighProductivityStar-P: parallelMATLAB,Python, R
CoherentSharedMemoryOpenMP,pthreads
CoherentSharedMemoryOpenMP,pthreads
HybridMPI/OpenMP,MPI/threaded
HybridMPI/OpenMP,MPI/threaded
PGASUPC, CAFPGASUPC, CAF
MPI,shmemMPI,shmem
Charm++Charm++
extreme capability algorithm expression user productivity workflows
![Page 15: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/15.jpg)
15
Programming ModelsPetascale Capability Applications
HighProductivityStar-P: parallelMATLAB,Python, R
HighProductivityStar-P: parallelMATLAB,Python, R
CoherentSharedMemoryOpenMP,pthreads
CoherentSharedMemoryOpenMP,pthreads
HybridMPI/OpenMP,MPI/threaded
HybridMPI/OpenMP,MPI/threaded
PGASUPC, CAFPGASUPC, CAF
MPI,shmemMPI,shmem
Charm++Charm++
• Full-system applications will run in any of 4 programming models
• Dual emphasis on performance and productivity– Existing codes
– Optimization for multicore
– New and rewritten applications
![Page 16: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/16.jpg)
16
Programming ModelsHigh Productivity Supercomputing
HighProductivityStar-P: parallelMATLAB,Python, R
HighProductivityStar-P: parallelMATLAB,Python, R
CoherentSharedMemoryOpenMP,pthreads
CoherentSharedMemoryOpenMP,pthreads
HybridMPI/OpenMP,MPI/threaded
HybridMPI/OpenMP,MPI/threaded
PGASUPC, CAFPGASUPC, CAF
MPI,shmemMPI,shmem
Charm++Charm++
• Algorithm development
• Rapid prototyping
• Interactive simulation
• Also:– Analysis and visualization– Computational steering– Workflows
![Page 17: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/17.jpg)
17
Programming ModelsNew Research Communities
HighProductivityStar-P: parallelMATLAB,Python, R
HighProductivityStar-P: parallelMATLAB,Python, R
CoherentSharedMemoryOpenMP,pthreads
CoherentSharedMemoryOpenMP,pthreads
HybridMPI/OpenMP,MPI/threaded
HybridMPI/OpenMP,MPI/threaded
PGASUPC, CAFPGASUPC, CAF
MPI,shmemMPI,shmem
Charm++Charm++
• multi-TB coherent shared memory
• Global address space
• Express algorithms not served by distributed systems
– Complex, dynamic connectivity
– Simplify load balancing
![Page 18: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/18.jpg)
18
Enhanced Service for Current Power UsersAnalyze Massive Data where you produce it
• Combines superb MPI performance with shared memory and higher level languages for rapid analysis prototyping,
![Page 19: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/19.jpg)
19
• Validation across models (Quake: CMU, AWM: SCEC) 4D waveform output at 2Hz (to address civil engineering structures) for 200s earthquake simulations will generate hundreds of TB of output.
• Voxel by voxel comparison is not an appropriate comparison technique. PSC developed data-intensive statistical analysis tools to understand subtle differences in these vast spatiotemporal datasets.
• required having substantial windowsof both datasets in memory to compare
Analysis of Seismology Simulation Results
![Page 20: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/20.jpg)
20
Design of LSST Detectors
• Gravitational lensing can map the distribution of dark matter in the Universe and make estimates of Dark Energy content more accurate. – Measurements are very subtle.– High quality modeling, with robust statistics,
is needed for LSST detector design.
• Must calculate ~10,000 light cones through each simulated universe.– Each universe is 30TB.– Each light cone calculation requires
analyzing large chunks of the entire dataset..
![Page 21: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/21.jpg)
21
A crack in the surface of a piece of metal grows from activity of atoms at the point of cracking. Quantum-level simulation (right panel) leads to modeling the consequences (left panel). From http://viterbi.usc.edu/news/news/2004/2004_10_08_corrosion.htm
Understanding the Processes that DriveStress-Corrosion Cracking (SCC)
• Stress-corrosion cracking affects the safe, reliable performance of buildings, dams, bridges, and vehicles.
– Corrosion costs the U.S. economy about 3% of GDP annually.
• Predicting the lifetime beyond which SCC may causefailure requires multiscale simulations that couplequantum, atomistic, and structural scales.
– 100-300nm, 1-10 million atoms, over 1-5 μs, 1 fs timestep
• Efficient execution requires large SMP nodes to minimize surface-to-volume communication, large cache capacity,and high-bandwidth, low-latency communications.
• expected to achieve the ~1000 timesteps per second needed for realistic simulation of stress-corrosion cracking.
Courtesy of Priya Vashishta, USC
![Page 22: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/22.jpg)
22
From Karla Atkins et al., An Interaction Based Composable Architecture for Building Scalable Models of Large Social, Biological, Information and Technical Systems, CTWatch Quarterly March 2008http://www.ctwatch.org
Analyzing the Spread of Pandemics
• Understanding the spread of infectiousdiseases is critical for effective responseto disease outbreaks (e.g. avian flu).
• EpiFast: a fast, reliable method for simulating pandemics, based on a combinatorial interpretation of percolation on directed networks
– Madhav Marathe, Keith Bisset, et al.,Network Dynamics and SimulationsScience Laboratory (NDSSL) at Virginia Tech
• Large shared memory is needed for efficientimplementation of graph theoretic algorithmsto simulate transmission networks that modelhow disease spreads from one individual to the next.
• 4TB of shared memory will allow study of world-wide pandemics.
![Page 23: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/23.jpg)
23
• Web analytics
• Applications: fight spam, rank importance, cluster information, determine communities
• Algorithms are notoriously hard to implement on distributed memory machines.
Engaging New CommunitiesMemory-Intensive Graph Algorithms
1010 pages 1011 links 40 bytes/link → 4TB
web page
Link
courtesy Guy Blelloch (CMU)
![Page 24: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/24.jpg)
24
More Memory-Intensive Graph Algorithms
protein
interaction
IP packet
session
item
common receipt
word
adjacency
computer securitybiological pathways
machine translation analyzing buying habits
Also: epidemiology,social networks, … courtesy Guy Blelloch (CMU)
![Page 25: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/25.jpg)
25
PSC T2c: Summary
• PSC’s T2c system, when awarded, will leverage architectural innovations in the processor (Intel Nehalem-EX) and the platform (SGI Project Ultraviolet) to enable groundbreaking science and engineering simulations using both “traditional HPC” and emerging paradigms
• Complement and dramatically extend existing NSF program capabilities
• Usability features will be transformative– Unprecedented range of target communities
• perennial computational scientists
• algorithm developers, especially those tackling irregular problems
• data-intensive and memory-intensive fields
• highly dynamic workflows (modify code, run, modify code again, run again, …)
• Reduced concept-to-results time transforming NSF user productivity
![Page 26: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/26.jpg)
26
Integrated in National Cyberinfrastructure
• Enabled and supported by PSC’s advanced user support,application and system optimization, middleware and infrastructure, and leveraging national CyberInfrastructure
![Page 27: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/27.jpg)
27
Questions?
![Page 28: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/28.jpg)
28
Predicting Mesoscale Atmospheric Phenomena
• Accurate prediction of atmosphericphenomena at the 1-100km scale isneeded to reduce economic lossesand injuries due to strong storms.
• To achieve this, we require 20-memberensemble runs of 1 km resolution,covering the Continental US, withdynamic data assimilation inquasi-real time.
– Ming Xue, University of Oklahoma– Reaching 1.0-1.5 km resolution is critical.
(In certain weather situations, fewerensemble members may suffice.)
• Expected to sustain 200 Tf/s for WRF, enabling prediction of atmospheric phenomena at the mesoscale.
Fanyou Kong et al., Real-Time Storm-Scale Ensemble Forecast Experiment – Analysis of 2008 Spring Experiment Data, Preprints, 24th Conf. on SevereLocal Storm, Amer. Metor. Soc., 27-31 October 2008.http://twister.ou.edu/papers/Kong_24thSLS_extendedabs-2008.pdf
![Page 29: Ralph Roskies Scientific Director, Pittsburgh Supercomputing Center Jan 30, 2009](https://reader030.fdocuments.net/reader030/viewer/2022032710/5681397f550346895da1118a/html5/thumbnails/29.jpg)
29
Reliability
• Hardware-enabled fault detection, prevention, containment
• Enhanced monitoring and serviceability
• Numalink automatic retry, various error correcting mechanisms