Building Systems for Big Data Big Compute · 2016-09-01 · Strong motivation for HPC + Big Data in...
Transcript of Building Systems for Big Data Big Compute · 2016-09-01 · Strong motivation for HPC + Big Data in...
Building Systems for Big Data and Big Compute
Steve Scott, Cray CTO
Smoky Mountains ConferenceSeptember 1, 2016
C O M P U T E | S T O R E | A N A L Y Z E
We’ve Been Doing “Big Data” For a Long Time
Massive Datasets
High Performance Memory, Interconnects, and Storage
Copyright 2016 Cray Inc. 2
C O M P U T E | S T O R E | A N A L Y Z E
Disruptive Memory Technology
Cray Inc.
● Standard DDR memory BW has notkept pace with CPUs
● HBM:● ~10x higher BW, ~10x less energy/bit● Costs ~2x DDR4 per bit
0
200
400
600
800
1000
1200
1400
1600
1800
bandwidth (GB/s) pJ/bit
Today’s DDR4 vs. Future HBM3
4 channels 2.4 GHz DDR4 4 stacks of gen-3 HBM on package
180160140120100
80604020
0
May want more, smaller nodes, with better BW and capacity per op
C O M P U T E | S T O R E | A N A L Y Z E
Most “Big Data” Jobs Aren’t That Big
Copyright 2016 Cray Inc. 4
● Aggregate data becoming very large, but most analytic jobs are modest● Typical data analytics workloads: 10GB mean, 100 GB 95%ile● Prabhat: big HPC analytics jobs ~10x larger than that● Many data analytics jobs run on a handfull of cores
● Meanwhile, the APEX procurement wants multiple PB of memory!
3PB main memory
1 TB “Big Data” job
C O M P U T E | S T O R E | A N A L Y Z E
Most “Big Data” Jobs Aren’t That Big
Copyright 2016 Cray Inc. 5
● Aggregate data becoming very large, but most analytic jobs are modest● Typical data analytics workloads: 10GB mean, 100 GB 95%ile● Prabhat: big HPC analytics jobs ~10x larger than that● Many data analytics jobs run on a handfull of cores
● Meanwhile, the APEX procurement wants multiple PB of memory!
● I’ll interpret “Big Data” as meaning data analytics● Extracting knowledge/insight from data● As opposed to simulation and modeling, which generally producesdata
C O M P U T E | S T O R E | A N A L Y Z E
Convergence of HPC and Big Data
Copyright 2016 Cray Inc. 6
What dowe need tobedoing inHPCthat is differentfrom what we have donein
the past?
C O M P U T E | S T O R E | A N A L Y Z E
What is an optimal design for HPC?
Copyright 2016 Cray Inc. 7
Node Architecture A
• Dual Haswell nodes @ 2.4 GHz• 128 GB DDR4 @ 2.66 GHz• 12.5 GB/s/node network bandwidth
Node Architecture B
• Dual Haswell nodes @ 2.6 GHz• 256 GB DDR4 @ 2.66 GHz• 25 GB/s/node network bandwidth
Not atall clear.
Whichofthese is better?
C O M P U T E | S T O R E | A N A L Y Z E
§ Map Reduce§ N-body methods § Graph traversal § Graphical models § Dense and sparse linear algebra § Spectral methods § Structured and unstructured grids § Combinational logic§ Dynamic programming § Backtrack and branch-and-bound § Finite-state machines
§ Basic statistics – simple Map Reduce implementation
§ Generalized n-body problems§ Graph-theoretic computations§ Linear algebraic computations§ Optimizations – e.g., linear programming§ Integration/machine learning§ Alignment problems – e.g., BLAST
Copyright 2016 Cray Inc.
Landscape of Parallel Computing Research (Berkeley – 2006/2008)
State of Big Data: Use Cases and Ogre Patterns (NIST 2014)
Data Analytics can be considered just another set of workloads in a sea of workloads.
8
C O M P U T E | S T O R E | A N A L Y Z E
GeneralizationsAbout Analytics Workloads
Copyright 2016 Cray Inc. 9
● Data centric workloads⇒ Larger memories and local SSDs are helpful
● Vertical data motion is important● Hadoop and Spark effectively move computation to the data, do initial filtering of data locally⇒ Don’t (usually) need much network bandwidth
● Notable exceptions: Graph analytics and machine learning● Graph analytics
● Can’t partition the data! So really hard to scale! (many get discouraged)● Wants a network that can do fine-grained RDMA well (similar to some HPC)
● Machine Learning● Training problem can be parallelized, can use lots of data, and requires global communication● Wants a very high performance network and memory system
C O M P U T E | S T O R E | A N A L Y Z E
Merging of HPC and Data Analytics
Copyright 2016 Cray Inc. 10
Urika-GDCustom Graph
Analyticsengine Urika-XA
Hadoop, Spark,NoSQL
Urika-GX “Athena”
Cray Graph Engine
HPC + Analytics workflows
Why combine HPC and Analytics solutions in a single box? HPC underneath the covers
Open analytics framework
Aries network
Integrated system: Hadoop/Spark + Graph analytics +
HPC
XC40World’s leading Supercomputer
“Minerva”
C O M P U T E | S T O R E | A N A L Y Z E
Building an Analytics Machine
Copyright 2016 Cray Inc. 11
● Urika-GX Approach:● 48 Haswell nodes per cabinet● Aries network● Up to 512GB DRAM per node● Dual SATA HDDs per node● Up to 4TB/node SSD per node
● XC40 Approach:● 192 Haswell nodes per cabinet● Aries network● Up to 256GB DRAM per node● DataWarp 12TB SSD blades, which can
be dynamically shared across system
But…. need to address Lustre metadata bottleneck for codes that do lots of “local” file IO.
C O M P U T E | S T O R E | A N A L Y Z E
Using Shifter to Accelerate Per-Node I/O
Copyright 2016 Cray Inc. 12
• Demonstrated > 100x speedup vs. straight Lustre on IOPS benchmark at 256 nodes
• Demonstrated Spark scaling to 50,000 cores in CUG 2016 paper
“NAS storage surprisingly close to local SSDs”`
https://cug.org/proceedings/cug2016_proceedings/includes/files/pap125.pdf
C O M P U T E | S T O R E | A N A L Y Z E
Resource and Management and Scheduling
Copyright 2016 Cray Inc. 13
Picture from Malte Schwarzkopf bloghttp://www.firmament.io/blog/scheduler-architectures.html
● Analytics workloads can have very different scheduling needs than HPC workloads● May want very fine-grained scheduling (cores, not nodes)● May have long-running services processing streaming data● May need to dynamically expand/contract● May be tied to real-time events such as experimental control or output processing● May be interactive/bursty (database utilization depends on queries)
C O M P U T E | S T O R E | A N A L Y Z E
Other Analytics Implications (mostly SW)
Copyright 2016 Cray Inc. 14
● Greater diversity of programming languages & environments● Python, R, Julia, Spark, Scala, ML frameworks, etc.● MPI + OpenMP is a foreign concept to the analytics community● Openness and container support are important
● Cloud interoperability● E.g.: source data from cloud ➝ compute/analyze ➝ store data back in cloud
● Data movement between apps● HPC tends to focus on accelerating single applications● Analytics workloads usually involve pipelines● Shared data formats can allow data exchange in memory
● E.g.: Arrow in-memory data structure specification for columnar data
C O M P U T E | S T O R E | A N A L Y Z E
Take aways
Copyright 2016 Cray Inc. 15
● Strong motivation for HPC + Big Data in a single system● Growing desire for HPC + analytic workflows● More efficient when data can be transferred in memory/SSD● Utilization is better with systems that can be dynamically provisioned
● Big Data is just another set of workloads● Not that different (we already build machines to handle big data)● On average, probably want more memory per node for analytics● Some workloads don’t need much network, but others need a strong network● May argue for heterogeneous systems (already do that for HPC)
● Biggest issue may be resource management/scheduling● A few other software issues, but no show stoppers for converged systems
Thank You!Questions?