Exploring Scientific Discovery with Large-Scale Parallel...
Transcript of Exploring Scientific Discovery with Large-Scale Parallel...
Exploring Scientific Discovery with Large-ScaleParallel Scripting
Tim Armstrong 1 Justin M. Wozniak 2
Michael Wilde12
1University of Chicago
2Argonne National Laboratory
May 15, 2013
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Overview
Parallel scripting: massive scalability with (relative) ease
• Scaling up real science applications difficult:• Must adapt code to radically different programming model• Concurrency bugs• Load balancing, data management, etc
• SciColSim: compute-intensive science app
• Swift/T: super-scalable high-performance scripting system forparallel composition of existing code
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
The Scripting Paradigm
• Low-level language (e.g. C) + high-level language (e.g.Python)
High-level script
Optimized performance-critical functions
orchestrates
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Parallel Scripting
• Can retrofit parallelism onto sequential scripting languages:• Threads• Message passing (MPI, etc.)• Abstractions (MapReduce, etc.)
• But parallelism is a second-class concept in the language...
• Q: Why can’t I express parallelism with loops, conditionals,variables, etc?
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Parallel Scripting in Swift/T• Q: Why can’t I express parallelism with loops, conditionals,
variables?• A: you can in Swift!• The Swift parallel scripting language[WFI+09]:
• Implicit dataflow parallelism• Language statements execute concurrently in dataflow order• Single-assignment variables guarantee determinism• Determinism extends to additional, rich, data structures:
arrays, hash tables, structs.
float results[];
file data = input_file("my.data");
foreach i in [1:N] { Independent parallel iterations
if (predicate(i)) { Dataflow dependencies
results[i] = compute(i, data);
}
}
mean, stdev = stat_summ( results);
Swift code with implied parallel dataflow
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Swift/T Scalable Implementation[WAW+]
• Can harness tens or hundreds of thousands of cores
• All runtime components distributed and scalable: data store,task distributor & script executor
• Optimizing compiler (stc) reduces messaging
Shared State
Data Store
Task Queue
Server Processes
Execution
Control/Worker Processes
ControlFlow
Load Balancing
RuleEngine Server
Server
TaskExecution
…………
…………
Process
Task flow
RuleEngine
Legend
Swift/T runtime services breakdown (left) and task dispatch (right)
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
SciColSim Application: Simulating Scientific Discovery• Ongoing research at University of Chicago• Want to understand process of scientific discovery: [ER10]
• How do scientists select hypotheses to work on?• What are the most effective strategies?
• Can explore with simulation:• Model knowledge as graph of concepts• Simulate different graph exploration strategies• Can measure how “efficient” strategy is
• Computational characteristics:• Each simulation implemented with sequential C++ code• Floating point intensive: many probability calculations
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Evaluating model parameters
• “Ensemble” of randomized simulations
• Results of simulation are averaged to evaluate “goodness” ofcurrent parameters
• Task duration is 0.2-20s. Runtime depends on inputparameters, plus significant random variation.
ensemble of randomized simulations
analyze and choose new parameters
parameter set i
parameter set i + i
parameter set i + 2
Task
Dataflow variable
Datadependency
Evaluating objective function and updating parameters
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Simulated Annealing
• Want to find “best” set of simulation parameters
• Optimize using a simulated annealing algorithm• Basic idea:
1. Perturb one parameter2. Evaluate objective function for current parameters3. Depending on result, maybe undo parameter change4. Repeat...
10x independent simulated annealing instances
…
500-1000x parameter updates
………
…… …
Visualization of parallel simulated annealing with 8-way parallelizedobjective function. Real runs have 1000-way parallelism.
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Scale-up requirements
• Optimization + validation: 0.25–0.5M CPU-hours per model
• Fast feedback needed: scientists want to iterate models
• Need to get high speedup: 4000x+ to get timely results
• Relatively short-lived tasks: 0.2s-20s. Fan-out and fan-inevery 1-2 minutes.
• Unpredictable task duration: need to dynamically assign tasksto processors, in scalable way
• High-performance dynamic task allocation mandatory
10x independent simulated annealing instances
…
500-1000x parameter updates
………
…… …
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Adapting for Swift/T
• Kept compute-intensive simulation logic in C++
• Converted simulated annealing algorithm to Swift/T:• Nested parallel loops• Sequential iteration• Logic and formulas to update parameters• Logging and output
Original Swift/T VersionLines of Code Python: 33 lines
C++: 1175 linesSwift/T: 269 linesC++: 861 lines
Scalability One node, many cores Many cores, 100’s or1000’s of nodes
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Scaling up!
• Strong scaling results for production workload at differentcompiler optimization levels: scales well!
• Mainly limited by amount of parallelism in workload ⇒ couldscale further with different optimization algorithm
• STC compiler optimization: reduces messaging ⇒ betterscaling
0 1000 2000 3000 40000
5
10
15
O0
O1
O2
O3
Ideal
Cores
Iters
/sec
0 2000 4000 6000 80000
0.01
0.02
0.03
0.04
0.05
iters/sec
Ideal
Cores
Iters
/sec
Strong scaling for down-scaled at different STC optimization levels(left) and full-scale problem (right)
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Task Prioritization
• Key technique enabled by Swift/T: task prioritization
• Improves resource utilization and time-to-solution
• Exploits application knowledge:• “Catch-up” heuristic for slower optimization chains• Prioritize long-running tasks: target parameter correlated
with runtime
@prio= 100*(niters - iter) + target
run simulation(...);
1200 1300 1400 1500 160005
1015202530
without priorities
with priorities
Time (seconds)
Bus
y co
res
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
References
J. Evans and A. Rzhetsky, Machine science, Science 329 (2010), no. 5990.
J. M. Wozniak, T. G. Armstrong, M. Wilde, D. S. Katz, E. Lusk, and I. T.Foster, Swift/T: Large-scale application composition viadistributed-memory data flow processing, Proc. CCGrid ’13.
M. Wilde, I. Foster, K. Iskra, P. Beckman, Z. Zhang, A. Espinosa,M. Hategan, B. Clifford, and I. Raicu, Parallel scripting for applications atthe petascale and beyond, Computer 42 (2009), no. 11.
Acknowledgements This research is supported in part by the U.S. DOE Office of
Science under contract DE-AC02-06CH11357, FWP-57810. This research was
supported in part by NIH through resources provided by the Computation Institute
and the Biological Sciences Division of the University of Chicago and Argonne
National Laboratory, under grant S10 RR029030-01.
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Demo
• Compile application from scratch to illustrate toolchain
• Production-scale run of SciColSim on 8400 cores of BeagleCray XE6 supercomputer @ UChicago
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Conclusions
• Can scale up existing applications with parallel scripting
• Quick development cycle: easy to debug and modify code,compared with alternative cluster programming models
• Appropriate for applications that can be implemented asuser-defined tasks with explicit data dependences
• Much better for moderately fine-grained workloads on largeclusters than traditional centralized workflow systems
• Does not support wide-area grids/clouds (yet)
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Task Dispatch Speed
• Cray XE6
• On 10 nodes, 24 cores per node
• Many independent 0s tasks
ADLB O0 O1 O2 O30.00.20.40.60.81.01.21.4
Wor
k T
asks
/s (
Mil.
)
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Scaling up to 105
• Experiment on Blue Gene/P Intrepid at Argonne National Lab
• 100s task durations
• Experiment used old version of Swift/T. Many improvementssince.
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Optimizations
• O0: Only optimize write reference counts.
• O1: Basic optimizations: constant folding, dead codeelimination, forward data flow, and loop fusion.
• O2: More aggressive optimizations: asynchronous opexpansion, wait coalescing, hoisting, and small loop expansion.
• O3: All optimizations: function inlining, pipeline fusion, loopunrolling, intra-block instruction reordering, and simplealgebra.
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Comparison with Other Systems
• Hadoop• Fixed communication pattern• Must reorganize code to fit MapReduce model• Minimize per data-item overhead versus minimize per
task-overhead
• Swift and other workflow systems• Single master node limits scalability• Optimizing compiler• Better foreign-function interface for directly calling C++ code• No support yet for wide-area systems
Parallel Scripting with Swift/T SciColSim Application Scaling up SciColSim with Swift/T
Comparison with PGAS
• Scripting paradigm versus one language for computation +coordination
• Focus on simplicity
• No explicit data placement: managed by runtime
• Strong safety guarantees (e.g. determinism)