Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing...
Transcript of Static and Dynamic Load Balancing for Heterogeneous Systems · Static and Dynamic Load Balancing...
Multicore Challenge Conference 2013
UWE, Bristol
Static and Dynamic Load Balancing for Heterogeneous Systems
Greg Michaelson
Turkey Al Salkini, Khari Armih, Deng Xiao Yan
Heriot-Watt University
Phil Trinder
University of Glasgow
1 Multicore Challenge Conference 2013
Context
• explosive growth in deployment of heterogeneous systems
– desktop/laptop/tablet with many-core + GPU
• HPC with clusters of nodes of many-core + GPU
• upgrade/replace/remove cluster components as opportune rather than replacing whole cluster
• newer components typically higher spec than older components
• hard to allocate and balance load
Multicore Challenge Conference 2013 2
Context
• typical heterogeneous system
• colours indicate different technologies
Multicore Challenge Conference 2013 3
network
bus
Cores Mem GPU
bus
Cores Mem GPU
cluster cluster
Context
• static work allocation for regular data parallelism
• simple data parallel skeleton with master & identical workers
• masters sends work to workers and receives results
• assume: master on core; workers on cores or GPU
Multicore Challenge Conference 2013 4
master
worker1 worker2 workerM ...
Homogeneity
• cluster of identical single core nodes with no GPU
Multicore Challenge Conference 2013 5
network
bus
Core Mem
bus
Core Mem
Amdahl’s Law
• T = total sequential time
Multicore Challenge Conference 2013 6
T
Amdahl’s Law
• T = total sequential time
• Tseq = necessarily sequential time
• Tpar = potentially parallelisable time
• i.e. T = Tseq + Tpar
Multicore Challenge Conference 2013 7
Tseq Tpar
T
Amdahl’s Law
• T = total sequential time
• Tseq = necessarily sequential time
• Tpar = potentially parallelisable time
• i.e. T = Tseq + Tpar
• N = number of nodes
• allocate same amounts of Tpar
to each node
Multicore Challenge Conference 2013 8
Tseq Tpar
T
Tpar1 Tpar2 TparN ... Tseq
Amdahl’s Law
• T = total sequential time
• Tseq = necessarily sequential time
• Tpar = potentially parallelisable time
• i.e. T = Tseq + Tpar
• N = number of nodes
• allocate same amounts of Tpar
to each node
• Tmin = minimum time
• Tmin = Tseq + Tpar/N
Multicore Challenge Conference 2013 9
Tseq Tpar
T
Tpar1 Tpar2 TparN ... Tseq
Tseq Tpar1
Tpar2
...
TparN
saving
Communication cost
• Tsend = time to send data to processors
• Trec = time to receive results from processors
• for any parallel benefit, need:
• Tsend+Trec+Tpar/N < Tpar
Multicore Challenge Conference 2013 10
T
Tseq Tpar1
Tpar2
...
TparN
Tseq Tpar
Tsend Trec
saving
Heterogeneity 1
• different number of cores/amount of RAM on each node
• same core/RAM technologies
Multicore Challenge Conference 2013 11
network
bus
Cores1 Mem1
bus
Cores2 Mem2
Heterogeneity 1
• Coresi = number of cores on node i
• P = number of processors = Cores1+...+CoresN
• each core gets:
Tpar/P
• node i gets:
Coresi*Tpar/P
• haven’t taken memory into account...?
Multicore Challenge Conference 2013 12
Heterogeneity 2
• different technologies on each node
• no GPU
Multicore Challenge Conference 2013 13
network
bus
Cores1 Mem1
bus
Cores2 Mem2
Heterogeneity 2
• characterise node strength
– coarse-grain, relative measure
• for node i:
Speedi = clock speed of each core
Strengthi = Coresi*Speedi
i.e. strength increases with number of cores & core speed
• whole system: Strength = Strength1+...+StrengthN
• node i gets: Tpar*Strengthi/Strength
• each node i core gets: (Tpar*Strengthi/Strength)/Coresi
Multicore Challenge Conference 2013 14
Heterogeneity 2
• still doesn’t take memory into account?
• top level shared cache (i.e. L2 or L3) most important
– determines frequencies of access clashes & caching
• for node i:
Cachei = shared cache size
Strengthi = Coresi*Speedi*Cachei
i.e. strength also increases with cache size
• use previous equations for system strength, and node and core load
Multicore Challenge Conference 2013 15
Heterogeneity 2
Multicore Challenge Conference 2013 16
Heterogeneity 2
Multicore Challenge Conference 2013 17
Heterogeneity 3
• different node technologies
• with GPU
Multicore Challenge Conference 2013 18
network
bus
Cores1 Mem1 GPU1
bus
Cores2 Mem2 GPU2
Heterogeneity 3
• wish to treat GPU as additional general compute resource - GP-GPU
• but cannot directly adopt current algorithm independent model
– very hard to characterise GP-GPU succinctly from static information
• base GPU strength measure on algorithm specific profiling
• use cost model for core strength
• normalise strength relative to one core
Multicore Challenge Conference 2013 19
Heterogeneity 3
TCore0 = time for algorithm on base core
• for node i:
TGPUi = time for algorithm on GPU
• strength of GPUi = TGPUi/TCore0
• node i core strength = (Speedi*Cachei)/(Speed0*Cache0)
• assume GPU controlled by one core
Strengthi = TGPUi/TCore0+
(Corei-1)*
(Speedi*Cachei)/(Speed0*Cache0)
Multicore Challenge Conference 2013 20
Heterogeneity 3
Multicore Challenge Conference 2013 21
Heterogeneity 3
Multicore Challenge Conference 2013 22
Heterogeneity 3
Multicore Challenge Conference 2013 23
Task mobility
• static model assumes program has sole use of system
– no load changes at run-time
• if system shared then load changes unpredictably at run-time
• base system design on threads?
• hope operating system will manage load?
– optimising movement of arbitrary threads across nodes is a black art
• decentralise mobility
• let master decide when to move work
Multicore Challenge Conference 2013 24
Task mobility
Multicore Challenge Conference 2013 25
W2
M
W1
W3
M: Master, Wi: Worker, Colour: Amount of load on processor.
Task mobility
Multicore Challenge Conference 2013 26
W2
M
W1
W3
W3 becomes overloaded. The load should be balanced by moving the
work to a lightly loaded processor.
Task mobility
Multicore Challenge Conference 2013 27
W2
W3
M
W1
The load is re-balanced.
Mobility decision
Ti = time to complete task on location i
Tmove = time to move task from current to new location
Overhead = acceptable completion overhead e.g. 5%
• to justify moving work from location i to location j:
Tj+Tmove < Ti*Overhead
Multicore Challenge Conference 2013 28
Mobility decision
• homogeneous case
TBase = time to complete whole task on unloaded core
Loadi,t = load on location i at time t
• assume we inspect load having completed P% of task
Ti = TBase*(100-P)/100*f(Loadi,t,Loadi,0)
Multicore Challenge Conference 2013 29
Multicore Challenge Conference 2013 30
Evaluation
Multicore Challenge Conference 2013 31
16 workers + 10 tasks, no mobility 16 workers + 10 tasks, mobility
• 7,000*7,000 matrix multiplication • Poisson load • darker == more heavily loaded
Conclusions
• simple models + relative measures suffice for static & dynamic load balancing
• data parallel/regular parallelism only
• small scale experiments
• next:
– larger scale experiments
– explore more complex algorithms
– extend mobility to many core
Multicore Challenge Conference 2013 32
References
• T. Al Salkini and G. Michaelson, `Dynamic Farm Skeleton Task Allocation Through Task Mobility’, 18th International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'12), Las Vegas, July 16-19, 2012
• K. Armih, G. Michaelson and P. Trinder, `Cache size in a cost model for heterogeneous skeletons’, Proceedings of HLPP 2011 : 5th ACM SIGPLAN Workshop on High-Level Parallel Programming and Applications, Tokyo, September. 2011
• X. Y. Deng, P. Trinder and G. Michaelson, `Cost-Driven Autonomous Mobility', Computer Languages, Systems and Structures, 36(1) (April 2010) pp 34-51, 2010
Multicore Challenge Conference 2013 33