Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid
-
Upload
olivia-russo -
Category
Documents
-
view
31 -
download
0
description
Transcript of Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid
![Page 1: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/1.jpg)
Fault tolerance, malleability and migration for divide-and-conquer
applications on the Grid
Gosia Wrzesińska, Rob V. van Nieuwpoort, Jason Maassen, Henri E. Bal
vrije Universiteit Ibis
![Page 2: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/2.jpg)
Distributed supercomputing
• Parallel processing on geographically distributed computing systems (grids)
• Needed:– Fault-tolerance: survive node crashes
– Malleability: add or remove machines at runtime
– Migration: move a running application to another set of machines
• We focus on divide-and-conquer applications
LeidenDelft
Brno
Internet
Berlin
![Page 3: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/3.jpg)
Outline
• The Ibis grid programming environment• Satin: a divide-and-conquer framework• Fault-tolerance, malleability and migration in Satin• Performance evaluation
![Page 4: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/4.jpg)
The Ibis system• Java-centric => portability
– „write once, run anywhere”
• Efficient communication– Efficient pure Java implementation– Optimized solutions for special cases
• High level programming models:– Divide & Conquer (Satin)– Remote Method Invocation (RMI)– Replicated Method Invocation (RepMI)– Group Method Invocation (GMI)
http://www.cs.vu.nl/ibis/
![Page 5: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/5.jpg)
Satin: divide-and-conquer on the Grid• Performs excellent on the Grid
– Hierarchical: fits hierarchical platforms
– Java-based: can run on heterogeneous resources
– Grid-friendly load balancing: Cluster-aware Random Stealing [van Nieuwpoort et al., PPoPP 2001]
fib(1) fib(0) fib(0)
fib(0)
fib(4)
fib(1)
fib(2)
fib(3)
fib(3)
fib(5)
fib(1) fib(1)
fib(1)
fib(2) fib(2)cpu 2
cpu 1cpu 3
cpu 1
• Missing support for– Fault tolerance
– Malleability
– Migration
![Page 6: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/6.jpg)
Example application: Fibonacci
Also: Barnes-Hut, Raytracer, SAT solver, Tsp, Knapsack...
Fib(3) Fib(2)
Fib(4) Fib(3)
Fib(2)
Fib(1)
Fib(1)
Fib(0)
Fib(1) Fib(0) Fib(1)
Fib(2) Fib(1)
Fib(0)
Fib(5)
processor 1(master)
processor 2
processor 3 2
5
8
![Page 7: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/7.jpg)
Fault-tolerance, malleability, migration
• Can be implemented by handling processors joining or leaving the ongoing computation
• Processors may leave either unexpectedly (crash) or gracefully
• Handling joining processors is trivial:– Let them start stealing jobs
• Handling leaving processors is harder:– Recompute missing jobs
– Problems: orphan jobs, partial results from gracefully leaving processors
![Page 8: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/8.jpg)
8
Crashing processors
1
12
3
6 7
13
14
10 11
4
2
5
9
15
processor 1
processor 2
processor 3
8
![Page 9: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/9.jpg)
Crashing processors
1
3
6 74
9
15
processor 1
processor 3
8
14
12 13
![Page 10: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/10.jpg)
Crashing processors
1
3
6 74
9
15
processor 1
processor 3
2
8
14
12 13
![Page 11: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/11.jpg)
Crashing processors
1
3
6 74
9
15
processor 1
processor 3
?
8
Problem: orphan jobs – jobs stolen from crashed processors
14
12 13
2
![Page 12: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/12.jpg)
Crashing processors
1
3
6 74
9
15
processor 1
processor 3
2?
4 5
8
Problem: orphan jobs – jobs stolen from crashed processors
14
12 13
![Page 13: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/13.jpg)
Handling orphan jobs• For each finished orphan, broadcast (jobID,processorID) tuple; abort
the rest
• All processors store tuples in orphan tables
• Processors perform lookups in orphan tables for each recomputed job
• If successful: send a result request to the owner (async), put the job on a stolen jobs list
processor 3
broadcast(9,cpu3)(15,cpu3)
14
4
9
15
8
![Page 14: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/14.jpg)
Handling orphan jobs - example
1
3
6 7
11
4
2
5
9
15
processor 1
processor 2
processor 3
8
14
10 12 13
![Page 15: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/15.jpg)
Handling orphan jobs - example
1
3
6 74
9
15
processor 1
processor 3
8
14
12 13
![Page 16: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/16.jpg)
Handling orphan jobs - example
1
3
6 74
9
15
processor 1
processor 3
8
14
12 13
2
![Page 17: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/17.jpg)
Handling orphan jobs - example
1
3
6 7
9
15
processor 1
processor 3
(9, cpu3)(15,cpu3)
9 cpu315 cpu3
12 13
2
![Page 18: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/18.jpg)
Handling orphan jobs - example
1
3
6 74
2
5
9
15
processor 1
processor 3
9 cpu315 cpu3
12 13
![Page 19: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/19.jpg)
Handling orphan jobs - example
1
3
6 74
2
5
9
15
processor 1
processor 3
9 cpu315 cpu3
12 13
![Page 20: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/20.jpg)
Processors leaving gracefully
1
3
6 7
11
4
2
5
9
15
processor 1
processor 2
processor 3
8
14
10 12 13
![Page 21: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/21.jpg)
Processors leaving gracefully
1
3
6 7
11
4
2
5
9
15
processor 1
processor 2
processor 3
8
Send results to another processor; treat those results as orphans14
10 12 13
![Page 22: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/22.jpg)
Processors leaving gracefully
1
3
6 7
11
4
9
15
processor 1
processor 3
8
14
12 13
![Page 23: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/23.jpg)
Processors leaving gracefully
1
3
6 7
119
15
processor 1
processor 3(11,cpu3)(9,cpu3)(15,cpu3)
11 cpu39 cpu3
15 cpu3
12 13
2
![Page 24: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/24.jpg)
Processors leaving gracefully
1
3
6 7
11
2
5
9
15
processor 1
processor 3
4
11 cpu39 cpu3
15 cpu3
12 13
![Page 25: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/25.jpg)
Processors leaving gracefully
1
3
6 7
11
2
5
9
15
processor 1
processor 3
4
11 cpu39 cpu3
15 cpu3
12 13
![Page 26: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/26.jpg)
Some remarks about scalability
• Little data is broadcast (< 1% jobs)
• We broadcast pointers
• Message combining
• Lightweight broadcast: no need for reliability, synchronization, etc.
![Page 27: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/27.jpg)
Performance evaluation
• Leiden, Delft (DAS-2) + Berlin, Brno (GridLab)• Bandwidth:
62 – 654 Mbit/s• Latency:
2 – 21 ms
![Page 28: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/28.jpg)
Impact of saving partial results
0
50
100
150
200
250
300
350
400
450
wide-area DAS-2 GridLab
runt
ime
(sec
.)
1 cluster leavesunexpectedly (withoutsaving orphans)
1 cluster leavesunexpectedly (withsaving orphans)
1 cluster leavesgracefully
1.5/3.5 clusters (nocrashes)
16 cpus Leiden 16 cpus Delft
8 cpus Leiden, 8 cpus Delft4 cpus Berlin, 4 cpus Brno
![Page 29: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/29.jpg)
Migration overhead
0
50
100
150
200
250
300
350
1
runt
ime
(sec
.)
with migration
without migration
8 cpus Leiden 4 cpus Berlin4 cpus Brno(Leiden cpus replaced by Delft)
![Page 30: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/30.jpg)
Crash-free execution overhead
0
5
10
15
20
25
30
35
Raytracer TSP SAT solver Knapsack
Spee
dup
on 3
2 cp
us
plain Satin malleable Satin
Used: 32 cpus in Delft
![Page 31: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/31.jpg)
Summary
• Satin implements fault-tolerance, malleability and migration for divide-and-conquer applications
• Save partial results by repairing the execution tree• Applications can adapt to changing numbers of cpus
and migrate without loss of work (overhead < 10%)• Outperform traditional approach by 25%• No overhead during crash-free execution
![Page 32: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/32.jpg)
Further information
http://www.cs.vu.nl/ibis/
Publications and a software distribution available at:
![Page 33: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/33.jpg)
Additional slides
![Page 34: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/34.jpg)
Ibis design
![Page 35: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/35.jpg)
Partial results on leaving cpus
If processors leave gracefully:
• Send all finished jobs to another processor
• Treat those jobs as orphans = broadcast (jobID, processorID) tuples
• Execute the normal crash recovery procedure
![Page 36: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/36.jpg)
A crash of the master
• Master: the processor that started the computation by spawning the root job
• Remaining processors elect a new master• At the end of the crash recovery procedure the
new master restarts the application
![Page 37: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/37.jpg)
Job identifiers
• rootId = 1
• childId = parentId * branching_factor + child_no
• Problem: need to know maximal branching factor of the tree
• Solution: strings of bytes, one byte per tree level
![Page 38: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/38.jpg)
Distributed ASCI Supercomputer (DAS) – 2
Node configuration
Dual 1 GHz Pentium-III>= 1 GB memory100 Mbit Ethernet +(Myrinet)Linux
VU (72 nodes)UvA (32)
Leiden (32) Delft (32)
GigaPort(1-10 Gb)
Utrecht (32)
![Page 39: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/39.jpg)
Compiling/optimizing programs
Javacompiler
bytecoderewriter JVM
JVM
JVM
source bytecodebytecode
• Optimizations are done by bytecode rewriting– E.g. compiler-generated serialization (as in Manta)
![Page 40: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/40.jpg)
GridLab testbed
interface FibInter extends ibis.satin.Spawnable {
public int fib(long n);}
class Fib extends ibis.satin.SatinObject implements FibInter { public int fib (int n) {
if (n < 2) return n;int x = fib (n - 1);int y = fib (n - 2);sync();return x + y;
}}
Java + divide&conquer
Example
![Page 41: Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid](https://reader036.fdocuments.net/reader036/viewer/2022062422/56813507550346895d9c58f6/html5/thumbnails/41.jpg)
Grid results
Program sites CPUs Efficiency
Raytracer 5 40 81 %
SAT-solver 5 28 88 %
Compression 3 22 67 %
• Efficiency based on normalization to single CPU type (1GHz P3)