Idempotent Work Stealing
description
Transcript of Idempotent Work Stealing
1
Idempotent Work Stealing
Maged M. Michael, Martin T. Vechev,Vijay A. Saraswat
PPoPP’09
2
Memory Operations Reordering Problem Definition – Idempotent Work-
Stealing The algorithms Comparison to Previous Work Summary
Outline
3
Some architectures reorder the memory accesses to achieve faster execution
Good optimization for uni-processors… But may be dangerous for multi-processors
Memory Operations Reordering
read(a)read(b)
write(a,1)write(b,2)
read(a)write(b,2)
write(a,1)read(b)
4
Memory Operations Reordering
P1L1: if(read(a) = 0)
goto L1 print(read(b))
Memorya = 0;b = 0;
P2 write(b, 7) write(a, 1)
Expected output of P1?
What happens if P2 changes the order of memory stores?
P1
P2
5
Operations that synchronize memory accesses
X-Y fence: all previous operations of type X must commit before all following operations of type Y start
Example: store-load
Memory Fences
read1
write1
store-loadwrite2
read2
store-store?
6
Memory Operations Reordering –With Memory Fences
P1L1: if (read(a) = 0)
goto L1 print(read(b))
Memorya = 0;b = 0;
P2 write(b, 1) store-store write(a, 7)
P1
P2
7
Sequential Consistency A model where:
◦ All processors see all memory operations in the same order
◦ Must adhere to the program order (for each thread)
Memory operations are not sequential consistent
Makes program verification a non-simple task
8
Sequential Consistency Vs. Linearizability Linearizability is stronger than sequential
consistency
(and not only for a single thread)
If operation A is executed before operation B (in real-time), then
A precedes B in the order
9
Memory Operations Reordering Problem Definition – Idempotent Work-
Stealing The algorithms Comparison to Previous Work Summary
Outline
10
Idempotence – the property of certain operations, that can be applied multiple times without changing the result (Wikipedia)
In other words: f(f(x))=f(x)
Examples:1. The absolute function2. The number 1 is idempotent of multiplication:
1 * 13. SQL query (without updates)
Problem Definition - Idempotence
11
A policy to divide procedure executions (jobs/tasks) efficiently among multiple processors
Each processor has a deque (double-ended queue) of jobs
Problem Definition – Work Stealing
job
job
job
job
job
job
job
job
job
P1
P2
Pk
12
Each processor can put a new job in its own queue
Each processor can take a job from its own queue
Problem Definition – Work Stealing
job
job
job
job
job
job
job
job
job
job
P1
P2
Pk
13
A processor without work can steal jobs from another processor
Problem Definition – Work Stealing
job
job
job
job
job
job
job
P1
P2
Pk
14
Fibonacci numbers – fib(7) P1 – take() -> fib(7) P1 – put(fib(6)), put(fib(5)) P1 – take() -> fib(6) P2 – steal(P1) P2 – take() -> fib(5) P1 – put(fib(5)), put(fib(4)) P2 – put(fib(4)), put(fib(3)) P1 – take() -> fib(5) P3 – steal(P1) P3 – take() -> fib(4) P2 – take() -> fib(4)
…
Work Stealing - Example
fib(7)Fib(6)
fib(5)
Fib(5)
Fib(4)
Fib(4)
Fib(3)
P1
P2
P3
15
Work stealing seems like a good idea… But, it can be expensive…
Because:1. Using locks2. Using atomic Read-Modify-Write operations3. Using Memory Ordering Fence
Previous work-stealing algorithms use strong synchronization primitives
Well…
Can Work-Stealing algorithms of Idempotent tasks avoid using
synchronization primitives?
16
Not exactly…
Our goal:◦ Making Work-stealing cheap when jobs are
idempotent
How?◦ Making the owner’s operations (“put”, “take”)
cheap, but “steal” remains expensive
The answer
17
A snippet of the Chase-Lev algorithm:
The Chase-Lev algorithm
Task take() {1. b := bottom;2. CircularArray a = activeArray;3. b = b – 1;4. bottom = b;5. t = top;… }
store-load
18
Memory Operations Reordering Problem Definition – Idempotent Work-
Stealing The algorithms Comparison to Previous Work Summary
Outline
19
We will see 3 algorithms All algorithms insert (put) jobs at the tail
1. Idempotent LIFO – extracting tasks (take/steal) from the tail
2. Idempotent FIFO – extracting tasks (take/steal) from the head
3. Idempotent double-ended – the owner takes tasks from the tail, and the others steal from the head
The algorithms
20
Each processor has:◦ Dynamic array of tasks◦ A capacity variable◦ An anchor (tail index)
1) Idempotent LIFO
capacity = 7anchor = 0
tasks
insert – to tailtake/stealfrom tail
P1
21
Idempotent LIFO – put(task)
task1
capacity = 7anchor = 0
void put(Task task) {1. t := anchor;2. if (t = capacity) { expand(); goto 1;}3. tasks[t] := task;4. anchor := t + 1; } store-store
1
tasks
22
Idempotent LIFO – take()
task1 task2 task3
capacity = 7anchor = 3
Task take() {1. t := anchor;2. if (t = 0) return EMPTY;3. task := tasks[t – 1];4. anchor := t - 1;5. return task; }
2
tasks
23
Idempotent LIFO – steal()
task1 task2 task3
capacity = 7anchor = 3
Task steal() {1. t := anchor;2. if (t = 0) return EMPTY;3. a := tasks;4. task := a[t – 1];5. if !CAS(anchor, t, t-1) goto 1;6. return task; }
load-load
load-CAS
2
Why tasks must be idempotent?
tasks
24
Idempotent tasks
task1 task2 task3
capacity = 7anchor = 3
Task take() {1. t := anchor;2. if (t = 0) return EMPTY;3. task := tasks[t – 1];4. anchor := t - 1;5. return task; }
2
tasks
Task steal() {1. t := anchor;2. if (t = 0) return EMPTY;3. a := tasks;4. task := a[t – 1];5. if !CAS(anchor, t, t-1) goto 1;6. return task; }t a
task=task3t
task=task3
2
25
How is ABA possible?
Preventing ABA
task1 task2 task3
tasks
ownertake();put(taskX);…put(taskY);
Task steal() {1. t := anchor;2. if (t = 0) return EMPTY;3. a := tasks;4. task := a[t – 1];5. if !CAS(anchor, t, t-1) goto 1;6. return task; }
capacity = 7anchor = 32
t
taskX
3
task=task3
taskX is lost!
2
26
How can we prevent it?
Preventing ABA
anchor: <integer, integer>; // <tail, tag>
void put(Task task) {1. <t,tag> := anchor;2. if (t = capacity) { expand(); goto 1;}3. tasks[t] := task;4. anchor := <t + 1, tag + 1>; } Task steal() {
1. <t,tag> := anchor;2. if (t = 0) return EMPTY;3. a := tasks;4. task := a[t – 1];5. if !CAS(anchor, <t,tag>, <t-1,tag>) goto 1;6. return task; }
27
Each processor has:◦ Dynamic cyclic-array of tasks◦ A capacity variable◦ Head index (always increasing)◦ Tail index (always increasing)
2) Idempotent FIFO
task2 task3 task4
capacity = 7head = 1tail = 4
tasks
insert – to tailtake/stealfrom head
Next…P1
28
Idempotent FIFO – put(task) void put(Task task) {1. h := head;2. t := tail;3. if (t = h + tasks.capacity) { expand(); goto 1;}4. tasks.array[t%tasks.capacity] := task;5. tail := t + 1; }
store-store
task2 task3 task4 task5
capacity = 7head = 1tail = 45
29
Idempotent FIFO – take() Task take() {1. h := head;2. t := tail;3. if (h = t) return EMPTY;4. task := tasks.array[h%tasks.capacity];5. head := h + 1;6. return task; }
task2 task3 task4 task5
capacity = 7head = 1tail = 4
2
30
Idempotent FIFO – steal() Task steal() {1. h := head;2. t := tail;3. if (h = t) return EMPTY;4. a := tasks;5. task := a.array[h%a.capacity];6. if !CAS(head, h, h+1) goto 1;7. return task; }
load-load
load-CAS
task2 task3 task4 task5
capacity = 7head = 1tail = 4
2
load-load
31
Each processor has:◦ Dynamic cyclic-array of tasks◦ A capacity variable◦ An anchor (head, size)
3) Idempotent double-ended
task2 task3 task4
capacity= 7anchor = <1, 3>
tasks
insert – to tailtake – from tail
steal - from head
Next…P1
32
Idempotent double-ended – put(task)
void put(Task task) {1. <h, s> := anchor;2. if (s = tasks.capacity) { expand(); goto 1;}3. tasks.array[(h+s)%tasks.capacity] := task;4. anchor := <h, s + 1>; } store-store
task2 task3 task4 task5
capacity = 7anchor = <1, 3>4
33
Idempotent double-ended – take()
Task take() {1. <h, s> := anchor;2. if (s = 0) return EMPTY;3. task := tasks.array[(h+s-1)%tasks.capacity];4. anchor := <h, s – 1>;5. return task; }
task2 task3 task4 task5
capacity = 7anchor = <1, 4>3
34
Idempotent double-ended – steal()
Task steal() {1. <h, s> := head;2. if (s = 0) return EMPTY;3. a := tasks;4. task := a.array[h%a.capacity];5. h2 := (h + 1) % a.capacity;6. if !CAS(head, <h,s>, <h2,s-1>) goto 1;7. return task; }
load-load
load-CAS
task2 task3 task4 task5
capacity = 7anchor = <1, 4 >2, 3
35
Memory Operations Reordering Problem Definition – Idempotent Work-
Stealing The algorithms Comparison to Previous Work Summary
Outline
36
Compared against “Chase-Lev” and “Cilk THE” algorithms (after adding memory fences)
Benchmarks:◦ Micro – the common case – take() and put()◦ Irregular Graph Applications
Experimental evaluation
37
2 Scenarios:◦ Both puts and takes (106 ops for each type)◦ Only takes (106 ops) – pre populating the work-
queues
Micro-benchmarks
38
2 Scenarios:◦ Both puts and takes (106 ops for each type)◦ Only takes (106 ops) – pre populating the work-
queues
Micro-benchmarks
40
Based on SIMPLE framework 2D Torus Graph:
◦ Vertices – on the torus◦ Each vertex connected to its 4
neighbors Build a spanning tree
Irregular Graph Applications
41
2D-Torus
Up to 6% redundant work
42
Memory Operations Reordering Problem Definition – Idempotent Work-
Stealing The algorithms Comparison to Previous Work Summary
Outline
43
Memory operations reordering improves execution times
Use with care in multi-processors “Idempotent Work-Stealing” useful for some
workloads Idempotent-LIFO gives good results for all
benchmarks
Summary
44
Thank You!Questions?