Embracing Explicit Communication in Work- Stealing Runtime...
Transcript of Embracing Explicit Communication in Work- Stealing Runtime...
![Page 1: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/1.jpg)
Embracing Explicit Communication in Work-
Stealing Runtime SystemsAndreas Prell
![Page 2: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/2.jpg)
Manycore Processors• “Cluster-on-chip” architectures
• Increasing thread- and data-level parallelism
• Growing importance of scalable communication
Left: S. Bell et al.,TILE64TM Processor: A 64-Core SoC with Mesh Interconnect, ISSCC 2008 Center: T. Mattson et al., The 48-Core SCC Processor: The Programmer’s View, SC 2010 Right: A. Sodani et al., Knights Landing: Second-Generation Intel Xeon Phi Product, IEEE Micro 2016
![Page 3: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/3.jpg)
From Threads to TasksMake it easy to express fine-grained task parallelism
int recurse(int n) { if (n < 2) return base_case();
int x;
std::thread t([&] { x = recurse(n-1); });
int y = recurse(n-2); t.join();
return x + y; } Threads
![Page 4: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/4.jpg)
From Threads to Tasksint recurse(int n) { if (n < 2) return base_case();
int x = spawn recurse(n-1); int y = recurse(n-2);
sync;
return x + y; }
Make it easy to express fine-grained task parallelism
Tasks
![Page 5: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/5.jpg)
Task Pool
Thread Pool
From Threads to TasksRuntime system manages parallel execution
:::
:::
:::
:::
![Page 6: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/6.jpg)
Central versus Distributed Task Pools
0
5
10
15
20
25
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of threads
GCC 4.9.1ICC 14.0.1
GCC: GNU libgomp, ICC: Intel OpenMP RTL
Benchmark: UTS T3L (binomial tree of ~111 million nodes)
223x}
![Page 7: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/7.jpg)
Load Balancing through Work Stealing
W1 W2 W3
A
![Page 8: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/8.jpg)
B
Load Balancing through Work Stealing
W1 W2 W3
push B
A
B
![Page 9: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/9.jpg)
Load Balancing through Work Stealing
W1 W2 W3
push C
C
B
A
C
![Page 10: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/10.jpg)
Load Balancing through Work Stealing
W1 W2 W3
pop
B
A
![Page 11: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/11.jpg)
Load Balancing through Work Stealing
W1 W2 W3
B
A
pop
![Page 12: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/12.jpg)
Load Balancing through Work Stealing
W1 W2 W3
B
A
pop
![Page 13: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/13.jpg)
Load Balancing through Work Stealing
W1 W2 W3
B
A
steal
![Page 14: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/14.jpg)
Load Balancing through Work Stealing
W1 W2 W3
B
A
![Page 15: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/15.jpg)
Load Balancing through Work Stealing
W1 W2 W3
B
AShared deques
![Page 16: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/16.jpg)
Load Balancing through Work Stealing
W1 W2 W3
B
APrivate deques
![Page 17: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/17.jpg)
Load Balancing through Work Stealing
W1 W2 W3
B
A
pop
Private deques
![Page 18: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/18.jpg)
Steal Request
Load Balancing through Work Stealing
W1 W2 W3
B
APrivate deques
![Page 19: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/19.jpg)
Load Balancing through Work Stealing
W1 W2 W3
B
APrivate deques
steal
![Page 20: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/20.jpg)
Load Balancing through Work Stealing
W1 W2 W3
B
Private deques
steal
A
![Page 21: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/21.jpg)
Load Balancing through Work Stealing
W1 W2 W3
B
Private deques
![Page 22: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/22.jpg)
Embracing Explicit Communication
![Page 23: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/23.jpg)
Requirements• Work-stealing scheduling
• Explicit communication → Efficient message passing, private-access deques
• Task synchronization → Collective, individual
• Coarse-grained parallelism → Polling
• Fine-grained parallelism → Adaptive stealing strategies, granularity control
![Page 24: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/24.jpg)
ChannelsSimple message passing abstraction: bool channel_send(Channel *, void *, size_t);
bool channel_receive(Channel *, void *, size_t);
Bounded FIFO message queues
Building blocks: MPSC SPSC
![Page 25: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/25.jpg)
struct steal_request { Channel *chan; int thief; // ... };
Steal Requests
![Page 26: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/26.jpg)
struct steal_request req = {
.chan = ,
.thief = ID,
// ...
};
int i = select_victim();
channel_send( [i], &req, sizeof(req));
Steal Requests
MPSC
SPSC
Two channels per worker
![Page 27: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/27.jpg)
Steal Request
Steal Requests
W1 W2 W3
![Page 28: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/28.jpg)
ACK: Nothing
Steal Requests
W1 W2 W3
![Page 29: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/29.jpg)
Steal Request
Steal Requests
W1 W2 W3
![Page 30: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/30.jpg)
Steal Requests
W1 W2 W3
![Page 31: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/31.jpg)
Steal Request
Steal Requests
W1 W2 W3
Idea: Eliminate ACKs by forwarding steal requests
![Page 32: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/32.jpg)
Steal Request
Steal Requests
W1 W2 W3
Idea: Eliminate ACKs by forwarding steal requests
![Page 33: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/33.jpg)
Steal Requests
W1 W2 W3
Idea: Eliminate ACKs by forwarding steal requests
![Page 34: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/34.jpg)
…
Forwarding
• reduces number of messages
• facilitates asynchronous stealing
• improves performance
Steal Requests
Benchmark: BPC with d = 105, n = 9, and t as shown (x-axis)
400
500
600
700
800
900
1000
1100
0 1 2 3 4 5 6 7 8 9 10
Exec
utio
n tim
e (m
s)
Task length (µs)
AcknowledgingForwarding
…
…
![Page 35: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/35.jpg)
H
Stealing Tasks
T
One task
T’
Send T or &T to thief
![Page 36: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/36.jpg)
H
Stealing Tasks
T
One task
T’
Send T or &T to thief Share memory by communicating*
*A. Gerrand, https://blog.golang.org/share-memory-by-communicating, 13 July 2010
![Page 37: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/37.jpg)
H
Stealing Tasks
T
tasks
Send &H’ to thief
bn/2c
T’ H’
![Page 38: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/38.jpg)
Stealing Tasks
0
8
16
24
32
40
48
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of workers
Steal-oneSteal-half
Benchmark: SPC with n = 106 and t = 100 µs
Task length 100 µs
…
![Page 39: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/39.jpg)
Stealing Tasks
0
8
16
24
32
40
48
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of workers
Steal-oneSteal-half
Benchmark: SPC with n = 106 and t = 10 µs
Task length 10 µs
![Page 40: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/40.jpg)
Task Synchronization
![Page 41: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/41.jpg)
Task Barrier#include <stdio.h> #include "tasking.h"
ASYNC_VOID_DECL ( puts, const char *s, s );
int main(void) { TASKING_INIT();
ASYNC(puts, "Order"); ASYNC(puts, "Undefined");
TASKING_BARRIER();
ASYNC(puts, "Last");
TASKING_EXIT(); return 0; }
#include <stdio.h> #include <omp.h>
int main(void) { #pragma omp parallel { #pragma omp master { #pragma omp task puts("Order"); #pragma omp task puts("Undefined"); } #pragma omp barrier #pragma omp master { #pragma omp task puts("Last"); } } return 0; }
![Page 42: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/42.jpg)
Termination Detection with Steal Requests
Problem: Detect when all workers are idle without resorting to implicit communication
Idea: “Color” steal requests
→ Avoids separate control messages
→ Termination follows from forwarding
![Page 43: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/43.jpg)
Extended Steal Requestsstruct steal_request { Channel *chan; int thief; enum { working, idle, reg_idle } state; // ... };
![Page 44: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/44.jpg)
Extended Steal Requests
struct steal_request { Channel *chan; int thief; enum { working, idle, reg_idle } state; // ... };
working idle reg_idle
![Page 45: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/45.jpg)
Extended Steal Requests
working idle reg_idle
// Manager receives req switch (req.state) { case idle: // Mark reg_idle break; }
![Page 46: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/46.jpg)
Steal Request
(reg_idle)
Notifying the Manager
W1 W2 Manager
![Page 47: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/47.jpg)
Update!
Notifying the Manager
W1 W2 Manager
![Page 48: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/48.jpg)
Notifying the Manager
W1 W2 Manager
![Page 49: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/49.jpg)
Steal request
UpdatesWorker i
Worker j j2
i3
Manager
m2 m3
i1
m1
Update
Steal request
i2
j1
Task
(reg_idle)
struct steal_request { Channel *chan; int thief; enum { working, idle, reg_idle, update } state; // ... };
// Manager receives req switch (req.state) { case update: // ... break; case idle: // ... break; }
![Page 50: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/50.jpg)
Task BarrierImpact of explicit communication
0
1000
2000
3000
4000
5000
6000
60 120 180 240
Max
. tas
k ba
rrier
late
ncy
(µs)
Number of workers N
Return steal request after N attemptsReturn steal request after 10 attempts
![Page 51: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/51.jpg)
Task BarrierImpact of explicit communication
0
20
40
60
80
100
60 120 180 240
Min
. tas
k ba
rrier
late
ncy
(µs)
Number of workers N
Steal requests + cancel after barrierSteal requests + worker 0 as manager
Intel OpenMP barrier
Worker 0 cancels a random worker’s steal request
No additional communication required
![Page 52: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/52.jpg)
Channel-based Futuresint x = spawn f(n-1); int y = f(n-2); sync;
future fx = FUTURE(f, n-1); int y = f(n-2); int x = AWAIT(fx, int);
![Page 53: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/53.jpg)
future fx = FUTURE(f, n-1); int y = f(n-2); int x = AWAIT(fx, int);
Channel-based Futures
1. Allocates a one-element SPSC channel 2. Creates a task, passing the channel
3. Returns a handle to the channel (future)
![Page 54: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/54.jpg)
future fx = FUTURE(f, n-1); int y = f(n-2); int x = AWAIT(fx, int);
Channel-based Futures
1. Waits for the task to send its result 2. Receives and returns the result
3. Frees or recycles the channel
![Page 55: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/55.jpg)
future fx = FUTURE(f, n-1); int y = f(n-2); int x = AWAIT(fx, int);
Channel-based Futures
Tries to schedule other work to avoid idling
![Page 56: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/56.jpg)
Performance
0
8
16
24
32
40
48
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of workers
Futures w/ cachingCilk Plus
0
8
16
24
32
40
48
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of workers
Futures w/ cachingCilk Plus
0
8
16
24
32
40
48
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of workers
Futures w/ cachingCilk Plus
Fork/join parallelism
Benchmarks from left to right: Tree recursion with n = 34 and t = 1 µs | 14 Queens Problem | Cilksort of 108 integers
![Page 57: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/57.jpg)
Adaptive Strategies
![Page 58: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/58.jpg)
Adaptive Stealing
![Page 59: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/59.jpg)
Adaptive Stealing
Steal-one Steal-half
Idea: Reevaluate strategy after stealsN
Count how many tasks have been executed:
M/N = 1
M/N < 2
M/N > 1 M/N � 2
M
![Page 60: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/60.jpg)
Adaptive Stealing
Benchmark: BPC with d = 105, n = 9, and t = 10 µs
Task length 10 µs
600
800
1000
1200
1400
1600
1800
8 16 24 32 40 48
Exec
utio
n tim
e (m
s)
Number of workers
Steal-halfSteal-adaptive N = 3Steal-adaptive N = 5
Steal-adaptive N = 10Steal-adaptive N = 25Steal-adaptive N = 50
Steal-one
![Page 61: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/61.jpg)
Adaptive Stealing
Benchmark: BPC with d = 1, n = 999,999, and t = 10 µs
Task length 10 µs
0
500
1000
1500
2000
2500
8 16 24 32 40 48
Exec
utio
n tim
e (m
s)
Number of workers
Steal-oneSteal-adaptive N = 3Steal-adaptive N = 5
Steal-adaptive N = 10Steal-adaptive N = 25Steal-adaptive N = 50
Steal-half
![Page 62: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/62.jpg)
Very Fine-grained Tasks
Benchmark: SPC with n = 106 and t = 1 µs
0 1 2 3 4 5 6 7 8
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of workers
Steal-oneSteal-half
Steal-adaptive
Task length 1 µs
for (i = 0; i < N; i++) ASYNC(f, i, ...);
![Page 63: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/63.jpg)
Splittable Tasks
Benchmark: SPC with n = 106 and t = 1 µs
Task length 1 µs
0
8
16
24
32
40
48
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of workers
Steal-adaptiveLazy work splitting
for (i = 0; i < N; i++) ASYNC(f, i, ...);
ASYNC_FOR ( f, 0, N, ... );
![Page 64: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/64.jpg)
Lazy Binary SplittingAssumes concurrent work-stealing deques: Worker splits when local deque is empty, otherwise executes tasks sequentially
→ Splitting is lazy as opposed to eager
→ Chunking (granularity control) is implicit
A. Tzannes et al., Lazy Binary Splitting: A Run-Time Adaptive Work-Stealing Scheduler, PPoPP 2010
![Page 65: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/65.jpg)
Lazy Binary Splitting
A. Tzannes et al., Lazy Binary Splitting: A Run-Time Adaptive Work-Stealing Scheduler, PPoPP 2010
![Page 66: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/66.jpg)
Lazy Binary Splitting
A. Tzannes et al., Lazy Binary Splitting: A Run-Time Adaptive Work-Stealing Scheduler, PPoPP 2010
Available to thieves
1/2
![Page 67: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/67.jpg)
Lazy Guided SplittingExample: Four workers
![Page 68: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/68.jpg)
Lazy Guided SplittingExample: Four workers
Available to thieves
3/4
![Page 69: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/69.jpg)
Lazy Adaptive SplittingExample: Two workers are idle
![Page 70: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/70.jpg)
Lazy Adaptive SplittingExample: Two workers are idle
Available to thieves
2/3
![Page 71: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/71.jpg)
Lazy Splitting
Neither strategy is truly lazy
Difficult to know which strategy works best → Explicit communication solves this problem
![Page 72: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/72.jpg)
Lazy Adaptive SplittingExample: Worker receives two steal requests
![Page 73: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/73.jpg)
Lazy Adaptive SplittingExample: Worker receives two steal requests
![Page 74: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/74.jpg)
Lazy Adaptive SplittingExample: Worker receives two steal requests
Send to thieves
1/31/3
![Page 75: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/75.jpg)
PerformanceLazy Splitting
Benchmark: Parallel loops of Fine, Coarse, Random, Increasing, and Decreasing Granularity
|{z}Balanced
|{z}Unbalanced
0
8
16
24
32
40
48
FG CG RG IG DG
Spee
dup
over
seq
. exe
cutio
nBinary splitting
Guided splittingAdaptive splitting
![Page 76: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/76.jpg)
PerformanceMixing tasks and splittable tasks
0
8
16
24
32
40
48
9 99 999 9999 99999Spee
dup
over
seq
. exe
cutio
n
Size of splittable consumer tasks
Cilk PlusBinary splitting
Guided splittingAdaptive splitting
Benchmark: BPC with d = [104, 103, …, 1], n = [9, 99, …, 99999], and t = 10 µs
…
…
…
![Page 77: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/77.jpg)
PerformanceMixing tasks and splittable tasks
Benchmark: BPC with d = [105, 104, …, 10], n = [9, 99, …, 99999], and t = 1 µs
0
8
16
24
32
40
48
9 99 999 9999 99999Spee
dup
over
seq
. exe
cutio
n
Size of splittable consumer tasks
Single tasksBinary splitting
Guided splittingAdaptive splitting
…
…
…
![Page 78: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/78.jpg)
Conclusion
![Page 79: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/79.jpg)
Performance Ranking
2-socket Intel Xeon (24 threads)
1. Chase-Lev WS -1.6 %
2. Channel WS -2.4 %3. Cilk Plus -4.6 %4. Intel OpenMP -10.7 %
4-socket AMD Opteron(48 threads)
1. Chase-Lev WS -2.2 %
2. Channel WS -2.4 %3. Cilk Plus -6.9 %4. Intel OpenMP -21.8 %
60-core Intel Xeon Phi (240 threads)
1. Chase-Lev WS -13.7 %
2. Channel WS -13.7 %3. Intel OpenMP -22.2 %4. Cilk Plus -28.1 %
Average deviations from the best median speedups
21 benchmarks/workloads (20 in the case of Cilk Plus)
David Chase and Yossi Lev, Dynamic Circular Work-Stealing Deque, SPAA 2005
![Page 80: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing](https://reader033.fdocuments.net/reader033/viewer/2022043011/5fa68928fb251164b50aa578/html5/thumbnails/80.jpg)
Summary• Work-stealing runtime system with
• private deques
• channel communication
• Workers
• forward steal requests
• adapt their stealing strategy
• split tasks lazily
Flexibility ✔
Performance ✔
}
}