Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send...

31
Stefan Kaestle, Reto Achermann, Roni Haecki, Moritz Hoffmann, Sabela Ramos, Timothy Roscoe Systems Group, Department of Computer Science, ETH Zurich 1 Smelt: Machine-aware Atomic Broadcast Trees for Multicores

Transcript of Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send...

Page 1: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

Stefan Kaestle, Reto Achermann, Roni Haecki, Moritz Hoffmann, Sabela Ramos, Timothy Roscoe

Systems Group, Department of Computer Science, ETH Zurich

1

Smelt: Machine-aware Atomic Broadcast Trees for Multicores

Page 2: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

2

Large number of trees: topologies and send orders

Number of cross NUMA

links?

Tree Topology?

Root of the tree?

Send order of messages?

Maximum out degree?

Page 3: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

3

Large number of trees: topologies and send orders

Number of cross NUMA

links?

Tree Topology?

Root of the tree?

Send order of messages?

Maximum out degree?

Large number of possible trees

topology + message send order

𝑛! 𝐶𝑛−1 =(2𝑛 − 2)!

𝑛 − 1 !

Page 4: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

0

5

10

15

20

25

30

35

40

Barrier Reduction Broadcast 2PC

Binary Tree Cluster Sequential MST Fibonacci

4

There is no globally optimal tree structure

0

20

40

60

80

100

120

Barrier Reduction Broadcast 2PC

Cluster topology wins Binary tree or Fibonacci win

AMD Interlagos (4 Socket x 4 Cores x 2 Threads) Intel Xeon Phi (61 Cores x 1 Thread)

Execution Time [kCycles] Execution Time [kCycles]

Page 5: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

5

Tree Generation

Time

𝑡𝑠𝑒𝑛𝑑

𝑡𝑟𝑒𝑐𝑒𝑖𝑣𝑒

𝑡𝑠𝑒𝑛𝑑

𝑡𝑟𝑒𝑐𝑒𝑖𝑣𝑒

Multicore Model

Smelt: Automatic optimization ofbroadcast and reduction trees

Page 6: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

6

Example: Building fast and simple barriers

0

5

10

15

20

25

30

35

40

45

Barrier implementation

Dramatic improvement through automatic optimization of communication patterns

Barrier Benchmark on Intel Sandy Bridge 4x8x2Execution Time [kCycles]

Dissemination Parlib MCS Smelt

7x

3x

Page 7: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

7

Broadcasts and reductions are central building blocks for parallel programs

Performance

Atomic broadcasts

Replication for data locality

e.g. Shoal, Carrefour, SMMP OS, FOS

Fault-Tolerance

Agreement protocols, atomic broadcasts

Replication for failure resilience

e.g. 1Paxos

Execution Control

Reductions, broadcast, barriers

Thread synchronization, data gathering

e.g. OpenMP

Page 8: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

8

Multicore hardware is complex

Interconnect / coherence protocol complexity

Distributed resources

Hardware parallelism

Hierarchical:thread/cores

caches/memory

Intel Sandy Bridge 4x8x2

Page 9: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

9

Multicore hardware is complex

Interconnect / coherence protocol complexity

Distributed resources

Hardware parallelism

Hierarchical:thread/cores

caches/memory

Intel Sandy Bridge 4x8x2

Generate a good tree topology and schedules

automatically

Page 10: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

Works well for our approach.

Clear concept: Enables reasoning about send and receive costs

10

Smelt is based on peer-to-peer message passing

cache-line sized slot

Thread 1 Thread 2

shared memory page

shared memory page

Page 11: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

Core 1

Core 2

Time [10ns]

Core 3

Core 4

Machine 1

Machine 2

Machine 3

Machine 4

Classical Network Multicore interconnect

11

Message-passing on multicores is different

Time [1us]

𝑡𝑟𝑒𝑐𝑒𝑖𝑣𝑒𝑡𝑠𝑒𝑛𝑑 𝑡𝑝𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒

𝑡𝑠𝑒𝑛𝑑

𝑡𝑟𝑒𝑐𝑒𝑖𝑣𝑒

𝑡𝑠𝑒𝑛𝑑 𝑡𝑠𝑒𝑛𝑑

𝑡𝑟𝑒𝑐𝑒𝑖𝑣𝑒

𝑡𝑟𝑒𝑐𝑒𝑖𝑣𝑒

Page 12: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

Core 1

Core 2

Time [10ns]

Core 3

Core 4

Machine 1

Machine 2

Machine 3

Machine 4

Classical Network Multicore interconnect

12

Message-passing on multicores is different

Time [1us]

𝑡𝑟𝑒𝑐𝑒𝑖𝑣𝑒𝑡𝑠𝑒𝑛𝑑 𝑡𝑝𝑟𝑜𝑝𝑎𝑔𝑎𝑡𝑒

𝑡𝑠𝑒𝑛𝑑

𝑡𝑟𝑒𝑐𝑒𝑖𝑣𝑒

𝑡𝑠𝑒𝑛𝑑 𝑡𝑠𝑒𝑛𝑑

𝑡𝑟𝑒𝑐𝑒𝑖𝑣𝑒

𝑡𝑟𝑒𝑐𝑒𝑖𝑣𝑒

On multicores send and receive times dominate propagation time

Goal: Minimize total time of the broadcast

Page 13: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

𝑡𝑏𝑟𝑜𝑎𝑑𝑐𝑎𝑠𝑡 = 𝑡𝑙𝑎𝑠𝑡 − 𝑡𝑠𝑡𝑎𝑟𝑡

Minimize the longest path from the root to the leaves.

13

Minimizing the total time of a broadcast

Root Leaf

𝑡𝑟𝑒𝑐𝑒𝑖𝑣𝑒1 𝑡𝑟𝑒𝑐𝑒𝑖𝑣𝑒2𝑡𝑠𝑒𝑛𝑑1 𝑡𝑠𝑒𝑛𝑑2

𝑡𝑝𝑎𝑡ℎ = ∑(𝑡𝑠𝑒𝑛𝑑 + 𝑡𝑟𝑒𝑐𝑒𝑖𝑣𝑒)

We need to know the send and receive cost between any pair of cores

Page 14: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

14

Information obtained from hardware discovery

$ numactl –hardwarenode distances:node 0 1 2 3 4 5 6 7 0: 10 16 16 22 16 22 16 22 1: 16 10 22 16 16 22 22 16 2: 16 22 10 16 16 16 16 16 3: 22 16 16 10 16 16 22 22 4: 16 16 16 16 10 16 16 22 5: 22 22 16 16 16 10 22 16 6: 16 22 16 22 16 22 10 16 7: 22 16 16 22 22 16 16 10

$ lscpuCPU(s): 64Thread(s) per core: 2Core(s) per socket: 8Socket(s): 4NUMA node(s): 8L1d cache: 16KL1i cache: 64KL2 cache: 2048KL3 cache: 6144KNUMA node0 CPU(s): 0,4,8,12,16,20,24,28NUMA node1 CPU(s): 32,36,40,44,48,52,56,60NUMA node2 CPU(s): 2,6,10,14,18,22,26,30NUMA node3 CPU(s): 34,38,42,46,50,54,58,62NUMA node4 CPU(s): 3,7,11,15,19,23,27,31NUMA node5 CPU(s): 35,39,43,47,51,55,59,63NUMA node6 CPU(s): 1,5,9,13,17,21,25,29NUMA node7 CPU(s): 33,37,41,45,49,53,57,61

AMD Interlagos 4x4x2

Page 15: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

15

Information obtained from hardware discovery

$ numactl –hardwarenode distances:node 0 1 2 3 4 5 6 7 0: 10 16 16 22 16 22 16 22 1: 16 10 22 16 16 22 22 16 2: 16 22 10 16 16 16 16 16 3: 22 16 16 10 16 16 22 22 4: 16 16 16 16 10 16 16 22 5: 22 22 16 16 16 10 22 16 6: 16 22 16 22 16 22 10 16 7: 22 16 16 22 22 16 16 10

$ lscpuCPU(s): 64Thread(s) per core: 2Core(s) per socket: 8Socket(s): 4NUMA node(s): 8L1d cache: 16KL1i cache: 64KL2 cache: 2048KL3 cache: 6144KNUMA node0 CPU(s): 0,4,8,12,16,20,24,28NUMA node1 CPU(s): 32,36,40,44,48,52,56,60NUMA node2 CPU(s): 2,6,10,14,18,22,26,30NUMA node3 CPU(s): 34,38,42,46,50,54,58,62NUMA node4 CPU(s): 3,7,11,15,19,23,27,31NUMA node5 CPU(s): 35,39,43,47,51,55,59,63NUMA node6 CPU(s): 1,5,9,13,17,21,25,29NUMA node7 CPU(s): 33,37,41,45,49,53,57,61

AMD Interlagos 4x4x2

Doesn’t distinguish between send() and recv()

4

2

2

4Symmetric: AB == BA

NUMA distance: abstract value

Page 16: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

16

Complement with microbenchmarks: pairwise send and receive

Cost of Send Operation [Cycles] Cost of Receive Operation [cycles]

sending core sending core

receiving core receiving core

AMD Interlagos 4x4x2

min = 43 max = 166

min = 43 max = 360

Page 17: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

17

Complement with microbenchmarks: pairwise send and receive

Cost of Send Operation [Cycles] Cost of Receive Operation [cycles]

sending core sending core

receiving core receiving core

AMD Interlagos 4x4x2

min = 43 max = 166

min = 43,max = 360

10

22 137

10

22 282

22

10 159

22

10 351

Not symmetric!

Core 10 Core 22: 137 + 282 = 419 cyclesCore 22 Core 10: 159 + 351 = 510 cycles

Page 18: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

18

Smelt

Page 19: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

19

Using Smelt for group communication

Multicore ModelProgram

Hardware Discovery# Cores# NUMA nodes

lstopo; /proc/cpu

MeasurementsMicro-benchmarks

#include <smelt/smelt.h>

void main() {

smelt_init();

smelt_topology_create();

smelt_broadcast(msg);

}

Smelt runtime

smelt_topology_create() {

}

Smelt Algorithm

Tree topology

Page 20: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

20

Smelt’s tree generator heuristics

Remote Cores First

1

Avoid Expensive Communication

No redundancy

r

1

2

3

4

Maximize Parallelism

r

rr

Page 21: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

21

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

NUMA 3

NUMA 2

NUMA 1

NUMA 0

500 1000 1500 20000

Broadcast onAMD Shanghai 4x4x1

sendreceive

Simplified Example

Time [Cycles]

Page 22: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

22

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

NUMA 3

NUMA 2

NUMA 1

NUMA 0

500 1000 1500 20000

207 199 198 184

186564

450

455

Heuristic #1: Expensive first

Broadcast onAMD Shanghai 4x4x1

sendreceive

Simplified Example

Time [Cycles]

184 183 183

186 183 183

188 188 187

187187188455

357

357

356

352

350

350

354

350

350

357

355

353Step 1: Pick the rootCore with the smallest total send time to any other core.

Step 2: Start SchedulerSchedule cores to send and receive messages

Apply HeuristicsDecide where to send messages to

The tree is good,But not optimal!

Page 23: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

23

Smelt Tree forIntel Sandy Bridge 4 Sockets x 8 Cores x 2 Threads

Smelt Tree forIntel Xeon Phi using 61 cores

Page 24: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

24

Evaluation Testbed

Architecture Sockets Cores / Socket

Threads / Core

Magny Cours 4 12 1

Barcelona 8 4 1

Shanghai 4 4 1

Interlagos 4 4 2

Istanbul 4 6 1

Architecture Sockets Cores / Socket

Threads / Core

Ivy Bridge 2 10 2

Nehalem 4 8 2

Knights Corner 1 61 4

Sandy Bridge 4 8 2

Sandy Bridge 2 10 2

Bloomfield 2 4 2

Full set of results online.

http://machinedb.systems.ethz.ch

Intel AMD

Page 25: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

0

5

10

15

20

25

30

35

40

Barrier Reduction Broadcast 2PC

Binary Tree Cluster Sequential MST Fibonacci Smelt

25

Smelt produces good trees across architectures

0

20

40

60

80

100

120

Barrier Reduction Broadcast 2PC

AMD Interlagos (4 Socket x 4 Threads) Intel Xeon Phi (61 Threads)Execution Time [kCycles] Execution Time [kCycles]

Best other = Cluster Best other = Fibonacci / Binary Tree

Page 26: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

26

Smelt produces good trees across architectures

Manufacturer Codename Sockets x Cores x Threads

slowdown speedup

Page 27: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

Cluster Topology on Intel Bloomfield 2x4x2Smelt Tree on Intel Bloomfield 2x4x2

27

Fast broadcast trees are good for reductions in most cases

Additional Cross-NUMA link

Page 28: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

05

1015202530354045

32 Threads 64 ThreadsDissemination Parlib MSC Smelt

Barriers based on reduction and broadcast

28

Smelt provides simple and fast barriers

Execution Time [kCycles]

Barrier Benchmark on Intel Sandy Bridge 4x8x2

Simple barrier implementation

void smelt_barrier(void) {smelt_reduce();smelt_broadcast();

}

Page 29: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

29

OpenMP: EPCC OpenMP Benchmark Collection

/* epcc openmp barrier benchmark */void testbar() {

int j;#pragma omp parallel private(j){

for (j = 0; j < innerreps; j++) {delay(delaylength);#pragma omp barrier

}}

} Implicit barrier at the end of parallel block

Explicit barrier

Remaining results on the website

0

5

10

15

20

25

PARALLEL BARRIER

AMD Interlagos 4x4x2

0

20

40

60

PARALLEL BARRIER

Intel SandyBridge 4x8x2

GOMP Smelt

Execution Time [us]

Execution Time [us]

Replaced GOMP barrier with Smelt

Page 30: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

30

Agreement Protocols: 1Paxos

0

10

20

30

40

50

60

8 12 16 20 24 28

Number of Replicas

Original Broadcast Smelt Broadcast

4 clients to generate loadN replicas executing 1Paxos

Execution Time [kCycles]

1Paxos Benchmark on AMD Interlagos 4x4x2

Page 31: Smelt: Machine-aware Atomic Broadcast Trees for Multicores...Tree Topology? Root of the tree? Send order of messages? Maximum ... topology + message send order 𝑛!𝐶 −1= (2𝑛−2)!

Broadcasts and reductions are central building blocks

No globally optimal tree topology

Information from hardware discovery is not sufficient

Smelt’s produces good trees

31

SummaryTalk to us at

the first poster session

machinedb.systems.ethz.ch github.com/libsmelt