Networks: Topologies - HPC Advisory Council - A community effort

24
Networks: Topologies How to Design Gilad Shainer, [email protected]

Transcript of Networks: Topologies - HPC Advisory Council - A community effort

Page 1: Networks: Topologies - HPC Advisory Council - A community effort

Networks: Topologies How to Design

Gilad Shainer, [email protected]

Page 2: Networks: Topologies - HPC Advisory Council - A community effort

2

TOP500 Statistics

Page 3: Networks: Topologies - HPC Advisory Council - A community effort

3

TOP500 Statistics

Page 4: Networks: Topologies - HPC Advisory Council - A community effort

4

World Leading Large-Scale Systems

• National Supercomputing Centre in Shenzhen – Fat-tree, 5.2K nodes, 120K cores, NVIDIA GPUs, China (Petaflop)

• Tokyo Institute of Technology – Fat-tree, 4K nodes, NVIDIA GPUs, Japan (Petaflop)

• Commissariat a l'Energie Atomique (CEA)

– Fat-tree, 4K nodes, 140K cores, France (Petaflop)

• Los Alamos National Lab - Roadrunner – Fat-tree, 4K nodes, 130K cores, USA (Petaflop)

• NASA

– Hypercube, 9.2K nodes, 82K cores – NASA, USA

• Jülich JuRoPa

– Fat-tree, 3K nodes, 30K cores, Germany

• Sandia National Labs – Red Sky

– 3D-Torus, 5.4K nodes, 43K cores – Sandia “Red Sky”, USA

Page 5: Networks: Topologies - HPC Advisory Council - A community effort

5

ORNL “Spider” System – Lustre File System

• Oak Ridge Nation Lab central storage system – 13400 drives

– 192 Lustre OSS

– 240GB/s bandwidth

– InfiniBand interconnect

– 10PB capacity

Page 6: Networks: Topologies - HPC Advisory Council - A community effort

6

Network Topologies

• Fat-tree (CLOS), Mesh, 3D-Torus topologies

• CLOS (fat-tree)

– Can be fully non-blocking (1:1) or blocking (x:1)

– Typically enables best performance

• Non blocking bandwidth, lowest network latency

• Mesh or 3D Torus

– Blocking network, cost-effective for systems at scale

– Great performance solutions for applications with locality

– Support for dedicate sub-networks

– Simple expansion for future growth

0,0

1,0

0,1

1,1

2,0 2,1

0,2

1,2

2,2

Page 7: Networks: Topologies - HPC Advisory Council - A community effort

7

d-Dimensional Torus Topology

• Formal definition – T=(V,E) is said to be d-dimensional torus of size N1xN2x…xNd if:

• V={(v1,v2,…,vd) : 0 ≤ vi ≤ Ni-1}

• E={(uv) : exists j s.t. 1) for each i≠j, vi=ui AND 2) vj=(uj±1) mod Nj}

• Examples

0 1 2

0,0

1,0

0,1

1,1

2,0 2,1

0,2

1,2

2,2

N1=5 N1=N2=3

3 4

Page 8: Networks: Topologies - HPC Advisory Council - A community effort

8

3D-Torus System – Key Items

• Multiple server nodes per cube junction

• Smallest 3D cube size the better

– Lowest latency between remote nodes

– Minimizing throughput contention

• Ability to connect storage

• Support for separate networks

– Dedicated network (links) for specific

applications/usage

– Example: links dedicated for collectives or

specific jobs

Page 9: Networks: Topologies - HPC Advisory Council - A community effort

9

InfiniBand 3D Torus

Page 10: Networks: Topologies - HPC Advisory Council - A community effort

10

Routing for 3D Torus (Avoiding Deadlocks)

• Setting routing might look simple – Just route packets on the shortest path between source - destination

• In lossless networks trivial routing can be disastrous

Communication pairs

1. 02

2. 13

3. 24

4. 30

5. 41

0 1 2 3 4 2 3 4 0 1

Page 11: Networks: Topologies - HPC Advisory Council - A community effort

11

Avoiding Deadlock – Restrictive Approach

• Idea – Define a set of rules forbidding usage of some resources or a

(temporal) combination of resources which will guarantee freedom

from deadlock

– Design a routing complying with the rules

0 1 2 3 4 2 3 4 0 1

Page 12: Networks: Topologies - HPC Advisory Council - A community effort

12

Avoiding Deadlock – Separation Approach

• Idea – Decompose each (unidirectional) physical link into several logical

channels with private buffer resources

– Use logical channels to separate the network into virtual networks,

each dependency-cycle-free

– Assign communication pairs (with their paths) to the virtual networks

• Back to our ring

0 1 2 3 4

Routing: Shortest

path Virtual mapping: If a

path uses 04 or

40 link map it to the

red virtual network

else to the black one

2 3 4

1 0

Page 13: Networks: Topologies - HPC Advisory Council - A community effort

13

InfiniBand 3D Torus

• InfiniBand drivers includes subnet management for

– Fat Tree – min hop, up/down etc

– 3D Torus - Dimension Ordered Routing

Page 14: Networks: Topologies - HPC Advisory Council - A community effort

14

Mixed Topologies

• Fat-tree topology provide the best performance solution

• 3D-Torus can be more cost effective, easier to scale, good fit for

applications with locality

• Mixed topology

– System connected as 3D Torus

– Fast Fat-tree for collective operations

0,0

1,0

0,1

1,1

2,0 2,1

0,2

1,2

2,2

Page 15: Networks: Topologies - HPC Advisory Council - A community effort

15

Notes

• Following Fat-Tree network configurations – Flat network

– No unused port

– Two layer of switch fabric (L1 and L2)

• Following 3D Torus configurations – Each 3D Torus junction is a 36-port switch

– Number of switches refers to 36-port switches

• InfiniBand is a great interconnect technology to enable

flat connectivity of thousands and tens-of-thousands

of servers in future Mega Warehouse Data Centers

Page 16: Networks: Topologies - HPC Advisory Council - A community effort

16

Example: Non-blocking, Fat-Tree, 40Gb/s

Non-blocking

Network

Servers

Servers

18

18

648 L1 36-port switches 18 L2 648-port switches

Total: 11664 servers (nodes)

Throughput: 40Gb/s to the node

Page 17: Networks: Topologies - HPC Advisory Council - A community effort

17

Non-blocking

Network

Servers

Servers

24

24

648 L1 36-port switches 12 L2 648-port switches

Total: 15552 servers (nodes)

Throughput: 20Gb/s to the node

Example: 2:1 Oversubscription, Fat-Tree, 40Gb/s

Page 18: Networks: Topologies - HPC Advisory Council - A community effort

18

Example: 3:1 Oversubscription, Fat-Tree, 40Gb/s

Non-blocking

Network

Servers

Servers

27

27

648 L1 36-port switches 9 L2 648-port switches

Total: 17496 servers (nodes)

Throughput: 13Gb/s to the node

Page 19: Networks: Topologies - HPC Advisory Council - A community effort

19

Example: 8:1 Oversubscription, Fat-Tree, 40Gb/s

Non-blocking

Network

Servers

Servers

32

32

324 L1 36-port switches 2 L2 648-port switches

Total: 10368 servers (nodes)

Throughput: 5Gb/s to the node

Page 20: Networks: Topologies - HPC Advisory Council - A community effort

20

Example: 3D Torus

120Gb/s

120Gb/s

120Gb/s

120Gb/s

120Gb/s

120Gb/s

+ y + z

- y

- z

+ x - x

40Gb/s each

3D Torus size: 8x8x8 (512 36-port switches)

Total number of servers: 9216

18 Servers (nodes)

3D Torus Switch Junction

Page 21: Networks: Topologies - HPC Advisory Council - A community effort

21

36-port Switch

+x -x +y -y +z -z Node 1

6

Node 1

5

Node 1

4

Node 1

3

Node 1

2

Node 1

1

Node 1

0

Node

9

Node

8

Node

7

Node

6

Node

5

Node

4

Node

3

Node

2

Node

1

ION

1

3D Torus Connections Example

Node

0

Page 22: Networks: Topologies - HPC Advisory Council - A community effort

22

Choosing the Right Topology

• Performance: Fat Tree

– Application locality? 3D can become an option

– Multiple users/applications? Fat Tree

– Non blocking? Fat Tree

• Cost

– Depends on the size of the system

– Very large systems can be more cost effective with 3D Torus

• Future expansion? 3D Torus will be easier to expend

Page 23: Networks: Topologies - HPC Advisory Council - A community effort

23

Network Offloading

• Transport offloads – critical for CPU efficiency

• Congestion avoidance – must be done in the network

• Applications offloading (MPI offloading)

– For example: MPI Collectives Offloads

Software MPI:

Losing performance

beyond 20% CPU

computation

availability

Collectives Offload

based MPI:

Beyond 80% CPU

computation

availability without

any performance

loss!

Lower is better

Page 24: Networks: Topologies - HPC Advisory Council - A community effort

24 24

Thank You

www.hpcadvisorycouncil.com

[email protected]