Network in Supercomputers

Network in Supercomputers情報ネットワーク特論

Takeshi NanriResearch Institute for Information Technology, Kyushu Univ.

1

Today's topics

• What are supercomuters?

• How to speed up computers?

• Network in supercomputers.

2

What are supercomputers?• "super" computers

= computers that are capable of performance far higher than ordinal computers at the point.

• Purposes of supercomputers• Mainly, science technology field:

• Medicine• Molecular simulation

• Design of cars, airplanes, etc.• Structural analysis simulation

• Weather forecast• Physical simulation

etc.

• Recently, for other fields:• Prediction of markets• Traffics• SNS data analysis• Customer behavior analysisetc

3

出典： http://eng.mod.gov.cn/Database/Academies/2013-06/18/content_4455818_2.htm

出典： https://www.llnl.gov/news/aroundthelab/2012/Jun/ATL-061812_sequoia.html

Demands for faster computers

出典： http://www.aics.riken.jp/jp/k/facility.html

Theoretical speed of computers

• FLOPS (FLoating Operations Per Second)• Number of floating point operations per second.

• Important for scientific simulations

• Theoretical Peak FLOPS = frequency of CPU（CPUのクロック周波数）x number of operations per clock （１クロックあたりの演算数）x number of CPUs （CPUの数）

• Ex:1GHz CPU, 4operations per clock, 1000 CPUs:

Speed＝ 4000 GFLOPS ＝ 4 TFLOPS (Tera FLOPS)

4

Actual performance• CPU needs to wait for:

• Data to be calculated from memory, disk or networks• Synchronization with other CPUs

• So, actual performance depends on programs

5

Benchmark program:A program used for comparing speed among computers.

Top500 Listhttp://www.top500.org

• Most famous supercomputer performance list in the world• 500 working supercomputers are listed• Updated on every June and November.

• LINPACK benchmark• Solves linear algebra• Shows speed close to the theoretical peak.

• Convenient for comparing speeds among computers:• lists since 1993• almost all supercomputers are listed

6

Used in the race of supercomputer developments

Example: Earth simulator

• Since March, 2002• by NEC, Japan

• Target:• Simulating globe with 10km square mesh

• was 100km square... typhoon could not be simulated

•

7

Earth simulator in Top500

• 35.8FLOPS Linpack (Theoretical peak = 41.0TFLOPS)• top of Top500• higher speed than aggregated numbers of number 2 to 10, at Nov. 2002

• "Computenik" by Prof. Jack Dongarra (Tennessee Univ.)• Sputnik in computer field

8

Accelerated development of supercomputersin USA

USA's revenge (2014)

• IBM Blue Gene/L (Nov., 2004)http://www.research.ibm.com/bluegene/

• 70.7TFLOPS (Theoretical peak = 91.8TF)

• "Not completed yet. It will be four times faster."

9

It will be much faster than the total performanceof all supercomputers in Japan.

http://www.research.ibm.com/bluegene/

USA's act (2015)• IBM Blue Gene/L (Nov., 2015)

• 280.6TFLOPS (Theoretical peak = 367.0TFLOPS)• First supercomputer with more than 100TFLOPS speed.

• Share of supercomputer performance：1 USA 68.3%2 Japan 5.68 %3 UK 5.41%4 Germany 3.10%5 China 2.59%

, Australia, Switzerland, Netherland, Korea, ...

10

11

ASCI White (USA), 0.007 PFLOPSEarth Simulator (Japan), 0.036 PFLOPSBlueGene/L (USA), 0.478 PFLOPSRoadRunner (USA), 1.1 PFLOPSJaguar (USA), 1.8 PFLOPSTianhe-1A (China), 2.6 PFLOPSK Computer (Japan), 10.5 PFLOPSTitan (USA), 17.6 PFLOPSTianhe-2 (China), 33.9 PFLOPSSunway TaihuLight (China), 93.0 PFLOPS

Performance development

出典： http://top500.org

Latest: June, 2017

• 1位 Sunway TaihuLight (China) 93.0 PFLOPS2位 Tianhe-2 (China) 33.9 PFLOPS3位 Piz Daint (Switzerland) 19.6 PFLOPS4位 Titan (USA) 17.6 PFLOPS5位 Sequoia (USA) 17.1 PFLOPS6位 Cori (USA) 14.0 PFLOPS7位 Oakforest-PACS (Japan) 13.6 PFLOPS8位 K Computer(Japan) 10.5 PFLOPS

• Country：1 USA 33.5% (250.9 PFLOPS)2 China 31.4% (235.1 PFLOPS)3 Japan 8.3% ( 62.5 PFLOPS)4 Germany 5.0% ( 37.5 PFLOPS)5 France 3.4% ( 25.7 PFLOPS)

• ，UK, South Korea, Italy, Canada, ...12

to be updated on next month

How to speed up computers?

• Gain clock speed• Increase instruction level parallelism

• Increase number of cores

• Many cores and Accelerators

• SIMD (Single Instruction Multiple Data)

13

limit of electric power and heat

limit of instruction parallelism

Increase processors• Computers today = parallel computers with multiple processors

• from smartphones to supercomputers

14

Name Number of CoresK computer 705,024Titan 560,640TaihuLight 10,649,600

Name Number of CoresiPhone 8 2 (high speed core) +

4 (efficiency core) +Deep Learning PU

Galaxy S8 4 (2.35GHz) + 4 (1.9GHz)

XPERIA XZ 4 (2.45GHz) + 4 (1.9GHz)

supercomputers smartphones

Parallel computers• Speed up by distributing tasks among processors

= parallel computation

•

• Parallel programs are needed for parallel computation.

15

work 1 work 2 work 3work 1

work 2

work 3

Sequential parallel

Speedup according to the number of processors, if programs are parallelized sufficiently.

Parallel programs

• Programs with specifications for parallelization • How to distribute works to processors• exchanging information among processors• synchronization

etc

16

Difference between "serial programs" and "parallel programs"?

Serial program:A = B + C

• sequentially calculate elements from 0 to 99

17

A

B

C

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

0 99...

double A[100], B[100], C[100];...

for (i = 0; i < 100; i++)A[i] = B[i] + C[i];

program

Example of parallel program:Parallelization with "threads"• Threads：

Flows of computation that share the same memory space

18

A

B

C

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

=+

...0 24 25 ... 49 50 ... 74 75 ... 99

double A[100],B[100],C[100];...

for (i=0; i<25; i++)A[i] = B[i] + C[i];

double A[100],B[100],C[100];...

for (i=25; i<50; i++)A[i] = B[i] + C[i];

double A[100],B[100],C[100];...

for (i=50; i<75; i++)A[i] = B[i] + C[i];

double A[100],B[100],C[100];...

for (i=75; i<100; i++)A[i] = B[i] + C[i];thread 0 thread 1 thread 2 thread 3

thread 0 thread 1 thread 2 thread 3

Pros and cons of parallelization with threads

• Good: Easy to parallelize• most of the compilers can generate thread-parallel programs automatically.

• Bad: Basically, runs only on "Shared-memory parallel computers"= Cannot be used on "Distributed-memory parallel computers"= Cannot be used on supercomputers

19

Shared-memory parallel computers

• Share one memory among multiple CPUs

20

CPUcore

Memory

CPUcore

CPUcore

CPUcore

CPUcore

CPUcore

Bottle-neck at the path from CPU cores to the memory

=> limit on the number of CPU cores

Distributed-memory parallel computers• Multiple set of CPUs and Memory

21

CPUcore

Memory

CPUcore

CPUcore

Memory

CPUcore

CPUcore

Memory

CPUcore

CPUcore

Memory

CPUcore

Network

Number of paths from CPU to memory increases according to the number of sets=> Easy to construct large-scale machines

Process-parallel program

• Needed for parallel computation on distributed-memory parallel computers

• Process-parallelParallel computation with multiple processes, where each process

has its own memory space.

22

Data in process-parallel program

• Data is also distributed into processes

23

double A[25],B[25],C[25];...

for (i=0;i<25;i++)A[i] = B[i] + C[i];

Process 0

A

B

C

=+

=+

=+

=+

=+

...0 24

Process 0

A

B

=+

=+

=+

=+

=+

...0 24

Process 1

A

B

=+

=+

=+

=+

=+

...0 24

Process 2

A

B

=+

=+

=+

=+

=+

...0 24

Process 3

double A[25],B[25],C[25];...

for (i=0;i<25;i++)A[i] = B[i] + C[i];

double A[25],B[25],C[25];...

for (i=0;i<25;i++)A[i] = B[i] + C[i];

double A[25],B[25],C[25];...

for (i=0;i<25;i++)A[i] = B[i] + C[i];

Process 1Process 2

Process 3

C C C

Each process computes on different arrays

Communication in process-parallel program

• A process cannot make access to other process's memory, directly.

• Perform communication instead.

24

A

Process 0

A

Process 1

A

Process 2

A

Process 3

Network

SendReceive

ＭＰＩ(Message Passing Interface)

• Definitions of functions for communications performed in parallel programs.

• Example) Transfer data from process 0 to process 1

25

MPI_Comm_rank(MPI_COMM_WORLD, &myid);...

if (myid == 0) MPI_Send(&(a[5]), 1, MPI_DOUBLE, 1,

0, MPI_COMM_WORLD);if (myid == 1)MPI_Recv(&(a[3]), 1, MPI_DOUBLE, 0,

0, MPI_COMM_WORLD, &status);

Retrieve process number

Process 0 sends data to process 1

Process 1 receives data from process 0

Pros and cons of distributed memory parallel computers

• Pros: Can increase the "theoretical peak performance", easily.• You can just increase the number of CPUs with memory

• Cons: Need techniques to increase practical performance.• Develop "process-parallel" program

• distribute computation• distribute data• perform communication

26

Every supercomputers are distributed-memory parallel computers, now.

Parallelization methods vs Parallel computers

• Parallelization methods

27

shared distributedthread ○ ×

process (MPI) ○ ○

MPI program may require some effort to develop, but runs on every parallel computers.

Many-core• Background:

The most severe problem in speeding up supercomputers is "electric power".

• Basic idea of "many-core":Use large number of cores with low-performance,low-functionality, but power efficient.

• Example• Sunway SW26010 260C: 260 cores

• developed in China

• Intel Xeon Phi 7290: 72 cores• old Pentium architecture + SIMD operation

• Pros-and-cons• Each core is a normal core with full-set of CPU operations

• At least it can run existing programs, as it is.• Need detailed tuning to achieve sufficient performance.

28

Ex: Achieve 1/2 speedwith 1/8 power.

x4 power efficiency

Power

Performance

出典： http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf

Accelerator• Background:

Similar to many-core. Use large-number of power-efficient calculation units.• In most cases, processing cores in accelerators are much simpler than the ones in many-core.

• GPGPU (General Purpose Graphic Processing Unit)• Idea of using processors for graphics as general purpose accelerators.

• Example)• NVIDIA Tesla P100: 3584 CUDA cores

• Pros-and-cons:• Accelerator has its own memory.

So, data transfer between CPU andaccelerator is required.

• Processors in accelerators are completely different from normal CPUs.Need dedicated programming language(like CUDA).

• Speeds-up some specific computations,such as machine learning.

29

出典： http://www.nvidia.co.jp/object/tesla-p100-jp.html

SIMD (Single Instruction Multiple Data) unit• SIMD operation:

Compute multiple data with one instruction• Ex) Intel AVX512: compute 8 double-precision floating point operations with one

instruction

• SIMD unit:A part of computational units in a CPU core.• Power-efficient computation: similar to many-core• Included in most of the existing CPU cores

• Pros-and-cons:• Compiler tries to convert the source program without any SIMD operation to the

program that invokes SIMD operation, automatically.• In most of the cases, programmers need some tunings:

• Re-write source codes as simple as possible, so that compiler can understand them.• Insert SIMD operation in the source code.

30

__m512 ax = _mm512_load_ps(&a[i]);__m512 bx = _mm512_load_ps(&b[i]);sumx = _mm512_fmadd_ps(ax, bx, sumx);

a

b

axbx

sumxsumx

*+

Memory-speed issue

• Memory-speed = the speed of feeding data from memory to CPU• Example) Intel Xeon Skylake-SP, 2.0GHz, 20core

• Speed of data required by CPU: 2560 GB /sec• 20 cores * 2.0GHz * 8 operations * 8 bytes

• Memory-speed: 127.8 GB / sec

• For faster data-feed:• Hardware techniques:

• high-speed memory:• HBM2 for GPU, MCDRAM for many-cores

• Cache memory• Software techniques

• Prefetching data• Loop-tiling

etc.

31

Only 1/20 of CPU speed can be achieved.

GPU vs many-core vs traditional CPU

32

NVIDIA Tesla P100 Intel Xeon Phi 5110P Intel Xeon Gold 6154

Doubleprecision FLOPS

5.3TF 3.4TF 1.7TF

Memory 16GB 16GB 1.5TB

Memoryspeed

732GB/s 490GB/s 127.8GB/s

Powerconsumption

300W 245W 200W

FLOPS / Watt 14.3GF / W 13.9GF / W 8.5GF / W

Comparison of top computers

33

Sunway TaihuLight

Tianhe-2 Piz Daint Titan Sequoia Cori Oakforest-PACS

K Earth Simulator

# nodes 40,960 16,000 6,528 18,688 98,304 9,152 8,208 88,128 640

Architecture

Many Core CPU + Many Core

CPU + GPU

CPU + GPU

CPU Many Core

Many Core

CPU CPU

Theoretical peak

125.4 PF 54.9 PF 25.3 PF 27.1 PF 20.1 PF 27.9 PF 24.9 PF 11.3 PF 0.041 PF

Linpack 93.0 PF 33.9 PF 19.6 PF 17.6 PF 16.3 PF 14.0 PF 13.6 PF 10.5 PF 0.036 PF

Linpack /theoretical peak

0.73 0.61 0.77 0.64 0.81 0.50 0.55 0.93 0.88

Power 15.4 MW 17.8 MW 2.3 MW 8.2 MW 7.9 MW 3.9 MW 2.7 MW 12.7 MW 3.2 MW

GFLOPS / Watt

6.1 1.9 11.0 2.1 2.1 3.6 5.0 0.83 0.01

Network Fat Tree Fat Tree DragonFly 3D Torus

5D Torus

DragonFly Fat Tree 6D Torus Crossbar

Year of appearance

2016 2013 2017 2012 2012 2016 2016 2011 2002

Actual performance of parallel computers

• Programmers say:"We want the parallel program to be 4 times faster with 4 CPUs."

• Actually:"3 times faster with 4 CPUs is good enough."

• Why?• Amdahl's law• Load balance• Communication

34

Amdahl's law

• "Only tuned portion of the program can be faster."

• In parallel computation:Theoretically, speed-up ratio by parallelization is limited tospeedup = 1/((1-P)+P/N)• P: Ratio in time of the parallelized part of the program.

• N: Number of processes.

• Example)To achieve 3.5 times speed up with N = 4, P should be greater than 0.95.

35

Load balance

• Load = amount of computation• Execution time of a parallel program is

"the execution time of the slowest process".

Rank 0

Rank 1

Rank 2

Rank 3

Rank 0

Rank 1

Rank 2

Rank 3

Execution time of this program

Execution time of this program

36

Load imbalanced Load balanced

Communication

• Was not necessary before parallelization.= Additional cost by the parallelization.

Rank 0

Rank 1

Rank 2

Rank 337

Before

After

Communications in parallel programs• Exchange results with neighboring processes

• Example) Stencil computation (Update each element with values of surrounding elements)

• Aggregate results among processes• Example) product of two vectors

38

parallelization

exchange

proc0

* * * * *

+

* * * * *

+

* * * * *

+

* * * * *

+

...

+

proc1 proc2 procN

aggregation

Time for transferring m byte of data

• T = a + b * m• a: Latency

• Time required for each data transfer. Independent from the size of the data.• Depends on the speed of network devices, and the distance between processes.

• b: Transfer time per byte• Inverse of communication bandwidth (byte / sec)

39Size (byte)

Time (sec)

a(b

1byte

Time for computation and communication of parallel programs

• Time for computation: Basically, decreases according to the increase of the number of processes.

• Time for communication: Basically, increases along with the number of processes.

40

Ratio of communication time increases along with the number of processes.

Time

Number of processes１２４８

Computation

Communication

Requirement for the networks in supercomputers

• Communication performance• Low latency

• Fast network devices• Short distance

• High bandwidth

• Less collisions on links• More links• Advanced control of

traffic routes

• Cost• As cheap as possible

• Depends on the performance of network device, number of devices, number of links, and bandwidth of links.

41

Long path = slow Short path = fast

Narrow Wide

Collision Collision

Less number of paths More paths

Bus / Ring Topology• Share one path (bus) among all of the nodes

• Ring is the similar topology with end-to-end connection

• Pros:• Can use wide bandwidth path.• Numbers of network devices and links are small

=> lower cost

• Cons:• Only one communication is performed at the same time

• Sometimes, used within a CPU to connect multiple cores, like 8 cores.

42

Crossbar Switch• Connect all nodes via matrix of switches.

• Pros: • Latency is quite low.• Can achieve bandwidth proportional to the number of nodes.

• Cons:• Expensive

• Need N^2 devices and links.

• One of the largest is Earth simulator (640 nodes)• Usually, used for much smaller

systems

43

Fat Tree• Tree-based layered structure with multiple roots• Pros:

• Relatively, small latency• Up to 2 * number of layers

• Relatively high bandwidth• Depends on the number of roots

• Relatively low cost

• Cons:• Expensive for tens of thousands of nodes

• Number of roots should be reduced

44

Roots (= Spines)

Leaves

Nodes

Multi-dimensional mesh / torus• Construct multi-dimensional array of nodes, and connect neighboring

nodes with each other• Torus is the similar topology with end-to-end connection

• Pros:• Cheap

• Cost is linear to the number of nodes• Faster than Bus of Ring topology

• Especially low-latency with neighboring nodes

• Cons:• Can cause severe collisions

=> need careful tuning of communication in a program

• Mainly used for tens of thousandsof nodes

45

Tofu: Network for K computer

• 6 dimensional mesh / torus• Place blocks with 2 x 3 x 2 nodes (A-B-C axes), in three dimensional array (X-Y-X

axes)• Each node is connected to the nodes in the same position of the neighboring

blocks of X, Y and Z axes.

• Distance among nodes is shorter than two or three dimensional mesh / torus.

• Less links to connect all nodes=> low cost and low power

• Alternative path even with one failure node.

46

From: http://www.ssken.gr.jp/MAINSITE/download/newsletter/2011/20110825-sci-1/lecture-5/ppt.pdf

Full Direct Connection• Direct connection between

all pairs of nodes

• Pros:• Smallest latency

• Cons:• Expensive

• Links are square of the number of nodes

• Each node need the number of ports same as the number of nodes

• Mainly used for very smallnetwork within a CPU(like 4 cores)

47

DragonFly

• Divide nodes in groups, and connectgroups with full direct connection.

• Small distances• Large bandwidth

48

Network devices for supercomputers

• Ethernet：Only used in the systems with which communication speed is not

important.• Bandwidth: 1 to 10 Gbps• Latency: tens of micro seconds

• High-performance network:Used in the systems that require very fast communication,

especially low latency• Bandwidth: 10 to 100 Gbps• Latency: 1 micro second• Products used in Top500 machines:

• Mellanox InfiniBand• Intel OmniPath• Cray Aries

49

Functionalities of high-speed networks to support parallel computation

• RDMA (Remote Direct Memory Access)• Direct data transfer from / to remote nodes• Fast communication without CPU• Communication time can be hidden

• Offloading• Let NIC (Network Interface Controller) to perform

algorithms to do high-level communication,such as aggregation

• CPU can concentrate on its computation

• Adaptive Routing• Avoid links with collisions

50

NIC NICRDMA(Get)

CPU CPURAM RAM

Switch

NIC

CPU RAM

recv

a

wai

t

c =

a +

b

send

c

recv

b

Future of supercomputer development

• Enormous amount of budget requirement：120 billion JPY for K computer in 7 years.

• Next target:1 Exa FLOPS around 2020?• Budget?

• Development is slowing down• The top system remains for years• Sum (total of 500 systems) and

#500 (performance of the 500th system) is slowing down significantly.

51

Changes in USA• PCAST (President’s Council of Advaisors on Science and Technology)

http://insidehpc.com/2010/12/22/pcast-report-supercomputing-arms-race-may-be-the-wrong-path-forward/"an arms race that is very expensive and may not be a good use of funds."

• Founder of TOP500http://www.top500.org/blog/top500-founder-erich-strohmaier-on-the-lists-evolution/"It is expected to debut this November in tandem with SC13.""you will need to keep learning, changing and adapting to the rapidly changing hardware and software environments of HPC."

52

Rule of the game can be changed.

Alternative rankings

• HPC Challengehttp://icl.cs.utk.edu/hpcc/• Compare multiple performance

• Linpack, Matrix Multiply, Memory Bandwidth,Matrix Transpose, Random Access, Fast Fourier Trans,Communication Bandwidth and Latency

• K computer marked top rank on all of them.

• Graph500http://www.graph500.org• Compare speed of graph processing

• Green500http://www.green500.org• Compare FLOPS / W

53

Research Institute for Information Technology of Kyushu Univ.

• Supercomputer is availablehttp://www.cc.kyushu-u.ac.jp/scp/• Ask your supervisor

• ITO System• In service since Oct. 2017 • Network: Fat Tree

54

Network in Supercomputers

Documents

Transcript of Network in Supercomputers