HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192...

62
HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS Dan Negrut Vilas Associate Professor NvidiaCUDA Fellow University of Wisconsin-Madison

Transcript of HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192...

Page 1: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS

Dan NegrutVilas Associate Professor

Nvidia CUDA Fellow

University of Wisconsin-Madison

Page 2: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

People Whose Work/Ideas Shaped Up This Presentation

• Current and past close collaborators:• Radu Serban, University of Wisconsin-Madison

• Alessandro Tasora, University of Parma, Italy

• Mihai Anitescu, University of Chicago, Argonne National Lab

• Students from University of Wisconsin-Madison, listed alphabetically:• Omkar Deshmukh, Toby Heyn, Hammad Mazhar, Daniel Melanz, Arman Pazouki, Andrew Seidl

2University of Wisconsin

Page 3: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Acknowledgements: Funding Sources

• National Science Foundation

• US Army

• NVIDIA

• Caterpillar

• Simertis GMBH

• MSC.Software

• FunctionBay, S. Korea

3University of Wisconsin

Page 4: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

About this presentation

• Summary of my understanding of the advanced computing topic

• Title of talk somewhat misleading

• High Performance Computing (HPC) often times associated with big supercomputers

• This talk: what one can do to speed up a simulation in multibody dynamics

• Supercomputers can sometimes be an option

4University of Wisconsin

Page 5: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Multibody Dynamics: Commercial Solution

5University of Wisconsin

Page 6: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Multibody Dynamics: Commercial Solution

6University of Wisconsin

Page 7: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

[Q] Why am in interested in this? [A] Large Dynamics problems: e.g., terrain simulation

• How is the Rover moving along on a slope with granular material?

• What wheel geometry is more effective?

• How much power is needed to move it?

• At what grade will it get stuck?

• And so on…

University of Wisconsin 7

Page 8: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

On wigs, ties, and t-shirts

8University of Wisconsin

Page 9: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

On Computing Speeds: The Price of 1 Mflop/second

• 1961: • Combine 17 million IBM-1620 computers

• At $64K apiece, when adjusted for inflation, this would cost $7 trillion

• 2000:• About $1,000

• 2013:• Less than 20 cents out of the value of a workstation

9University of Wisconsin

Page 10: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

More Good News: Chip Feature Length & Consequences

• Moore’s law at work

• 2013 – 22 nm

• 2015 – 14 nm

• 2017 – 10 nm

• 2019 – 7 nm

• 2021 – 5 nm

• 2023 – ??? (carbon nanotubes?)

• More transistors = More computational units

• October 2013: • Intel Xeon w/ 12 cores – 3 billion transistors

• Projecting ahead, estimates:• 2015 – 24 cores

• 2017 – 32 cores

• 2019 – 64 cores

• 2021 – 124 cores

University of Wisconsin 10

Page 11: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Frictional Contact Simulation[Commercial Software Simulation - 2007]

• Model Parameters:• Spheres: 60 mm diameter and mass 0.882 kg

• Penalty Approach: stiffness of 1E5, force exponent of 2.2, damping coefficient of 10.0

• Simulation length: 3 seconds

11University of Wisconsin

Page 12: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Frictional Contact Simulation[Commercial Software Simulation - 2013]

• Same problem tested in 2013

• Simulation time reduced by a factor of six

• Simulation times still prohibitively long

• Conclusion: fast computers mean nothing to the bottom line

12University of Wisconsin

Page 13: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Should you decide it is time to look for the exit…

• One of the two important points that should come out of this talk is this

• Performing math operations is basically free

• Procuring the operands is very expensive

• In terms of energy

• In terms of time

• Corollary: a program that leverages spatial or temporal locality in data accesses is a fast program

13University of Wisconsin

Page 14: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

NVIDIA’s Fermi GPU Architecture: Quick Facts

14

• Lots of ALU (green), not much of CU (orange)

• Explains why GPUs are fast for high arithmetic intensity applications• Arithmetic intensity: high when many operations performed per word of memory

University of Wisconsin

Page 15: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

The Fermi GPU Architecture

• Late 2009, early 2010

• 40 nm technology

• Three billion transistors

• 512 Scalar Processors (SP, “shaders”)

• 64 KB L1 cache

• 768 KB L2 uniform cache (shared by all SMs)

• Up to 6 GB of global memory

• Operates at several clock rates• Memory

• Scheduler

• Execution unit

• High memory bandwidth • Close to 200 GB/s

15University of Wisconsin

Page 16: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Fermi: cost of arithmetic vs. cost of memory access

• Fact 1: 32 FMAs (Fused-Multiply-Add) single precision operations in one clock cycle

• Fact 2: One memory request takes 400-600 cycles to service unless it hits L1 or L2 cache

• Conclusions1. Hundreds of times more expensive to bring data into SM than to compute something with it

2. Arithmetic is free, bringing data over is expensive

16University of Wisconsin

Page 17: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

GPU Computing Requires High Bandwidth

• Required bandwidth:• 32 (SP) x 3 (operands) x 4 (bytes) x 1125 MHz

x 15 (SMs) = 6,480 GB/s

• Available bandwidth • 200 GB/s, about 20-30 times smaller

• Two things save the day• Caching

• High arithmetic intensity algorithms

17University of Wisconsin

Page 18: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

My Lamborghini can drive at 250 mph [I drive it to get groceries]

18University of Wisconsin

Page 19: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

My Lamborghini can drive at 250 mph [I drive it to get groceries]

19University of Wisconsin

Page 20: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Bandwidth in a CPU-GPU System

[Robert Strzodka, Max Plank Institute, Germany]→

1-8 GB/s

GPU

NOTE: The width

of the black lines is

proportional to the bandwidth.

University of Wisconsin 20

Page 21: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Intel Haswell

• Released in June 2013

• 22 nm technology

• Transistor budget: 1.4 billions

• Tri-gate, 3D transistors

• Typically comes in four cores

• Has an integrated GPU

• Deep pipeline – 16 stages

• Strong Instruction Level Parallelism (ILP) support

• Superscalar

• Supports HTT (hyper-threading technology)

21University of Wisconsin

Page 22: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Haswell Layout: “System on Chip” (Soc) Paradigm

22

� Three clocks:� A core’s clock ticks at 2.7 to 3.0 GHz – adjustable up to 3.7-3.9 GHz

� Graphics processor ticking at 400 MHz – adjustable up to 1.3 GHz

� Ring bus and the shared L3 cache - a frequency close, but not necessarily identical, to that of the cores

University of Wisconsin[Intel]→

Page 23: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Caches

• Data:• L1 – 32 KB per core

• L2 – 512 KB or 1024 KB per core

• L3 – 8 MB per CPU

• Instruction:• L0 – storage for about 1500 microoperations (uops) per core

• L1 – 32 KB per core

23University of Wisconsin

Page 24: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Running fast on one workstation

• Several options• Vectorization using AVX

• Leverage multiple cores, up to 16 (AMD Opteron C6274)

• Have four CPUs on a node (64 cores)

• Intel Xeon Phi (60 cores)

24University of Wisconsin

Intel Xeon E5-2690 v2 Ivy Bridge-EP 3.0GHz 25MB L3 Cache 10-CoreTesla K20X (Kepler architecture)

Intel® Xeon Phi™ Coprocessor 5110P (8GB, 1.053 GHz, 60 core)

Page 25: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

How can you drive this HW?

• GPU:• CUDA (NVIDIA proprietary software ecosystem, freely distributed)

• OpenCL (standard supporting parallel computing in hardware-agnostic fashion)

• x86• pthreads

• OpenMP

• MPI

25University of Wisconsin

Page 26: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Reflections on picking the winning horse

• The hardware is complex

• The compiler flags are numerous

• The problems can be formulated in so many ways

• The second point of this talk:• The proof of the pudding is in the eating

University of Wisconsin 26

Page 27: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Next two slides:Intel MKL – bewildering results at times

“In our factory, we make lipstick. In our Advertising, we sell hope.”

Charles Revson, cosmetics magnate, Revlon

University of Wisconsin 27

Page 28: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

DGEMM : C = alpha*A*B + beta*C using Intel MKL 11.1

(alpha = 1, beta = 0)

A(2000x200), B(200x1000)

Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz

University of Wisconsin 28

Page 29: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

DGEMM on Intel Xeon Phi (60 Cores with 512-bit SIMD vector registers)

Intel’s MKL 11.1 – used in native mode

University of Wisconsin 29

Page 30: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Next slides: Picking the winning horse is not obvious

University of Wisconsin 30

Page 31: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

for (int yOut = 0; yOut < nHeight; yOut++) { // Image Y-dimension

const int yInTopLeft = yOut;

for (int xOut = 0; xOut < nWidth; xOut++) { // Image X-dimension

const int xInTopLeft = xOut;

float sum = 0.f;

for (int r = 0; r < nFilterWidth; r++) { // Kernel Y-dimension

const int idxFtmp = r * nFilterWidth;

const int yIn = yInTopLeft + r;

const int idxIntmp = yIn * nInWidth + xInTopLeft;

for (int c = 0; c < nFilterWidth; c++) { // Image X-dimension

const int idxF = idxFtmp + c;

const int idxIn = idxIntmp + c;

sum += pFilter[idxF]*pInput[idxIn];

}

}

const int idxOut = yOut * nWidth + xOut;

pOutput[idxOut] = sum;

}

}

2D Convolution

University of Wisconsin 31

Page 32: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

2D Convolution[Image size: 8192 x 8192]

• Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz• Dual core chip, supports 256-bit AVX capable of 8 FMA per clock-cycle

• Middle of the road laptop, single socket configuration

• Used for AVX acceleration, implementation by Okmar

• Vectorization with OpenCL built-in float4 data type

• AMD Opteron™ 6274, 2MB, 2.2 GHz• Has 16 physical cores

• Used with OpenMP 4/16/64 Threads

University of Wisconsin 32

Page 33: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Vectorization with AVX

• AVX: Advanced Vector Extension – difficult to program, little compiler optimizations

• Examples of AVX intrinsics

• Create and initialize variables that map to AVX registers

__m256 prod __attribute__ ((aligned (32))) = _mm256_set1_ps(0.0f);

• Carry out AVX multiplication using 256 bit wide registers

prod = _mm256_mul_ps(data, kernel);

• Multiplication maps eventually into assembly

c5 f4 59 c0 vmulps ymm0,ymm1,ymm0

University of Wisconsin 33

Page 34: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

2D Convolution[Image size: 8192 x 8192]

University of Wisconsin 34

Page 35: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

University of Wisconsin 35

2D Convolution: Scaling w/ image size[Kernel Size: 8 x 8]

Page 36: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

2D ConvolutionGPU Results – Tesla K20x, 5GB, 0.73 GHz

University of Wisconsin 36

http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf

Reference: one core of Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz

Page 37: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Next slides: pthreads is not it

University of Wisconsin 37

Page 38: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Stencil-Type Operation

• Average over the neighbors (grid[ i ± 1 ][ j ± 1 ])

• Hardware Setup: Dual socket, 4-core Intel Xeon CPUs with hyper-threading

• Shows up as 16 virtual cores

• Software approaches

• Pthreads – do-it-yourself, low-level, handle barriers, locking, synchroniztion

• OpenMP – Single line of code

• MPI – Middle ground, uses processes that communicate with each other

University of Wisconsin 38

Page 39: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

for (int t=0; t < timesteps/2; t++) {

int flag = 1;

int i, j, k;

MPI_Status status;

// Even indexed elements

for (j=1; j < ydim-1; j++) {

for (i=flag; i < xdim-1; i+=2) {

grid[j][i] = (grid[j-1][i] + grid[j][i-1] + grid[j][i] + grid[j][i+1] + grid[j+1][i]) / 5;

}

flag = (flag == 1 ? 2 : 1);

}

// Odd indexed elements

flag = 2;

for (j=1; j < ydim-1; j++) {

for (i=flag; i < xdim-1; i+=2) {

grid[j][i] = (grid[j-1][i] + grid[j][i-1] + grid[j][i] + grid[j][i+1] + grid[j+1][i]) / 5;

}

flag = (flag == 1 ? 2 : 1);

}

Barrier();

}

University of Wisconsin 39

Page 40: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

University of Wisconsin 40

Page 41: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

University of Wisconsin 41

Page 42: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Accelerators

Multicore chips

Many-node configurations

42University of Wisconsin

Page 43: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Our Lab’s Cluster

University of Wisconsin 43

Page 44: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Lab’s Cluster

• More than 1,200 CPU cores

• Mellanox Infiniband Interconnect (QDR), 40Gb/sec

• Memory: about 2.7 TB of RAM

• More than 10 TFlops Double Precision out of x86 hardware

• 60 GPU cards (K40, K30, GTX480) – more than 15 Tflops

• BTW: you can get an account if interested

University of Wisconsin 44

Page 45: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Rover Simulation

University of Wisconsin 45

Page 46: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Distributed Computing Dynamics: Is it worth it?

• The cons• Communication is major bottleneck – latencies are high, bandwidth is ok for dynamics

• Software design is complex

• The pros• Access to lots of memory

• Can put more cores to work

• Conclusion• Justifiable if one node not large enough to store the entire problem

46University of Wisconsin

Page 47: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Breaking Up Dynamics for Distributed Computing

• Simulation divided into chunks executed on different cores

• Elements leave one chunk (subdomain) to move to a different one

• Key issues:• Dynamic load balancing

• Establish a dynamic data exchange (DDE) protocol to implement migration at run time

47

v1

v3

v2

v5

v4

University of Wisconsin

Page 48: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Computation Using Multiple CPUs & MPI

University of Wisconsin 48

Page 49: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

0.5 Million Bodies on 64 Cores[Penalty Approach, MPI-based]

49University of Wisconsin

Page 50: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Project Chrono: Using HPC in Multibody Dynamics

• Open source, distributed under BSD-3 license

• 100,000 lines of code

• https://github.com/projectchrono/chrono

50University of Wisconsin

Page 51: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Tracked Vehicle Simulation – on GPU

51University of Wisconsin

Page 52: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Chrono::Fluid

University of Wisconsin 52

Page 53: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Chrono::Flex [GPU+CPU]

University of Wisconsin 53

Page 54: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Additive Manufacturing (3D Printing)

Selective Laser Sintering (SLS) machine

54University of Wisconsin

Page 55: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Additive Manufacturing (3D Printing)

55University of Wisconsin

Page 56: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Surface Roughness: Before and After Leveling

56University of Wisconsin

Page 57: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Put things in perspective: parallel computing options

• Accelerators (GPU/CUDA, Intel Xeon/Phi)

• Up one level: multicore

• Up one more level: multi-node

57University of Wisconsin

Page 58: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Wrapping Things Up

• In the midst of a democratization of parallel computing• Many alternatives, landscape very fluid

• Lots of hardware options: GPU, CPU (Intel/AMD), Phi, clusters, FPGAs, etc

• Lots of software ecosystems: CUDA, OpenCL, OpenMP, OpenACC, MPI, etc.

• Parallel computing can be a game changer• Can pursue new physics or new problems

• Provide impetus into development of new algorithms that expose more parallelism

58University of Wisconsin

Page 59: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Two pieces of practical advice

• Moving data around is hurting performance badly• Fixing this aspect calls for rethinking of implementation and maybe even numerical algorithm

• Fluid landscape, technology changing fast – best solution is a bit of a guessing game• The proof of the pudding is in the eating

• “80% of success is showing up” (Woody Allen, mentioned at breakfast by David Stewart)

59University of Wisconsin

Page 60: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Looking ahead

• Integration of the CPU and GPU – the “system on a chip” (SoC) paradigm takes over

• Intel• Haswell, yet not clear strategy for CPU-GPU integration

• NVIDIA• Maxwell and Denver project - CUDA 6 sees a unified CPU-GPU memory space

• AMD• Kavery chip, “Hybrid System Architecture” – pushing OpenCL to leverage the architecture

University of Wisconsin 60

Page 61: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Thank You.

[email protected]

Simulation Based Engineering Lab

Wisconsin Applied Computing Center

University of Wisconsin-Madison

61University of Wisconsin

Page 62: HIGH PERFORMANCE COMPUTING IN MULTIBODY DYNAMICS · 2014-04-06 · 2D Convolution [Image size: 8192 x 8192] • Intel® Core™ i5-3337U, 3M Cache, 1.90 GHz • Dual core chip, supports

Next two slides – Intel MKL: What you get, is not quite what’s advertised

“The very first law in advertising is to avoid the concrete promise and cultivate the delightfully vague.”~ Bill Cosby

University of Wisconsin 62