HPC Essentials 0

HPC Essentials Prequel: From 0 to HPC in one hour

ORfive ways to do Kriging

Bill BrouwerResearch Computing and Cyberinfrastructure

(RCC), PSU

[email protected]

Outline● Step 0

– Navigating RCC resources● Step 1

– Ordinary Kriging in Octave● Step 2

– Vectorized octave● Step 3

– Compiled code● Digression on Profiling &Amdahl's Law

● Step 4

– Accelerating using GPU● Step 5

– Shared Memory● Step 6

– Distributed Memory● Scenarios & Summary

[email protected]

Step 0

● Get an account on our systems● Check out the system details, or let us help pick one for you● They are Linux systems, you'll need some basic commandline knowledge

– You may want to check out HPC Essentials I seminar, Unix/C overview

● We use the modules system for software, you'll need to load what you use eg., to see a list of everything available:

module av

eg., load octave:

module load octave

To see which modules you have in your environment

module list

[email protected]

Step 0

● There are two main types of systems:– Interactive, share a single machine with one or

more users, including memory and CPUs, used for● Debugging● Benchmarking● Using a program with a graphical user interface

– You'll need to log in using Exceed onDemand● Running for short periods of time

[email protected]

Step 0

● Batch systems

– Get dedicated memory and CPUs for period of time● Maximum time is generally 24 hours● Maximum memory and CPUs depends on the cluster

– You log in to a head node, from which you submit a request eg., an interactive session for 1 node, 1 processor per node (ppn) and 4gb total memory:qsub -I -l walltime=24:00:00 -l mem=4gb -l nodes=1:ppn=1

● To check the status of your request:

qstat -u <your_psu_id>

[email protected]

Step 0

● Other notes on clusters:

– Please never run anything significant on head nodes, use PBS to submit a job instead

– If you request more than 1 CPU, remember your code/workflow needs to be able to either

● Use multiple CPUs on a single node (set ppn parameter) using some form of shared memory parallelism

● Use multiple CPUs on multiple nodes (set combination node &ppn parameters) using some form of distributed memory parallelism

● A combination of the above

● Parallelism applied in an optimal way is high performance computing

[email protected]

High Performance Computing

● Using one or more forms of parallelism to improve the performance and scaling of your code– Vector architecture eg., SSE/AVX in Intel CPU

– Shared memory parallelism eg., using multiple cores of CPU

– Distributed memory parallelism eg., using Message Passing Interface (MPI) to communicate between CPUs or GPUs

– Accelerators eg., Graphics Processing Units

[email protected]

Typical Compute NodeCPU

IOH

ICH

QuickPath Interconnect

memory busRAM

PCI-express

GPU

PCI-e cards

SATA/USB

Direct Media Interface

non-volatile storage

BIOS

ethernetNETWORK

volatile storage

[email protected]

CPU Architecture

● Composed of several complex processing cores, control elements and high speed memory areas (eg., registers, L3 cache), as well as vector elements including special registers

[email protected]

Core Core Core Core

Cache

Memory Controller

I/O PCIe

Shared + Distributed Memory Parallelism

● Shared memory parallelism is :

– usually implemented with pThreads or directive based programming (OpenMP)

– uses one or more cores in CPU

● Distributed memory parallelism is:

– one or more nodes (composed of CPUs + possibly GPUs) communicating with each other using high speed network eg., Infiniband

– network topology and fabric critical to ensuring optimal communication

[email protected]

Nvidia GPU Streaming Multiprocessor

CUDA core

[email protected]

32768x32 bit registers

interconnect

64kB shared mem/L1 Cache

Dispatch unit Dispatch unit

Warp scheduler Warp Scheduler

Special Function Unit x4

Load/Store Unit x16

Core x 16 x 2

Dispatch Port

FPU Int U

Operand Collector

Result Queue

● GPUs run many light-weight threads at once; device composed of many more (simpler) cores than CPU

Step 1: Prototype your problem

● Pick a numerical scripting language eg., Octave, free version of matlab– Solid, well established, linear algebra based

● Code up a solution (eg., we'll consider ordinary kriging)● Time all scopes/sections of your code to get a feel for

bottlenecks● You can use the keyboard statement to set

breakpoints in your code for debugging purposes

[email protected]

Step 1: Prototype your problem

● Kriging is a geospatial statistical method eg., predicting rainfall for locations where no measurements exist, based on surrounding measurements

● Solution involves: – constructing Gamma matrix

– solve system of equations for every desired prediction location

[email protected]

Step 1: Prototype your problemfunction [w,G,g,pred] = krige()

% load input data &output prediction gridload input.csv; load output.csv;% init…% Gamma; m is size of input space, x,y are coordinates for available data z

for i=1:m for j=1:m G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))^2+(y(i)-y(j))^2)/3.33)); endend % matrix inversionGinv = inv(G);

% predictions; n is size of output space, xp,yp are prediction coordinates% z is available data for x,y coordinatesfor i=1:n g(1:m) = 10.*(1-exp(-sqrt((xp(i)-x).^2+(yp(i)-y).^2)/3.33));

w=Ginv * g';

pred(i) = sum(w(1:m).*z);

end

[email protected]

Results 1● Use tic/toc statements around code blocks for timing; following times are for:

– Initialization

– Gamma construction

– Matrix inversion

– SolutionOctave:1> [a b c d]=krige();Elapsed time is 0.079224 seconds.Elapsed time is 40.9722 seconds.Elapsed time is 0.742576 seconds.Elapsed time is 10.6134 seconds.

● 80% of the time is spent in constructing the matrix → need to vectorize

● Interpreted languages like Octave benefit from removing loops and replacing with array operations

– Loops are parsed every iteration by the interpreter

– Vectorizing code by using array operations may take advantage of vector architecture in CPU

[email protected]

Step 2: Vectorize your Prototypefunction [w,G,g,pred] = krige()

% load input data &output prediction gridload input.csv; load output.csv;% init…% GammaXI = (ones(m,1)*x)'; YI = (ones(m,1)*y)';G(1:m,1:m) = 10.*(1-exp(-sqrt((XI-XI').^2+(YI-YI').^2)/3.33));

% matrix inversionGinv = inv(G); % predictions

XP = (ones(m,1)*xp); YP = (ones(m,1)*yp);XI = (ones(n,1)*x)'; YI = (ones(n,1)*y)';ZI = (ones(n,1)*z)';

g(1:m,:) = 10.*(1-exp(-sqrt((XP-XI).^2+(YP-YI).^2)/3.33));w=Ginv * g;pred = sum(w(1:m,:).*ZI);

[email protected]

Results 2octave:2> [a b c d]=krige();Elapsed time is 0.0765891 seconds.Elapsed time is 0.195605 seconds.Elapsed time is 0.758174 seconds.Elapsed time is 3.24861 seconds.

● Code is more than 15x times faster, for a relatively small investment

● Vectorized code will have a higher memory overhead, due to the creation of temporary arrays; harder to read too :)

● When memory or compute time become unacceptable, no choice but to move to compiled code

● C/C++ are logical choices in a Linux environment

– Very stable, heavily used, Linux OS itself is written in C

– Expressive languages containing many innovations, algorithms and data structures

– C++ is object oriented, allows for design of large sophisticated projects

[email protected]

Step 3 : Compiled Code

● Unlike a scripted language, C/C++ must be compiled to run on the CPU, converting a human readable language into machine code

● Several compilers are available on the clusters including Intel, PGI and the GNU compiler collection

● In compilation and linking steps we must specify headers (with interfaces) and libraries (with functions) need by our application

● Try to avoid reinventing the wheel, always use available libraries if you can instead of reimplementing algorithms, data structures

● As opposed to scripting, now responsible for memory management eg., allocating on the heap (dynamically at runtime) or on the stack (statically at compile time)

[email protected]

Step 3 : Compiled Code

● In porting Octave/Matlab code to C/C++ you should always consider using these libraries at least:

– Armadillo, C++ wrappers for BLAS/LAPACK, syntax very similar to Octave/Matlab

– BLAS/LAPACK itself

● BLAS==Basic Linear Algebra● LAPACK==Linear Algebra PACKage● Both come in many optimized flavors eg., Intel MKL

● If you want to know more about Linux basics including writing/compiling C code, you could check out HPC Essentials I

● If you want to know more about C++, you could check out HPC Essentials V

[email protected]

Step 3 : Compiled Code#include "armadillo"#include <mkl.h>#include <iostream>

using namespace std;using namespace arma;

int main(){ mat G; vec g; //load data, initialize variables, calculate Gamma for (int i=0; i<m; i++) for (int j=0; j<m; j++){ G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))*(x(i)-x(j))\ +(y(i)-y(j))*(y(i)-y(j)))/3.33)); } char uplo = 'U'; int N = m+1; int info; int * ipiv = new int[N]; double * work = new double[3*N]; // factorize using the LU decomp. routine from LAPACK dgetrf(&N, &N, G.memptr(), &N, ipiv, &info); //solve int nrhs=1; char trans='N'; for (int i=0; i<n; i++){ g.rows(0,m-1) = ... dgetrs(&trans,&N,&nrhs,G.memptr(),&N,ipiv,g.memptr(),&N,&info); pred(i,0)=dot(z,g.rows(0,m-1)); … } [email protected]

Results 3

● Compiled code is comparable in speed to vectorized code, although we could make some algorithmic changes to improve further:

– The Gamma matrix is symmetric, no need to calculate values for j >= i (ie., just calculate/store a triangular matrix)

– Calculating the inverse is expensive and inaccurate, better to (for eg.,) factorize a matrix and use direct solve eg., using forward/backward substitution (we did do this, but using full matrix/LU decomp.)

– Armadillo uses operator overloading &expression templates to allow a vectorized approach to programming, although we leave loops in for the moment, to allow parallelization later

● If you have bugs in your code, use gdb to debug

● Always profile completely in order to solve all issues and get a complete handle on your code

[email protected]

Important Code Profiling Methods

● Solving memory leaks; use valgrind

● Poor memory access patterns/cache usage– Use valgrind --tool=cachegrind to assess cache hits +

misses

● Heap memory usage– Memory management has performance impact, assess with valgrind --tool=massif

● And before you consider moving to parallel, develop a call profile for your code eg., in terms of total instructions executed for each scope, using valgrind --tool=callgrind

[email protected]

Amdahl's Law

● The problems in science we seek to solve are becoming increasingly large, as we go down in scale (eg., quantum chemistry) or up (eg., astrophysics)

● As a natural consequence, we seek both performance and scaling in our scientific applications, thus we parallelize as we run out of resources using a single processor

● We are limited by Amdahl's law, an expression of the maximum improvement of parallel code over serial:

1/((1-P) +P/N)

where

P is the portion of application code we parallelize, and N is the number of processors ie., as N increases, the portion of remaining serial code becomes increasingly expensive, relatively speaking

[email protected]

Amdahl's Law

● Unless the portion of code we can parallelize approaches 100%,we see rapidly diminishing returns with increasing numbers of processors

[email protected]

Step 4 : Accelerate

● In general not all algorithms are amenable, and there is the communication bottleneck between CPU and GPU to overcome

● However, linear algebra operations are extremely efficient on GPU, you can expect 2-10x over a whole CPU socket (ie., running all cores) for many operations

● The language for programming Nvidia series GPUs is CUDA; much like C but you need to know the architecture well and/or:– Use libraries like cuBLAS (what we'll try)

– Use directive based programming in the form of openACC

– Use the OpenCL language (cross platform, but not heavily supported by Nvidia like CUDA)

[email protected]

Step 4 : Accelerate#include "armadillo"#include <mkl.h>#include <iostream>#include <cuda.h>using namespace std;using namespace arma;

int main(){ mat G; vec g; //load data, initialize variables, calculate Gamma as before //factorize using the LU decomp. routine from LAPACK, as before //allocate memory on GPU and transfer data //solve on gpu; two steps, solve two triangular systems cublasDtrsm(...); cublasDtrsm(...);

//free memory on GPU and transfer data back}

[email protected]

Results 4

● Minimal code changes, recompilation using nvcc compiler, available by loading any CUDA module on lion-GA (where you'll also need to run)

● We still perform matrix factorization on CPU side, move data to GPU for performing solve in two steps

● This overall solution is roughly 6x the single CPU thread solution presented previously, for larger data sizes

● General rule of thumb → minimize communication btwn CPU + GPU, use GPU when you can occupy all SMPs per device, don't bother for small problems, cost of communication outweighs benefits

● There is ongoing work performed in porting LAPACK routines to GPU eg., check out our LU/QR work, or the significant MAGMA project from UT/ORNL

● If you're interested in trying CUDA and GPUs further, you could check out HPC Essentials IV

[email protected]

Step 5: Shared memory

● We've determined through profiling that it's worthwhile parallelizing our loops● By linking against Intel MKL we also have access to threaded functions● Will simply use OpenMP directive based programming for this example● We are generally responsible for deciding what variables need to be shared

by threads, and which variables should be privately owned by threads● If we fail to make these distinctions where needed, we end up with race

conditions– Threads operate on data in an uncoordinated fashion ,and data elements have

unpredictable/erroneous values

● Outside the scope of this talk, but just as pernicious is deadlock, when threads (and indeed whole programs) hang due to improper coordination

[email protected]

Step 5 : Shared Memory#include "armadillo"#include <mkl.h>#include <iostream>#include <omp.h>...int main(){ ... //load data, initialize variables, calculate Gamma #pragma omp parallel for for (int i=0; i<m; i++) for (int j=0; j<m; j++){ G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))*(x(i)-x(j))\ +(y(i)-y(j))*(y(i)-y(j)))/3.33)); } // factorize using the LU decomp. routine from LAPACK dgetrf(&N, &N, G.memptr(), &N, ipiv, &info); //initialize data for solve, for all right hand sides #pragma omp parallel for for (int i=0; i<n; i++) for (int j=0; j<m; j++) g(i,j) = ... //multithreaded solve for all RHS dgetrs(&trans,&N,&n,G.memptr(),&N,ipiv,g.memptr(),&N,&info); //assemble predictions

[email protected]

Results 5

● In linking, must specify -fopenmp if using GNU compiler, or -openmp for Intel

● At runtime, need to export the environment variable OMP_NUM_THREADS to the desired number

● Exporting this number to something beyond the total number of cores you have access to will result in severe performance degradation

● Outside the scope of this talk, but often need to tune CPU affinity for best performance

● For more information, please check out HPC Essentials II

[email protected]

Step 6 : Distributed Memory

● A good motivation for moving to distributed memory is, in a simple case, a shortage of memory on a single node

● From a practical perspective, scheduling distributed CPU cores is easier than shared memory cores ie., your PBS queuing time is shorter :)

● We will use the message passing interface (MPI), a venerable standard developed over the last 20 years or so, with language bindings for C and fortran

● On the clusters, we use OpenMPI (not to be confused with OpenMP); once you load the module, by using the wrapper compilers, compilation and linking paths are taken care of for you

● Aside from needing to link with other libraries like Intel MKL, compiling and linking a C++ MPI program can be as simple as :

module load openmpi

mpic++ my_program.cpp

[email protected]

Step 6 : Distributed Memory#include "armadillo"#include <mkl.h>#include <iostream>#include <mpi.h>

int main(int argc, char * argv[]){ int rank, size; MPI_Status status; MPI_Init(&argc, &argv);

// size== total processes in this MPI_COMM_WORLD pool MPI_Comm_size(MPI_COMM_WORLD, &size); // rank== my identifier in pool MPI_Comm_rank(MPI_COMM_WORLD, &rank);

// load data, initialize variables, calculate Gamma, perform factorization // solve just for my portion of predictions int lower = (rank * n) / size; int upper = ((rank+1) * n) / size)-1;

for (int i=lower; i<upper; i++){ g.rows(0,m-1) = ... dgetrs(&trans,&N,&nrhs,G.memptr(),&N,ipiv,g.memptr(),&N,&info); pred(i,0)=dot(z,g.rows(0,m-1)); … }

//gather results back to root process

[email protected]

Results 6

● When you run a MPI job using PBS, you need to use the mpirun script to setup your environment and spawn processes on the different CPUs allocated to you by the scheduler:

mpirun my_application.x

● Here we simply divided the output space between different processors ie., each processor in the pool calculated a portion of the predictions

● However a collective call was needed (not shown) after the solve steps, a gather statement to bring all the results to the root process (with rank 0)

● This was the only communication between different processes throughout the calculation ie., this was close to embarrassingly parallel → no communication, great scaling with processors

● Despite the high bandwidths available on modern networks, the cost of latency is generally the limiting factor is using distributed memory parallelism

● For more on MPI you could check out HPC Essentials III

[email protected]

Review

● Let's review some of the things we've discussed

● I'll splash up several scenario's and we'll attempt to score them

[email protected]

Score Card

Score What this feels like Your HPC vehicle

+5 Civilized society Something German

+4 Evening with friends American Muscle

+3 Favorite show renewed A Honda

+2 Twinkies are back Sonata

+1 A fairy gets its wings Camry

0 meh Corolla

-1 A fairy dies Neon

-2 Twinkies are gone Pinto

-3 Favorite show canceled Le Barron

-4 Evening with Facebook Yugo

-5 Zombie Apocalypse Abrams tank

[email protected]

Scenario 1

● You get an account for hammer, maybe install and use Exceed onDemand, load and use the Matlab/Octave module after logging in

[email protected]

Scenario 1

● Score : 0● Meh. You'll run a little faster, probably have

more memory. But this isn't HPC and you could almost do this on your laptop. You're driving a Corolla, doing 45 mp/h in the fast lane.

[email protected]

Scenario 2

● You vectorize your loops and/or create a compiled MEX (Matlab) or OCT (Octave) function

[email protected]

Scenario 2

● A fairy gets its wings! You move up to the Camry! ● By vectorizing loops you use internal functions that are

interpreted once at runtime, and under the hood may even get to utilize the vector architecture of the CPU.

● Tricky loops eg., those with conditionals are best converted to MEX/OCT functions eg., for OCTAVE you want the mkoctfile utility

● If compiling new functions, don't forget to link with HPC libraries eg., Intel MKL or AMD ACML where possible.

[email protected]

Scenario 3

● Instead of submitting a PBS job you do all this on the head node of a batch cluster

[email protected]

Scenario 3

● A fairy dies! You drive a Neon at 35 mp/h in the HPC fastlane! ● Things could be worse for you, but using memory and CPU on head

nodes can grind processes like parallel filesystems to a halt, making other users and sys admin feel downright melancholy. Screens freeze, commands return at the speed of pitchblende.

● If you need dedicated resources and/or to run for more than a few minutes, please use an interactive cluster or PBS : https://rcc.its.psu.edu/user_guides/system_utilities/pbs/

[email protected]

Scenario 4

● You use Armadillo to port your Matlab/Octave code to C++, and use version control to manage your project (eg., SVN, git/github)

[email protected]

Scenario 4

● Twinkies are back! You think Hyundai finally have it together and splash out on the Sonata!

● Vectorized Octave/Matlab code is hard to beat. However you may wish to scale outside the node someday, integrate into an existing C++ project or perhaps use rich C++ objects (found in Boost for eg.,) so this is the way to go. Actually there are myriad reasons.

● Don't forget to compile first with '-Wall -g' options, then when it's working and you get the right answer, optimize!

[email protected]

Scenario 5

● You port your Matlab/Octave code to C++ without use of libraries or version control

[email protected]

Scenario 5

● No twinkies! You drive a pinto that bursts into flames immediately!

● Reinventing the wheel is a very bad, time consuming idea. Armadillo uses expression templates to create very efficient code at compile time, without it you could end up with an inefficient mess.

● Neglect to use version control and you will surely regret it. Probably right around a publication deadline too. And while we're on the topic please backup your data.

[email protected]

Scenario 6

● You target sections of your version controlled C++ code for acceleration, after understanding it better by profiling using valgrind --tool=callgrind

[email protected]

Scenario 6

● Score : +3● Futurama is back! You get a new civic! ● Believe the hype, GPUs are here to stay and will

accelerate many algorithms, especially linear algebra. ● Take advantage of libraries like CUBLAS before rolling

your own code, check in at the CUDAZONE to see what applications and code examples exist already. Get familiar with CUDA, we are an Nvidia CUDA Research Center : https://research.nvidia.com/content/penn-state-crc-summary

[email protected]

Scenario 7

● Your non-version controlled C++ code has bad memory access patterns, memory leaks, creates many temporaries.

● Score : -3● Bye-Bye Futurama ! Hello Le Barron! ● Ignore good memory and cache access patterns at

your peril● Use valgrind (default) and valgrind --tool=cachegrind

to learn more. Avoid temporaries by using libraries like Armadillo, or learning and using expression templates.

[email protected]

Scenario 8● Scenario 6 and you introduce shared memory parallelism using

OpenMP. You look into and tune CPU affinity.● Score : +4● You provide Babette's feast for your friends and elicit a

penchant for the Ford mustang. ● OpenMP is relatively easy eg., a pragma around a for loop. ● Don't forget to check thread performance with valgrind

--tool=helgrind● Now your code is a thing of beauty, properly version controlled,

profiled completely (well you could run massif as well) and you're able to use all the compute hardware in a single heterogeneous node.

[email protected]

Scenario 9● Scenario 7 AND you decide to thrash disk. Plus you try to

write >= 1M files● Score : -4● Yugo is only cool in that Portlandia bit, and Facebook was

only good for a brief period in 2006. ● Disk I/O kills in a HPC context, plus the maximum file limit at

time of writing is 1M– You give control to the kernel and your application ceases to

execute for some time (a voluntary context switch)

– You might be contending for disk with other processes

– You introduce the lower memory bandwidth (BW) and higher latency (Delta) of disk versus system memory

– Parallel filesystems → all of the above plus network BW and [email protected]

Scenario 10● Scenario 8 AND you decide to scale outside the

node with MPI. You look into Patterns. GOF is on the nightstand.

● Score : +5● You are a cultured individual and you drive a

German vehicle. You care about engineering. ● Don't forget Amdahl's law● Even with IB networks, minimize communication,

consider new paradigms in distributed memory parallelism (check out MPI revision 3).

[email protected]

Scenario 11

● Scenario 9 and you do it all on the head node, including OpenMP for 1% of your loops. You also export OMP_NUM_THREADS=20 and you have 10 cores. There's no coordination between threads, races all over the place. You have about 40 MPI processes trying to read the same file as well, without parallel file I/O.

[email protected]

Scenario 11● Score : -5● The end is nigh and you're taking out zombies and

HPC infrastructure in your Abrams tank, moving at 1mph, getting 0.2 miles to the gallon

● You ignored all the other advice, and now you throw out Amdahl's law too.

● AND you have no coordination between any of your threads or processes.

● AND you're trying to run more threads and processes than the system can support concurrently, so context switching takes place furiously.

● Expect a not-so-rosy email from sys admin :-)[email protected]

Summary● High performance computing is leveraging one or more

forms of parallelism in a performant way● Often the best gains come from writing vectorized octave

code, or making algorithmic changes● Before you parallelize, fully profile your code and keep

Amdahl's law in mind● All forms of parallelism have their limitations, but in

general:– GPU accelerators are excellent for linear algebra

– Shared memory using OpenMP works well for simple, nested loops

– Consider using MPI (distributed memory) for 'big data', but limit communication

[email protected]

HPC Essentials 0

Technology

Transcript of HPC Essentials 0