Big Iron and Parallel Processing

37
USArray Data Processing Workshop Big Iron and Parallel Processing Original by: Scott Teige, PhD, IU Information Technology Support Modified for 2010 course by G Pavlis June 20, 2022

description

Big Iron and Parallel Processing. USArray Data Processing Workshop. Original by: Scott Teige, PhD, IU Information Technology Support Modified for 2010 course by G Pavlis. September 14, 2014. Overview. How big is “Big Iron”? Where is it, what is it? One system, the details - PowerPoint PPT Presentation

Transcript of Big Iron and Parallel Processing

Page 1: Big Iron and Parallel Processing

USArray Data Processing Workshop

Big Iron and Parallel Processing

Original by:Scott Teige, PhD, IU Information Technology Support

Modified for 2010 course by G Pavlis

April 21, 2023

Page 2: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

Overview

• How big is “Big Iron”?• Where is it, what is it?• One system, the details• Parallelism, the way forward• Scaling and what it means to you• Programming techniques• Examples• Excercises

Page 3: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

What is the TeraGrid?

• “… a nationally distributed cyberinfrastructure that provides leading edge computational and data services for scientific discovery through research and education…”

• One of several consortiums for high performance computing supported by the NSF

Page 4: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

Some TeraGrid Systems

Kraken NICS Cray 608 TF 128 TB

Ranger TACC Sun 579 123

Abe NCSA Dell 89 9.4

Lonestar TACC Dell 62 11.6

Steele Purdue Dell 60 12.4

Queen Bee LONI Dell 50 5.3

Lincoln NCSA Dell 47 3.0

BigRed IU IBM 30 6.0

Page 5: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

System Layout

Kraken 2.30 GHz 66048 cores

Ranger 2.66 62976

Abe 2.33 9600

Lonestar 2.66 5840

Steele 2.33 7144

Page 6: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

Availability

Kraken 608TFLOPS 96% Use 24.3 IdleTF

Ranger 579 91% 52.2

Abe 89 90% 8.9

Lonestar 62 92% 5.0

Steele 60 67% 19.8

Queen Bee 51 95% 2.5

Lincoln 48 4% 45.6

Big Red 31 83% 5.2

Page 7: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

IU Research Cyberinfrastructure

The Big Picture:• Compute

Big Red (IBM e1350 Blade Center JS21)Quarry (IBM e1350 Blade Center HS21)

• StorageHPSSGPFSOpenAFSLustreLustre/WAN

Page 8: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

High Performance Systems• Big Red

30 TFLOPS IBM JS21 SuSE Cluster 768 blades/3072 cores: 2.5 GHz PPC 970MP8GB Memory, 4 cores per bladeMyrinet 2000LoadLeveler & Moab

• Quarry7 TFLOPS IBM HS21 RHEL Cluster140 blades/1120 cores: 2.0 GHz Intel Xeon 53358GB Memory, 8 cores per blade1Gb Ethernet (upgrading to 10Gb)PBS (Torque) & Moab

Page 9: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

Data Capacitor (AKA Lustre)

High Performance Parallel File system

-ca 1.2PB spinning disk

-local and WAN capabilities

SC07 Bandwidth Challenge Winner

-moved 18.2 Gbps across a single 10Gbps link

Dark side: likes large files, performs badly on large numbers of files and for simple commands like “ls” on a directory

Page 10: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

HPSS

• High Performance Storage System• ca. 3 PB tape storage• 75 TB front-side disk cache• Ability to mirror data between IUPUI and

IUB campuses

Page 11: Big Iron and Parallel Processing

Practical points• If you are doing serious data processing NSF

cyberinfrastructure systems have major advantagesState of the art compute serversLarge capacity data storageArchival storage for data backup

• Dark side:Shared resourceHave to work through remote sysadminsCommercial software (e.g. matlab) can be a issue

April 21, 2023USArray Data Processing Workshop

Page 12: Big Iron and Parallel Processing

Parallel processing

• Why it mattersSingle CPU systems are reaching their limitMultiple CPU desktops are the norm alreadyAll current HPC = parallel processing

• Dark side• Still requires manual coding changes (i.e. not

yet common for code to automatically be parallel)

• Lots of added complexity

April 21, 2023USArray Data Processing Workshop

Page 13: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

Serial vs. Parallel• Calculation• Flow Control• I/O

• Calculation• Flow Control• I/O• Synchronization• Communication

Page 14: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

A SerialProgram

F

1-F

F/N

1-F

S=1/(1-F+F/N)

Amdahl’s Law:

Special case, F=1

S=N, Ideal Scaling

Page 15: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

Speed for various scaling rules

S=Ne -(N-1)/q

“Paralyzable Process”

S>N

“Superlinear Scaling”

Page 16: Big Iron and Parallel Processing

Architectures

• Shared memoryThese imacs are shared memory machines

with 2 processorsEach cpu can address the same RAM

• Distributed memoryBlades(nodes)=motherboard in a rackEach blade has it’s own RAMClusters have fast network to link nodes

• All modern HPC systems are both (each blade uses multicore processor)

April 21, 2023USArray Data Processing Workshop

Page 17: Big Iron and Parallel Processing

Current technologies

• ThreadsLow level functionality Good for raw speed on desktopMainly for the hard core nerd like meSo will say no more today

• OpenMP• MPI

April 21, 2023USArray Data Processing Workshop

Page 18: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

MPI vs. OpenMP

• MPI code may execute across many nodes

• Entire program is replicated for each core (sections may or may not execute)

• Variables not shared• Typically requires

structural modification to code

• OpenMP code executes only on the set of cores sharing memory

• Simplified interface to pthreads

• Sections of code may be parallel or serial

• Variables may be shared

• Incremental parallelization is easy

Page 19: Big Iron and Parallel Processing

Let’s look first at OpenMP

• Who has heard of the old fashioned “fork” procedure (part of unix since 1970s)?

• What is a “thread” then and how is it different from a fork?

• OpenMP is a simple, clean way to spawn and manage a collection of threads

April 21, 2023USArray Data Processing Workshop

Page 20: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

Fork

Join

OPENMP Getting Started ExercisePreliminaries:In terminal window cd to test directoryExport OMP_NUM_THREADS=8icc omp_hello.c –openmp –o hello

Run her:./hello

Look at the source code together and discuss

Run a variant:export OMP_NUM_THREADS=20./hello

Page 21: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

PROGRAM DOT_PRODUCT

INTEGER N, CHUNKSIZE, CHUNK, I PARAMETER (N=100) PARAMETER (CHUNKSIZE=10) REAL A(N), B(N), RESULT

! Some initializations DO I = 1, N A(I) = I * 1.0 B(I) = I * 2.0 ENDDO RESULT= 0.0 CHUNK = CHUNKSIZE

!$OMP PARALLEL DO!$OMP& DEFAULT(SHARED) PRIVATE(I)!$OMP& SCHEDULE(STATIC,CHUNK)!$OMP& REDUCTION(+:RESULT)

DO I = 1, N RESULT = RESULT + (A(I) * B(I)) ENDDO

!$OMP END PARALLEL DO NOWAIT

PRINT *, 'Final Result= ', RESULT END

Fork

Join

You can even use this in, yes, FORTRAN

Page 22: Big Iron and Parallel Processing

Some basic issues in parallel codes• Synchronization

Are tasks of each thread balanced?Tie up CPU if it is waiting for other

threads to exit• Shared memory means two threads can try

to alter the same dataTraditional threads use a mutexOpenMP uses a simpler method (hang

on – next slides)April 21, 2023USArray Data Processing Workshop

Page 23: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

OpenMP Synchronization Constructs• MASTER: block executed only by master

thread• CRITICAL: block executed by one thread

at a time• BARRIER: each thread waits until all

threads reach the barrier• ORDERED: block executed sequentially

by threads

Page 24: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

Data Scope Attribute Clauses

• SHARED: variable is shared across all threads – mutex automatically created

• PRIVATE: variable is replicated in each thread (not protected by a mutex – faster)

• DEFAULT: change the default scoping of all variables in a region

Page 25: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

Some Useful Library routines

• omp_set_num_threads(integer)• omp_get_num_threads()• omp_get_max_threads()• omp_get_thread_num()• Others are implementation dependent

Page 26: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

OpenMP Advice

• Always explicitly scope variables• Never branch into/out of a parallel region• Never put a barrier in an if block• Avoid i/o in a parallel loop (nearly

guarantees a load imbalance)

Page 27: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

Exercise 2: OpenMP

• The example programs are in ~/OMP_F_examples or ~/OMP_C_examples

• Go to https://computing.llnl.gov/tutorials/openMP/

• Skip to step 4, compiler is “icc” or “ifort”

• Work on this until I call end

Page 28: Big Iron and Parallel Processing

Next topic: MPI

• MPI=Message Passing Interface• Can be used on a multicore CPU, but

main application is for multiple nodes• Next slide is source code for mpi hello

world program we’ll run in a minute

04/21/23USArray Data Processing Workshop

Page 29: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

#include <stdio.h>#include <stdlib.h>#include <mpi.h>int myrank;int ntasks;

int main(int argc, char **argv){

/* Initialize MPI */ MPI_Init(&argc, &argv);

/* get number of workers */ MPI_Comm_size(MPI_COMM_WORLD, &ntasks);

/* Find out my identity in the default communicator each task gets a unique rank between 0 and ntasks-1 */ MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

MPI_Barrier(MPI_COMM_WORLD);

fprintf(stdout,"Hello from MPI_BABY=%d\n",myrank); MPI_Finalize(); exit(0);}

… …

Node 1 Node 2 …

Page 30: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

cp –r /N/dc/scratch/usarray/MPI .

mpicc mpi_baby.c –o mpi_baby

mpirun –np 8 mpi_baby

mpirun –np 32 –machinefile my_list mpi_baby

Running mpi_baby

Page 31: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

C AUTHOR: Blaise Barney program scatter include 'mpif.h' integer SIZE parameter(SIZE=4) integer numtasks, rank, sendcount, recvcount, source, ierr real*4 sendbuf(SIZE,SIZE), recvbuf(SIZE)C Fortran stores this array in column major order, so the C scatter will actually scatter columns, not rows. data sendbuf /1.0, 2.0, 3.0, 4.0, & 5.0, 6.0, 7.0, 8.0, & 9.0, 10.0, 11.0, 12.0, & 13.0, 14.0, 15.0, 16.0 / call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, numtasks, ierr) if (numtasks .eq. SIZE) then source = 1 sendcount = SIZE recvcount = SIZE call MPI_SCATTER(sendbuf, sendcount, MPI_REAL, recvbuf, & recvcount, MPI_REAL, source, MPI_COMM_WORLD, ierr) print *, 'rank= ',rank,' Results: ',recvbuf else print *, 'Must specify',SIZE,' processors. Terminating.' endif call MPI_FINALIZE(ierr) end

From the man page:

MPI_Scatter - Sends data from one task to all tasks in a group …message is split into n equal segments, the ith segment is sent to the ith process in the group

Page 32: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

man -w MPIls /N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/share/man/man3

MPI_AbortMPI_AllgatherMPI_AllreduceMPI_Alltoall...MPI_WaitMPI_WaitallMPI_WaitanyMPI_Waitsome

mpicc --showme/N/soft/linux-rhel4-x86_64/intel/cce/10.1.022/bin/icc \-I/N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/include \-pthread -L/N/soft/linux-rhel4-x86_64/openmpi/1.3.1/intel-64/lib \-lmpi -lopen-rte -lopen-pal -ltorque -lnuma -ldl \-Wl,--export-dynamic -lnsl -lutil -ldl -Wl,-rpath -Wl,/usr/lib64

Some linux tricks to get more information:

Page 33: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

MPI Advice

• Never put a barrier in an if block• Use care with non-blocking

communication, things can pile up fast

Page 34: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

So, can I use MPI with OpenMP?

• Yes you can; extreme care is advised• Some implementations of MPI forbid it• You can get killed by “oversubscription” real fast,

I’ve (Scott) seen time increase like N2

• But sometimes you must… some fftw libraries are OMP multithreaded, for example.

• As things are going this caution likely to disappear

Page 35: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

Exercise: MPI• Examples are in ~/MPI_F_examples or ~/MPI_C_examples• Go to https://computing.llnl.gov/tutorials/mpi/• Skip to step 6. MPI compilers are “mpif90” and “mpicc”, normal

(serial) compilers are “ifort” and “icc”.• Compile your code: “make all” (Overrides section 9)• To run an mpi code: “mpirun –np 8 <exe>” …or…• “mpirun –np 16 –machinefile <ask me> <exe>”• Skip section 12• There is no evaluation form.

Page 36: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

Where were those again?• https://computing.llnl.gov/tutorials/openMP/excercise.html• https://computing.llnl.gov/tutorials/mpi/exercise.html

Page 37: Big Iron and Parallel Processing

April 21, 2023USArray Data Processing Workshop

Acknowledgements

• This material is based upon work supported by the National Science Foundation under Grant Numbers 0116050 and 0521433. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation (NSF).

• This work was support in part by the Indiana Metabolomics and Cytomics Initiative (METACyt). METACyt is supported in part by Lilly Endowment, Inc.

• This work was support in part by the Indiana Genomics Initiative. The Indiana Genomics Initiative of Indiana University is supported in part by Lilly Endowment, Inc.

• This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University.