Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.
-
Upload
abigayle-audra-sims -
Category
Documents
-
view
226 -
download
0
Transcript of Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.
![Page 1: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/1.jpg)
Parallel and Pipeline Programming
![Page 2: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/2.jpg)
Super-scalar,
pipelined with vector instruction
support
![Page 3: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/3.jpg)
Definitions
• Super-scalar - multiple integer or floating-point ALUs
• Pipeline - executes instructions in steps like an assembly line
• Stall - instruction execution state that delays a pipeline step– If an add takes 2 steps and there are two ALUs,
then 3 adds in a row could cause a stall
![Page 4: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/4.jpg)
![Page 5: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/5.jpg)
Memory Hierarchy
![Page 6: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/6.jpg)
Access Times
Hierarchy Access Times
To Where CPU CyclesRegister <= 1L1d cache ~3L2 cache ~14L3 cache ~30Main Memory ~240Disk ~7,000,000
![Page 7: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/7.jpg)
Disk Transfer Time
• Fujitsu MHS2060AT 60GB Laptop Hard Drive
• 4200RPM, 420 sectors/track, 512 B/sector
• 1 track per 1/4200M = 1/70S
• 1 track/(1/70S) x 420 sect/trk x 512 B/sect
• = 15.05mb/second transfer rate
![Page 8: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/8.jpg)
Key to effective cache utilization is program locality
• Temporal locality refers to the reuse of the same address within relatively small time durations.
• Spatial locality refers to the use of data within relatively "close" storage locations.
![Page 9: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/9.jpg)
Page Fault-Rate Curve from OS
![Page 10: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/10.jpg)
Multiprocessor Caching
![Page 11: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/11.jpg)
Sequential Consistency
R/W from each CPU
reach memory in
order executed
![Page 12: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/12.jpg)
Strict Consistency• R/W are seen in same order by all processors• In hardware, is implemented by atomic
hardware instructionsIntel Compare-Exchange Semantics
if (EAX== DEST) {ZF = 1 DEST = SRC
} else { ZF = 0 EAX= DEST}
![Page 13: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/13.jpg)
Processor Code Improvement
![Page 14: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/14.jpg)
Gcc Optimization Options-falign-functions=n -falign-jumps=n -falign-labels=n -falign-loops=n -falign-loops-max-skip=n -falign-jumps-max-skip=n
-fbounds-check -fmudflap -fmudflapth -fmudflapir -fbranch-probabilities -fprofile-values -fvpt -fbranch-target-load-optimize
-fbranch-target-load-optimize2 -fbtr-bb-exclusive -fcaller-saves -fcprop-registers -fcreate-profile -fcse-follow-jumps
-fcse-skip-blocks -fcx-limited-range -fdata-sections -fdelayed-branch -fdelete-null-pointer-checks -fearly-inlining
-fexpensive-optimizations -ffast-math -ffloat-store -fforce-addr -ffunction-sections -fgcse -fgcse-lm -fgcse-sm -fgcse-las
-fgcse-after-reload -fcrossjumping -fif-conversion -fif-conversion2 -finline-functions -finline-functions-called-once
-finline-limit=n -fkeep-inline-functions -fkeep-static-consts -flocal-alloc (APPLE ONLY)-fmerge-constants
-fmerge-all-constants -fmodulo-sched -fno-branch-count-reg -fno-default-inline -fno-defer-pop -fmove-loop-invariants
-fno-function-cse -fno-guess-branch-probability -fno-inline -fno-math-errno -fno-peephole -fno-peephole2
-funsafe-math-optimizations -funsafe-loop-optimizations -ffinite-math-only -fno-toplevel-reorder
-fno-trapping-math -fno-zero-initialized-in-bss -mstackrealign -fomit-frame-pointer -foptimize-register-move
-foptimize-sibling-calls -fprefetch-loop-arrays -fprofile-generate -fprofile-use -fregmove -frename-registers -freorder-blocks
-freorder-blocks-and-partition -freorder-functions -frerun-cse-after-loop -frounding-math -frtl-abstract-sequences
-fschedule-insns -fschedule-insns2 -fno-sched-interblock -fno-sched-spec -fsched-spec-load
-fsched-spec-load-dangerous -fsched-stalled-insns=n -fsched-stalled-insns-dep=n
-fsched2-use-superblocks -fsched2-use-traces -fsee -freschedule-modulo-scheduled-loops
-fsection-anchors -fsignaling-nans -fsingle-precision-constant -fstack-protector -fstack-protector-all -fstrict-aliasing
-fstrict-overflow -ftracer -fthread-jumps -funroll-all-loops -funroll-loops -fpeel-loops -fsplit-ivs-in-unroller -funswitch-loops
-fvariable-expansion-in-unroller -ftree-pre -ftree-ccp -ftree-dce -ftree-loop-optimize
-ftree-loop-linear -ftree-loop-im -ftree-loop-ivcanon -fivopts -ftree-dominator-opts
-ftree-dse -ftree-copyrename -ftree-sink -ftree-ch -ftree-sra -ftree-ter -ftree-lrs -ftree-fre
-ftree-vectorize -ftree-vect-loop-version -ftree-salias -fuse-profile -fipa-pta -fweb -ftree-copy-prop
-ftree-store-ccp -ftree-store-copy-prop -fwhole-program --param name=value
-O -O0 -O1 -O2 -O3 -Os -Oz <<most important
![Page 15: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/15.jpg)
Just-In-Time Runtime Optimization
![Page 16: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/16.jpg)
Code Improvement OptionsConstant foldingx=3+4*5;x = 23;Constant propagationn=3;b=y*n+n;b=y*3+3;Assign variables to registers in C/C++register int x,y;Operator strength reductionx=y*3;x=y+y+y;Peephole optimization (use architecture-specific instructions)a += 1;
![Page 17: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/17.jpg)
Compiler option to target different architectures (386, 486, Pentium, i5)Aligning data structures on natural boundaries (unaligned data accesses fault on some and are slower on all)
Common sub-expression optimizationx=(n+2)*y;y=z/(n+2)t=(n+2)then use tInline functions(treat function definition as a macro and substitute the text at every call)
![Page 18: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/18.jpg)
Invariant code motion out of loopswhile (x++<Y) { z += p+n*6; p--;}n*6 never changesLoop fusion (make one loop out of two or more
(i.e. omp collapse))Loop unrolling (reduce iteration by factor of n,
replicate loop body n times)Loop interchange (change nesting order of loops,
which may enable other optimizations)Loop blocking or tiling (replace array processing
by two loops to divide the iteration space into smaller blocks to minimize cache misses
![Page 19: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/19.jpg)
Omit frame pointers(procedure entry-exit code can be simplified when procedure call chain is deterministic)
![Page 20: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/20.jpg)
Max Function#include <stdio.h>#include <stdlib.h>#include <time.h>#define N 20000000
int array_int_max(int a[], int n) { int i, max=0; for (i=1; i<n; i++) if (a[max]<a[i]) max=i; return max;}int test[N];int main(int argc, char *argv[]) { int i, j;for (i=0; i<N; i++) test[i]=rand();j=clock(); i=array_int_max(test,N);printf("clock=%ld index=%d max=%d\n", clock()-j, i, test[i]);return 0;}OUTPUTclock=96782 index=1310 max=2147483531
![Page 21: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/21.jpg)
Microsoft Visual Studio Timings for Max Function
Build Options Clock() TimingDebug 131
Release, Optimization disabled,Favor small code, no whole program optimization 127
Release, Minimize size, Favor small code,no whole program optimization 42
Release, Minimize size, Favor fast code,no whole program optimization 29
Release, Minimize size, Favor fast code,Whole program optimization 29
Release, Maximize speed, Favor fast code,Whole program optimization 31#pragma omp sections, 2 threads 37
![Page 22: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/22.jpg)
Pipeline Hazards
• Structural hazard– hardware resource conflicts prevent overlapped
execution.
• Control hazard– when any instruction, such as a branch, changes the
instruction pointer register (IP). The choices are to stall after a branch IF, to undo un-branched-to instructions, or to predict where every branch is going.
• Data hazard– An instruction produces output or an action that is
needed by a later instruction’s pipeline stage
![Page 23: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/23.jpg)
![Page 24: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/24.jpg)
Pipeline Optimization
![Page 25: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/25.jpg)
Loop Unrolling
int A[N][N], B[N][N], C[N][N];int main(int argc, char *argv[]) {int i, j, k, z;for (i=0; i<N; i++) for (j=0; j<N; j++) { A[i][j]=rand(); B[i][j]=A[i][j]+1; C[i][j]=A[i][j]-1;}z=clock();for (i=0; i<N; i+=4) //increment by unrolling factorfor (j=0; j<N; j++)for (k=0; k<N; k++) { //8301 clocks, no unrolling A[i][j] = A[i][j] + B[i][k] * C[k][j]; //4281 clocks, 2 statements A[i+1][j] = A[i+1][j] + B[i+1][k] * C[k][j]; //3251 clocks, 3 statements A[i+2][j] = A[i+2][j] + B[i+2][k] * C[k][j]; //3063 clocks, 4 statements A[i+3][j] = A[i+3][j] + B[i+3][k] * C[k][j];}printf("clock=%d\n", clock()-z);return 0;}
![Page 26: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/26.jpg)
Software Pipelining• Loop over statements, each statement is
dependent on the previous statement.
– ai, bi, ci
• Loop unrolling would result in
– ai, bi, ci, ai+1, bi+1, ci+1
• However, the dependency (data hazard) between b and a and between c and b still exist.
• Software pipelining changes loop to contain
– ai, ai+1, bi, bi+1, ci, ci+1
![Page 27: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/27.jpg)
Vector Instruction Data Types
![Page 28: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/28.jpg)
Vector Instruction Processing
![Page 29: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/29.jpg)
Intel SSE
![Page 30: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/30.jpg)
Vector Max
Function
#define N 20000000int array_int_max(vInt32 a[], int n) {int i; vInt32 max, temp, temp1;vCopy(max, a[0]);for (i=1; i<n; i+=4) { vMax_int(temp,a[i],a[i+1]); vMax_int(temp1,a[i+2],a[i+3]); vMax_int(max,temp,max); vMax_int(max,temp1,max);}vSplat_int(temp,max,0); vMax_int(max,temp,max);vSplat_int(temp,max,1); vMax_int(max,temp,max);vSplat_int(temp,max,2); vMax_int(max,temp,max);return vExtract_int(max,3);}
int test[N];int main(int argc, char *argv[]) {int i, j;for (i=0; i<N; i++) test[i]=rand();j=clock();i=array_int_max((vInt32 *) test, N/4);printf("clock=%d max=%d\n", clock()-j, i);}
![Page 31: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/31.jpg)
a[0] a[1] a[2] a[3]0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
vCopy(max,a[0]);max temp temp10 1 2 3
vMax_int(temp,a[i],a[i+1]); 8 9 10 11vMax_int(temp1,a[i+2],a[i+3]); 12 13 14 15vMax_int(max,temp,max); 8 9 10 11vMax_int(max,temp1,max); 12 13 14 15vSplat_int(temp,max,0); 12 12 12 12vMax_int(max,temp,max); 12 13 14 15vSplat_int(temp,max,0); 13 13 13 13vMax_int(max,temp,max); 13 13 14 15vSplat_int(temp,max,0); 14 14 14 14vMax_int(max,temp,max); 14 14 14 15return vExtract_int(max,3); 15
![Page 32: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/32.jpg)
CPU versus GPU Parallelism
![Page 33: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/33.jpg)
Nvidia Tesla GPU
![Page 34: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/34.jpg)
OpenCL Execution Model• Context
– Defines the target execution environment for a Program. A Context can include muliple GPUs and a CPU.
• Kernel– A C-like method executed on a streaming processor (also referred to as a
processing element).
– Kernel code only uses registers, no stack and no heap. Kernel code that uses more registers than are available may fail to load or execute inefficiently.
– No nested kernel calls, no recursion.
– Kernels are compiled for every device in a context.
• Kernel Arguments– Scalar
– Vector (128 bits, 4 floats or ints, 2 doubles)
– Pointer to a 1-d sequence of values no matter what the shape of the data.
• Program– Collection of kernels. Must be dynamically loaded into one or more
CPU/GPUs.
![Page 35: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/35.jpg)
OpenCL GPU Storage Model
MemoryAccess Speed
VisibilityGPU Access
Host Access
Private Faster Work Item Read/Write NoneLocal Faster Work Group Read/Write NoneConstant Slower NDRange Read WriteGlobal Slower NDRange Read/write Read/Write
![Page 36: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/36.jpg)
OpenCL Vector Additionconst char * sProgramSource =
"__kernel void vectorAdd( \n" \
"__global const float * a, \n" \
"__global const float * b, \n" \
"__global float * c) \n" \
"{ \n" \
" // Vector element index \n" \
" int nIndex = get_global_id(0); \n" \
" c[nIndex] = a[nIndex] + b[nIndex]; \n" \
"} \n";
![Page 37: Parallel and Pipeline Programming. Super- scalar, pipelined with vector instruction support.](https://reader035.fdocuments.net/reader035/viewer/2022062408/56649ec05503460f94bcb9ca/html5/thumbnails/37.jpg)
OpenCL Vector Addition
• No use or private or local storage
• Reference to __global is slow
• Computation per PE is too little