Tools for High Performance Scientific Computing
description
Transcript of Tools for High Performance Scientific Computing
![Page 1: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/1.jpg)
Tools for High Performance Scientific Computing
http://www.cs.berkeley.edu/~yelick/
Kathy YelickU.C. Berkeley
![Page 2: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/2.jpg)
HPC Problems and Approaches
• Parallel machines are too hard to program• Users “left behind” with each new major generation
• Efficiency is too low• Even after a large programming effort • Single digit efficiency numbers are common
• Approach• Titanium: A modern (Java-based) language that provides
performance transparency• Sparsity: Self-tuning scientific kernels• IRAM: Integrated processor-in-memory
![Page 3: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/3.jpg)
Titanium: A Global Address Space Language Based on Java
• Faculty• Susan Graham• Paul Hilfinger• Katherine Yelick• Alex Aiken
• LBNL collaborators• Phillip Colella• Peter McQuorquodale• Mike Welcome
• Students• Dan Bonachea• Szu-Huey Chang• Carrie Fei• Ben Liblit• Robert Lin• Geoff Pike• Jimmy Su• Ellen Tsai• Mike Welcome (LBNL)• Siu Man Yau
http://titanium.cs.berkeley.edu/
![Page 4: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/4.jpg)
Global Address Space Programming• Intermediate point between message passing and
shared memory
• Program consists of a collection of processes.• Fixed at program startup time, like MPI
• Local and shared data, as in shared memory model• But, shared data is partitioned over local processes• Remote data stays remote on distributed memory machines• Processes communicate by reads/writes to shared variables
• Note: These are not data-parallel languages
• Examples are UPC, Titanium, CAF, Split-C • E.g., http://upc.nersc.gov
![Page 5: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/5.jpg)
Titanium OverviewObject-oriented language based on Java with:• Scalable parallelism
• SPMD model with global address space• Multidimensional arrays
• points and index sets as first-class values • Immutable classes
• user-definable non-reference types for performance• Operator overloading
• by demand from our user community• Semi-automated memory management
• uses memory regions for high performance
![Page 6: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/6.jpg)
SciMark Benchmark
• Numerical benchmark for Java, C/C++• Five kernels:
• FFT (complex, 1D)• Successive Over-Relaxation (SOR)• Monte Carlo integration (MC)• Sparse matrix multiply • dense LU factorization
• Results are reported in Mflops• Download and run on your machine from:
• http://math.nist.gov/scimark2• C and Java sources also provided
Roldan Pozo, NIST, http://math.nist.gov/~Rpozo
![Page 7: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/7.jpg)
SciMark: Java vs. C(Sun UltraSPARC 60)
0
10
20
30
40
50
60
70
80
90
MF
lop
s
FFT SOR MC Sparse LU
C
Java
* Sun JDK 1.3 (HotSpot) , javac -0; Sun cc -0; SunOS 5.7
Roldan Pozo, NIST, http://math.nist.gov/~Rpozo
![Page 8: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/8.jpg)
Can we do better without the JVM?
• Pure Java with a JVM (and JIT)• Within 2x of C and sometimes better
• OK for many users, even those using high end machines
• Depends on quality of both compilers• We can try to do better using a traditional
compilation model • E.g., Titanium compiler at Berkeley
• Compiles Java extension to C• Does not optimize Java arrays or for loops (prototype)
![Page 9: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/9.jpg)
Java Compiled by Titanium CompilerPerformance on a Sun Ultra 4
0
10
20
30
40
50
60
70
Overall FFT SOR MC Sparse LU
MF
lop
s
Java C Ti Ti -nobc
![Page 10: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/10.jpg)
SciMark on Pentium III (550 MHz)
![Page 11: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/11.jpg)
SciMark on Pentium III (550 MHz)
![Page 12: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/12.jpg)
Language Support for Performance
• Multidimensional arrays• Contiguous storage• Support for sub-array operations without copying
• Support for small objects• E.g., complex numbers• Called “immutables” in Titanium• Sometimes called “value” classes
• Unordered loop construct• Programmer specifies iteration independent• Eliminates need for dependence analysis – short term
solution? Used by vectorizing compilers.
![Page 13: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/13.jpg)
Optimizing Parallel Code
• Compiler writers would like to move code around• The hardware folks also want to build hardware
that dynamically moves operations around
• When is reordering correct? • Because the programs are parallel, there are more
restrictions, not fewer• The reason is that we have to preserve semantics of what
may be viewed by other processors
![Page 14: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/14.jpg)
Sequential Consistency• Given a set of executions from n processors, each defines a
total order Pi.
• The program order is the partial order given by the union of these Pi ’s.
• The overall execution is sequentially consistent if there exists a correct total order that is consistent with the program order.
write x =1 read y 0
write y =3 read z 2
read x 1 read y 3
When this is serialized, the read and write
semantics must be
preserved
![Page 15: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/15.jpg)
Use of Memory Fences
• Memory fences can turn a weak memory model into sequential consistency under proper synchronization:• Add a read-fence to acquire lock operation• Add a write fence to release lock operation
• In general, a language can have a stronger model than the machine it runs if the compiler is clever
• The language may also have a weaker model, if the compiler does any optimizations
![Page 16: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/16.jpg)
Compiler Analysis Overview• When compiling sequential programs, compute
dependencies:
Valid if y not in expr1 and x not in expr2 (roughly)• When compiling parallel code, we need to consider
accesses by other processors.
y = expr2;
x = expr1;
x = expr1;
y = expr2;
Initially flag = data = 0
Proc A Proc B
data = 1; while (flag == 0);
flag = 1; ... = ...data...;
![Page 17: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/17.jpg)
Cycle Detection• Processors define a “program order” on accesses from
the same thread P is the union of these total orders
• Memory system define an “access order” on accesses to the same variable
A is access order (read/write & write/write pairs)
• A violation of sequential consistency is cycle in P U A [Shash&Snir]
write data read flag
write flag read data
![Page 18: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/18.jpg)
Cycle Analysis Intuition• Definition is based on execution model, which
allows you to answer the question: Was this execution sequentially consistent?
• Intuition:• Time cannot flow backwards• Need to be able to construct total order
• Examples (all variables initially 0)
write data 1 read flag 1
write flag 1 read data 0
write data 1 read data 1
write flag 1 read flag 0
![Page 19: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/19.jpg)
Cycle Detection Generalization• Generalizes to arbitrary numbers of variables and
processors
• Cycles may be arbitrarily long, but it is sufficient to consider only minimal cycles with 1 or 2 consecutive stops per processor
• Can simplify the analysis by assuming all processors run a copy of the same code
write x write y read y
read y write x
![Page 20: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/20.jpg)
Static Analysis for Cycle Detection• Approximate P by the control flow graph• Approximate A by undirected “conflict” edges
• Bi-directional edge between accesses to the same variable in which at least one is a write
• It is still correct if the conflict edge set is a superset of the reality
• Let the “delay set” D be all edges from P that are part of a minimal cycle• The execution order of D edge must be preserved; other P
edges may be reordered (modulo usual rules about serial code)
write ywrite z
read y write z
read x
![Page 21: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/21.jpg)
Cycle Detection in Practice
• Cycle detection was implemented in a prototype version of the Split-C and Titanium compilers. • Split-C version used many simplifying assumptions.• Titanium version had too many conflict edges.
• What is needed to make it practical?• Finding possibly-concurrent program blocks
• Use SPMD model rather than threads to simplify• Or apply data race detection work for Java threads
• Compute conflict edges• Need good alias analysis• Reduce size by separating shared/private variables
• Synchronization analysis
![Page 22: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/22.jpg)
Communication Optimizations• Data on an old machine, UCB NOW, using a simple
subset of C
Tim
e (
no
rma
lized
)
![Page 23: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/23.jpg)
Global Address Space
• To run shared memory programs on distributed memory hardware, we replace references (pointers) by global ones:• May point to remote data• Useful in building large, complex data structures• Easy to port shared-memory programs (functionality is correct)• Uniform programming model across machines• Especially true for cluster of SMPs
• Usual implementation• Each reference contains:
• Processor id (or process id on cluster of SMPs)• And a memory address on that processor
![Page 24: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/24.jpg)
Use of Global / Local
• Global pointers are more expensive than local• When data is remote, it turns into a remote read or write)
which is a message call of some kind• When the data is not remote, there is still an overhead
• space (processor number + memory address)• dereference time (check to see if local)
• Conclusion: not all references should be global -- use normal references when possible.• Titanium adds “local qualifier” to language
![Page 25: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/25.jpg)
Local Pointer Analysis• Compiler can infer locals using Local
Qualification Inference
• Data structures must be well partitioned
Effect of LQI
0
50
100
150
200
250
cannon lu sample gsrb poison
applications
run
nin
g t
ime
(s
ec
)
Original
After LQI
![Page 26: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/26.jpg)
Region-Based Memory Management
• Processes allocate locally• References can be passed
to other processesclass C { int val;... }C gv; // global pointerC local lv; // local pointer
if (thisProc() == 0) {lv = new C();
}gv = broadcast lv from 0; gv.val = ...; ... = gv.val;
Process 0Other
processes
lv
gv
lv
gv
lv
gv
lv
gv
lv
gv
lv
gv
LOCAL HEAP
LOCAL HEAP
![Page 27: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/27.jpg)
Parallel Applications
• Genome Application• Heart simulation • AMR elliptic and hyperbolic solvers• Scalable Poisson for infinite domains• Genome application• Several smaller benchmarks: EM3D, MatMul,
LU, FFT, Join,
![Page 28: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/28.jpg)
Heart Simulation• Problem: compute blood flow in the heart
• Modeled as an elastic structure in an incompressible fluid.
• The “immersed boundary method” [Peskin and McQueen].• 20 years of development in model
• Many other applications: blood clotting, inner ear, paper making, embryo growth, and more
• Can be used for design of prosthetics• Artificial heart valves• Cochlear implants
![Page 29: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/29.jpg)
AMR Gas Dynamics• Developed by McCorquodale and Colella• 2D Example (3D supported)
• Mach-10 shock on solid surface at oblique angle
• Future: Self-gravitating gas dynamics package
![Page 30: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/30.jpg)
Benchmarks for GAS Languages• EEL – End to end latency or time spent sending a
short message between two processes.• BW – Large message network bandwidth• Parameters of the LogP Model
• L – “Latency”or time spent on the network• During this time, processor can be doing other work
• O – “Overhead” or processor busy time on the sending or receiving side.
• During this time, processor cannot be doing other work• We distinguish between “send” and “recv” overhead
• G – “gap” the rate at which messages can be pushed onto the network.
• P – the number of processors
• This work was done with the UPC group at LBL
![Page 31: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/31.jpg)
LogP: Overhead & Latency• Non-overlapping
overhead• Send and recv overhead
can overlapP0
P1
osend
L
orecv
P0
P1
osend
orecv
EEL = osend + L + orecv EEL = f(osend, L, orecv)
![Page 32: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/32.jpg)
Benchmarks• Designed to measure the network parameters
• Also provide: gap as function of queue depth • Measured for “best case” in general
• Implemented once in MPI• For portability and comparison to target specific layer
• Implemented again in target specific communication layer:• LAPI• ELAN• GM• SHMEM• VIPL
![Page 33: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/33.jpg)
Results: EEL and Overhead
0
5
10
15
20
25
T3E/M
PI
T3E/S
hmem
T3E/E
-Reg
IBM
/MPI
IBM
/LAPI
Quadr
ics/M
PI
Quadr
ics/P
ut
Quadr
ics/G
et
M2K
/MPI
M2K
/GM
Dolph
in/M
PI
Gigan
et/V
IPL
use
c
Send Overhead (alone) Send & Rec Overhead Rec Overhead (alone) Added Latency
![Page 34: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/34.jpg)
Results: Gap and Overhead
6.7
1.2 0.2
8.2 9.5
6.0
1.6
6.5
10.3
17.8
7.84.6
0.0
5.0
10.0
15.0
20.0
use
c
Gap Send Overhead Receive Overhead
![Page 35: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/35.jpg)
Send Overhead Over Time
• Overhead has not improved significantly; T3D was best• Lack of integration; lack of attention in software
Myrinet2K
Dolphin
T3E
Cenju4
CM5
CM5
Meiko
MeikoParagon
T3D
Dolphin
Myrinet
SP3
SCI
Compaq
NCube/2
T3E0
2
4
6
8
10
12
14
1990 1992 1994 1996 1998 2000 2002Year (approximate)
usec
![Page 36: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/36.jpg)
Summary
• Global address space languages offer alternative to MPI for large machines• Easier to use: shared data structures• Recover users left behind on shared memory?• Performance tuning still possible
• Implementation • Small compiler effort given lightweight communication• Portable communication layer: GASNet• Difficulty with small message performance on IBM SP
platform
![Page 37: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/37.jpg)
Future Plans
• Merge communication layer with UPC• “Unified Parallel C” has broad vendor support.• Uses some execution model as Titanium• Push vendors to expose low-overhead communication
• Automated communication overlap• Analysis and refinement of cache optimizations• Additional support for unstructured grids
• Conjugate gradient and particle methods are motivations• Better uniprocessor optimizations, possibly new arrays
![Page 38: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/38.jpg)
Sparsity: Self-Tuning Scientific Kernels
FacultyJames DemmelKatherine Yelick
Graduate StudentsRich VuducEun-Jim Im
• Undergraduates• Shoaib Kamil• Rajesh Nishtala• Benjamin Lee• Hyun-Jin Moon• Atilla Gyulassy• Tuyet-Linh Phan
http://www.cs.berkeley.edu/~yelick/sparsity
![Page 39: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/39.jpg)
Context: High-Performance Libraries
• Application performance dominated by a few computational kernels
• Today: Kernels hand-tuned by vendor or user• Performance tuning challenges
• Performance is a complicated function of kernel, architecture, compiler, and workload
• Tedious and time-consuming
• Successful automated approaches• Dense linear algebra: PHiPAC/ATLAS• Signal processing: FFTW/SPIRAL/UHFFT
![Page 40: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/40.jpg)
Tuning pays off – ATLAS
Extends applicability of PHIPAC; Incorporated in Matlab (with rest of LAPACK)
![Page 41: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/41.jpg)
Tuning Sparse Matrix Kernels
• Performance tuning issues in sparse linear algebra• Indirect, irregular memory references• High bandwidth requirements, poor instruction mix• Performance depends on architecture, kernel, and matrix• How to select data structures, implementations? at run-time?• Typical performance: < 10% machine peak
• Our approach to automatic tuning: for each kernel,• Identify and generate a space of implementations• Search the space to find the fastest one (models, experiments)
![Page 42: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/42.jpg)
Sparsity System Organization• Optimizations depend on machine and matrix structure• Choosing optimization is expensive
Sparsitymachineprofiler
RepresentativeMatrix
Machine Profile
Maximum # vectors
Sparsityoptimizer
Data Structure Definition &
Code
Matrix Conversion
routine
![Page 43: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/43.jpg)
Sparse Kernels and Optimizations
• Kernels• Sparse matrix-vector multiply (SpMV): y=A*x• Sparse triangular solve (SpTS): x=T-1*b• y=ATA*x, y=AAT*x• Powers (y=Ak*x), sparse triple-product (R*A*RT), …
• Optimization (implementation) space• A has special structure (e.g., symmetric, banded, …)• Register blocking• Cache blocking• Multiple dense vectors (x)• Hybrid data structures (e.g., splitting, switch-to-dense, …)• Matrix reordering
![Page 44: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/44.jpg)
Register Blocking Optimization Identify a small dense
blocks of nonzeros. Fill in extra zeros to
complete blocks Use an optimized
multiplication code for the particular block size.
Improves register reuse, lowers indexing overhead.
Filling in zeros increases storage and computation
2 1 4 3
0 2 1 2
1 2
1 0
0 3 3 1
2 5 1 4
3 0 3 2
0 4 1 2
2x2 register blocked matrix
00
00
0
0
0
2421
123
2 5
5473
21
133
1
11
![Page 45: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/45.jpg)
Register Blocking Performance Model
• Estimate performance of register blocking:• Estimated raw performance: Mflop/s of dense
matrix in sparse rxc blocked format • Estimated overhead: to fill in rxc blocks
• Maximize over rxc:
Estimated raw performance Estimated overhead
• Use sampling to further reduce time, row and column dimensions are computed separately
Matrix-dependent
Machine-dependent
![Page 46: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/46.jpg)
Machine Profiles Computed Offline
333 MHzSunUltra 2i
375 MHzIBMPower3
500 MHzIntelPentiumIII
800 MHzIntelItanium
35
73
42
105
88
172
110
250
Register blocking performance for a dense matrix in sparse format.
![Page 47: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/47.jpg)
Register Blocked SpMV Performance: Ultra 2i
(See upcoming SC’02 paper for a detailed analysis.)
![Page 48: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/48.jpg)
Register Blocked SpMV Performance: P-III
![Page 49: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/49.jpg)
Register Blocked SpMV Performance: Power3
Additional low-level performance tuning is likely to help on the Power3.
![Page 50: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/50.jpg)
Register Blocked SpMV Performance: Itanium
![Page 51: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/51.jpg)
Multiple Vector Optimization
Better potential for reuse: A is reused
Loop unrolled codes multiplying across vectors are generated by a code generator.
ayy
x
iji1i2
j1
11 jiji xay 22 jiji xay • Allows reuse of matrix elements.
• Choosing the # of vectors affects both performance and higher level algorithm.
![Page 52: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/52.jpg)
Multiple Vector Performance: Itanium
![Page 53: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/53.jpg)
Multiple Vector Performance: Pentium 4
![Page 54: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/54.jpg)
Exploiting Additional Matrix Structure
• Symmetry (numerical or structural)• Reuse matrix entries• Can combine with register blocking, multiple vectors, …
• Large matrices with random structure• E.g., Latent Semantic Indexing (LSI) matrices• Technique: cache blocking
• Store matrix as 2i x 2j sparse submatrices• Useful when source vector is large• Currently, search to find fastest size
![Page 55: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/55.jpg)
Symmetric SpMV Performance: Pentium 4
![Page 56: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/56.jpg)
Cache Blocking Optimization
• Keeping part of source vector in cache
Source vector (x)
DestinationVector (y)
Sparse matrix(A)
= xAy
• Improves cache reuse of source vector.• Used for nearly random nonzero patterns• When source vector does not fit in cache
![Page 57: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/57.jpg)
Cache Blocked SpMV on LSI Matrix: Ultra 2i
![Page 58: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/58.jpg)
Tuning Sparse Triangular Solve (SpTS)
• Compute x=L-1*b where T sparse lower (upper) triangular, x & b dense
• L arising in sparse LU factorization have rich dense substructure• Dense trailing triangle can account for 20—90% of matrix non-zeros
• SpTS optimizations• Split into sparse trapezoid and dense trailing triangle• Use dense BLAS (DTRSV) on dense triangle• Use Sparsity register blocking on sparse part
• Tuning parameters• Size of dense trailing triangle• Register block size
![Page 59: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/59.jpg)
Example: Sparse Triangular Factor
• Raefsky4 (structural problem) + SuperLU + colmmd
• N=19779, nnz=12.6 M
Dense trailing triangle: dim=2268, 20% of total nz
![Page 60: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/60.jpg)
Sparse/Dense Partitioning for SpTS
• Partition L into sparse (L1,L2) and dense LD:
2
1
2
1
2
1
b
b
x
x
LL
L
D
• Perform SpTS in three steps:
22
1222
111
ˆ)3(
ˆ)2(
)1(
bxL
xLbb
bxL
D
• Sparsity optimizations for (1)—(2); DTRSV for (3)
![Page 61: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/61.jpg)
SpTS Performance: Itanium
(See POHLL ’02 workshop paper, at ICS ’02.)
![Page 62: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/62.jpg)
SpTS Performance: Power3
![Page 63: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/63.jpg)
Sustainable Memory Bandwidth
![Page 64: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/64.jpg)
Summary and Future Work
• Applying new optimizations• Other split data structures (variable block, diagonal, …)• Matrix reordering to create block structure• Structural symmetry
• New kernels (triple product RART, powers Ak, …)• Tuning parameter selection• Building an automatically tuned sparse matrix library
• Extending the Sparse BLAS• Leverage existing sparse compilers as code generation infrastructure• More thoughts on this topic tomorrow
![Page 65: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/65.jpg)
IRAM: Intelligent RAM
• Faculty• Dave Patterson• Katherine Yelick
• Graduate Students• Christoforos Kozyrakis• Joe Gebis• Sam Williams• Manikandan Narayanan• Iakovos Kosmidakis• Iaonnis Kosmidakis
• LBNL Collaborators (Benchmarking)• Parry Husbands• Brain Gaeke• Xiaoye Li• Leonid Oliker
• Staff• Dave Judd• Steve Pope
![Page 66: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/66.jpg)
Motivation
• Observation: Current cache-based supercomputers perform at a small fraction of peak for memory intensive problems (particularly irregular ones)• E.g. Optimized Sparse Matrix-Vector Multiplication runs at ~ 20% of
floating point peak on 1.5GHz P4• Even worse when parallel efficiency considered
• Overall <10% efficiency is typical for many applications
• Performance directly related to memory system design• But “gap” between processor performance and DRAM access times
continues to grow (60%/yr vs. 7%/yr)• Is memory bandwidth the problem?
![Page 67: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/67.jpg)
VIRAM Overview14.5 mm
20
.0 m
m
MIPS core (200 MHz) Main memory system
13 MB of on-chip DRAM Large on-chip bandwidth
6.4 GBytes/s peak to vector unit Vector unit
Efficient way to express fine-grained parallelism and exploit bandwidth
Typical power consumption: 2.0 W Peak vector performance
1.6/3.2/6.4 Gops 1.6 Gflops (single-precision)
Fabrication by IBM Taping out now
Our results use simulator with Cray’s vcc compiler
![Page 68: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/68.jpg)
Our Task
• Evaluate use of processor-in-memory (PIM) chips as a building block for high performance machines• For now focus on serial performance
• Benchmark VIRAM on Scientific Computing kernels• Originally for multimedia applications
• Can we use on-chip DRAM for vector processing vs. the conventional SRAM? (DRAM denser)
• Isolate performance limiting features of architectures• More than just memory bandwidth
![Page 69: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/69.jpg)
Benchmarks ConsideredMost taken for DARPA’s DIS Benchmark Suite
• Transitive-closure (small & large data set)
• NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit)• Fetch-and-increment a stream of “random” addresses
• Sparse matrix-vector product:• Order 10000, #nonzeros 177820
• Computing a histogram • Different algorithms: 64-elements sorting kernel; privatization; retry
• 2D unstructured mesh adaptation
Transitive GUPS SPMV Histogram Mesh
Ops/step 2 1 2 1 N/A
Mem/step 2 ld 1 st 2 ld 2 st 3 ld 2 ld 1 st N/A
![Page 70: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/70.jpg)
Power and Performance on BLAS-2
• 100x100 matrix vector multiplication (column layout)• VIRAM result compiled, others hand-coded or Atlas optimized• VIRAM performance improves with larger matrices• VIRAM power includes on-chip main memory• 8-lane version of VIRAM nearly doubles MFLOPS
0
100
200
300
400
VIRAM Sun Ultra I Sun Ultra IIMIPS R12K Alpha21264
PowerPCG3
Power3630
MFLOPS MFLOPS/Watt
![Page 71: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/71.jpg)
Performance Comparison
• IRAM designed for media processing• Low power was a higher priority than high performance
• IRAM is better for apps with sufficient parallelism
0100200300400500600700800900
1000
Transitive GUPS SPMV (reg) SPMV (rand) Hist Mesh
MO
PS
VIRAM
R10K
P-III
P4
Sparc
EV6
![Page 72: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/72.jpg)
Power Efficiency
• Huge power/performance advantage in VIRAM from both• PIM technology • Data parallel execution model (compiler-controlled)
0
50
100
150
200
250
300
350
400
450
500
Transitive GUPS SPMV (reg) SPMV (rand) Hist Mesh
MO
PS
/Wa
tt
VIRAM
R10K
P-III
P4
Sparc
EV6
![Page 73: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/73.jpg)
Power Efficiency
• Same data on a log plot• Includes low power processors (Mobile PIII)• The same picture for operations/cycle
0.1
1
10
100
1000
Transitive GUPS SPMV (reg) SPMV (rand) Hist Mesh
MO
PS
/Wat
t
VIRAM R10K
P-III P4
Sparc EV6
![Page 74: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/74.jpg)
Is Memory Bandwidth the Limit?
• What is the bottleneck in each case?• Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak)• SPMV and Mesh limited by address generation and bank conflicts• For Histogram there is insufficient parallelism
0
1000
2000
3000
4000
5000
6000
Transitive GUPS SPMV(Regular)
SPMV(Random)
Histogram Mesh
MB
/s
0
100200
300400
500
600700
800900
1000
Mo
ps
MB/s
MOPS
![Page 75: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/75.jpg)
Computation Memory Balance
Imagine SRF
Imagine Memory
IRAM SX-6 Itanium
Clock Rate (MHz)
500 500 200 500 800
Bandwidth (Gbytes/s)
32 2.7 6.4 32 2.1*
Single precision flop rate (Gflop/s)
20 20 1.6 8 3.2*
Ratio (flop/word)
2.5 30 1 1 6.1*
•Approximate
![Page 76: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/76.jpg)
Vector Add Example
• Vector add operation is memory-intensive
• 2 loads, 1 store, 1 arithmetic op
• Imagine runs at small fraction of peak due to high computation to memory ratio.
• VIRAM - 370 MOPS (23.13% of peak)
• Imagine - 170 MOPS (0.85% of peak)
![Page 77: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/77.jpg)
Imagine Streams vs. IRAM Vectors
• Imagine advantages• 8 SIMD VLIW clusters => higher absolute peak at low control overhead• Extra level of memory (local registers) are good for short vectors• To high latency to off-chip memory, need:
• Long “vectors” (streams of length >> 64) • Good temporal locality
• VIRAM advantages• High memory bandwidth helps many applications• 64 element vectors are sufficient to hide memory latency• Only one level of memory hierarchy to worry about (Two levels in Imagine)
• Programmability• VIRAM has C compiler, which is easy to use and is proven technology• Imagine’s Stream C and Kernel C give users more control
![Page 78: Tools for High Performance Scientific Computing](https://reader035.fdocuments.net/reader035/viewer/2022062309/56815097550346895dbe93c7/html5/thumbnails/78.jpg)
Summary
• Both IRAM and Imagine depend on parallelism
• Programmability advantage to VIRAM• All benchmarks vectorized by the VIRAM compiler (Cray vectorizer)• With restructuring and hints from programmers
• Performance advantage of VIRAM• Large on applications limited only by bandwidth• More address generators/sub-banks would help irregular performance
• Performance/Power advantage of VIRAM• Over both low power and high performance processors• Both PIM and data parallelism are key
• Imagine results preliminary• Great peak performance for programs with good temporal locality