Scientific Applications on Multi-PIM Systems WIMPS 2002 Katherine Yelick U.C. Berkeley and...
-
date post
22-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Scientific Applications on Multi-PIM Systems WIMPS 2002 Katherine Yelick U.C. Berkeley and...
Scientific Applications onScientific Applications on Multi-PIM Systems Multi-PIM Systems
WIMPS 2002Katherine Yelick
U.C. Berkeley and NERSC/LBNL
Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke, Parry Husbands (LBNL)
And the Berkeley IRAM group: Dave Patterson, Joe Gebis, Dave Judd, Christoforos Kozyrakis, Sam Williams, Steve Pope
K. Yelick, WIMPS 2002
Algorithm SpaceAlgorithm Space
Regularity
Reuse
Two-sided dense linear algebra
One-sided dense linear algebra
FFTs
Sparse iterative solvers
Sparse direct solvers
Asynchronous discrete even simulation
Grobner Basis (“Symbolic LU”)
Search
Sorting
K. Yelick, WIMPS 2002
Why build Multiprocessor PIM?Why build Multiprocessor PIM?
Scaling to Petaflops Low power/footprint/etc.
Performance And performance predictability
Programmability Let’s not forget this Would like to increase user base
Start with single chip problem by looking at VIRAM
K. Yelick, WIMPS 2002
VIRAM OverviewVIRAM Overview
14.5 mm
20
.0 m
m
MIPS core (200 MHz) Single-issue, 8 Kbyte I&D caches
Vector unit (200 MHz) 32 64b elements per register 256b datapaths, (16b, 32b, 64b
ops) 4 address generation units
Main memory system 13 MB of on-chip DRAM in 8 banks 12.8 GBytes/s peak bandwidth
Typical power consumption: 2.0 W Peak vector performance
1.6/3.2/6.4 Gops wo. multiply-add 1.6 Gflops (single-precision)
Fabrication by IBM Tape-out in O(1 month)
K. Yelick, WIMPS 2002
Benchmarks for Scientific ProblemsBenchmarks for Scientific Problems
Dense Matrix-vector multiplication Compare to hand-tuned codes on conventional machines
Transitive-closure (small & large data set) On a dense graph representation
NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit) Fetch-and-increment a stream of “random” addresses
Sparse matrix-vector product: Order 10000, #nonzeros 177820
Computing a histogram Used for image processing of a 16-bit greyscale image: 1536 x
1536 2 algorithms: 64-elements sorting kernel; privatization Also used in sorting
2D unstructured mesh adaptation initial grid: 4802 triangles, final grid: 24010
K. Yelick, WIMPS 2002
Power and Performance on BLAS-2 Power and Performance on BLAS-2
100x100 matrix vector multiplication (column layout) VIRAM result compiled, others hand-coded or Atlas optimized VIRAM performance improves with larger matrices VIRAM power includes on-chip main memory 8-lane version of VIRAM nearly doubles MFLOPS
0
100
200
300
400
VIRAM Sun Ultra I Sun Ultra IIMIPS R12K Alpha21264
PowerPCG3
Power3630
MFLOPS MFLOPS/Watt
K. Yelick, WIMPS 2002
Performance ComparisonPerformance Comparison
IRAM designed for media processing Low power was a higher priority than high performance
IRAM (at 200MHz) is better for apps with sufficient parallelism
0100200300400500600700800900
1000
Transitive GUPS SPMV (reg) SPMV (rand) Hist Mesh
MO
PS
VIRAM
R10K
P-III
P4
Sparc
EV6
K. Yelick, WIMPS 2002
Power EfficiencyPower Efficiency
Huge power/performance advantage in VIRAM from both PIM technology Data parallel execution model (compiler-controlled)
0
50
100
150
200
250
300
350
400
450
500
Transitive GUPS SPMV (reg) SPMV (rand) Hist Mesh
MO
PS
/Wa
tt
VIRAM
R10K
P-III
P4
Sparc
EV6
K. Yelick, WIMPS 2002
Power EfficiencyPower Efficiency
Same data on a log plot Includes both low power processors (Mobile PIII) The same picture for operations/cycle
0.1
1
10
100
1000
Transitive GUPS SPMV (reg) SPMV (rand) Hist Mesh
MO
PS
/Wa
tt
VIRAM R10K
P-III P4
Sparc EV6
K. Yelick, WIMPS 2002
Which Problems are Limited by Which Problems are Limited by Bandwidth?Bandwidth?
What is the bottleneck in each case? Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak) SPMV and Mesh limited by address generation and bank conflicts For Histogram there is insufficient parallelism
0
1000
2000
3000
4000
5000
6000
Transitive GUPS SPMV(Regular)
SPMV(Random)
Histogram Mesh
MB
/s
0
100200
300400
500
600700
800900
1000
Mo
ps
MB/s
MOPS
K. Yelick, WIMPS 2002
Summary of 1-PIM ResultsSummary of 1-PIM Results
Programmability advantage All vectorized by the VIRAM compiler (Cray vectorizer) With restructuring and hints from programmers
Performance advantage Large on applications limited only by bandwidth More address generators/sub-banks would help irregular
performance
Performance/Power advantage Over both low power and high performance processors Both PIM and data parallelism are key
K. Yelick, WIMPS 2002
Analysis of a Multi-PIM SystemAnalysis of a Multi-PIM System
Machine Parameters Floating point performance
PIM-node dependent Application dependent, not theoretical peak
Amount of memory per processor Use 1/10th Algorithm data
Communication Overhead Time processor is busy sending a message Cannot be overlapped
Communication Latency Time across the network (can be overlapped)
Communication Bandwidth Single node and bisection
Back-of-the envelope calculations !
K. Yelick, WIMPS 2002
Real Data from an Old Machine (T3E)Real Data from an Old Machine (T3E)
UPC uses a global address space Non-blocking remote put/get model Does not cache remote data
Sparse Matrix-Vector Multiply (T3E)
0
50
100
150
200
250
1 2 4 8 16 32
Processors
Mfl
op
s
UPC + PrefetchMPI (Aztec)UPC BulkUPC Small
K. Yelick, WIMPS 2002
Running Sparse MVM on a Pflop PIMRunning Sparse MVM on a Pflop PIM
1 GHz * 8 pipes * 8 ALUs/Pipe = 64 GFLOPS/node peak 8 Address generators limit performance to 16 Gflops 500ns latency, 1 cycle put/get overhead, 100 cycle MP overhead Programmability differences too: packing vs. global address space
1.E+07
1.E+08
1.E+09
1.E+10
1.E+11
1.E+12
1.E+13
1.E+14
1.E+15
1.E+16
Op
s/s
ec
Put/Get
Blocking read/w rite
Synchronous MP
Asynch MP
Peak
K. Yelick, WIMPS 2002
Effect of Memory SizeEffect of Memory Size
For small memory nodes or smaller problem sizes Low overhead is more important
For large memory nodes and large problems packing is better
1.E+07
1.E+08
1.E+09
1.E+10
1.E+11
1.E+12
1.E+13
1.E+14
1.E+15
1.E+16
0.3
0.5
1.0
2.1
4.1
8.2
16.4
32.9
65.8
131.
626
3.1
526.
3
1052
.5
2105
.0
4210
.0
MB/node of data
Op
s/s
ec
Put/Get
Blocking read/w rite
Synchronous MP
Asynch MP
Peak
K. Yelick, WIMPS 2002
ConclusionsConclusions
Performance advantage for PIMS depends on application Need fine-grained parallelism to utilize on-chip bandwidth Data parallelism is one model with the usual trade-offs
Hardware and programming simplicity Limited expressibility
Largest advantages for PIMS are power and packaging Enables Peta-scale machine
Multiprocessor PIMs should be easier to program At least at scale of current machines (Tflops) Can we bget rid of the current programming model hierarchy?
K. Yelick, WIMPS 2002
The The EndEnd
K. Yelick, WIMPS 2002
BenchmarksBenchmarks
Kernels Designed to stress memory systems
Some taken from the Data Intensive Systems Stressmarks
Unit and constant stride memory Dense matrix-vector multiplication Transitive-closure
Constant stride FFT
Indirect addressing NSA Giga-Updates Per Second (GUPS) Sparse Matrix Vector multiplication Histogram calculation (sorting)
Frequent branching a well and irregular memory acess Unstructured mesh adaptation
K. Yelick, WIMPS 2002
Conclusions and VIRAM Future DirectionsConclusions and VIRAM Future Directions
VIRAM outperforms Pentium III on Scientific problems With lower power and clock rate than the Mobile Pentium
Vectorization techniques developed for the Cray PVPs applicable. PIM technology provides low power, low cost memory system. Similar combination used in Sony Playstation.
Small ISA changes can have large impact Limited in-register permutations sped up 1K FFT by 5x.
Memory system can still be a bottleneck Indexed/variable stride costly, due to address generation.
Future work: Ongoing investigations into impact of lanes, subbanks Technical paper in preparation – expect completion 09/01 Run benchmark on real VIRAM chips Examine multiprocessor VIRAM configurations
K. Yelick, WIMPS 2002
Management PlanManagement Plan
Roles of different groups and PIs Senior researchers working on particular class of benchmarks
Parry: sorting and histograms Sherry: sparse matrices Lenny: unstructured mesh adaptation Brian: simulation Jin and Hyun: specific benchmarks
Plan to hire additional postdoc for next year (focus on Imagine) Undergrad model used for targeted benchmark efforts
Plan for using computational resources at NERSC Few resourced used, except for comparisons
K. Yelick, WIMPS 2002
Future Funding ProspectsFuture Funding Prospects
FY2003 and beyond DARPA initiated DIS program Related projects are continuing under Polymorphic Computing New BAA coming in “High Productivity Systems” Interest from other DOE labs (LANL) in general problem
General model Most architectural research projects need benchmarking Work has higher quality if done by people who understand
apps. Expertise for hardware projects is different: system level
design, circuit design, etc. Interest from both IRAM and Imagine groups show level of
interest
K. Yelick, WIMPS 2002
Long Term ImpactLong Term Impact
Potential impact on Computer Science Promote research of new architectures and micro-
architectures Understand future architectures
Preparation for procurements Provide visibility of NERSC in core CS research areas
Correlate applications: DOE vs. large market problems
Influence future machines through research collaborations
K. Yelick, WIMPS 2002
Benchmark Performance on IRAM Benchmark Performance on IRAM SimulatorSimulator IRAM (200 MHz, 2 W) versus Mobile Pentium III (500 MHz, 4 W)
K. Yelick, WIMPS 2002
Project Goals for FY02 and BeyondProject Goals for FY02 and Beyond
Use established data-intensive scientific benchmarks with other emerging architectures:
IMAGINE (Stanford Univ.) Designed for graphics and image/signal processing Peak 20 GLOPS (32-bit FP) Key features: vector processing, VLIW, a streaming memory
system. (Not a PIM-based design.) Preliminary discussions with Bill Dally.
DIVA (DARPA-sponsored: USC/ISI) Based on PIM “smart memory” design, but for multiprocessors Move computation to data Designed for irregular data structures and dynamic databases. Discussions with Mary Hall about benchmark comparisons
K. Yelick, WIMPS 2002
Media BenchmarksMedia Benchmarks
FFT uses in-register permutations, generalized reduction All others written in C with Cray vectorizing compiler
0
0.5
1
1.5
2
2.5
3
3.5
4G
OP
S
K. Yelick, WIMPS 2002
Integer BenchmarksInteger Benchmarks
Strided access important, e.g., RGB narrow types limited by address generation
Outer loop vectorization and unrolling used helps avoid short vectors spilling can be a problem
01000200030004000500060007000
1 lane
2 lane
4 lane
K. Yelick, WIMPS 2002
Status of benchmarking software releaseStatus of benchmarking software release
Build and test scripts (Makefiles, timing, analysis, ...)
Standard random number generator
OptimizedGUPS
inner loop
GUPS C codes PointerJumping
PointerJumpingw/Update
Transitive Field
ConjugateGradient(Matrix)
Neighborhood
Optimizedvector
histogramcode
Vector histogramcode generator
GUPSDocs
Test cases (small and large working sets)
Optimized
Unoptimized Future work:
• Write more documentation, add better test cases as we find them
• Incorporate media benchmarks, AMR code, library of frequently-used compiler flags & pragmas
K. Yelick, WIMPS 2002
Status of benchmarking workStatus of benchmarking work
Two performance models: simulator (vsim-p), and trace analyzer (vsimII)
Recent work on vsim-p: Refining the performance model for double-precision FP
performance.
Recent work on vsimII: Making the backend modular
Goal: Model different architectures w/ same ISA. Fixing bugs in the memory model of the VIRAM-1 backend. Better comments in code for better maintainability. Completing a new backend for a new decoupled cluster
architecture.
K. Yelick, WIMPS 2002
Comparison with Mobile PentiumComparison with Mobile Pentium
GUPS: VIRAM gets 6x more GUPS
Data element width
16 bit 32 bit
64 bit
Mobile Pentium GUPS
.045 .046 .036
VIRAM GUPS .295 .295 .244
0
1
2
3
4
5
6
7
8
tran
sitiv
e
tran
sitiv
e
tran
sitiv
e
tran
sitiv
e
tran
sitiv
e
tran
sitiv
e
tran
sitiv
e
tran
sitiv
e
50 100150200250350450550
Matrix size
To
tal
exec
uti
on
tim
e (s
eco
nd
s)
P-III
VIRAM 4lane
0
0.0001
0.0002
0.0003
0.0004
0.0005
0.0006
update update update update update
0tiny test test2 test3 test4
Working set size
tota
l execu
tio
n t
ime (
seco
nd
s)
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0.0035
pointer pointer pointer pointer
0tiny test test2 test3
working set size
tota
l execu
tio
n t
ime (
seco
nd
s)
TransitivePointerUpdate
VIRAM=30-50% faster than P-III
Ex. time for VIRAM rises much more slowly w/ data size than for P-III
K. Yelick, WIMPS 2002
Sparse CGSparse CG
Solve Ax = b; Sparse matrix-vector multiplication dominates.
Traditional CRS format requires: Indexed load/store for X/Y vectors Variable vector length, usually short
Other formats for better vectorization: CRS with narrow band (e.g., RCM ordering)
Smaller strides for X vector Segmented-Sum (Modified the old code developed for Cray
PVP) Long vector length, of same size Unit stride
ELL format: make all rows the same length by padding zeros Long vector length, of same size Extra flops
K. Yelick, WIMPS 2002
SMVM PerformanceSMVM Performance
DIS matrix: N = 10000, M = 177820 (~ 17 nonzeros per row)
IRAM results (MFLOPS)
Mobile PIII (500 MHz) CRS: 35 MFLOPS
SubBanks
1 2 4 8
CRS 91 106 109 110
CRS banded
110 110 110 110
SEG-SUM 135 154 163 165
ELL (4.6 X more flops)
511(111)
570(124)
612(133)
632(137)
K. Yelick, WIMPS 2002
2D Unstructured Mesh Adaptation2D Unstructured Mesh Adaptation
Powerful tool for efficiently solving computational problems with evolving physical features (shocks, vortices, shear layers, crack propagation)
Complicated logic and data structures Difficult to achieve high efficiently
Irregular data access patterns (pointer chasing) Many conditionals / integer intensive
Adaptation is tool for making numerical solution cost effective Three types of element subdivision
K. Yelick, WIMPS 2002
Vectorization Strategy and Performance Vectorization Strategy and Performance ResultsResults Color elements based on vertices (not edges)
Guarantees no conflicts during vector operations
Vectorize across each subdivision (1:2, 1:3, 1:4) one color at a time Difficult: many conditionals, low flops, irregular data access,
dependencies Initial grid: 4802 triangles, Final grid 24010 triangles
Preliminary results demonstrate VIRAM 4.5x faster than Mobile Pentium III 500
Higher code complexity (requires graph coloring + reordering)
Pentium III 500
1 Lane 2 Lanes 4 Lanes
61 18 14 13Time (ms)