The SGI Multi-Stream Processor - CUG · 2010. 8. 11. · • SV1 results using a current F90...
Transcript of The SGI Multi-Stream Processor - CUG · 2010. 8. 11. · • SV1 results using a current F90...
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
28-May-99Copyright © 1999, SGI
The SGI Multi-Stream Processor(MSP)
Terry GreyzckTechnical Lead, MSP Compiler
41st Cray User Group ConferenceMinneapolis, Minnesota
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
We know• Multi-level parallelism experts for 25 years
– Multiple, pipelined functional units (scheduling)
– Automatic vectorization• Very low overhead• Chained, pipelined operations• Asynchronous loads and stores• High percentage of peak performance
– Autotasking/OpenMP• Higher overhead• Both automatic and directive-based• Designed for maximal use of expensive processors in a time-
sharing environment
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
Multi-Stream Processing– Multi-Stream Processing
• True parallel execution• Allows optimal use of memory bandwidth• Synergistic union of hardware and software
– Shared registers and cache control on the Cray SV1– The Cray SV2 is a dedicated MSP machine with fully integrated
hardware support– Multi-Streamed jobs are scheduled by the operating system– Codes that vectorize well generally stream well, using
breakthrough compiler technology
• Fills the cost/effort gap between vectorization and OpenMP
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
Minimum Effort, Maximum Gain
• Manual parallelization methods require substantially more user effort than automaticvectorization or streaming
• Vectorization requires relatively little user intervention
• Multi-Streaming is designed to be a logical bridge between vectorization and OpenMP
Scheduling
Vectorization
Multi-Streaming
OpenMP
0%
20%
40%
60%
80%
100%
Speedup
User Effort
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
What’s this thing?It’s an abstract representationof an MSP– An SV1 board contains four SSPs– An MSP is ideally configured by
choosing one SSP from each offour boards
– The rod running through the SSPsrepresents a single SV1 MSP
– If you can find it on the web, itspins!
– The SV2 MSP configuration isvery different, but I haven’tcreated its graphic yet…
– (yes, it will spin as well)
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
Very simple example…(Reference only)
DO I = 1,250 SSP 0
A(I) = B(I) + C(I)
ENDDO
DO I = 251,500 SSP 1
A(I) = B(I) + C(I)
ENDDO
DO I = 501,750 SSP 2
A(I) = B(I) + C(I)
ENDDO
DO I = 751,1000 SSP 3
A(I) = B(I) + C(I)
ENDDO
DO I = 1,1000
A(I) = B(I) + C(I)
ENDDO
• Only one version of the code exists; allSSPs execute the same loop
• Each SSP determines its own loopiteration range, based on its SSPnumber
• There is no ‘master’ or ‘slave’ concept;each SSP is fully independent and iscapable of executing any segment ofcode
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
End of the TermMSP (n) Multi-Stream Processor
SSP (n) Single-Stream Processor
Stream (n,v) When used as a verb, to use multipleSSPs of an MSP
Streamed (n,v,adj,É) Can be used interchangeably withmulti-streaming (English majors, we’re not)
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
Operating System Support
• Currently available on the Cray SV1– Up to six configurable MSPs on a 32 processor system– Each MSP is configured to maximize its access to
memory bandwidth– All SSPs comprising an MSP are considered a unit,
greatly lowering intra-MSP communication costs
• And on the Cray SV2– Consists entirely of Multi-Stream Processors– Treats an MSP like a single CPU
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
MSP Program Support
• F90 support available on the SV1– Available with PE release 3.3 in late July– Primary option is -Ostream[0-3]
– -Ostream2 currently provides automatic streaming ofloop nests, including reductions and last value captures
– SV1 default is stream0, SV2 default is stream2
• C and C++ option is -hstream[0-3]– Available with PE release 3.4
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
Program Support - continued
• MSP Programming Environment reliability– Rigorous functional testing suite with full automatic
streaming enabled• Over 20,000 MSP tests covering more than 8,600,000 lines of
Fortran code• Developmental compiler has a passing rate of 99.78%, and is
improving...• New tests being added regularly
– Extensive performance testing and analysis• Of both benchmarks and actual production codes• Results help to channel development efforts, allowing us to
focus our resources where they will do the most good
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
MSP Hardware Support
• Cray SV1 MSP hardware support includes– Shared primary memory– Intra-MSP cache coherency control– Shared B and T registers
• Cray SV2 MSP hardware support includes– Shared primary memory– Intra-MSP cache coherency– Extremely fast hardware synchronization mechanism
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
Streaming and Vectorization
• Multi-streaming acts hand-in-hand withvectorization to achieve the greatest speed up– Loops can both stream and vectorize
– Streaming, in conjunction with various loop restructuringtechniques, reduces memory traffic by reusing vectorresults
– Multi-streaming, in the simplest case, behaves as a pipemultiplier for vectors, effectively giving the SV1 an eightpipe, 256-element vector unit per MSP
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
Streaming and… everything else
If it works with OpenMP…
It will work with multi-streaming.
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
What Multi-Streams?(Reference only)
• Inner loops, outer loops, middle loops… loops!– Strided loads and stores– Reduction operations, with consistent results– Conditional execution– Last value captures– Loops that require array privatization– Gathers and unordered scatters (ordered coming soon)
– Pack/unpack operations (coming soon)
– Bit matrix manipulations (coming soon)
– Many other constructs available now, with more planned
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
SV1 MSP Performance
• Experiences to date– Codes that run well on long-vector, wide pipe machines perform
very nicely using the MSP
– The ability to multi-stream outer loops that are otherwise non-vectorizable can result in a large performance gain
– MSP performance on the SV1 can exceed a factor of fourimprovement for limited cases, due to a larger apparent cache
– The primary limiting factor on the SV1 is that intra-MSP datasynchronization requires a full cache invalidation
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
SV1 MSP Timings
• SV1 results using a current F90 compiler with -Ostream2• T90 results obtained using the same version of F90 with default options• Both tests run with NCPUS = 1 (single processor execution) and no code modifications
0
10
20
30
40
50
60
ARC2D ARC3D BMK1 BMK11B HYDRO2D MGRID OVERLAP SCATTER STATIC SWIM
MSP Performance SuiteExecution Times in Processor Seconds
Measured 14May99
T90
SV1 MSP
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
ARC2D
ARC3D
BMK1
BMK11B
HYDRO2D
MGRID
OVERLAP
SCATTER
STATIC
SWIM
• Results normalilzedwith respect to theT90; ratios indicatethe percentage aSV1 MSP codeexceeded the speedof the T90
• The range between -20% and 20% (-0.2to 0.2) is consideredto be T90 equivalent
• This comprises theMSP PerformanceSuite, defaultoptions atcompilation except -Ostream2 for theSV1
• All tests run withNCPUS set to 1
• Y-axis isT90
MSP SV1 - T90
±20% is roughly T90 equivalent
Speed ExecutionT90
MSP SV1
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
SV1 MSP Performance Continued
• Lessons we have learned– The proper selection of loop nest optimizations is crucial
for obtaining maximum performance– Reducing the amount of data synchronization is often
necessary for achieving peak performance– Aggressive removal of overly conservative cache
invalidations is essential on the Cray SV1– Some manual intervention may be required until
compiler technology for automatic multi-streaming isfully developed
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
Future Directions
• Additional directives and assertions for obtainingthe highest possible performance
• Improved loop restructuring• Improved cache utilization• Multi-streaming of a wider range of constructs• Further reduction of synchronization overhead• Asymmetrical multi-streaming• Multi-streaming of code outside of loops• Closer interaction with inlining
The SGI Multi-Stream ProcessorCopyright © 1999, SGI
28-May-99Terry Greyzck
And the winner is...• The multi-stream processor was designed for you, the
customer• A guiding design goal is ease of use and minimizing the
necessity for hand-tuning• At the same time, allowing low level manipulation for the
seriously geeky (that includes most of the developmentteam, IÕm afraid)
• It’s fun! My boss told me so.• Remember, as a company we’ve got 25 years of design
experience behind us• The winner is… you, the customer, and us, here at SGI.