The SGI Multi-Stream Processor - CUG · 2010. 8. 11. · • SV1 results using a current F90...

The SGI Multi-Stream ProcessorCopyright © 1999, SGI

28-May-99Terry Greyzck

28-May-99Copyright © 1999, SGI

The SGI Multi-Stream Processor(MSP)

Terry GreyzckTechnical Lead, MSP Compiler

41st Cray User Group ConferenceMinneapolis, Minnesota



We know• Multi-level parallelism experts for 25 years

– Multiple, pipelined functional units (scheduling)

– Automatic vectorization• Very low overhead• Chained, pipelined operations• Asynchronous loads and stores• High percentage of peak performance

– Autotasking/OpenMP• Higher overhead• Both automatic and directive-based• Designed for maximal use of expensive processors in a time-

sharing environment



Multi-Stream Processing– Multi-Stream Processing

• True parallel execution• Allows optimal use of memory bandwidth• Synergistic union of hardware and software

– Shared registers and cache control on the Cray SV1– The Cray SV2 is a dedicated MSP machine with fully integrated

hardware support– Multi-Streamed jobs are scheduled by the operating system– Codes that vectorize well generally stream well, using

breakthrough compiler technology

• Fills the cost/effort gap between vectorization and OpenMP



Minimum Effort, Maximum Gain

• Manual parallelization methods require substantially more user effort than automaticvectorization or streaming

• Vectorization requires relatively little user intervention

• Multi-Streaming is designed to be a logical bridge between vectorization and OpenMP

Scheduling

Vectorization

Multi-Streaming

OpenMP

0%

20%

40%

60%

80%

100%

Speedup

User Effort



What’s this thing?It’s an abstract representationof an MSP– An SV1 board contains four SSPs– An MSP is ideally configured by

choosing one SSP from each offour boards

– The rod running through the SSPsrepresents a single SV1 MSP

– If you can find it on the web, itspins!

– The SV2 MSP configuration isvery different, but I haven’tcreated its graphic yet…

– (yes, it will spin as well)



Very simple example…(Reference only)

DO I = 1,250 SSP 0

A(I) = B(I) + C(I)

ENDDO

DO I = 251,500 SSP 1

A(I) = B(I) + C(I)

ENDDO

DO I = 501,750 SSP 2

A(I) = B(I) + C(I)

ENDDO

DO I = 751,1000 SSP 3

A(I) = B(I) + C(I)

ENDDO

DO I = 1,1000

A(I) = B(I) + C(I)

ENDDO

• Only one version of the code exists; allSSPs execute the same loop

• Each SSP determines its own loopiteration range, based on its SSPnumber

• There is no ‘master’ or ‘slave’ concept;each SSP is fully independent and iscapable of executing any segment ofcode



End of the TermMSP (n) Multi-Stream Processor

SSP (n) Single-Stream Processor

Stream (n,v) When used as a verb, to use multipleSSPs of an MSP

Streamed (n,v,adj,É) Can be used interchangeably withmulti-streaming (English majors, we’re not)



Operating System Support

• Currently available on the Cray SV1– Up to six configurable MSPs on a 32 processor system– Each MSP is configured to maximize its access to

memory bandwidth– All SSPs comprising an MSP are considered a unit,

greatly lowering intra-MSP communication costs

• And on the Cray SV2– Consists entirely of Multi-Stream Processors– Treats an MSP like a single CPU



MSP Program Support

• F90 support available on the SV1– Available with PE release 3.3 in late July– Primary option is -Ostream[0-3]

– -Ostream2 currently provides automatic streaming ofloop nests, including reductions and last value captures

– SV1 default is stream0, SV2 default is stream2

• C and C++ option is -hstream[0-3]– Available with PE release 3.4



Program Support - continued

• MSP Programming Environment reliability– Rigorous functional testing suite with full automatic

streaming enabled• Over 20,000 MSP tests covering more than 8,600,000 lines of

Fortran code• Developmental compiler has a passing rate of 99.78%, and is

improving...• New tests being added regularly

– Extensive performance testing and analysis• Of both benchmarks and actual production codes• Results help to channel development efforts, allowing us to

focus our resources where they will do the most good



MSP Hardware Support

• Cray SV1 MSP hardware support includes– Shared primary memory– Intra-MSP cache coherency control– Shared B and T registers

• Cray SV2 MSP hardware support includes– Shared primary memory– Intra-MSP cache coherency– Extremely fast hardware synchronization mechanism



Streaming and Vectorization

• Multi-streaming acts hand-in-hand withvectorization to achieve the greatest speed up– Loops can both stream and vectorize

– Streaming, in conjunction with various loop restructuringtechniques, reduces memory traffic by reusing vectorresults

– Multi-streaming, in the simplest case, behaves as a pipemultiplier for vectors, effectively giving the SV1 an eightpipe, 256-element vector unit per MSP



Streaming and… everything else

If it works with OpenMP…

It will work with multi-streaming.



What Multi-Streams?(Reference only)

• Inner loops, outer loops, middle loops… loops!– Strided loads and stores– Reduction operations, with consistent results– Conditional execution– Last value captures– Loops that require array privatization– Gathers and unordered scatters (ordered coming soon)

– Pack/unpack operations (coming soon)

– Bit matrix manipulations (coming soon)

– Many other constructs available now, with more planned



SV1 MSP Performance

• Experiences to date– Codes that run well on long-vector, wide pipe machines perform

very nicely using the MSP

– The ability to multi-stream outer loops that are otherwise non-vectorizable can result in a large performance gain

– MSP performance on the SV1 can exceed a factor of fourimprovement for limited cases, due to a larger apparent cache

– The primary limiting factor on the SV1 is that intra-MSP datasynchronization requires a full cache invalidation



SV1 MSP Timings

• SV1 results using a current F90 compiler with -Ostream2• T90 results obtained using the same version of F90 with default options• Both tests run with NCPUS = 1 (single processor execution) and no code modifications

0

10

20

30

40

50

60

ARC2D ARC3D BMK1 BMK11B HYDRO2D MGRID OVERLAP SCATTER STATIC SWIM

MSP Performance SuiteExecution Times in Processor Seconds

Measured 14May99

T90

SV1 MSP



-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

ARC2D

ARC3D

BMK1

BMK11B

HYDRO2D

MGRID

OVERLAP

SCATTER

STATIC

SWIM

• Results normalilzedwith respect to theT90; ratios indicatethe percentage aSV1 MSP codeexceeded the speedof the T90

• The range between -20% and 20% (-0.2to 0.2) is consideredto be T90 equivalent

• This comprises theMSP PerformanceSuite, defaultoptions atcompilation except -Ostream2 for theSV1

• All tests run withNCPUS set to 1

• Y-axis isT90

MSP SV1 - T90

±20% is roughly T90 equivalent

Speed ExecutionT90

MSP SV1



SV1 MSP Performance Continued

• Lessons we have learned– The proper selection of loop nest optimizations is crucial

for obtaining maximum performance– Reducing the amount of data synchronization is often

necessary for achieving peak performance– Aggressive removal of overly conservative cache

invalidations is essential on the Cray SV1– Some manual intervention may be required until

compiler technology for automatic multi-streaming isfully developed



Future Directions

• Additional directives and assertions for obtainingthe highest possible performance

• Improved loop restructuring• Improved cache utilization• Multi-streaming of a wider range of constructs• Further reduction of synchronization overhead• Asymmetrical multi-streaming• Multi-streaming of code outside of loops• Closer interaction with inlining



And the winner is...• The multi-stream processor was designed for you, the

customer• A guiding design goal is ease of use and minimizing the

necessity for hand-tuning• At the same time, allowing low level manipulation for the

seriously geeky (that includes most of the developmentteam, IÕm afraid)

• It’s fun! My boss told me so.• Remember, as a company we’ve got 25 years of design

experience behind us• The winner is… you, the customer, and us, here at SGI.

The SGI Multi-Stream Processor - CUG · 2010. 8. 11. · • SV1 results using a current F90...

Documents

Transcript of The SGI Multi-Stream Processor - CUG · 2010. 8. 11. · • SV1 results using a current F90...