CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016...

Post on 01-Aug-2021

3 views 0 download

Transcript of CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016...

3/30/2016 CS152,Spring2016

CS152ComputerArchitectureandEngineering

Lecture15:VectorComputers

Dr. George Michelogiannakis

EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

http://inst.eecs.berkeley.edu/~cs152!

3/30/2016 CS152,Spring2016

LastTimeLecture14:Mul?threading

2

Time (

proc

esso

r cyc

le) Superscalar Fine-Grained Coarse-Grained Multiprocessing

Simultaneous Multithreading

Thread 1 Thread 2

Thread 3 Thread 4

Thread 5 Idle slot

3/30/2016 CS152,Spring2016

Ques?onoftheDay

3

§ CanVectorandVLIWcombine?

3/30/2016 CS152,Spring2016

Supercomputers

§ Defini@onofasupercomputer:§  Fastestmachineinworldatgiventask

–  Performsatornearthecurrentlyhighestopera@onalrateforcomputers

§ Adevicetoturnacompute-boundproblemintoanI/Oboundproblem

§ Anymachinecos@ng$30M+§ AnymachinedesignedbySeymourCray

§ CDC6600(Cray,1964)regardedasfirstsupercomputer

4

3/30/2016 CS152,Spring2016

CDC6600SeymourCray,1963

§ Afastpipelinedmachinewith60-bitwords–  128Kwordmainmemorycapacity,32banks

§ Tenfunc@onalunits(parallel,unpipelined)–  Floa@ngPoint:adder,2mul@pliers,divider–  Integer:adder,2incrementers,...

§ Hardwiredcontrol(nomicrocoding)§ Scoreboardfordynamicschedulingofinstruc@ons§ TenPeripheralProcessorsforInput/Output

–  afastmul@-threaded12-bitintegerALU§ Veryfastclock,10MHz(FPaddin4clocks)§ >400,000transistors,750sq.b.,5tons,150kW,novelfreon-basedtechnologyforcooling

§ Fastestmachineinworldfor5years(un@l7600)–  over100sold($7-10Meach)

53/10/2009

3/30/2016 CS152,Spring2016

IBMMemoonCDC6600ThomasWatsonJr.,IBMCEO,August1963:

“Lastweek,ControlData...announcedthe6600system.Iunderstandthatinthelaboratorydevelopingthesystemthereareonly34peopleincludingthejanitor.Ofthese,14areengineersand4areprogrammers...ContrasGngthismodesteffortwithourvastdevelopmentacGviGes,IfailtounderstandwhywehavelostourindustryleadershipposiGonbyleIngsomeoneelseoffertheworld'smostpowerfulcomputer.”

TowhichCrayreplied:“ItseemslikeMr.WatsonhasansweredhisownquesGon.”

6

3/30/2016 CS152,Spring2016

Top500Systems

7

LINPACK & LAPACK: Software libraries for performing linear algebra

3/30/2016 CS152,Spring2016

OakRidgeTitan

§  560,640cores§  LinkPackperformance17,590TFlop/s§  Theore@calpeak27,112.5TFlop/s§  8,209.00kW§  710,144GB§ Opteron627416C2.2GHz

8

3/30/2016 CS152,Spring2016

NERSC(LBNL)Cori

§ CrayXC40supercomputer§  Theore@calPeakperformance1.92Petaflops/sec§  1,630computesnodes,52,160coresintotal§ CrayArieshigh-speedinterconnectwithDragonflytopologyasonEdison(0.25μsto3.7μsMPIlatency,~8GB/secMPIbandwidth)

§ Aggregatememory:203TB§  Scratchstoragecapacity:30PB

9

3/30/2016 CS152,Spring2016

CDC6600:ALoad/StoreArchitecture

10

•  Separate instructions to manipulate three types of reg. 8 60-bit data registers (X)

8 18-bit address registers (A) 8 18-bit index registers (B)

• All arithmetic and logic instructions are reg-to-reg

• Only Load and Store instructions refer to memory!

Touching address registers 1 to 5 initiates a load 6 to 7 initiates a store

- very useful for vector operations

opcode i j k Ri ← (Rj) op (Rk)

opcode i j disp Ri ← M[(Rj) + disp]

6 3 3 3

6 3 3 18

3/30/2016 CS152,Spring2016

CDC6600:Datapath

11

AddressRegsIndexRegs8x18-bit8x18-bit

OperandRegs8x60-bit

Inst.Stack8x60-bit

IR

10Func@onalUnits

CentralMemory

128Kwords,32banks,1µscycle

resultaddr

result

operand

operandaddr

3/30/2016 CS152,Spring2016

CDC6600ISAdesignedtosimplifyhigh-performanceimplementa?on

§ Useofthree-address,register-registerALUinstruc@onssimplifiespipelinedimplementa@on–  Noimplicitdependenciesbetweeninputsandoutputs

§ Decouplingseongofaddressregister(Ar)fromretrievingvaluefromdataregister(Xr)simplifiesprovidingmul@pleoutstandingmemoryaccesses–  Sobwarecanscheduleloadofaddressregisterbeforeuseofvalue–  Caninterleaveindependentinstruc@onsinbetween

§ CDC6600hasmul@pleparallelbutunpipelinedfunc@onalunits–  E.g.,2separatemul@pliers

§  Follow-onmachineCDC7600usedpipelinedfunc@onalunits–  ForeshadowslaterRISCdesigns

12

3/30/2016 CS152,Spring2016

CDC6600:VectorAddi?on

13

B0<--nloop: JZEB0,exit

A0<-B0+a0 loadX0A1<-B0+b0 loadX1X6<-X0+X1A6<-B0+c0 storeX6B0<-B0+1jumploop

Ai=addressregisterBi=indexregisterXi=dataregister

3/30/2016 CS152,Spring2016

SupercomputerApplica?ons

§  Typicalapplica@onareas–  Militaryresearch(nuclearweapons,cryptography)–  Scien@ficresearch–  Weatherforecas@ng–  Oilexplora@on–  Industrialdesign(carcrashsimula@on)–  Bioinforma@cs–  Cryptography

§ Allinvolvehugecomputa@onsonlargedatasets

§  In70s-80s,Supercomputer≡VectorMachine

14

3/30/2016 CS152,Spring2016

VLIWvsVector

15

§ VLIWtakesadvantageofinstruc@onlevelparallelism(ILP)byspecifyinginstruc@onstoexecuteinparallel

§ Vectorarchitecturesperformthesameopera@ononmul@pledataelements–  Data-levelparallelism

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic Instructions

ADDV v3, v1, v2 v3

v2 v1

IntOp2 MemOp1 MemOp2 FPOp1 FPOp2IntOp1

3/30/2016 CS152,Spring2016

VectorProgrammingModel

16

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic Instructions

ADDV v3, v1, v2 v3

v2 v1

Scalar Registers

r0

r15 Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1]

VLR Vector Length Register

v1 Vector Load and Store Instructions

LV v1, r1, r2

Base, r1 Stride, r2 Memory

Vector Register

3/30/2016 CS152,Spring2016

ControlInforma?on

17

§ VLRlimitsthehighestvectorelementtobeprocessedbyavectorinstruc@on–  VLRisloadedpriortoexecu@ngthevectorinstruc@onwithaspecialinstruc@on

§  Strideforload/stores:–  Vectorsmaynotbeadjacentinmemoryaddresses–  E.g.,differentdimensionsofamatrix–  Stridecanbespecifiedaspartoftheload/store

3/30/2016 CS152,Spring2016

VectorCodeExample

18

# Scalar Code LI R4, 64 loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop

# Vector Code LI VLR, 64 LV V1, R1 LV V2, R2 ADDV.D V3, V1, V2 SV V3, R3

# C code for (i=0; i<64; i++) C[i] = A[i] + B[i];

3/30/2016 CS152,Spring2016

Flynn’sTaxonomy

19

§ Singleinstruc@on,singledata(SISD)–  E.g.,ourin-orderprocessor

§ Singleinstruc@on,mul@pledata(SIMD)–  Mul@pleprocessingelements,sameopera@on,differentdata–  Vector–  Mul@pleprocessingunitsexecutethesameinstruc@onondifferentdatainalockstep.Eitherallcompleteornonedo.Therefore,allunitshavetoexecutethesameinstruc@onatagiven@me

§ Mul@pleinstruc@on,mul@pledata(MIMD)–  Mul@pleautonomousprocessorsexecu@ngdifferentinstruc@onsondifferentdata

–  Mostcommonandgeneralparallelmachine

§ Mul@pleinstruc@on,singledata(MISD)– Whywouldanyonedothis?

3/30/2016 CS152,Spring2016

MoreCategories

20

§ Singleprogram,mul@pledata(SPMD)–  Mul@pleautonomousprocessorsexecutetheprogramatindependentpoints

–  DifferencewithSIMD:SIMDimposesalockstep–  ProgramsatSPMDcanbeatindependentpoints–  SPMDcanrunongeneralpurposeprocessors–  Mostcommonmethodforparallelcompu@ng

§ Mul@pleprogram,mul@pledata(MPMD)–  Mul@pleautonomousprocessorssimultaneouslyopera@ngatleast2independentprograms

3/30/2016 CS152,Spring2016

VectorSupercomputers

§ Epitomy:Cray-1,1976§ ScalarUnit

–  Load/StoreArchitecture

§ VectorExtension–  VectorRegisters–  VectorInstruc@ons

§ Implementa@on–  HardwiredControl–  HighlyPipelinedFunc@onalUnits–  InterleavedMemorySystem–  NoDataCaches–  NoVirtualMemory

21

3/30/2016 CS152,Spring2016

Cray-1(1976)

22

Single Port Memory 16 banks of 64-bit words

+ 8-bit SECDED

80MW/sec data load/store 320MW/sec instruction buffer refill

4 Instruction Buffers

64-bitx16 NIP

LIP

CIP

(A0)

( (Ah) + j k m )

64 T Regs

(A0)

( (Ah) + j k m )

64 B Regs

S0 S1 S2 S3 S4 S5 S6 S7

A0 A1 A2 A3 A4 A5 A6 A7

Si

Tjk

Ai

Bjk

FP Add FP Mul FP Recip

Int Add Int Logic Int Shift Pop Cnt

Sj

Si

Sk

Addr Add Addr Mul

Aj

Ai

Ak

memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)

V0 V1 V2 V3 V4 V5 V6 V7

Vk

Vj

Vi V. Mask

V. Length 64 Element Vector Registers

3/30/2016 CS152,Spring2016

VectorInstruc?onSetAdvantages

§ Compact–  oneshortinstruc@onencodesNopera@ons

§ Expressive,tellshardwarethattheseNopera@ons:–  areindependent–  usethesamefunc@onalunit–  accessdisjointregisters–  accessregistersinsamepauernaspreviousinstruc@ons–  accessacon@guousblockofmemory(unit-strideload/store)

–  accessmemoryinaknownpauern(stridedload/store)

§ Scalable–  canrunsamecodeonmoreparallelpipelines(lanes)

23

3/30/2016 CS152,Spring2016

VectorArithme?cExecu?on

24

• Usedeeppipeline(=>fastclock)toexecuteelementopera@ons

•  Simplifiescontrolofdeeppipelinebecauseelementsinvectorareindependent(=>nohazards!)

V1

V2

V3

V3<-v1*v2

SixstagemulGplypipeline

3/30/2016 CS152,Spring2016

VectorInstruc?onExecu?on

25

ADDV C,A,B

C[1]

C[2]

C[0]

A[3] B[3]

A[4] B[4]

A[5] B[5]

A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]

A[16] B[16]

A[20] B[20]

A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]

A[17] B[17]

A[21] B[21]

A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]

A[18] B[18]

A[22] B[22]

A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]

A[19] B[19]

A[23] B[23]

A[27] B[27]

Execution using four pipelined functional units

3/30/2016 CS152,Spring2016

HowDoVectorArchitecturesAffectMemory?

26

3/30/2016 CS152,Spring2016

InterleavedVectorMemorySystem

27

0 1 2 3 4 5 6 7 8 9 A B C D E F

+

Base Stride Vector Registers

Memory Banks

Address Generator

Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency

•  Bank busy time: Time before bank ready to accept next request

3/30/2016 CS152,Spring2016

VectorUnitStructure

28

Lane

FuncGonalUnit

VectorRegisters

MemorySubsystem

Elements0,4,8,…

Elements1,5,9,…

Elements2,6,10,…

Elements3,7,11,…

3/30/2016 CS152,Spring2016

T0VectorMicroprocessor(UCB/ICSI,1995)

29

LaneVectorregisterelementsstriped

overlanes

[0] [8] [16] [24]

[1] [9] [17] [25]

[2] [10] [18] [26]

[3] [11] [19] [27]

[4] [12] [20] [28]

[5] [13] [21] [29]

[6] [14] [22] [30]

[7] [15] [23] [31]

3/30/2016 CS152,Spring2016

VectorInstruc?onParallelism§ Canoverlapexecu@onofmul@plevectorinstruc@ons

–  examplemachinehas32elementspervectorregisterand8lanes

30

load

load mul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Instruction issue

Complete24opera@ons/cyclewhileissuing1shortinstruc@on/cycle

3/30/2016 CS152,Spring2016

VectorChaining

31

§ Vectorversionofregisterbypassing–  introducedwithCray-1

Memory

V1

Load Unit

Mult.

V2

V3

Chain

Add

V4

V5

Chain

LV v1

MULV v3,v1,v2

ADDV v5, v3, v4

3/30/2016 CS152,Spring2016

VectorChainingAdvantage

32

•  Withchaining,canstartdependentinstruc@onassoonasfirstresultappears

LoadMul

Add

LoadMul

AddTime

•  Withoutchaining,mustwaitforlastelementofresulttobewriuenbeforestar@ngdependentinstruc@on

3/30/2016 CS152,Spring2016

VectorStartup§  Twocomponentsofvectorstartuppenalty

–  func@onalunitlatency(@methroughpipeline)–  dead@meorrecovery@me(@mebeforeanothervectorinstruc@oncanstartdownpipeline).Somepipelinesreducecontrollogicbyrequiringdead@mebetweeninstruc@onstothesamevectorunit

33

R X X X WR X X X W

R X X X WR X X X W

R X X X WR X X X W

R X X X W

R X X X WR X X X W

R X X X W

Func@onalUnitLatency

DeadTime

FirstVectorInstruc@on

SecondVectorInstruc@on

DeadTime

3/30/2016 CS152,Spring2016

DeadTimeandShortVectors

34

Cray C90, Two lanes 4 cycle dead time

Maximum efficiency 94% with 128 element vectors

4 cycles dead time T0, Eight lanes No dead time

100% efficiency with 8 element vectors

No dead time

64 cycles active

3/30/2016 CS152,Spring2016

VectorMemory-MemoryversusVectorRegisterMachines

§ Vectormemory-memoryinstruc@onsholdallvectoroperandsinmainmemory

§  Thefirstvectormachines,CDCStar-100(‘73)andTIASC(‘71),werememory-memorymachines

§ Cray-1(’76)wasfirstvectorregistermachine

35

for (i=0; i<N; i++) { C[i] = A[i] + B[i]; D[i] = A[i] - B[i]; }

Example Source Code ADDV C, A, B SUBV D, A, B

Vector Memory-Memory Code

LV V1, A LV V2, B ADDV V3, V1, V2 SV V3, C SUBV V4, V1, V2 SV V4, D

Vector Register Code

3/30/2016 CS152,Spring2016

VectorMemory-Memoryvs.VectorRegisterMachines

§ Vectormemory-memoryarchitectures(VMMA)requiregreatermainmemorybandwidth,why?–  Alloperandsmustbereadinandoutofmemory

§ VMMAsmakeifdifficulttooverlapexecu@onofmul@plevectoropera@ons,why?–  Mustcheckdependenciesonmemoryaddresses

§ VMMAsincurgreaterstartuplatency–  ScalarcodewasfasteronCDCStar-100(VMM)forvectors<100elements

§ ApartfromCDCfollow-ons(Cyber-205,ETA-10)allmajorvectormachinessinceCray-1havehadvectorregisterarchitectures

§  (weignorevectormemory-memoryfromnowon)

36

3/30/2016 CS152,Spring2016

Automa?cCodeVectoriza?on

37

for (i=0; i < N; i++) C[i] = A[i] + B[i];

loadload

add

store

loadload

add

store

Iter.1

Iter.2

ScalarSequenGalCode

Vectoriza@onisamassivecompile-@mereorderingofopera@onsequencing

⇒requiresextensiveloopdependenceanalysis

VectorInstrucGon

load

load

add

store

load

load

add

store

Iter.1 Iter.2

VectorizedCode

Time

3/30/2016 CS152,Spring2016

VectorStripminingProblem:VectorregistershavefinitelengthSolu?on:Breakloopsintopiecesthatfitinregisters,“Stripmining”

38

ANDI R1, N, 63 # N mod 64 MTC1 VLR, R1 # Do remainder loop: LV V1, RA DSLL R2, R1, 3 # Multiply by 8 DADDU RA, RA, R2 # Bump pointer LV V2, RB DADDU RB, RB, R2 ADDV.D V3, V1, V2 SV V3, RC DADDU RC, RC, R2 DSUBU N, N, R1 # Subtract elements LI R1, 64 MTC1 VLR, R1 # Reset full length BGTZ N, loop # Any more to do?

for (i=0; i<N; i++) C[i] = A[i]+B[i];

+

+

+

A B C

64 elements

Remainder

3/30/2016 CS152,Spring2016

VectorCondi?onalExecu?on

39

Problem: Want to vectorize loops with conditional code: for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i];

Solution: Add vector mask (or flag) registers –  vector version of predicate registers, 1 bit per element

…and maskable vector instructions –  vector operation becomes bubble (“NOP”) at elements where mask bit is clear

Code example: CVM # Turn on all elements LV vA, rA # Load entire A vector SGTVS.D vA, F0 # Set bits in mask register where A>0 LV vA, rB # Load B vector into A under mask SV vA, rA # Store A back to memory under mask

3/30/2016 CS152,Spring2016

MaskedVectorInstruc?ons

40

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0

M[4]=1

M[5]=1

M[6]=0

M[2]=0

M[1]=1

M[0]=0

M[7]=1

Density-TimeImplementa@on–  scanmaskvectorandonlyexecuteelementswithnon-zeromasks

C[1]

C[2]

C[0]

A[3] B[3]

A[4] B[4]

A[5] B[5]

A[6] B[6]

M[3]=0

M[4]=1

M[5]=1

M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data port Write Enable

A[7] B[7] M[7]=1

SimpleImplementa@on–  executeallNopera@ons,turnoffresultwritebackaccordingtomask

3/30/2016 CS152,Spring2016

VectorReduc?ons

41

Problem:Loop-carrieddependenceonreduc@onvariablessum = 0; for (i=0; i<N; i++) sum += A[i]; # Loop-carried dependence on sum

Solu?on:Re-associateopera@onsifpossible,usebinarytreetoperformreduc@on # Rearrange as: sum[0:VL-1] = 0 # Vector of VL partial sums for(i=0; i<N; i+=VL) # Stripmine VL-sized chunks sum[0:VL-1] += A[i:i+VL-1]; # Vector sum # Now have VL partial sums in one vector register do { VL = VL/2; # Halve vector length sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials } while (VL>1)

3/30/2016 CS152,Spring2016

VectorScacer/Gather

42

Wanttovectorizeloopswithindirectaccesses:for (i=0; i<N; i++) A[i] = B[i] + C[D[i]]

Indexedloadinstruc@on(Gather)LV vD, rD # Load indices in D vector LVI vC, rC, vD # Load indirect from rC base LV vB, rB # Load B vector ADDV.D vA,vB,vC # Do add SV vA, rA # Store result

3/30/2016 CS152,Spring2016

VectorScacer/Gather

43

Histogramexample:for (i=0; i<N; i++) A[B[i]]++;

Isfollowingacorrecttransla@on? LV vB, rB # Load indices in B vector LVI vA, rA, vB # Gather initial A values ADDV vA, vA, 1 # Increment SVI vA, rA, vB # Scatter incremented values

3/30/2016 CS152,Spring2016

AModernVectorSuper:NECSX-9(2008)§ 65nmCMOStechnology§ Vectorunit(3.2GHz)

– 8foregroundVRegs+64backgroundVRegs(256x64-bitelements/VReg)

– 64-bitfunc@onalunits:2mul@ply,2add,1divide/sqrt,1logical,1maskunit

– 8lanes(32+FLOPS/cycle,100+GFLOPSpeakperCPU)

– 1loadorstoreunit(8x8-byteaccesses/cycle)

§ Scalarunit(1.6GHz)– 4-waysuperscalarwithout-of-orderandspecula@veexecu@on

– 64KBI-cacheand64KBdatacache

44

• Memorysystemprovides256GB/sDRAMbandwidthperCPU• Upto16CPUsandupto1TBDRAMformshared-memorynode

–  totalof4TB/sbandwidthtosharedDRAMmemory

• Upto512nodesconnectedvia128GB/snetworklinks(messagepassingbetweennodes)

3/30/2016 CS152,Spring2016

Mul?mediaExtensions(akaSIMDextensions)

45

§  Veryshortvectorsaddedtoexis@ngISAsformicroprocessors§  Useexis@ng64-bitregisterssplitinto2x32bor4x16bor8x8b

–  LincolnLabsTX-2from1957had36bdatapathsplitinto2x18bor4x9b–  Newerdesignshavewiderregisters

•  128bforPowerPCAl@vec,IntelSSE2/3/4•  256bforIntelAVX

§  Singleinstruc@onoperatesonallelementswithinregister

16b 16b 16b 16b

32b 32b

64b

8b 8b 8b 8b 8b 8b 8b 8b

16b 16b 16b 16b

16b 16b 16b 16b

16b 16b 16b 16b

+ + + + 4x16badds

3/30/2016 CS152,Spring2016

Mul?mediaExtensionsversusVectors

§ Limitedinstruc@onset:–  novectorlengthcontrol–  nostridedload/storeorscauer/gather–  unit-strideloadsmustbealignedto64/128-bitboundary

§ Limitedvectorregisterlength:–  requiressuperscalardispatchtokeepmul@ply/add/loadunitsbusy–  loopunrollingtohidelatenciesincreasesregisterpressure

§ Trendtowardsfullervectorsupportinmicroprocessors–  Beuersupportformisalignedmemoryaccesses–  Supportofdouble-precision(64-bitfloa@ng-point)–  NewIntelAVXspec(announcedApril2008),256bvectorregisters(expandableupto1024b)

46

3/30/2016 CS152,Spring2016

DegreeofVectoriza?on

§ Compilersaregoodatfindingdata-levelparallelism

47

3/30/2016 CS152,Spring2016

AverageVectorLength

§ Maximumdependsonifbecnhmarksuse16bitor32bitopera@ons

48

3/30/2016 CS152,Spring2016

Distribu?onofInstruc?ons

49

3/30/2016 CS152,Spring2016

Ques?onoftheDay

50

§ CanVectorandVLIWcombine?

§  Yes!§  FujitsyFR-VcanprocessbothVLIWandvectorinstruc@ons§  Exploitsbothinstruc@on-anddata-levelparallelism

3/30/2016 CS152,Spring2016

Acknowledgements

§  Theseslidescontainmaterialdevelopedandcopyrightby:–  Arvind(MIT)–  KrsteAsanovic(MIT/UCB)–  JoelEmer(Intel/MIT)–  JamesHoe(CMU)–  JohnKubiatowicz(UCB)–  DavidPauerson(UCB)

§ MITmaterialderivedfromcourse6.823§ UCBmaterialderivedfromcourseCS252§  “VectorVs.SuperscalarandVLIWArchitecturesforEmbeddedMul@mediaBenchmarks”.ChristosKozyrakisandDavidPauerson.MICRO-35.2002

51