CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016...
Transcript of CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016...
3/30/2016 CS152,Spring2016
CS152ComputerArchitectureandEngineering
Lecture15:VectorComputers
Dr. George Michelogiannakis
EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
http://inst.eecs.berkeley.edu/~cs152!
3/30/2016 CS152,Spring2016
LastTimeLecture14:Mul?threading
2
Time (
proc
esso
r cyc
le) Superscalar Fine-Grained Coarse-Grained Multiprocessing
Simultaneous Multithreading
Thread 1 Thread 2
Thread 3 Thread 4
Thread 5 Idle slot
3/30/2016 CS152,Spring2016
Ques?onoftheDay
3
§ CanVectorandVLIWcombine?
3/30/2016 CS152,Spring2016
Supercomputers
§ Defini@onofasupercomputer:§ Fastestmachineinworldatgiventask
– Performsatornearthecurrentlyhighestopera@onalrateforcomputers
§ Adevicetoturnacompute-boundproblemintoanI/Oboundproblem
§ Anymachinecos@ng$30M+§ AnymachinedesignedbySeymourCray
§ CDC6600(Cray,1964)regardedasfirstsupercomputer
4
3/30/2016 CS152,Spring2016
CDC6600SeymourCray,1963
§ Afastpipelinedmachinewith60-bitwords– 128Kwordmainmemorycapacity,32banks
§ Tenfunc@onalunits(parallel,unpipelined)– Floa@ngPoint:adder,2mul@pliers,divider– Integer:adder,2incrementers,...
§ Hardwiredcontrol(nomicrocoding)§ Scoreboardfordynamicschedulingofinstruc@ons§ TenPeripheralProcessorsforInput/Output
– afastmul@-threaded12-bitintegerALU§ Veryfastclock,10MHz(FPaddin4clocks)§ >400,000transistors,750sq.b.,5tons,150kW,novelfreon-basedtechnologyforcooling
§ Fastestmachineinworldfor5years(un@l7600)– over100sold($7-10Meach)
53/10/2009
3/30/2016 CS152,Spring2016
IBMMemoonCDC6600ThomasWatsonJr.,IBMCEO,August1963:
“Lastweek,ControlData...announcedthe6600system.Iunderstandthatinthelaboratorydevelopingthesystemthereareonly34peopleincludingthejanitor.Ofthese,14areengineersand4areprogrammers...ContrasGngthismodesteffortwithourvastdevelopmentacGviGes,IfailtounderstandwhywehavelostourindustryleadershipposiGonbyleIngsomeoneelseoffertheworld'smostpowerfulcomputer.”
TowhichCrayreplied:“ItseemslikeMr.WatsonhasansweredhisownquesGon.”
6
3/30/2016 CS152,Spring2016
Top500Systems
7
LINPACK & LAPACK: Software libraries for performing linear algebra
3/30/2016 CS152,Spring2016
OakRidgeTitan
§ 560,640cores§ LinkPackperformance17,590TFlop/s§ Theore@calpeak27,112.5TFlop/s§ 8,209.00kW§ 710,144GB§ Opteron627416C2.2GHz
8
3/30/2016 CS152,Spring2016
NERSC(LBNL)Cori
§ CrayXC40supercomputer§ [email protected]/sec§ 1,630computesnodes,52,160coresintotal§ CrayArieshigh-speedinterconnectwithDragonflytopologyasonEdison(0.25μsto3.7μsMPIlatency,~8GB/secMPIbandwidth)
§ Aggregatememory:203TB§ Scratchstoragecapacity:30PB
9
3/30/2016 CS152,Spring2016
CDC6600:ALoad/StoreArchitecture
10
• Separate instructions to manipulate three types of reg. 8 60-bit data registers (X)
8 18-bit address registers (A) 8 18-bit index registers (B)
• All arithmetic and logic instructions are reg-to-reg
• Only Load and Store instructions refer to memory!
Touching address registers 1 to 5 initiates a load 6 to 7 initiates a store
- very useful for vector operations
opcode i j k Ri ← (Rj) op (Rk)
opcode i j disp Ri ← M[(Rj) + disp]
6 3 3 3
6 3 3 18
3/30/2016 CS152,Spring2016
CDC6600:Datapath
11
AddressRegsIndexRegs8x18-bit8x18-bit
OperandRegs8x60-bit
Inst.Stack8x60-bit
IR
10Func@onalUnits
CentralMemory
128Kwords,32banks,1µscycle
resultaddr
result
operand
operandaddr
3/30/2016 CS152,Spring2016
CDC6600ISAdesignedtosimplifyhigh-performanceimplementa?on
§ Useofthree-address,register-registerALUinstruc@onssimplifiespipelinedimplementa@on– Noimplicitdependenciesbetweeninputsandoutputs
§ Decouplingseongofaddressregister(Ar)fromretrievingvaluefromdataregister(Xr)simplifiesprovidingmul@pleoutstandingmemoryaccesses– Sobwarecanscheduleloadofaddressregisterbeforeuseofvalue– Caninterleaveindependentinstruc@onsinbetween
§ CDC6600hasmul@pleparallelbutunpipelinedfunc@onalunits– E.g.,2separatemul@pliers
§ Follow-onmachineCDC7600usedpipelinedfunc@onalunits– ForeshadowslaterRISCdesigns
12
3/30/2016 CS152,Spring2016
CDC6600:VectorAddi?on
13
B0<--nloop: JZEB0,exit
A0<-B0+a0 loadX0A1<-B0+b0 loadX1X6<-X0+X1A6<-B0+c0 storeX6B0<-B0+1jumploop
Ai=addressregisterBi=indexregisterXi=dataregister
3/30/2016 CS152,Spring2016
SupercomputerApplica?ons
§ Typicalapplica@onareas– Militaryresearch(nuclearweapons,cryptography)– Scien@ficresearch– Weatherforecas@ng– Oilexplora@on– Industrialdesign(carcrashsimula@on)– Bioinforma@cs– Cryptography
§ Allinvolvehugecomputa@onsonlargedatasets
§ In70s-80s,Supercomputer≡VectorMachine
14
3/30/2016 CS152,Spring2016
VLIWvsVector
15
§ VLIWtakesadvantageofinstruc@onlevelparallelism(ILP)byspecifyinginstruc@onstoexecuteinparallel
§ Vectorarchitecturesperformthesameopera@ononmul@pledataelements– Data-levelparallelism
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic Instructions
ADDV v3, v1, v2 v3
v2 v1
IntOp2 MemOp1 MemOp2 FPOp1 FPOp2IntOp1
3/30/2016 CS152,Spring2016
VectorProgrammingModel
16
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic Instructions
ADDV v3, v1, v2 v3
v2 v1
Scalar Registers
r0
r15 Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLR Vector Length Register
v1 Vector Load and Store Instructions
LV v1, r1, r2
Base, r1 Stride, r2 Memory
Vector Register
3/30/2016 CS152,Spring2016
ControlInforma?on
17
§ VLRlimitsthehighestvectorelementtobeprocessedbyavectorinstruc@on– VLRisloadedpriortoexecu@ngthevectorinstruc@onwithaspecialinstruc@on
§ Strideforload/stores:– Vectorsmaynotbeadjacentinmemoryaddresses– E.g.,differentdimensionsofamatrix– Stridecanbespecifiedaspartoftheload/store
3/30/2016 CS152,Spring2016
VectorCodeExample
18
# Scalar Code LI R4, 64 loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop
# Vector Code LI VLR, 64 LV V1, R1 LV V2, R2 ADDV.D V3, V1, V2 SV V3, R3
# C code for (i=0; i<64; i++) C[i] = A[i] + B[i];
3/30/2016 CS152,Spring2016
Flynn’sTaxonomy
19
§ Singleinstruc@on,singledata(SISD)– E.g.,ourin-orderprocessor
§ Singleinstruc@on,mul@pledata(SIMD)– Mul@pleprocessingelements,sameopera@on,differentdata– Vector– Mul@pleprocessingunitsexecutethesameinstruc@onondifferentdatainalockstep.Eitherallcompleteornonedo.Therefore,allunitshavetoexecutethesameinstruc@onatagiven@me
§ Mul@pleinstruc@on,mul@pledata(MIMD)– Mul@pleautonomousprocessorsexecu@ngdifferentinstruc@onsondifferentdata
– Mostcommonandgeneralparallelmachine
§ Mul@pleinstruc@on,singledata(MISD)– Whywouldanyonedothis?
3/30/2016 CS152,Spring2016
MoreCategories
20
§ Singleprogram,mul@pledata(SPMD)– Mul@pleautonomousprocessorsexecutetheprogramatindependentpoints
– DifferencewithSIMD:SIMDimposesalockstep– ProgramsatSPMDcanbeatindependentpoints– SPMDcanrunongeneralpurposeprocessors– Mostcommonmethodforparallelcompu@ng
§ Mul@pleprogram,mul@pledata(MPMD)– Mul@pleautonomousprocessorssimultaneouslyopera@ngatleast2independentprograms
3/30/2016 CS152,Spring2016
VectorSupercomputers
§ Epitomy:Cray-1,1976§ ScalarUnit
– Load/StoreArchitecture
§ VectorExtension– VectorRegisters– VectorInstruc@ons
§ Implementa@on– HardwiredControl– HighlyPipelinedFunc@onalUnits– InterleavedMemorySystem– NoDataCaches– NoVirtualMemory
21
3/30/2016 CS152,Spring2016
Cray-1(1976)
22
Single Port Memory 16 banks of 64-bit words
+ 8-bit SECDED
80MW/sec data load/store 320MW/sec instruction buffer refill
4 Instruction Buffers
64-bitx16 NIP
LIP
CIP
(A0)
( (Ah) + j k m )
64 T Regs
(A0)
( (Ah) + j k m )
64 B Regs
S0 S1 S2 S3 S4 S5 S6 S7
A0 A1 A2 A3 A4 A5 A6 A7
Si
Tjk
Ai
Bjk
FP Add FP Mul FP Recip
Int Add Int Logic Int Shift Pop Cnt
Sj
Si
Sk
Addr Add Addr Mul
Aj
Ai
Ak
memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)
V0 V1 V2 V3 V4 V5 V6 V7
Vk
Vj
Vi V. Mask
V. Length 64 Element Vector Registers
3/30/2016 CS152,Spring2016
VectorInstruc?onSetAdvantages
§ Compact– oneshortinstruc@onencodesNopera@ons
§ Expressive,tellshardwarethattheseNopera@ons:– areindependent– usethesamefunc@onalunit– accessdisjointregisters– accessregistersinsamepauernaspreviousinstruc@ons– accessacon@guousblockofmemory(unit-strideload/store)
– accessmemoryinaknownpauern(stridedload/store)
§ Scalable– canrunsamecodeonmoreparallelpipelines(lanes)
23
3/30/2016 CS152,Spring2016
VectorArithme?cExecu?on
24
• Usedeeppipeline(=>fastclock)toexecuteelementopera@ons
• Simplifiescontrolofdeeppipelinebecauseelementsinvectorareindependent(=>nohazards!)
V1
V2
V3
V3<-v1*v2
SixstagemulGplypipeline
3/30/2016 CS152,Spring2016
VectorInstruc?onExecu?on
25
ADDV C,A,B
C[1]
C[2]
C[0]
A[3] B[3]
A[4] B[4]
A[5] B[5]
A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]
A[16] B[16]
A[20] B[20]
A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]
A[17] B[17]
A[21] B[21]
A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]
A[18] B[18]
A[22] B[22]
A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]
A[19] B[19]
A[23] B[23]
A[27] B[27]
Execution using four pipelined functional units
3/30/2016 CS152,Spring2016
HowDoVectorArchitecturesAffectMemory?
26
3/30/2016 CS152,Spring2016
InterleavedVectorMemorySystem
27
0 1 2 3 4 5 6 7 8 9 A B C D E F
+
Base Stride Vector Registers
Memory Banks
Address Generator
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency
• Bank busy time: Time before bank ready to accept next request
3/30/2016 CS152,Spring2016
VectorUnitStructure
28
Lane
FuncGonalUnit
VectorRegisters
MemorySubsystem
Elements0,4,8,…
Elements1,5,9,…
Elements2,6,10,…
Elements3,7,11,…
3/30/2016 CS152,Spring2016
T0VectorMicroprocessor(UCB/ICSI,1995)
29
LaneVectorregisterelementsstriped
overlanes
[0] [8] [16] [24]
[1] [9] [17] [25]
[2] [10] [18] [26]
[3] [11] [19] [27]
[4] [12] [20] [28]
[5] [13] [21] [29]
[6] [14] [22] [30]
[7] [15] [23] [31]
3/30/2016 CS152,Spring2016
VectorInstruc?onParallelism§ Canoverlapexecu@onofmul@plevectorinstruc@ons
– examplemachinehas32elementspervectorregisterand8lanes
30
load
load mul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Instruction issue
Complete24opera@ons/cyclewhileissuing1shortinstruc@on/cycle
3/30/2016 CS152,Spring2016
VectorChaining
31
§ Vectorversionofregisterbypassing– introducedwithCray-1
Memory
V1
Load Unit
Mult.
V2
V3
Chain
Add
V4
V5
Chain
LV v1
MULV v3,v1,v2
ADDV v5, v3, v4
3/30/2016 CS152,Spring2016
VectorChainingAdvantage
32
• Withchaining,canstartdependentinstruc@onassoonasfirstresultappears
LoadMul
Add
LoadMul
AddTime
• Withoutchaining,mustwaitforlastelementofresulttobewriuenbeforestar@ngdependentinstruc@on
3/30/2016 CS152,Spring2016
VectorStartup§ Twocomponentsofvectorstartuppenalty
– func@onalunitlatency(@methroughpipeline)– dead@meorrecovery@me(@mebeforeanothervectorinstruc@oncanstartdownpipeline).Somepipelinesreducecontrollogicbyrequiringdead@mebetweeninstruc@onstothesamevectorunit
33
R X X X WR X X X W
R X X X WR X X X W
R X X X WR X X X W
R X X X W
R X X X WR X X X W
R X X X W
Func@onalUnitLatency
DeadTime
FirstVectorInstruc@on
SecondVectorInstruc@on
DeadTime
3/30/2016 CS152,Spring2016
DeadTimeandShortVectors
34
Cray C90, Two lanes 4 cycle dead time
Maximum efficiency 94% with 128 element vectors
4 cycles dead time T0, Eight lanes No dead time
100% efficiency with 8 element vectors
No dead time
64 cycles active
3/30/2016 CS152,Spring2016
VectorMemory-MemoryversusVectorRegisterMachines
§ Vectormemory-memoryinstruc@onsholdallvectoroperandsinmainmemory
§ Thefirstvectormachines,CDCStar-100(‘73)andTIASC(‘71),werememory-memorymachines
§ Cray-1(’76)wasfirstvectorregistermachine
35
for (i=0; i<N; i++) { C[i] = A[i] + B[i]; D[i] = A[i] - B[i]; }
Example Source Code ADDV C, A, B SUBV D, A, B
Vector Memory-Memory Code
LV V1, A LV V2, B ADDV V3, V1, V2 SV V3, C SUBV V4, V1, V2 SV V4, D
Vector Register Code
3/30/2016 CS152,Spring2016
VectorMemory-Memoryvs.VectorRegisterMachines
§ Vectormemory-memoryarchitectures(VMMA)requiregreatermainmemorybandwidth,why?– Alloperandsmustbereadinandoutofmemory
§ VMMAsmakeifdifficulttooverlapexecu@onofmul@plevectoropera@ons,why?– Mustcheckdependenciesonmemoryaddresses
§ VMMAsincurgreaterstartuplatency– ScalarcodewasfasteronCDCStar-100(VMM)forvectors<100elements
§ ApartfromCDCfollow-ons(Cyber-205,ETA-10)allmajorvectormachinessinceCray-1havehadvectorregisterarchitectures
§ (weignorevectormemory-memoryfromnowon)
36
3/30/2016 CS152,Spring2016
Automa?cCodeVectoriza?on
37
for (i=0; i < N; i++) C[i] = A[i] + B[i];
loadload
add
store
loadload
add
store
Iter.1
Iter.2
ScalarSequenGalCode
Vectoriza@onisamassivecompile-@mereorderingofopera@onsequencing
⇒requiresextensiveloopdependenceanalysis
VectorInstrucGon
load
load
add
store
load
load
add
store
Iter.1 Iter.2
VectorizedCode
Time
3/30/2016 CS152,Spring2016
VectorStripminingProblem:VectorregistershavefinitelengthSolu?on:Breakloopsintopiecesthatfitinregisters,“Stripmining”
38
ANDI R1, N, 63 # N mod 64 MTC1 VLR, R1 # Do remainder loop: LV V1, RA DSLL R2, R1, 3 # Multiply by 8 DADDU RA, RA, R2 # Bump pointer LV V2, RB DADDU RB, RB, R2 ADDV.D V3, V1, V2 SV V3, RC DADDU RC, RC, R2 DSUBU N, N, R1 # Subtract elements LI R1, 64 MTC1 VLR, R1 # Reset full length BGTZ N, loop # Any more to do?
for (i=0; i<N; i++) C[i] = A[i]+B[i];
+
+
+
A B C
64 elements
Remainder
3/30/2016 CS152,Spring2016
VectorCondi?onalExecu?on
39
Problem: Want to vectorize loops with conditional code: for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i];
Solution: Add vector mask (or flag) registers – vector version of predicate registers, 1 bit per element
…and maskable vector instructions – vector operation becomes bubble (“NOP”) at elements where mask bit is clear
Code example: CVM # Turn on all elements LV vA, rA # Load entire A vector SGTVS.D vA, F0 # Set bits in mask register where A>0 LV vA, rB # Load B vector into A under mask SV vA, rA # Store A back to memory under mask
3/30/2016 CS152,Spring2016
MaskedVectorInstruc?ons
40
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0
M[4]=1
M[5]=1
M[6]=0
M[2]=0
M[1]=1
M[0]=0
M[7]=1
Density-TimeImplementa@on– scanmaskvectorandonlyexecuteelementswithnon-zeromasks
C[1]
C[2]
C[0]
A[3] B[3]
A[4] B[4]
A[5] B[5]
A[6] B[6]
M[3]=0
M[4]=1
M[5]=1
M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data port Write Enable
A[7] B[7] M[7]=1
SimpleImplementa@on– executeallNopera@ons,turnoffresultwritebackaccordingtomask
3/30/2016 CS152,Spring2016
VectorReduc?ons
41
Problem:Loop-carrieddependenceonreduc@onvariablessum = 0; for (i=0; i<N; i++) sum += A[i]; # Loop-carried dependence on sum
Solu?on:Re-associateopera@onsifpossible,usebinarytreetoperformreduc@on # Rearrange as: sum[0:VL-1] = 0 # Vector of VL partial sums for(i=0; i<N; i+=VL) # Stripmine VL-sized chunks sum[0:VL-1] += A[i:i+VL-1]; # Vector sum # Now have VL partial sums in one vector register do { VL = VL/2; # Halve vector length sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials } while (VL>1)
3/30/2016 CS152,Spring2016
VectorScacer/Gather
42
Wanttovectorizeloopswithindirectaccesses:for (i=0; i<N; i++) A[i] = B[i] + C[D[i]]
Indexedloadinstruc@on(Gather)LV vD, rD # Load indices in D vector LVI vC, rC, vD # Load indirect from rC base LV vB, rB # Load B vector ADDV.D vA,vB,vC # Do add SV vA, rA # Store result
3/30/2016 CS152,Spring2016
VectorScacer/Gather
43
Histogramexample:for (i=0; i<N; i++) A[B[i]]++;
Isfollowingacorrecttransla@on? LV vB, rB # Load indices in B vector LVI vA, rA, vB # Gather initial A values ADDV vA, vA, 1 # Increment SVI vA, rA, vB # Scatter incremented values
3/30/2016 CS152,Spring2016
AModernVectorSuper:NECSX-9(2008)§ 65nmCMOStechnology§ Vectorunit(3.2GHz)
– 8foregroundVRegs+64backgroundVRegs(256x64-bitelements/VReg)
– 64-bitfunc@onalunits:2mul@ply,2add,1divide/sqrt,1logical,1maskunit
– 8lanes(32+FLOPS/cycle,100+GFLOPSpeakperCPU)
– 1loadorstoreunit(8x8-byteaccesses/cycle)
§ Scalarunit(1.6GHz)– 4-waysuperscalarwithout-of-orderandspecula@veexecu@on
– 64KBI-cacheand64KBdatacache
44
• Memorysystemprovides256GB/sDRAMbandwidthperCPU• Upto16CPUsandupto1TBDRAMformshared-memorynode
– totalof4TB/sbandwidthtosharedDRAMmemory
• Upto512nodesconnectedvia128GB/snetworklinks(messagepassingbetweennodes)
3/30/2016 CS152,Spring2016
Mul?mediaExtensions(akaSIMDextensions)
45
§ Veryshortvectorsaddedtoexis@ngISAsformicroprocessors§ Useexis@ng64-bitregisterssplitinto2x32bor4x16bor8x8b
– LincolnLabsTX-2from1957had36bdatapathsplitinto2x18bor4x9b– Newerdesignshavewiderregisters
• 128bforPowerPCAl@vec,IntelSSE2/3/4• 256bforIntelAVX
§ Singleinstruc@onoperatesonallelementswithinregister
16b 16b 16b 16b
32b 32b
64b
8b 8b 8b 8b 8b 8b 8b 8b
16b 16b 16b 16b
16b 16b 16b 16b
16b 16b 16b 16b
+ + + + 4x16badds
3/30/2016 CS152,Spring2016
Mul?mediaExtensionsversusVectors
§ Limitedinstruc@onset:– novectorlengthcontrol– nostridedload/storeorscauer/gather– unit-strideloadsmustbealignedto64/128-bitboundary
§ Limitedvectorregisterlength:– requiressuperscalardispatchtokeepmul@ply/add/loadunitsbusy– loopunrollingtohidelatenciesincreasesregisterpressure
§ Trendtowardsfullervectorsupportinmicroprocessors– Beuersupportformisalignedmemoryaccesses– Supportofdouble-precision(64-bitfloa@ng-point)– NewIntelAVXspec(announcedApril2008),256bvectorregisters(expandableupto1024b)
46
3/30/2016 CS152,Spring2016
DegreeofVectoriza?on
§ Compilersaregoodatfindingdata-levelparallelism
47
3/30/2016 CS152,Spring2016
AverageVectorLength
§ Maximumdependsonifbecnhmarksuse16bitor32bitopera@ons
48
3/30/2016 CS152,Spring2016
Distribu?onofInstruc?ons
49
3/30/2016 CS152,Spring2016
Ques?onoftheDay
50
§ CanVectorandVLIWcombine?
§ Yes!§ FujitsyFR-VcanprocessbothVLIWandvectorinstruc@ons§ Exploitsbothinstruc@on-anddata-levelparallelism
3/30/2016 CS152,Spring2016
Acknowledgements
§ Theseslidescontainmaterialdevelopedandcopyrightby:– Arvind(MIT)– KrsteAsanovic(MIT/UCB)– JoelEmer(Intel/MIT)– JamesHoe(CMU)– JohnKubiatowicz(UCB)– DavidPauerson(UCB)
§ MITmaterialderivedfromcourse6.823§ UCBmaterialderivedfromcourseCS252§ “VectorVs.SuperscalarandVLIWArchitecturesforEmbeddedMul@mediaBenchmarks”.ChristosKozyrakisandDavidPauerson.MICRO-35.2002
51