CS 61C: Great Ideas in Computer Architecture Lecture 12 ...
Transcript of CS 61C: Great Ideas in Computer Architecture Lecture 12 ...
CS61C:GreatIdeasinComputerArchitecture
Lecture12:Control&OperatingSpeed
KrsteAsanović &RandyKatz
http://inst.eecs.berkeley.edu/~cs61c/fa17
Agenda• FinishSingle-CycleRISC-VDatapath• Controller• InstructionTiming• PerformanceMeasures• IntroductiontoPipelining• PipelinedRISC-V Datapath• AndinConclusion,...CS61c Lecture12:Control&Performance 2
Recap:Addingbranchestodatapath
CS61c 3
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataAAddrD
DataB
DataD
Addr
DataWDataR
10
01
10 pc
01
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel RegWEn BrUnBrEq BrLT ASelBSel ALUSel MemRW WBSelPCSel
wb
ImplementingJALR Instruction(I-Format)
• JALRrd,rs,immediate−WritesPC+4toReg[rd](returnaddress)− SetsPC=Reg[rs1]+immediate− Usessameimmediates asarithmeticandloads
§ nomultiplicationby2bytes
4
Addingjalr todatapath
CS61c 5
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataAAddrD
DataB
DataD
Addr
DataWDataR
10
0121
0 pc01
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel RegWEn BrUnBrEq BrLT ASelBSel ALUSel MemRW WBSelPCSel
wb
Addingjalr todatapath
CS61c 6
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataAAddrD
DataB
DataD
Addr
DataWDataR
10
0121
0 pc01
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel=I RegWEn=1
BrUn=* BrEq=* BrLT=*
Asel=0Bsel=1
ALUSel=Add
MemRW=Read WBSel=2PCSel
wb
Implementingjal Instruction
• JALsavesPC+4inReg[rd](thereturnaddress)• SetPC=PC+offset(PC-relativejump)• Targetsomewherewithin±219 locations,2bytesapart
− ±218 32-bitinstructions• Immediateencodingoptimizedsimilarlytobranchinstructiontoreducehardwarecost
7
Addingjal todatapath
CS61c 8
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataAAddrD
DataB
DataD
Addr
DataWDataR
10
0121
0 pc01
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel RegWEn BrUnBrEq BrLT ASelBSel ALUSel MemRW WBSelPCSel
wb
Addingjal todatapath
CS61c 9
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataAAddrD
DataB
DataD
Addr
DataWDataR
10
0121
0 pc01
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel=J RegWEn=1
BrUn=* BrEq=* BrLT=*
Asel=1Bsel=1
ALUSel=Add
MemRW=Read WBSel=2PCSel
wb
“UpperImmediate”instructions
• Has20-bitimmediateinupper20bitsof32-bitinstructionword• Onedestinationregister,rd• Usedfortwoinstructions
− LUI– LoadUpperImmediate(addtozero)− AUIPC– AddUpperImmediatetoPC
10
Implementinglui
CS61c 11
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataAAddrD
DataB
DataD
Addr
DataWDataR
10
0121
001
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel=U RegWEn=1
BrUn=* BrE=* BrLT=*
Asel=*Bsel=1 ALUSel=B MemRW=Read WBSel=1PCSel=pc+4
wb
pc
Implementingauipc
CS61c 12
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataAAddrD
DataB
DataD
Addr
DataWDataR
10
0121
001
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel=U RegWEn=1
BrUn=* BrE=* BrLT=*
Asel=1Bsel=1 ALUSel=Add MemRW=0 WBSel=1PCSel=pc+4
wb
pc
Recap:CompleteRV32IISA
13
NotinCS61C
RV32Ihas47instructionstotal37instructionscoveredinCS61C
Single-CycleRISC-VRV32IDatapath
CS61c 14
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataAAddrD
DataB
DataD
Addr
DataWDataR
10
0121
0 pc01
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
inst[31:0] ImmSel RegWEn BrUnBrEq BrLT ASelBSel ALUSel MemRW WBSelPCSel
wb
Agenda• FinishSingle-CycleRISC-VDatapath• Controller• InstructionTiming• PerformanceMeasures• IntroductiontoPipelining• PipelinedRISC-V Datapath• AndinConclusion,...CS61c Lecture12:Control&Performance 15
Processor
CS61c Lecture12:Control&Performance 16
Processor
Control
DatapathPC
RegistersArithmetic&LogicUnit
(ALU)
Memory
Bytes
Enable?Read/Write
Address
WriteDataReadData
Processor-MemoryInterface
Program
Data
Single-CycleRISC-VRV32IDatapath
CS61c 17
IMEMALU
Imm.Gen
+4
DMEMBranchComp.
Reg[]
AddrAAddrB
DataAAddrD
DataB
DataD
Addr
DataWDataR
10
0121
0 pc01
inst[11:7]
inst[19:15]
inst[24:20]
inst[31:7]
pc+4alu
mem
wb
alu
pc+4
Reg[rs1]
pc
imm[31:0]
Reg[rs2]
ControlLogic
inst[31:0] ImmSel RegWEn BrUnBrEq BrLT ASelBSel ALUSel MemRW WBSelPCSel
wb
ControlLogicTruthTable(incomplete)
CS61c Lecture12:Control&Performance 18
Inst[31:0] BrEq BrLT PCSel ImmSel BrUn ASel BSel ALUSel MemRW RegWEn WBSel
add * * +4 * * Reg Reg Add Read 1 ALU
sub * * +4 * * Reg Reg Sub Read 1 ALU
(R-R Op) * * +4 * * Reg Reg (Op) Read 1 ALU
addi * * +4 I * Reg Imm Add Read 1 ALU
lw * * +4 I * Reg Imm Add Read 1 Mem
sw * * +4 S * Reg Imm Add Write 0 *
beq 0 * +4 B * PC Imm Add Read 0 *
beq 1 * ALU B * PC Imm Add Read 0 *
bne 0 * ALU B * PC Imm Add Read 0 *
bne 1 * +4 B * PC Imm Add Read 0 *
blt * 1 ALU B 0 PC Imm Add Read 0 *
bltu * 1 ALU B 1 PC Imm Add Read 0 *
jalr * * ALU I * Reg Imm Add Read 1 PC+4
jal * * ALU J * PC Imm Add Read 1 PC+4
auipc * * +4 U * PC Imm Add Read 1 ALU
ControlRealizationOptions• ROM
−“Read-OnlyMemory”−Regularstructure−Canbeeasilyreprogrammed
§ fixerrors§ addinstructions
−Popularwhendesigningcontrollogicmanually• CombinatorialLogic
−Today,chipdesignersuselogicsynthesistoolstoconverttruthtablestonetworksofgates
CS61c Lecture12:Control&Performance 19
RV32I,anine-bitISA!
20
NotinCS61C
Instructiontypeencodedusingonly9bitsinst[30],inst[14:12],inst[6:2]
inst[30] inst[14:12] inst[6:2]
ROM-basedControl
CS61c Lecture12:Control&Performance 21
ROM
Inst[30,14:12,6:2] BrEq9
PCSel
ALUSel[3:0]4
11-bitaddress(inputs)
15databits(outputs)
BrLT
ImmSel[2:0]3BrUnASelBSel
MemRWRegWEnWBSel[1:0]2
ROMControllerImplementation
CS61c Lecture12:Control&Performance 22
Control Word foraddControl Word forsubControl Word foror
.
.
.
Address
Decode
r...
Inst[]BrEQBrLT
Controlleroutput(PCSel,ImmSel,…)
addsubor
jal
11
Administrivia• Homework2Duetomorrow11:59pm• Project1Part1DueMondayOct.9
− Part2dueMondayOct.16
• Midterm1RegradesduenextTuesday− TalktoaTAifyoudon’tunderstandamidtermquestionorareunsureofaregrade
CS61c Lecture12:Control&Performance 23
Break!
10/5/17 24
Agenda• FinishSingle-CycleRISC-VDatapath• Controller• InstructionTiming• PerformanceMeasures• IntroductiontoPipelining• PipelinedRISC-V Datapath• AndinConclusion,...CS61c Lecture12:Control&Performance 25
InstructionTiming
IF ID EX MEM WB Total
I-MEM Reg Read ALU D-MEM Reg W
200 ps 100ps 200 ps 200ps 100ps 800psCS61c Lecture12:Control&Performance 26
InstructionTiming
• Maximumclockfrequency− fmax =1/800ps=1.25GHz
• Mostblocksidlemostofthetime− E.g.fmax,ALU =1/200ps=5GHz!− HowcanwekeepALUbusyallthetime?− 5billionadds/sec,ratherthanjust1.25billion?− Idea:Factories usethreeemployeeshifts- equipmentisalwaysbusy!
Instr IF=200ps ID=100ps ALU=200ps MEM=200ps WB =100ps Totaladd X X X X 600psbeq X X X 500psjal X X X 500pslw X X X X X 800pssw X X X X 700ps
Agenda• FinishSingle-CycleRISC-VDatapath• Controller• InstructionTiming• PerformanceMeasures• IntroductiontoPipelining• PipelinedRISC-V Datapath• AndinConclusion,...CS61c Lecture12:Control&Performance 28
PerformanceMeasures• “Our”RISC-Vexecutesinstructionsat1.25GHz
−1instructionevery800ps
• Canweimproveitsperformance?−Whatdowemeanwiththisstatement?−Notsoobvious:
§ Quickerresponsetime,soonejobfinishesfaster?§ Morejobsperunittime(e.g.webserverreturningpages)?§ Longerbatterylife?
CS61c Lecture12:Control&Performance 29
TransportationAnalogySportsCar Bus
Passenger Capacity 2 50Travel Speed 200mph 50mphGasMileage 5mpg 2mpg
CS61c Lecture12:Control&Performance 30
SportsCar BusTravelTime 15min 60minTimefor100 passengers 750min 120minGallonsperpassenger 5gallons 0.5 gallons
50Miletrip:
ComputerAnalogyTransportation Computer
TripTime Program executiontime:e.g.timetoupdatedisplay
Timefor100passengers Throughput:e.g.number ofserverrequestshandledperhour
Gallonsper passenger Energypertask*:e.g.how manymoviesyoucanwatchperbatterychargeorenergybillfordatacenter
31
*Note: powerisnotagoodmeasure,sincelow-powerCPUmightrunforalongtimetocompleteonetaskconsumingmoreenergythanfastercomputerrunningathigherpowerforashortertime
“IronLaw”ofProcessorPerformance
32
Time =Instructions Cycles TimeProgramProgram*Instruction*Cycle
InstructionsperProgram
Determinedby• Task• Algorithm,e.g.O(N2) vsO(N)• Programminglanguage• Compiler• InstructionSetArchitecture(ISA)
CS61c Lecture12:Control&Performance 33
(Average)ClockcyclesperInstruction
Determinedby• ISA• Processorimplementation(ormicroarchitecture)• E.g.for“our”single-cycleRISC-Vdesign,CPI=1• Complexinstructions(e.g.strcpy),CPI>>1• Superscalarprocessors,CPI<1(nextlecture)
CS61c Lecture12:Control&Performance 34
TimeperCycle(1/Frequency)
Determinedby• Processormicroarchitecture(determinescriticalpath
throughlogicgates)• Technology(e.g.14nmversus28nm)• Powerbudget(lowervoltagesreducetransistorspeed)
CS61c Lecture12:Control&Performance 35
SpeedTradeoffExample• Forsometask(e.g.imagecompression)…
CS61c Lecture12:Control&Performance 36
ProcessorA Processor B#Instructions 1Million 1.5MillionAverageCPI 2.5 1Clockrate f 2.5 GHz 2GHzExecutiontime 1ms 0.75ms
ProcessorBisfasterforthistask,despiteexecutingmoreinstructionsandhavingalowerclockrate!
EnergyperTask
37
Energy =Instructions EnergyProgramProgram*Instruction
Energy αInstructions *CV2
ProgramProgram
“Capacitance” dependsontechnology,processorfeaturese.g.#ofcores
Supplyvoltage,e.g.1V
Wanttoreducecapacitanceandvoltagetoreduceenergy/task
EnergyTradeoffExample• “Next-generation”processor
− C(Moore’sLaw): -15%− Supplyvoltage,Vsup: -15%− Energyconsumption: 1- (1-0.85)3 =-39%
• Significantlyimprovedenergyefficiencythanksto−Moore’sLawAND− Reducedsupplyvoltage
CS61c Lecture12:Control&Performance 38
Energy“IronLaw”
39
Performance=Power*EnergyEfficiency(Tasks/Second)(Joules/Second)(Tasks/Joule)
• Energyefficiency(e.g.,instructions/Joule)iskeymetricinallcomputingdevices
• Forpower-constrainedsystems(e.g.,20MWdatacenter),needbetterenergyefficiencytogetmoreperformanceatsamepower
• Forenergy-constrainedsystems(e.g.,1Wphone),needbetterenergyefficiencytoprolongbatterylife
EndofScaling• Inrecentyears,industryhasnotbeenabletoreducesupplyvoltagemuch,asreducingitfurtherwouldmeanincreasing“leakagepower”wheretransistorswitchesdon’tfullyturnoff(morelikedimmerswitchthanon-offswitch)• Also,sizeoftransistorsandhencecapacitance,notshrinkingasmuchasbeforebetweentransistorgenerations• Powerbecomesagrowingconcern– the“powerwall”• Cost-effectiveair-cooledchiplimitaround~150W
CS61c Lecture12:Control&Performance 40
0
1
10
100
1000
10000
100000
1000000
10000000
1970 1975 1980 1985 1990 1995 2000 2005 2010
Transistors (Thousands)Frequency (MHz)Power (W)Cores
ProcessorTrends
41[Olukotun, Hammond,Sutter,Smith,Batten]
Break!
10/5/17 42
Agenda• FinishSingle-CycleRISC-VDatapath• Controller• InstructionTiming• PerformanceMeasures• IntroductiontoPipelining• PipelinedRISC-V Datapath• AndinConclusion,...CS61c Lecture12:Control&Performance 43
Pipelining• Afamiliarexample:
− Gettingauniversitydegree
• ShortageofComputerscientists(yourstartupisgrowing):− Howlongdoesittaketoeducate16,000?
CS61c Lecture12:Control&Performance 44
Year1 Year2 Year3 Year4
ComputerScientistEducation
CS61c Lecture12:Control&Performance 45
• Option1:serial
• Option2:pipelining
7years
16,000in16years,averagethroughputis1000/year
• 16,000in7years• Steadystatethroughputis4000/year• Resourcesusedefficiently• 4-foldimprovementoverserialeducation
4000graduate
4years
4000graduate
4years
4000graduate
4years
4000
4years
4000graduate4000graduate
4000graduate4000graduate
4000enter
1year
LatencyversusThroughput• Latency
− Timefromenteringcollegetograduation− Serial 4years− Pipelining 4years
• Throughput− Averagenumberofstudentsgraduatingeachyear− Serial 1000− Pipelining 4000
• Pipelining− Increasesthroughput(4xinthisexample)− Butdoesnothingtolatency
§ sometimesworse(additionaloverheade.g.forshifttransition)CS61c Lecture12:Control&Performance 46
SimultaneousversusSequential
CS61c Lecture12:Control&Performance 47
• Whathappenssequentially?• Whathappenssimultaneously?
4years
4000graduate4000graduate
4000graduate4000graduate
4000graduate4000graduate
4000graduate4000graduate
4000graduate4000graduate
Agenda• FinishSingle-CycleRISC-VDatapath• Controller• InstructionTiming• PerformanceMeasures• IntroductiontoPipelining• PipelinedRISC-V Datapath• AndinConclusion,...CS61c Lecture12:Control&Performance 48
PipeliningwithRISC-V
CS61c 49
Phase Pictogram tstep SerialInstruction Fetch 200 psReg Read 100psALU 200psMemory 200psRegisterWrite 100pstinstruction 800ps
addt0,t1,t2
ort3,t4,t5
sll t6,t0,t3tcycle
instructionsequence
tinstruction
tcycle Pipelined200 ps200ps200ps200ps200ps1000ps
PipeliningwithRISC-V
CS61c 50
addt0,t1,t2
ort3,t4,t5
sll t6,t0,t3tcycle
instructionsequence
tinstruction
Single Cycle PipeliningTiming tstep =100…200ps tcycle =200ps
Register accessonly100ps AllcyclessamelengthInstruction time,tinstruction =tcycle =800ps 1000psClockrate,fs 1/800ps =1.25GHz 1/200 ps =5GHzRelativespeed 1x 4x
SequentialvsSimultaneous
addt0,t1,t2
ort3,t4,t5
sll t6,t0,t3
tcycle=200ps
instructionsequence
tinstruction =1000ps
sw t0,4(t3)
lw t0,8(t3)
addi t2,t2,1
Whathappenssequentially,whathappenssimultaneously?
Agenda• FinishSingle-CycleRISC-VDatapath• Controller• InstructionTiming• PerformanceMeasures• IntroductiontoPipelining• PipelinedRISC-V Datapath• AndinConclusion,...CS61c Lecture12:Control&Performance 52
AndinConclusion,…• Controller
− Tellsuniversaldatapathhowtoexecuteeachinstruction• Instructiontiming
− Setbyinstructioncomplexity,architecture,technology− Pipeliningincreasesclockfrequency,“instructionspersecond”
§ Butdoesnotreducetimetocompleteinstruction
• Performancemeasures− Differentmeasuresdependingonobjective
§ Responsetime§ Jobs/second§ Energypertask
CS61c Lecture12:Control&Performance 53