Single Cycle Datapath
description
Transcript of Single Cycle Datapath
Single Cycle Datapath
Lecture notes from MKP, H. H. Lee and S. Yalamanchili
(2)
Reading
• Section 4.1-4.4
• Appendices B.7, B.8, B.11, D.2
• Practice Problems: 1, 4, 6, 9
(3)
Introduction
• We will examine two MIPS implementations A simplified version this module A more realistic pipelined version
• Simple subset, shows most aspects Memory reference: lw, sw Arithmetic/logical: add, sub, and, or, slt Control transfer: beq, j
(4)
Instruction Execution
• PC instruction memory, fetch instruction
• Register numbers register file, read registers
• Depending on instruction class1. Use ALU to calculate
o Arithmetic resulto Memory address for load/storeo Branch target address
2. Access data memory for load/store3. PC An address or PC + 4
8d0b0000
014b5020 21080004 2129ffff 1520fffc 000a082a …..…..
An Encoded Program
Address
(5)
Basic Ingredients
• Include the functional units we need for each instruction – combinational and sequential
PC
Instructionmemory
Instructionaddress
Instruction
a. Instruction memory b. Program counter
Add Sum
c. Adder
16 32Sign
extend
b. Sign-extension unit
MemRead
MemWrite
Datamemory
Writedata
Readdata
a. Data memory unit
Address
ALU control
RegWrite
RegistersWriteregister
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Writedata
ALUresult
ALU
Data
Data
Registernumbers
a. Registers b. ALU
Zero5
5
5 3
(6)
Sequential Elements (4.2, B.7, B.11)
• Register: stores data in a circuit Uses a clock signal to determine when to update the
stored value Edge-triggered: update when Clk changes from 0 to 1
D
Clk
Q
Clk
D
QQQ
_Q
Q
_Q
Dlatch
D
C
Dlatch
DD
C
C
falling edge rising edge
(7)
Sequential Elements
• Register with write control Only updates on clock edge when write control input is 1 Used when stored value is required later
D
Clk
Q
Write
Write
D
Q
Clk
_Q
Q
_Q
Dlatch
D
C
Dlatch
DD
C
C
cycle time
(8)
Clocking Methodology
• Combinational logic transforms data during clock cycles Between clock edges Input from state elements, output to state element Longest delay determines clock period
• Synchronous vs. Asynchronous operation
Recall: Critical Path Delay
(9)
• Built using D flip-flops (remember ECE 2030!)
Register File (B.8)
Mux
Register 0
Register 1
Register n – 1
Register n
Mux
Read data 1
Read data 2
Read registernumber 1
Read registernumber 2
Read registernumber 1 Read
data 1
Readdata 2
Read registernumber 2
Register fileWriteregister
Writedata Write
(10)
Register File
• Note: we still use the real clock to determine when to write
n-to-1decoder
Register 0
Register 1
Register n – 1C
C
D
DRegister n
C
C
D
D
Register number
Write
Register data
0
1
n – 1
n
(11)
Building a Datapath (4.3)
• Datapath Elements that process data and addresses
in the CPUo Registers, ALUs, mux’s, memories, …
• We will build a MIPS datapath incrementally Refining the overview design
(12)
High Level Description
• Single instruction single data stream model of execution (Remember Flynn’s Taxonomy) Serial execution model
• Commonly known as the von Neumann execution model Stored program model Instructions and data share memory
Fetch Instructions
Execute Instructions
Memory Operations
Control
SISD SIMD
MISD MIMD
Data Streams
Inst
ruct
ion S
tream
s
(13)
Instruction Fetch
Increment by 4 for next instruction32-bit
register
clk
cycle timeStart instruction fetch Complete instruction fetch
clk
(14)
R-Format Instructions
• Read two register operands• Perform arithmetic/logical operation• Write register result
op rs rt rd shamt funct
(15)
Executing R-Format Instructions
ALU control
RegWrite
Writeregister
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Writedata
ALUresult
ALUZero
5
5
53
op rs rt rd shamt funct
(16)
Load/Store Instructions• Read register operands• Calculate address using 16-bit offset
Use ALU, but sign-extend offset• Load: Read memory and update register• Store: Write register value to memory
op rs rt 16-bit constant
(17)
Executing I-Format Instructions
16 32S ign
extendM e m R e a d
M e m W r it e
D a ta
m e m o r yW r i ted a ta
R e a dd a ta
A d d r e s s
RegWrite
Readregister 1
Readregister 2
Writeregister
op rs rt 16-bit constant
(18)
Branch Instructions
• Read register operands
• Compare operands Use ALU, subtract and check Zero output
• Calculate target address Sign-extend displacement Shift left 2 places (word displacement) Add to PC + 4
o Already calculated by instruction fetch
op rs rt 16-bit constant
(19)
Branch Instructions
Justre-routes
wires
Sign-bit wire replicated
op rs rt 16-bit constant
(20)
Updating the Program Counter
PC
Instructionmemory
Readaddress
Instruction[31–0]
Instruction [20–16]
Instruction [25–21]
Add
4
16 32Instruction [15–0] Signextend
1
Mux
0
Instruction [15–11
Shift
Branch
Add ALUresult
Computation of the branch
address
loop: beq $t0, $0, exitaddi $t0, $t0, -1lw $a0, arg1($t1)lw $a1, arg2($t2)jal funcadd $t3, $t3, $v0addi $t1, $t1, 4addi $t2, $t2, 4j loop
(21)
Composing the Elements• First-cut data path does an instruction in one
clock cycle Each datapath element can only do one function at a
time Hence, we need separate instruction and data
memories
• Use multiplexers where alternate data sources are used for different instructions
014b5020 21080004 2129ffff 1520fffc 000a082a …..…..
An Encoded Program
AddressPC
(22)
Full Single Cycle Datapath
Destination register is “instruction-
specific”
lw$t0, 0($t4) vs. add $t0m $t1, $t2
(23)
The Main Control Unit
• Control signals derived from instruction
0 rs rt rd shamt funct
31:26 5:025:21 20:16 15:11 10:6
35 or 43 rs rt address
31:26 25:21 20:16 15:0
4 rs rt address
31:26 25:21 20:16 15:0
R-type
Load/Store
Branch
opcode
always read
read, except for load
write for R-type
and load
sign-extend and add
(24)
ALU Control (4.4, D.2)
• ALU used for Load/Store: Function = add Branch: Function = subtract R-type: Function depends on funct field
ALU control Function
0000 AND
0001 OR
0010 add
0110 subtract
0111 set-on-less-than
1100 NOR
(25)
ALU Control
• Assume 2-bit ALUOp derived from opcode Combinational logic derives ALU control
opcode ALUOp Operation funct ALU function ALU control
lw 00 load word XXXXXX add 0010
sw 00 store word XXXXXX add 0010
beq 01 branch equal XXXXXX subtract 0110
R-type 10 add 100000 add 0010
subtract 100010 subtract 0110
AND 100100 AND 0000
OR 100101 OR 0001
set-on-less-than 101010 set-on-less-than 0111
• How do we turn this description into gates?
(26)
ALU Controller
ALUOp Funct field ALUControlALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0
0 0 X X X X X X 010X 1 X X X X X X 1101 X X X 0 0 0 0 0101 X X X 0 0 1 0 1101 X X X 0 1 0 0 0001 X X X 0 1 0 1 0011 X X X 1 0 1 0 111
inst[5:0]Generated fromDecoding inst[31:26]
ALU control
ALU
result
ALU
Zero
3
addsubaddsubandorslt
lw/swbeq
arith
ALU control
ALUOp
funct =inst[5:0]
(27)
ALU Control
• Simple combinational logic (truth tables)
Operation2
Operation1
Operation0
Operation
ALUOp1
F3
F2
F1
F0
F (5– 0)
ALUOp0
ALUOp
ALU control block
(28)
Datapath With Control
Use rt not rd
Instruction RegDst ALUSrcMemto-
RegReg
WriteMem Read
Mem Write Branch ALUOp1 ALUp0
R-format 1 0 0 1 0 0 0 1 0lw 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0beq X 0 X 0 0 0 1 0 1
(29)
Commodity ProcessorsARM 7
Single Cycle Datapath
(30)
Control Unit Signals
R-format Iw sw beq
Op0
Op1
Op2
Op3
Op4
Op5
Inputs
Outputs
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOpO
To harness the datapath
Inst[31:26]Instruction RegDst ALUSrc
Memto-
Reg
Reg
Write
Mem
Read
Mem
Write Branch ALUOp1 ALUp0
R-format 1 0 0 1 0 0 0 1 0
lw 0 1 1 1 1 0 0 0 0
sw X 1 X 0 0 1 0 0 0
beq X 0 X 0 0 0 1 0 1
(31)
Controller Implementation
LIBRARY IEEE;USE IEEE.STD_LOGIC_1164.ALL;USE IEEE.STD_LOGIC_ARITH.ALL;USE IEEE.STD_LOGIC_SIGNED.ALL;
ENTITY control IS PORT(
SIGNAL Opcode : IN STD_LOGIC_VECTOR( 5 DOWNTO 0 );
SIGNAL RegDst : OUT STD_LOGIC;SIGNAL ALUSrc : OUT STD_LOGIC;SIGNAL MemtoReg : OUT STD_LOGIC;SIGNAL RegWrite : OUT STD_LOGIC;SIGNAL MemRead : OUT STD_LOGIC;SIGNAL MemWrite : OUT STD_LOGIC;SIGNAL Branch : OUT STD_LOGIC;SIGNAL ALUop : OUT STD_LOGIC_VECTOR( 1 DOWNTO
0 );SIGNAL clock, reset : IN STD_LOGIC );
END control;
(32)
Controller Implementation (cont.)
ARCHITECTURE behavior OF control IS
SIGNAL R_format, Lw, Sw, Beq : STD_LOGIC;
BEGIN -- Code to generate control signals using
opcode bitsR_format <= '1' WHEN Opcode = "000000" ELSE '0';Lw <= '1' WHEN Opcode = "100011" ELSE '0';
Sw <= '1' WHEN Opcode = "101011" ELSE '0'; Beq <= '1' WHEN Opcode = "000100" ELSE '0'; RegDst <= R_format; ALUSrc <= Lw OR Sw;
MemtoReg <= Lw; RegWrite <= R_format OR Lw; MemRead <= Lw; MemWrite <= Sw; Branch <= Beq;
ALUOp( 1 ) <= R_format;ALUOp( 0 ) <= Beq;
END behavior;
Implementation of each table
column
Instruction RegDst ALUSrc
Memto-
Reg
Reg
Write
Mem
Read
Mem
Write Branch ALUOp1 ALUp0
R-format 1 0 0 1 0 0 0 1 0
lw 0 1 1 1 1 0 0 0 0
sw X 1 X 0 0 1 0 0 0
beq X 0 X 0 0 0 1 0 1
(33)
R-Type Instruction
(34)
Load Instruction
(35)
Branch-on-Equal Instruction
(36)
Implementing Jumps
• Jump uses word address• Update PC with concatenation of
Top 4 bits of old PC 26-bit jump address 00
• Need an extra control signal decoded from opcode
2 address
31:26 25:0
Jump
(37)
Datapath With Jumps Added
(38)
Energy Behavior
combinational activity
storage read/write access
(39)
Recall Hierarchy of Energy Models
Vin Vout
Vdd
PMOS
Ground
NMOS
ab
c
x
y
_Q
Q
_Q
Dlatch
D
C
Dlatch
DD
C
C
ALU
Switch level activity (dynamic) and leakage (static) energy costs
Aggregate energy expenditure into gate
level estimates
Aggregate energy expenditure into
higher level modules
(40)
A Simple Architecture Energy Model
• To a first order, we can use the per-access energy of each major component Obtain this for a technology generation
• Use this per-access energy to compute the energy of each instruction
• Note: This is a high level approximation. The actual physics
is more complicated. However, this useful for several purposes
• What components do each instruction exercise?
(41)
Example: Updating the PC
MemtoReg
MemRead
MemWrite
ALUOp
ALUSrc
RegDst
PC
Instructionmemory
Readaddress
Instruction[31–0]
Instruction [20–16]
Instruction [25–21]
Add
Instruction [5–0]
RegWrite
4
16 32Instruction [15–0]
0Registers
WriteregisterWritedata
Writedata
Readdata 1
Readdata 2
Readregister 1Readregister 2
Signextend
ALUresult
Zero
Datamemory
Address Readdata M
ux
1
1
Mux
0
1
Mux
0
1
Mux
0
Instruction [15–11]
ALUcontrol
Shiftleft 2
ALU
Add ALUresult
Branch
What is the energy cost of this operation?
(42)
Example: Register Instructions
MemtoReg
MemRead
MemWrite
ALUOp
ALUSrc
RegDst
PC
Instructionmemory
Readaddress
Instruction[31–0]
Instruction [20–16]
Instruction [25–21]
Add
Instruction [5–0]
RegWrite
4
16 32Instruction [15–0]
0Registers
WriteregisterWritedata
Writedata
Readdata 1
Readdata 2
Readregister 1Readregister 2
Signextend
ALUresult
Zero
Datamemory
Address Readdata M
ux
1
1
Mux
0
1
Mux
0
1
Mux
0
Instruction [15–11]
ALUcontrol
Shiftleft 2
ALU
Add ALUresult
Branch
What is the energy cost of this operation?
(43)
Example: I-type Instructions
MemtoReg
MemRead
MemWrite
ALUOp
ALUSrc
RegDst
PC
Instructionmemory
Readaddress
Instruction[31–0]
Instruction [20–16]
Instruction [25–21]
Add
Instruction [5–0]
RegWrite
4
16 32Instruction [15–0]
0Registers
WriteregisterWritedata
Writedata
Readdata 1
Readdata 2
Readregister 1Readregister 2
Signextend
ALUresult
Zero
Datamemory
Address Readdata M
ux
1
1
Mux
0
1
Mux
0
1
Mux
0
Instruction [15–11]
ALUcontrol
Shiftleft 2
ALU
Add ALUresult
Branch
What is the energy cost of this operation?
(44)
Example: I-Type for Branches
MemtoReg
MemRead
MemWrite
ALUOp
ALUSrc
RegDst
PC
Instructionmemory
Readaddress
Instruction[31–0]
Instruction [20–16]
Instruction [25–21]
Add
Instruction [5–0]
RegWrite
4
16 32Instruction [15–0]
0Registers
WriteregisterWritedata
Writedata
Readdata 1
Readdata 2
Readregister 1Readregister 2
Signextend
ALUresult
Zero
Datamemory
Address Readdata M
ux
1
1
Mux
0
1
Mux
0
1
Mux
0
Instruction [15–11]
ALUcontrol
Shiftleft 2
ALU
Add ALUresult
Branch
What is the energy cost of this operation?
(45)
Converting Energy to Power
• For this data path, except for data memory, all components are active every cycle, and dissipating energy on every cycle Later we will see how data paths can be made more
energy efficient
• Computing power Compute the total energy consumed over all cycles
(instructions) Divide energy by time to get power in watts
Example:
(46)
Example: A Simple Energy Model• We can use a simple model of per-access
energy for the architecture componentsCommon Components Access Energy (10-12 joules)
Inst. Decode Logic Switching 16.78
Inst. Registers Read 2.74 Write 4.38
FP. Registers Read 1.26 Write 1.98Other Buffers Read 9.74 Write 11.18
ALU + Result Bus (interconnect) Logic Switching 123.2FPU + Result Bus (interconnect) Logic Switching 241.02
• Each unit can be accessed multiple times depending on instruction type• An Intel/AMD x86 instruction consume 600pJ ~ 4nJ dynamic energy.
@16nm
(47)
ITRS Roadmap for Logic Devices
From: “ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems,” P. Kogge, et.al, 2008
(48)
• All of the logic is combinational
• We wait for everything to settle down, and the right thing to be done ALU might not produce “right answer” right away
we use write signals along with clock to determine when to write
• Cycle time determined by length of the longest path
Our Simple Control Structure
We are ignoring some details like setup and hold times
Clock cycle
Stateelement
1Combinational logic
Stateelement
2
(49)
Performance Issues
• Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data
memory register file
• Not feasible to vary period for different instructions
• Violates design principle Making the common case fast
• We will improve performance by pipelining
(50)
Summary
• Single cycle datapath All instructions execute in one clock cycle Not all instructions take the same amount of time Software sees a simple interface Can memory operations really take one cycle?
• Improve performance via pipelining, multi-cycle operation, parallelism or customization
• We will address these next
(51)
Study Guide
• Given an instruction, be able to specify the values of all control signals required to execute that instruction
• Add new instructions: modify the datapath and control to affect its execution E.g., jal, jr, shift, etc. Modify the VHDL controller
• Given delays of various components, determine the cycle time of the datapath
• Distinguish between those parts of the datapath that are unique to each instruction and those components that are shared across all instructions
(52)
Study Guide (cont.)
• Given a set of control signal values determine what operation the datapath performs
• Given the per access energies of each component: Compute the energy required of any instruction Given a program and clock rate compute the power
dissipation of the datapath
(53)
Glossary
• Asynchronous• Clock• Controller • Critical path• Flip Flop
• ITRS Roadmap• Per-access energy• Program counter• Register• Synchronous