Architecture (I) Processor Architecture. – 2 – Processor Goal Understand basic computer...
-
Upload
jonah-williams -
Category
Documents
-
view
241 -
download
1
Transcript of Architecture (I) Processor Architecture. – 2 – Processor Goal Understand basic computer...
Architecture (I)
Processor ArchitectureProcessor ArchitectureProcessor ArchitectureProcessor Architecture
– 2 – Processor
GoalGoal
Understand basic computer organization Understand basic computer organization Instruction set architecture
Deeply explore the CPU working mechanismDeeply explore the CPU working mechanism How the instruction is executed: sequential and pipeline
version
Help you programmingHelp you programming Fully understand how computer is organized and works will
help you write more stable and efficient code.
– 3 – Processor
CPU Design (Why?)CPU Design (Why?)
It is interesting.It is interesting.
Aid in understanding how the overall computer system Aid in understanding how the overall computer system works.works.
Many design hardware systems containing processors.Many design hardware systems containing processors.
Maybe you will work on a processor design.Maybe you will work on a processor design.
– 4 – Processor
CPU DesignCPU Design
Instruction set architectureInstruction set architecture
Logic designLogic design
Sequential implementationSequential implementation
Pipelining and initial pipelined implementationPipelining and initial pipelined implementation
Making the pipeline workMaking the pipeline work
Modern processor designModern processor design
– 5 – Processor
Suggested ReadingSuggested Reading
- - Chap 4.1, 4.2Chap 4.1, 4.2
– 6 – Processor
Instruction Set Architecture #1Instruction Set Architecture #1
What is it ?What is it ? Assemble Language Abstraction Machine Language Abstraction
What does it provide?What does it provide? An abstraction of the real computer, hide the details of
implementationThe syntax of computer instructionsThe semantics of instructionsThe execution modelProgrammer-visible computer status
Instruction Set Architecture (ISA)
– 7 – Processor
Instruction Set Architecture #2Instruction Set Architecture #2Assembly Language ViewAssembly Language View
Processor stateRegisters, memory, …
Instructionsaddl, movl, leal, …How instructions are encoded as
bytes
Layer of AbstractionLayer of Abstraction Above: how to program machine
Processor executes instructions in a sequence
Below: what needs to be builtUse tricks to make it run fastE.g., execute multiple instructions
simultaneously
ISA
Compiler OS
CPUDesign
CircuitDesign
ChipLayout
ApplicationProgram
– 8 – Processor
Instruction Set Architecture #3Instruction Set Architecture #3
ISA define the processor familyISA define the processor family Two main kind: RISC and CISC
RISC:SPARC, MIPS, PowerPCCISC:X86 (or called IA32)
Another divide: Superscalar, VLIW and EPICSuperscalar: all the aboveVLIW: Philips TriMediaEPIC: IA64
Under same ISA, there are many different processorsUnder same ISA, there are many different processors From different manufacturers:
X86 from Intel and AMD and VIA
Different models8086, 80386, Pentium, Pentium 4
– 9 – Processor
%eax%ecx%edx%ebx
%esi%edi%esp%ebp
Y86 Processor StateY86 Processor State
Program RegistersSame 8 as with IA32. Each 32 bits
Condition CodesSingle-bit flags set by arithmetic or logical instructions
» OF: Overflow ZF: Zero SF:Negative
Program Counter Indicates address of instruction
MemoryByte-addressable storage arrayWords stored in little-endian byte order
Program registers Condition
codes
PC
Memory
OF ZF SF
P258 Figure 4.1P258 Figure 4.1P258 Figure 4.1P258 Figure 4.1
– 10 – Processor
Y86 InstructionsY86 Instructions
FormatFormat (P259) (P259) 1--6 bytes of information read from memory
Can determine instruction length from first byteNot as many instruction types, and simpler encoding than with
IA32
Each accesses and modifies some part(s) of the program state
Errata: JXX and call are 5 bytes long.Errata: JXX and call are 5 bytes long.
P259 Figure 4.2P259 Figure 4.2P259 Figure 4.2P259 Figure 4.2
– 11 – Processor
Encoding RegistersEncoding RegistersEach register has 4-bit IDEach register has 4-bit ID
Same encoding as in IA32, but IA32 using only 3-bit ID
Register ID 8 indicates “no register”Register ID 8 indicates “no register” Will use this in our hardware design in multiple places
%eax%ecx%edx%ebx
%esi%edi%esp%ebp
0123
6745
P261 Figure 4.4P261 Figure 4.4P261 Figure 4.4P261 Figure 4.4
– 12 – Processor
Instruction ExampleInstruction ExampleAddition InstructionAddition Instruction
Add value in register rA to that in register rBStore result in register rBNote that Y86 only allows addition to be applied to register data
Set condition codes based on result e.g., addl %eax,%esiEncoding: 60 06 Two-byte encoding
First indicates instruction typeSecond gives source and destination registers
addl rA, rB 6 0 rA rB
Encoded Representation
Generic Form
P261 Figure 4.3P261 Figure 4.3P261 Figure 4.3P261 Figure 4.3
– 13 – Processor
Arithmetic and Logical OperationsArithmetic and Logical Operations Refer to generically as
“OPl” Encodings differ only by
“function code”Low-order 4 bytes in first
instruction word
Set condition codes as side effect
Notice: no multiply or divide operation
addl rA, rB 6 0 rA rB
subl rA, rB 6 1 rA rB
andl rA, rB 6 2 rA rB
xorl rA, rB 6 3 rA rB
Add
Subtract (rA from rB)
And
Exclusive-Or
Instruction Code Function Code
– 14 – Processor
Move OperationsMove Operations
Like the IA32 movl instruction Simpler format for memory addresses Give different names to keep them distinct
rrmovl rA, rB 2 0 rA rB Register --> Register
Immediate --> Registerirmovl V, rB 3 0 8 rB V
Register --> Memoryrmmovl rA, D(rB) 4 0 rA rB D
Memory --> Registermrmovl D(rB), rA 5 0 rA rB D
P259 Figure 4.2P259 Figure 4.2P259 Figure 4.2P259 Figure 4.2
Distinct: Distinct: 清楚的清楚的Distinct: Distinct: 清楚的清楚的
– 15 – Processor
Move Instruction ExamplesMove Instruction Examples
irmovl $0xabcd, %edx movl $0xabcd, %edx 30 82 cd ab 00 00
IA32 Y86 Encoding
rrmovl %esp, %ebx movl %esp, %ebx 20 43
mrmovl -12(%ebp),%ecxmovl -12(%ebp),%ecx 50 15 f4 ff ff ff
rmmovl %esi,0x41c(%esp)movl %esi,0x41c(%esp)
—movl $0xabcd, (%eax)
—movl %eax, 12(%eax,%edx)
—movl (%ebp,%eax,4),%ecx
40 64 1c 04 00 00
– 16 – Processor
Jump InstructionsJump Instructions Refer to generically as “jXX” Encodings differ only by
“function code” Based on values of condition
codes Same as IA32 counterparts Encode full destination
addressUnlike PC-relative
addressing seen in IA32
jmp Dest 7 0
Jump Unconditionally
Dest
jle Dest 7 1
Jump When Less or Equal
Dest
jl Dest 7 2
Jump When Less
Dest
je Dest 7 3
Jump When Equal
Dest
jne Dest 7 4
Jump When Not Equal
Dest
jge Dest 7 5
Jump When Greater or Equal
Dest
jg Dest 7 6
Jump When Greater
Dest
P259 Figure 4.2P259 Figure 4.2P259 Figure 4.2P259 Figure 4.2
– 17 – Processor
Y86 Program StackY86 Program Stack Region of memory holding
program data Used in Y86 (and IA32) for
supporting procedure calls Stack top indicated by %esp
Address of top stack element
Stack grows toward lower addresses
Top element is at lowest address in the stack
When pushing, must first decrement stack pointer
When popping, increment stack pointer
%esp
•
•
•
IncreasingAddresses
Stack “Top”
Stack “Bottom”
– 18 – Processor
Stack OperationsStack Operations
Decrement %esp by 4 Store word from rA to memory at %esp Like IA32
Read word from memory at %esp Save in rA Increment %esp by 4 Like IA32
pushl rA a 0 rA 8
popl rA b 0 rA 8
P259 Figure 4.2P259 Figure 4.2P259 Figure 4.2P259 Figure 4.2
– 19 – Processor
Subroutine Call and ReturnSubroutine Call and Return
Push address of next instruction onto stack Start executing instructions at Dest Like IA32
Pop value from stack Use as address for next instruction Like IA32
call Dest 8 0 Dest
ret 9 0
P259 Figure 4.2P259 Figure 4.2P259 Figure 4.2P259 Figure 4.2
– 20 – Processor
Miscellaneous InstructionsMiscellaneous Instructions
Don’t do anything
Stop executing instructions IA32 has comparable instruction, but can’t execute it in user mode We will use it to stop the simulator
nop 0 0
halt 1 0
P259 Figure 4.2P259 Figure 4.2P259 Figure 4.2P259 Figure 4.2
– 21 – Processor
Y86 Program StructureY86 Program Structure
Program starts at address 0
Must set up stackMake sure don’t
overwrite code!
Must initialize data Can use symbolic
names
irmovl Stack,%esp # Set up stackrrmovl %esp,%ebp # Set up frameirmovl List,%edxpushl %edx # Push argumentcall len2 # Call Functionhalt # Halt
.align 4List: # List of elements
.long 5043
.long 6125
.long 7395
.long 0
# Functionlen2:
. . .
# Allocate space for stack.pos 0x100Stack:
– 22 – Processor
Assembling Y86 ProgramAssembling Y86 Program
Generates “object code” file eg.yoActually looks like disassembler output
unix> yas eg.ys
0x000: 308400010000 | irmovl Stack,%esp # Set up stack 0x006: 2045 | rrmovl %esp,%ebp # Set up frame 0x008: 308218000000 | irmovl List,%edx 0x00e: a028 | pushl %edx # Push argument 0x010: 8028000000 | call len2 # Call Function 0x015: 10 | halt # Halt 0x018: | .align 4 0x018: | List: # List of elements 0x018: b3130000 | .long 5043 0x01c: ed170000 | .long 6125 0x020: e31c0000 | .long 7395 0x024: 00000000 | .long 0
– 23 – Processor
Simulating Y86 ProgramSimulating Y86 Program
Instruction set simulatorComputes effect of each instruction on processor statePrints changes in state from original
unix> yis eg.yo
Stopped in 41 steps at PC = 0x16. Exception 'HLT', CC Z=1 S=0 O=0Changes to registers:%eax: 0x00000000 0x00000003%ecx: 0x00000000 0x00000003%edx: 0x00000000 0x00000028%esp: 0x00000000 0x000000fc%ebp: 0x00000000 0x00000100%esi: 0x00000000 0x00000004
Changes to memory:0x00f4: 0x00000000 0x000001000x00f8: 0x00000000 0x000000150x00fc: 0x00000000 0x00000018
– 24 – Processor
SummarySummary
Y86 Instruction Set ArchitectureY86 Instruction Set Architecture Similar state and instructions as IA32 Simpler encodings Somewhere between CISC and RISC
How Important is ISA Design?How Important is ISA Design? Less now than before
With enough hardware, can make almost anything go fast
Intel is moving away from IA32Does not allow enough parallel execution Introduced IA64
» 64-bit word sizes (overcome address space limitations)» Radically different style of instruction set with explicit parallelism» Requires sophisticated compilers
Radically:Radically: 根本上根本上Radically:Radically: 根本上根本上
– 25 – Processor
Logic DesignLogic Design
Digital circuitDigital circuit What is digital circuit? Know what a CPU will base on?
Hardware Control Language (HCL)Hardware Control Language (HCL) A simple and functional language to describe our CPU
implementation Syntax like C
– 26 – Processor
Category of CircuitCategory of Circuit
Analog CircuitAnalog Circuit Use all the range of Signal Most part is amplifier Hard to model and automatic design Use transistor and capacitance as basis We will not discuss it here
Digital CircuitDigital Circuit Has only two values, 0 and 1 Easy to model and design Use true table and other tools to analyze Use gate as the basis The voltage of 1 is differ in different kind circuit.
E.g. TTL circuit using 5 voltage as 1
Amplifier: Amplifier: 放大器放大器TransistorTransistor ::晶体管晶体管Capacitance: Capacitance: 电容电容
Amplifier: Amplifier: 放大器放大器TransistorTransistor ::晶体管晶体管Capacitance: Capacitance: 电容电容
– 27 – Processor
Digital SignalsDigital Signals
Use voltage thresholds to extract discrete values from continuous signal
Simplest version: 1-bit signalEither high range (1) or low range (0)With guard range between them
Not strongly affected by noise or low quality circuit elementsCan make circuits simple, small, and fast
Voltage
Time
0 1 0
– 28 – Processor
Overview of Logic DesignOverview of Logic Design
Fundamental Hardware RequirementsFundamental Hardware Requirements Communication
How to get values from one place to another Computation Storage (Memory) Clock Signal
Bits are Our FriendsBits are Our Friends Everything expressed in terms of values 0 and 1 Communication
Low or high voltage on wire Computation
Compute Boolean functions Storage Clock Signal
– 29 – Processor
Category of Digital CircuitCategory of Digital Circuit
Combinational CircuitCombinational Circuit Without memory. So the circuit can’t have state. Any same
input will get the same output at any time. Needn’t clock signal Typical application: ALU
Sequential CircuitSequential Circuit = Combinational circuit + memory and clock signal Have state. Two same inputs may not generate the same
output. Use clock signal to control the run of circuit. Typical application: CPU
– 30 – Processor
Computing with Logic Gates P272 Figure 4.8Computing with Logic Gates P272 Figure 4.8
Outputs are Boolean functions of inputs Not an assignment operation, just give the circuit a name Respond continuously to changes in inputs
With some, small delay
ab out
ab out a out
out = a && b out = a || b out = !a
And Or Not
Voltage
Time
a
ba && b
Rising Delay Falling Delay
– 31 – Processor
Combinational CircuitsCombinational Circuits
Acyclic Network of Logic GatesAcyclic Network of Logic Gates Continuously responds to changes on inputs Outputs become (after some delay) Boolean functions of
inputs
Acyclic Network
Inputs Outputs
– 32 – Processor
Bit Equality P272 Figure 4.9Bit Equality P272 Figure 4.9
Generate 1 if a and b are equal
Hardware Control Language (HCL)Hardware Control Language (HCL) Very simple hardware description language
Boolean operations have syntax similar to C logical operations
We’ll use it to describe control logic for processors
Bit equala
b
eqbool eq = (a&&b)||(!a&&!b)
HCL Expression
– 33 – Processor
Bit-Level Multiplexor P273 Figure 4.10Bit-Level Multiplexor P273 Figure 4.10
Control signal s Data signals a and b Output a when s=1, b when s=0 Its name: MUX Usage: Select one signal from a couple of signals
Bit MUX
b
s
a
out
bool out = (s&&a)||(!s&&b)
HCL Expression
– 34 – Processor
Word Equality P274 Figure 4.11Word Equality P274 Figure 4.11
32-bit word size HCL representation
Equality operationGenerates Boolean value
b31Bit equal
a31
eq31
b30Bit equal
a30
eq30
b1Bit equal
a1
eq1
b0Bit equal
a0
eq0
Eq
==B
A
Eq
Word-Level Representation
bool Eq = (A == B)
HCL Representation
– 35 – Processor
Word Multiplexor P275 Figure 4.12Word Multiplexor P275 Figure 4.12
Select input word A or B depending on control signal s
HCL representationCase expressionSeries of test : value pairsOutput value for first successful test
Word-Level Representation
HCL Representation
b31
s
a31
out31
b30
a30
out30
b0
a0
out0
int Out = [ s : A; 1 : B;];
s
B
AOutMUX
– 36 – Processor
HCL Word-Level Examples P277, P266HCL Word-Level Examples P277, P266
Find minimum of three input words
HCL case expression Final case guarantees
match
AMin3MIN3B
Cint Min3 = [ A < B && A < C : A; B < A && B < C : B; 1 : C;];
D0
D3
Out4
s0s1
MUX4D2D1
Select one of 4 inputs based on two control bits
HCL case expression Simplify tests by
assuming sequential matching
int Out4 = [ !s1&&!s0: D0; !s1 : D1; !s0 : D2; 1 : D3;];
Minimum of 3 Words
4-Way Multiplexor
– 37 – Processor
OFZFSF
OFZFSF
OFZFSF
OFZFSF
Arithmetic Logic Unit P278 Figure 4.13Arithmetic Logic Unit P278 Figure 4.13
Combinational logicContinuously responding to inputs
Control signal selects function computedCorresponding to 4 arithmetic/logical operations in Y86
Also computes values for condition codes We will use it as a basic component for our CPU
ALU
Y
X
X + Y
0
ALU
Y
X
X - Y
1
ALU
Y
X
X & Y
2
ALU
Y
X
X ^ Y
3
A
B
A
B
A
B
A
B
– 38 – Processor
StorageStorage
RegistersRegisters Hold single words or bits Loaded as clock rises Not only program registers
Random-access memoriesRandom-access memories Hold multiple words E.g. register file, memory Possible multiple read or write ports Read word when address input changes Write word as clock rises
I O
Clock
– 39 – Processor
Register Operation P279 Figure 4.14Register Operation P279 Figure 4.14
Stores data bits For most of time acts as barrier between input and output As clock rises, loads input
State = x
RisingclockOutput = xInput = y
x
State = y
Output = y
y
Barrier: Barrier: 屏障屏障Barrier: Barrier: 屏障屏障
– 40 – Processor
State Machine ExampleState Machine Example
Accumulator circuit
Load or accumulate on each cycle
Comb. Logic
ALU
0
OutMUX
0
1
Clock
In
Load
x0 x1 x2 x3 x4 x5
x0 x0+x1 x0+x1+x2 x3 x3+x4 x3+x4+x5
Clock
Load
In
Out
– 41 – Processor
Random-Access Memory P280Random-Access Memory P280
Stores multiple words of memoryAddress input specifies which word to read or write
Register fileHolds values of program registers %eax, %esp, etc.Register identifier serves as address
» ID 8 implies no read or write performed
Multiple PortsCan read and/or write multiple words in one cycle
» Each has separate address and data input/output
Registerfile
Registerfile
A
B
W dstW
srcA
valA
srcB
valB
valW
Read ports Write port
Clock
– 42 – Processor
Register File TimingRegister File TimingReadingReading
Like combinational logic Output data generated based on input
address After some delay
WritingWriting Like register Update only as clock rises
Registerfile
Registerfile
A
B
srcA
valA
srcB
valB
y2
Registerfile
Registerfile
W dstW
valW
Clock
x2
Risingclock Register
fileRegister
fileW dstW
valW
Clock
y2
x2
x
2
– 43 – Processor
Hardware Control LanguageHardware Control Language Very simple hardware description language Can only express limited aspects of hardware operation
Parts we want to explore and modify
Data TypesData Types bool: Boolean
a, b, c, …
int: wordsA, B, C, …Does not specify word size---bytes, 32-bit words, …
StatementsStatements bool a = bool-expr ; int A = int-expr ;
– 44 – Processor
HCL OperationsHCL Operations Classify by type of value returned
Boolean ExpressionsBoolean Expressions Logic Operations
a && b, a || b, !a Word Comparisons
A == B, A != B, A < B, A <= B, A >= B, A > B Set Membership
A in { B, C, D }» Same as A == B || A == C || A == D
Word ExpressionsWord Expressions Case expressions
[ a : A; b : B; c : C ]Evaluate test expressions a, b, c, … in sequenceReturn word expression A, B, C, … for first successful test
– 45 – Processor
SummarySummary
ComputationComputation Performed by combinational logic Computes Boolean functions Continuously reacts to input changes
StorageStorage Registers
Hold single wordsLoaded as clock rises
Random-access memoriesHold multiple wordsPossible multiple read or write portsRead word when address input changesWrite word as clock rises