1 EE/CPRE 465 VLSI Design Process. 2 Outline Design Partitioning Design process: MIPS Processor as...

EE/CPRE 465

VLSI Design Process

Outline• Design Partitioning• Design process: MIPS Processor as an example

– Architecture Design

– Microarchitecture Design

– Logic Design

– Circuit Design

– Physical Design

• Fabrication, Packaging, Testing

Coping with Complexity• How to design System-on-Chip?

– Many millions (even billions!) of transistors

– Tens to hundreds of engineers

• Structured Design• Partitioning of Design Process

Structured Design• Hierarchy: Divide and Conquer

– Recursively partition a system into modules

• Regularity– Reuse modules wherever possible

– Example: Uniformly sized transistors at circuit level

Standard cell library at gate level

• Modularity: well-formed interfaces– Allows modules to be treated as black boxes

• Locality– Physical and temporal

Partitioning of Design Process• Architecture Design: User’s perspective, what does it do?

– Instruction set, register set, and memory model

– MIPS, x86, PIC, ARM, Power, SPARC, Alpha,…

• Microarchitecture Design: how the architecture is partitioned into registers and functional units

– Single cycle, multcycle, pipelined, superscalar?

– For x86: 386, 486, Pentium, PII, PIII, P4, Core, Core 2, Atom, Celeron, Cyrix MII, AMD K5, Athlon, Phenom

• Logic Design: how are functional blocks constructed

– Ripple carry, carry lookahead, carry select adders

• Circuit Design: how are transistors used to implement the logic

– Complementary CMOS, pass transistors, domino

• Physical Design: chip layout

Two Types of Engineers• “Short and fat” engineers

– Understand a large amount about a narrow field

• “Tall and skinny” engineers– Understand something about a broad range of topics

• Digital VLSI design favors the tall and skinny engineer– can evaluate how choices in one part of the system impact

other parts of the system

MIPS Architecture• Example: subset of MIPS processor architecture

– Drawn from Patterson & Hennessy• MIPS is a 32-bit architecture with 32 registers

– Consider 8-bit subset using 8-bit datapath– Only implement 8 registers ($0 - $7)– $0 hardwired to 00000000– 8-bit program counter

Original MIPS Architecture

Simplified MIPS Architecture here

Data width 32 bits 8 bits

Address width 32 bits 8 bits

# of registers 32 8

Instruction length 32 bits 32 bits

Instruction Set

101000

Instruction Encoding• 32-bit instruction encoding

– Requires four cycles to fetch on 8-bit datapath

• Note that the destination register is specified by:– Bits 15:11 for R-type instructions

– Bits 20:16 for addi instruction

format example encoding

0 ra rb rd 0 funct

ra rb imm

65 5 5 5

5 5 16

add $rd, $ra, $rb

beq $ra, $rb, imm

j dest dest

Fibonacci (C)f0 = 1; f-1 = -1

fn = fn-1 + fn-2

f1 =0, f2 =1, f3 =1, f4 =2, f5 =3, ...

Fibonacci (Assembly)

Fibonacci (Binary)• Machine language program

Multicycle MIPS Microarchitecture

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[15: 11]

Instruction[7: 0]

Instruction[25 : 21]

Instruction[15 : 0]

Instructionregister

ALUcontrol

ALUresult

ALUZero

Memorydata

register

MemRead

MemWrite

MemtoReg

PCWriteCond

PCWrite

IRWrite[3:0]

ALUSrcB

ALUSrcA

RegDst

PCSource

RegWrite

Control

Outputs

Op[5 : 0]

Instruction[31:26]

Instruction [5 : 0]

JumpaddressInstruction [5 : 0] 6 8

Shiftleft 2

1ALUOut

Memory

MemData

Writedata

Address

ALUControl

Shift left 2

Multicycle MIPS µ-arch (32-bit Design)

Multicycle Controller

PCWritePCSource = 10

ALUSrcA = 1ALUSrcB = 00ALUOp = 01PCWriteCond

PCSource = 01

ALUSrcA =1ALUSrcB = 00ALUOp= 10

RegDst = 1RegWrite

MemtoReg = 0

MemWriteIorD = 1

MemReadIorD = 1

ALUSrcA = 1ALUSrcB = 10ALUOp = 00

RegDst= 0RegWrite

MemtoReg=1

ALUSrcA = 0ALUSrcB = 11ALUOp = 00

MemReadALUSrcA = 0

IorD = 0IRWrite3

ALUSrcB = 01ALUOp = 00

Instruction fetch

Instruction decode/register fetch

Jumpcompletion

BranchcompletionExecution

Memory addresscomputation

Memoryaccess

Memoryaccess R-type completion

Write-back step

121195

MemReadALUSrcA = 0

IorD = 0IRWrite2

1MemRead

ALUSrcA = 0IorD = 0IRWrite1

2MemRead

ALUSrcA = 0IorD = 0IRWrite0

16Chapter 5 of Patterson and Hennessy (32-bit Design)

Summary of Steps for Each Instruction Class

Step nameAction for

R-type Instruction

Action for load

Instruction

Action for store

Instruction

Action for branch

Instruction

Action for jump

Instruction

Instruction fetch

IR <= Memory[PC]PC <= PC + 4

Instruction decode /

register fetch

A <= Reg[IR[25:21]]B <= Reg[IR[20:16]]

ALUOut <= PC + (sign-extend(IR[15:0]) << 2)

Execution / address

computation / branch/jump completion

ALUOut <= A op B

ALUOut <= A + sign-extend(IR[15:0])If (A==B)

PC <= ALUOut

PC <= {PC[31:28],

IR[25:0], 2’b00}

Memory access /R-type

completion

Reg[IR[15:11]] <= ALUOut

MDR <= Memory[ALUOut]

Memory[ALUOut] <= B

Memory read completion

Reg[IR[20:16]]<= MDR

Become 4 steps in our 8-bit design

Instructions from ISA Perspective

• Consider each instruction from the perspective of ISA.• Example: Add instruction

– Instruction specified by the PC. – Operand registers are specified by bits 25:21 and 20:16 of the

instruction– New value is the sum of two registers. – Register written is specified by bits 15:11 of instruction.

Reg[Memory[PC][15:11]] <=

Reg[Memory[PC][25:21]] + Reg[Memory[PC][20:16]]PC <= PC + 4

• In order to accomplish this we must break up the instruction.– kind of like introducing variables when programmingISA: Instruction Set Architecture

Breaking Down an Instruction

• ISA definition of arithmetic:

Reg[Memory[PC][15:11]] <=Reg[Memory[PC][25:21]] + Reg[Memory[PC][20:16]]

• Could break down to:– IR <= Memory[PC]– A <= Reg[IR[25:21]]– B <= Reg[IR[20:16]]– ALUOut <= A + B– Reg[IR[15:11]] <= ALUOut

• Don’t forgot an important part of the definition of arithmetic!– PC <= PC + 4

Idea Behind Multicycle Approach

• We define each instruction from the ISA perspective

• Break it down into steps:– Balance the amount of work to be done in different steps– Restrict each cycle to use only one major functional unit

• Introduce new registers as needed– A, B, ALUOut, MDR, IR, etc.

• Finally try and pack as much work into each step (avoid unnecessary cycles)

while also trying to share steps where possible(minimizes control, helps to simplify solution)

• Result: Our book’s multicycle implementation!

1. Instruction Fetch

2. Instruction Decode and Register Fetch

3. Execution, Memory Address Computation, or Branch / Jump Completion

4. Memory Access or R-type Instruction Completion

5. Memory Read Completion

INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

Five Execution Steps

6-8 cycles in our 8-bit design

Become 4 steps since we have an 8-bit design

• Use PC to get instruction and put it in the Instruction Register.• Increment the PC by 4 and put the result back in the PC.• Can be described succinctly using "Register-Transfer

Language“ (RTL):

IR <= Memory[PC];PC <= PC + 4;

Can we figure out the values of the control signals?

What is the advantage of updating the PC now?

Step 1: Instruction Fetch

Become 4 steps in our 8-bit design

• Read registers rs and rt in case we need them• Compute the branch address in case the instruction is a

branch• RTL:

A <= Reg[IR[25:21]];B <= Reg[IR[20:16]];ALUOut <= PC + (sign-extend(IR[15:0]) << 2);

• We are not setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)

Step 2: Instruction Decode and Register Fetch

• ALU is performing one of three functions, based on instruction type

• R-type:

ALUOut <= A op B;

• Memory Reference:

ALUOut <= A + sign-extend(IR[15:0]);

• Branch:

if (A==B) PC <= ALUOut;

• Jump :

PC <= {PC[31:28], IR[25:0], 2’b00}

Step 3: Execution, Memory Address Computation, or Branch / Jump Completion (Instruction Dependent)

• Loads and stores access memory

MDR <= Memory[ALUOut];or

Memory[ALUOut] <= B;

• R-type instructions completion

Reg[IR[15:11]] <= ALUOut;

Step 4: Memory Access or R-type Instruction Completion

• Reg[IR[20:16]] <= MDR;

Step 5: Memory Read Completion

Logic Design• Start at top level

– Hierarchically decompose MIPS into units

• Top-level interface

crystaloscillator

2-phaseclockgenerator MIPS

processor adr

writedata

memdata

externalmemory

memreadmemwrite

Block Diagram

datapath

controlleralucontrol

memdata[7:0]

writedata[7:0]

adr[7:0]

memread

memwrite

op[5:0]

regwrite

irwrite[3:0]

pcsource[1:0]

alusrcb[1:0]

alusrca

aluop[1:0]

regdst

funct[5:0]

alucontrol[2:0]

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[15: 11]

Instruction[7: 0]

Instruction[15 : 0]

Instructionregister

ALUcontrol

ALUresult

ALUZero

Memorydata

register

MemRead

MemWrite

MemtoReg

PCWriteCond

PCWrite

IRWrite[3:0]

ALUSrcB

ALUSrcA

RegDst

PCSource

RegWrite

Control

Outputs

Op[5 : 0]

Instruction[31:26]

Instruction [5 : 0]

JumpaddressInstruction [5 : 0] 6 8

Shiftleft 2

1ALUOut

Memory

MemData

Writedata

Address

ALUControl

Hierarchical Designmips

controller alucontrol datapath

standardcell library

bitslice zipper

flopinv4x

ramslice

fulladder

nand2nor2

HDLs• Hardware Description Languages

– Widely used in logic design

– Verilog and VHDL

• Describe hardware using code– Document logic functions

– Simulate logic before building

– Synthesize code into gates and layout

• Requires a library of standard cells

Verilog Examplemodule adder( input logic [7:0] a, b,

input logic c,output logic [7:0] s,output logic cout);

wire [6:0] carry;

fulladder fa0(a[0], b[0], c, s[0], carry[0]);fulladder fa0(a[1], b[1], carry[0], s[1], carry[1]);fulladder fa0(a[2], b[2], carry[1], s[2], carry[2]); . . . .fulladder fa0(a[7], b[7], carry[6], s[7], cout);

endmodule

module fulladder(input logic a, b, c, output logic s, cout);

sum s1(a, b, c, s);carry c1(a, b, c, cout);

endmodule module carry(input logic a, b, c, output logic cout)

assign cout = (a&b) | (a&c) | (b&c);endmodule

cout carrysum

fulladder

Circuit Design• How should logic be implemented?

– NANDs and NORs vs. ANDs and ORs?

– Fan-in and fan-out?

– How wide should transistors be?

• These choices affect speed, area, power• Logic synthesis makes these choices for you

– Good enough for many applications

– Hand-crafted circuits are still better

Example: Carry Logic• assign cout = (a&b) | (a&c) | (b&c);

Gate-level design: 26 transistors, 4 stages of gate delays

Example: Carry Logic• assign cout = (a&b) | (a&c) | (b&c);

coutcn

Transistor-level design: 12 transistors, 2 stages of gate delays

Gate-level Netlist

module carry(input a, b, c, output cout)

wire x, y, z;

and g1(x, a, b);and g2(y, a, c);and g3(z, b, c);or g4(cout, x, y, z);

endmodule

Transistor-Level Netlist

coutcn

module carry(input a, b, c, output cout)

wire i1, i2, i3, i4, cn;

tranif1 n1(i1, 0, a);tranif1 n2(i1, 0, b);tranif1 n3(cn, i1, c);tranif1 n4(i2, 0, b);tranif1 n5(cn, i2, a);tranif0 p1(i3, 1, a);tranif0 p2(i3, 1, b);tranif0 p3(cn, i3, c);tranif0 p4(i4, 1, b);tranif0 p5(cn, i4, a);tranif1 n6(cout, 0, cn);tranif0 p6(cout, 1, cn);

endmodule

SPICE Netlist.SUBCKT CARRY A B C COUT VDD GNDMN1 I1 A GND GND NMOS W=1U L=0.18U AD=0.3P AS=0.5PMN2 I1 B GND GND NMOS W=1U L=0.18U AD=0.3P AS=0.5PMN3 CN C I1 GND NMOS W=1U L=0.18U AD=0.5P AS=0.5PMN4 I2 B GND GND NMOS W=1U L=0.18U AD=0.15P AS=0.5PMN5 CN A I2 GND NMOS W=1U L=0.18U AD=0.5P AS=0.15PMP1 I3 A VDD VDD PMOS W=2U L=0.18U AD=0.6P AS=1 PMP2 I3 B VDD VDD PMOS W=2U L=0.18U AD=0.6P AS=1PMP3 CN C I3 VDD PMOS W=2U L=0.18U AD=1P AS=1PMP4 I4 B VDD VDD PMOS W=2U L=0.18U AD=0.3P AS=1PMP5 CN A I4 VDD PMOS W=2U L=0.18U AD=1P AS=0.3PMN6 COUT CN GND GND NMOS W=2U L=0.18U AD=1P AS=1PMP6 COUT CN VDD VDD PMOS W=4U L=0.18U AD=2P AS=2PCI1 I1 GND 2FFCI3 I3 GND 3FFCA A GND 4FFCB B GND 4FFCC C GND 2FFCCN CN GND 4FFCCOUT COUT GND 2FF.ENDS

Physical Design• Floorplan

– Area estimation

• Place & route– Standard cells

• Datapaths– Slice planning

Synthesized MIPS

Layout

MIPS Floorplan

datapath2700 x 1050

(2.8 M2)

alucontrol200 x 100

(20 k2)

zipper 2700 x 250

wiring channel: 30 tracks = 240

mips(4.6 M2)

bitslice 2700 x 100

control1500 x 400

(0.6 M2)

10 I/O pads

Area Estimation• Need area estimates to make floorplan

– Compare to another block you already designed

– Or estimate from transistor counts

– Budget room for large wiring tracks

– Your mileage may vary!

MIPS Layout

Standard Cells• Uniform cell height

• Uniform well height

• M1 VDD and GND rails

• M2 Access to I/Os

• Well / substrate taps

• Exploits regularity

Synthesized Controller• Synthesize HDL into gate-level netlist• Place & Route using standard cell library

Snap-Together Cells• Synthesized controller area is mostly wires

– Design is smaller if wires run through/over cells

– Smaller = faster, lower power as well!

• Design snap-together cells for datapaths and arrays– Plan wires into cells

– Pitch Matching required

– Connect by abutment• Exploits locality

• Takes lots of effort

A A A A

MIPS Datapath• 8-bit datapath built from 8 bitslices (regularity)• Zipper at top drives control signals to datapath

MIPS ALU• Arithmetic / Logic Unit is part of bitslice

Slice Plans• Slice plan for bitslice

– Cell ordering, dimensions, wiring tracks

– Arrange cells for wiring locality

Design Verification• Fabrication is slow & expensive

– MOSIS 0.6m: $1000, 3 months– 65 nm: $3M, 1 month

• Debugging chips is very hard– Limited visibility into operation

• Prove design is right before building!– Logic simulation– Ckt. simulation / Formal verification– Layout vs. schematic (LVS) comparison– Design & electrical rule checks (DRC &

• Verification is > 50% of effort on most chips!

Specification

ArchitectureDesign

LogicDesign

CircuitDesign

PhysicalDesign

Function

FunctionTimingPower

Fabrication & Packaging• Tapeout final layout• Fabrication

– 6, 8, 12” wafers

– Optimized for throughput,

not latency (10 weeks!)

– Cut into individual dice

• Packaging– Bond gold wires from die I/O pads to package

Testing• Test that chip operates

– Design errors

– Manufacturing errors

• A single dust particle or wafer defect kills a die– Yields from 90% to < 10%

– Depends on die size, maturity of process

– Test each part before shipping to customer

1 EE/CPRE 465 VLSI Design Process. 2 Outline Design Partitioning Design process: MIPS Processor as...

Documents

Transcript of 1 EE/CPRE 465 VLSI Design Process. 2 Outline Design Partitioning Design process: MIPS Processor as...

Design and microarchitecture of the IBM System z10 microprocessor

Advanced Microarchitecture

The architecture for Discovery...Tick-Tock Development Model Sustained Microprocessor Leadership Nehalem Microarchitecture Sandy Bridge Microarchitecture Haswell Microarchitecture

CprE 488 Embedded Systems Design Lecture 1 Introductionclass.ece.iastate.edu/cpre488/lectures/Lect-01.pdf · CprE 488 – Embedded Systems Design Lecture 1 – Introduction ... machines

CPRE foodmapping

CprE 488 â€“ Embedded Systems Design Lecture 1 â€“ Introduction

Members: Nicholas Allendorf - CprE Christopher Daly – CprE Daniel Guilliams – CprE Andrew Joseph – EE Adam Schuster – CprE Faculty Advisor: Dr. Daji Qiao.

CPRE 583 Reconfigurable Computing Lecture 6: Wed 10/21/2009 (Design Patterns)

Fine Grain 3D Integration for Microarchitecture Design ...cadlab.cs.ucla.edu/~cong/papers/2007_ICCD_Eren.pdfFine Grain 3D Integration for Microarchitecture Design Through Cube Packing

CPRE terapêutico

Microarchitecture-Level Power-Performance …dbrooks/micro2003-tutorial...Microarchitecture-Level Power-Performance Simulators: Modeling, Validation, and Impact on Design Zhigang Hu,

core design with next-generation microarchitecture E5-2400 ...

Microarchitecture. Outline Architecture vs. Microarchitecture Components MIPS Datapath 1.

CprE 488 Embedded Systems Design Lecture 5 Embedded ...class.ece.iastate.edu/cpre488/lectures/Lect-05.pdf · CprE 488 – Embedded Systems Design Lecture 5 – Embedded Operating

May04-14: Laptop Travel Games for Children Advisor: Dr. Jacobson Client: Senior Design Jonathan Gill: CprE Mike Mundy: CprE Nick Ransom: CprE Jonathan.

Computer Architectures - UTFPRpaginapessoal.utfpr.edu.br/gortan/aoc/transparenci...Microarchitecture 1 Microarchitecture Intel Core microarchitecture In computer engineering, microarchitecture

i8085 microarchitecture

CprE 488 Embedded Systems Design Lecture 6 Software ...

Senior Design SD1107 Solar Module Observation Device Team Leader: Collin Howe, CprE Webmaster: Jacob Rasmuson, CprE Communications: Arthur Fiester, CprE.

Nehalem (microarchitecture)