Jim Keller Digital Semiconductor Hudson MA

25
Digital Semiconductor Digital Semiconductor Jim Keller Digital Semiconductor Digital Equipment Corp. Hudson MA

Transcript of Jim Keller Digital Semiconductor Hudson MA

Digital SemiconductorDigital Semiconductor

��������������������� ���������� �������������������� ���������� �����������������

��������������������������������������������

Jim KellerDigital Semiconductor

Digital Equipment Corp.Hudson MA

Digital SemiconductorDigital Semiconductor

CMOS 70.25um

Performance - SPECint95

CMOS 60.35um

� �������� ��������� ���!� � �������� ��������� ���!�

1995 1998 1999 20001996 1997 2001

10

30

50

100

21164

21264

CMOS 50.5um

� �������� ��������� ���!� � �������� ��������� ���!�

Digital SemiconductorDigital Semiconductor

"�#���#���"�#���#���

� Continued Performance Leadership– Est. SPECint95 of 30+ and SPECfp95 of 50+– Spectacular memory bandwidth

� Architectural, circuit and process leadershipachieve 500 MHz operation in 0.35u CMOS

� Four-way Out-of-Order Execution

� Implements Alpha Motion Video Instructions– MPEG2 encode in realtime

Digital SemiconductorDigital Semiconductor

� ��������$��������������

Feature DescriptionProcess Technology CMOS 6 - 6 layer m etalM in. Feature Size 0.35� drawnTransistor Count 15.2 MillionFrequency 500+ MHzPower 60 W attsD ie Size 3.0 cm 2

Issue Rate 4 In teger2 F loating Point

L1 Instruction Cache 64KByte 2-way Set AssociativeL1 Data Cache 64KByte 2-way Set AssociativePackage 588 PG ASam ples Q1 97Volum e H2 97

Digital SemiconductorDigital Semiconductor

� ���������%���&�'��#��!� ���������%���&�'��#��!

Cache Bus

Phys Addr

FETCH MAP QUEUE REG EXEC DCACHEStage: 0 1 2 3 4 5 6

BusInter-faceUnit

IntRegMap

IntIssueQueu

e(20)

Exec

4 Instructions / cycle

RegFile(80)

Sys Bus

64-bit

128-bit

44-bit

VictimBuffer

L1DataCache64KB2-Set

FPRegMap

FP ADDDiv/Sqrt

FP MUL

Addr

80 in-flight instructionsplus 32 loads and 32 stores Addr

MissAddress

BranchPredictors

Next-LineAddress

L1 Ins.Cache64KB2-Set

Exec

Exec

ExecRegFile(80)

FPIssueQueue

(15)

RegFile(72)

Digital SemiconductorDigital Semiconductor

� ���������(���������� ���������(���������

Digital SemiconductorDigital Semiconductor

IntIssueQueu

e(20)

FPIssueQueue

(15)

� ���������%���&�'��#��!� ���������%���&�'��#��!

Cache Bus

Phys Addr

FETCH MAP QUEUE REG EXEC DCACHEStage: 0 1 2 3 4 5 6

BusInter-faceUnit

IntRegMap

Exec

4 Instructions / cycle

RegFile(80)

Sys Bus

64-bit

128-bit

44-bit

VictimBuffer

L1DataCache64KB2-Set

FPRegMap

FP ADDDiv/Sqrt

FP MUL

Addr

80 in-flight instructionsplus 32 loads and 32 stores Addr

MissAddress

BranchPredictors

Next-LineAddress

L1 Ins.Cache64KB2-Set

Exec

Exec

ExecRegFile(80)

RegFile(72)

Digital SemiconductorDigital Semiconductor

(��������#�(�����������������������

� Four instructions plus a Next-fetch predictor arefetched per cycle from the instruction cache– Pipelined branch predictor runs in parallel

� Next-fetch predictor addresses the cache next cycle– Eliminates branch bubbles– Learns dynamic jump targets for DLL’s

� Set Predictor stores next-icache set (one of two)in current icache line– Enables associative cache at high speed

Digital SemiconductorDigital Semiconductor

(��������#�(��������#�(�������%���&�'��#��!(�������%���&�'��#��!

...

Instr.Cache

Next- FetchPred

Instructions (4)

Tag S0

Cmp

Hit/Miss/Set Miss

Tag S1

Cmp

Next address + set

PC

Set Pred

0-cycle Penalty On Branches

Set-Associative Behavior

Digital SemiconductorDigital Semiconductor

(��������#�(��������#�%����������������%����������������

� Some branch directions can be predicted based ontheir past behavior: Local Correlation

� Others can be predicted based on how the programarrived at the branch: Global Correlation

� 21264 implements both methods and dynamicallyapplies the best one

Prediction

Local Global Chooser

Digital SemiconductorDigital Semiconductor

IntIssueQueu

e(20)

FPIssueQueue

(15)

� ���������%���&�'��#��!� ���������%���&�'��#��!

Cache Bus

Phys Addr

FETCH MAP QUEUE REG EXEC DCACHEStage: 0 1 2 3 4 5 6

BusInter-faceUnit

IntRegMap

Exec

4 Instructions / cycle

RegFile(80)

Sys Bus

64-bit

128-bit

44-bit

VictimBuffer

L1DataCache64KB2-Set

FPRegMap

FP ADDDiv/Sqrt

FP MUL

Addr

80 in-flight instructionsplus 32 loads and 32 stores Addr

MissAddress

BranchPredictors

Next-LineAddress

L1 Ins.Cache64KB2-Set

Exec

Exec

ExecRegFile(80)

RegFile(72)

Digital SemiconductorDigital Semiconductor

�� �������)��������#���� �������)��������#��

�Mapper:– Rename 4 instructions per cycle (8 source / 4 dest )– 80 int + 72 fp physical registers

� Queue Stage:– Instructions issued out-of-order when data ready– Instructions leave the queues after they issue– Integer: 20 entries / Quad-Issue– Floating Point: 15 entries / Dual-Issue

Digital SemiconductorDigital Semiconductor

�� �����)��������#���� �����)��������#��

RegisterScoreboard

Arbiter

Issued Instructions

QUEUE 20 Entries

MAPCAMs

80 Entries

RegisterNumbers

MAPPER ISSUE QUEUE

4 Inst / Cycle

Request Grant

Instructions

SavedMapState

RenamedRegisters

Digital SemiconductorDigital Semiconductor

IntIssueQueu

e(20)

FPIssueQueue

(15)

� ���������%���&�'��#��!� ���������%���&�'��#��!

Cache Bus

Phys Addr

FETCH MAP QUEUE REG EXEC DCACHEStage: 0 1 2 3 4 5 6

BusInter-faceUnit

IntRegMap

Exec

4 Instructions / cycle

RegFile(80)

Sys Bus

64-bit

128-bit

44-bit

VictimBuffer

L1DataCache64KB2-Set

FPRegMap

FP ADDDiv/Sqrt

FP MUL

Addr

80 in-flight instructionsplus 32 loads and 32 stores Addr

MissAddress

BranchPredictors

Next-LineAddress

L1 Ins.Cache64KB2-Set

Exec

Exec

ExecRegFile(80)

RegFile(72)

Digital SemiconductorDigital Semiconductor

�#���������������������#�� �#���������������������#��

FP Add

SQRT

FP Div

FP Mul

Reg

Floating Point Execution Units

Integer Execution Units

add / logic

mul

add / logic /memory

shift/br

Reg

add / logic

m. video

add / logic /memory

shift/br

Reg

Digital SemiconductorDigital Semiconductor

(������#����������������#(������#����������������#

� 21264 dual-issues floating-point instructions totwo independent units

– Add / Div / Square Root unit• 64-bit Add: 4-cycle latency, fully pipelined

• 64-bit IEEE Divide: 16 cycle latency

• 64-bit Square root: 33 cycle latency

– Multiply unit• 4 cycle latency, fully pipelined

Digital SemiconductorDigital Semiconductor

*�����������������������#*�����������������������#

� L1 Data Cache– Two loads / stores per cycle (any combination)

• 1GHz operation: Phase pipelined - no bank conflict

• 16 Byte read / write per cycle

� Loads and stores issue out-of-order– 32-entry load and store reorder buffers– Stores check load buffer to enforce ordering

� Cache prefetch instructions– Read, modify intent, non-cached, evict next

Digital SemiconductorDigital Semiconductor

��!��+��� �������!��+��� �����

Tag

StoreAddr

DcacheSet 0

DcacheSet 1

DTB

Tag DTB

Store Data

Victim D

ata Buffer

LoadQueue

StoreQueue

MissAddrFile

VictimAddrFileFill Data

To OperandBus

Fill Data

To OperandBus

Digital SemiconductorDigital Semiconductor

������!�����(������������!�����(��������!��+��+���!��!��+��+���!

� L1 dcache: 8+ GB/s sustained bandwidth– 128b datapath with 3 cycle load-to-use latency

� L2 cache: 4+ GB/s sustained bandwidth– 128b datapath with 12 cycle load-to-use latency

� System interface: 2+ GB/s sustained bandwidth– 64b datapath with 80 cycle load-to-use latency

� 16 64-byte off-chip memory references– 8 read-misses and 8 write-backs

Digital SemiconductorDigital Semiconductor

���������,�����������������,��������

Frequency: 80 - 200 MHz Clock In I/O: 2V HSTLL2 Cache : 0 -16MB Synch Direct Mapped Peak Bandwidth :Option 1: Reg-Reg 133 MHz BurstRAM 2.1GB/SecOption 2: ‘RISC’ 250MHz Late Write 4.0GB/SecOption 3: Dual Data 333MHz 5.3GB/Sec

Alpha21264

L2 CacheTAG RAMs

L2 CacheData RAMs128b

Address Out

Address In

System Data

64b

System Port

Data Slice2,4,8

DuplTAG (opt)

L2 Cache Port

PCI

Control

ASIC Example

Data

TAG

Address

Digital SemiconductorDigital Semiconductor

'�����������+���!����! ��'�����������+���!����! ��

21264 &L2 Cache

100MHz SDRAM

100MHz SDRAM

256b

256b

8 Data Switches

64b

64b

ControlChip

PCI-chip

MemoryBank

MemoryBank

64b PCI 64b PCI

21264 &L2 Cache

PCI-chip

Digital SemiconductorDigital Semiconductor

-����������+���!����! ��-����������+���!����! ��

21264 &L2 Cache

128b

128b

2 or 4 Data

Switches

64b

ControlChip

PCI-chip

MemoryBank

MemoryBank

64b PCI

Digital SemiconductorDigital Semiconductor

SPECfp95

01020304050

21264-500 est

21164-500

PA8000-180

P6-200 UltraSparc-200

R10000-200

SPECint95

05

1015202530

21264-500 est

21164-500

PA8000-180

P6-200 UltraSparc-200

R10000-200

������!�����$�! ������������!�����$�! ���������$./���$./

Digital SemiconductorDigital Semiconductor

������!�����$�! ������������!�����$�! �������� ���0��$�� ��1�� ���0��$�� ��1

McCalpin (MB/s)

0

400

800

1200

1600

21264-500 est

21164-500

PA8000-180

P6-200 UltraSparc-200

R10000-200

Digital SemiconductorDigital Semiconductor

��!!��+��!!��+

� 21264 will maintain Alpha’s performance lead– 30+ SPECint95 and 50+ SPECfp95– Spectacular memory bandwidth– Out-of-order execution can be done at very

high frequency