Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large...

1Copyright © 2006, Intel Corporation

Circuit Technologies for Multi-Core Processor Design

Stefan Rusu

Intel Corporation

Santa Clara, CA

[email protected]

Xeon, Itanium, Core are trademarks or registered trademarks of Intel Corporation or itssubsidiaries in the United States and other countries. *Other names and brands may beclaimed as the property of others. Copyright © 2006, Intel Corporation


Outline• Dual-core architectural directions• Interconnect trends• Power and leakage reduction

– Cache Sleep and Shut-off Modes– Long-Le Transistor Usage

• Voltage domains• Clock distribution• Package details• DFT/DFM features• Thermal management• Summary


Dual Core Processors Everywhere!

Intel tips Viiv, Yonah in consumer pushMark LaPedus, 01/06/2006 SAN JOSE, Calif. — Making a big splash in the consumer market, Intel Corp. on

Thursday unveiled a dual-core processor, PC platform and several content alliances

that are said to provide the foundation for digital entertainment and wireless laptops.

Intel unwraps dual-core Xeon server processors Tom Krazit, 10/10/2005At an event Monday in San Francisco, Intel unveiled its first dual-core Xeon chips for

two-processor and four-processor servers, previously known by the Paxville code name.

Intel ready to ship dual-core processorsDaniel A. Begun, 3/15/2005Earlier this year, Intel announced that desktop processors using dual-core technology will

be available by the end of June, but the company recently hinted at March's Intel

Developer Forum (IDF) that dual-core processors could be available even sooner than that.

Laptop

Server

Desktop


Tulsa – Xeon® Processor

• Process technology: 65 nm, 8 Cu interconnect layers• Transistor count: 1.328 Billion• Die area: 435 mm2

1MB L2

Core 0

Shared 16MB L3 Cache

Bus Interface

1MB L2

Core 1

16MB L3

TAG

TAG

1MB L2

1MB L2

Core 1

Core 0

FSB TOP

FSB BOT

Control Logic


Montecito – Itanium® 2 Processor

• Process technology: 90nm, 7 Cu interconnect layers• Transistor count: 1.72 Billion• Die area: 596mm2

1MB L2I

Core 0

12MB L3

Bus Interface

256kB L2D

1MB L2I

Core 1

12MB L3

256kB L2D

Core 1

Core 01MB

L2I

1MB

L2I

12MB L3 12MB L3


Yonah – Mobile/Desktop/Blade Server

• Process technology: 65 nm, 8 Cu interconnect layers• Transistor count: 151 Million• Die area: 90.3 mm2

Core 0

Shared 2MB L2 Cache

Bus Interface

Core 1

Bus

2 MB L2 Cache

Core 0 Core 1


Moore’s Law for Multi-Core Processors

Process technology scaling enables the number of cores to double every two years

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1.E+10

1980 1990 2000 2010

Tran

sist

ors

386

486

Pentium ®

286

Pentium ® II

Pentium ® III

Itanium ®

Itanium ® 2

Pentium ® 4

Dual-coreItanium ® 2 Dual-core

Xeon ®


Server Processors L3 Cache Trend

Cache size doubles with every process generation

◄ Xeon®

Processors

Itanium® ►Processors

048

1216202428

180nm 130nm 90nm 65nm

On-DieL3 Cache

[MB]


SRAM Cell Size Scaling

SRAM cell size scales ~0.5x per generation

0.57

4421

1.02.4

5.610.6

0.1

1

10

100

0.5 0.35 0.25 0.18 0.13 90n 65n

SR

AM

Cel

l Siz

e (u

m2 )


Server Processors L3 Cache Trend

Large per-core and per-thread cachesenable server-class performance

Dual-core Processor Process L3 cache per

core [MB]L3 cache per thread [MB]

Itanium® 90nm 12

8Xeon® 65nm

6

4


On-chip Interconnect Trend

• Local interconnects scale with gate delay• Global interconnects do not scale

0.1

1

10

100250 180 130 90 65 45 32

Feature size (nm)Relativedelay

Source: ITRS Gate delay (FO4)

Local interconnect (M1,2)

Global interconnectwith repeaters

Global interconnectwithout repeaters


Layout Techniques to Reduce Coupling

• Interleave bussed signals with other busses that switch at a different time

• Eliminates capacitive coupling and reduces inductive noise

• Switch bit order of bussed signals at every turn

• Noise will not be additive across the entire bus route

V

t

Bus_A

Bus_B

012345

304152

Swizzled Bus Route


• Interleave narrow Vcc/Vss lines through bus

• Use staggered inverting buffers [7]

Layout Techniques (Cont)

VccVssVccVssVss


XX

X

L3 Cache Sleep and Shut-off ModesActive Mode Sleep Mode Shut-off Mode

0V

1.1V

Vol

tage

0V250mV

520mVVirtualVSS

2x lowerleakage

2x lowerleakage

Virtual VSS

BlockSelect

Sub-array

SleepBias

Shutoff

Sub-array Sub-array


Leakage Shut-off Infrared Images

16MB SKU

All 16MB insleep mode

8MB SKU

8MB insleep mode

8MB inshut-off mode

Shut-off feature reduces the leakage of the8MB disabled sub-arrays by about 3W


Dynamic Intel® Smart Cache Sizing• First implementation in Yonah dual-core mobile processor

– Dynamic implementation of the shut-off mode

• HW based algorithm predicts cache usage requirements– Considers the % of time the CPU is in Active state compared to

the various sleep states

• During periods of low activity or inactivity the processor dynamically adapts its effective cache size– Cache content is gradually flushed to system memory– Cache ways are gradually turned off (physically as well as

logically), thus reducing power

• Cache ways are re-powered on demand to deliver full performance when needed


Leakage Mitigation: Long-Le Transistors

• All transistors can be either nominal or long-Le

• Most library cells are available in both flavors

• Long-Le transistors are about 10% slower, but have 3x lower leakage

• All paths with timing slack use long-Le transistors

• Initial design uses only long channel devices

Nominal Le

Long Le(Nom+10%)


Long-Le Transistors Usage Map

0%

20%

40%

60%

80%

100%

Long

-Le

Usa

ge (%

)

Core 0

Cor 1 L3 CacheControl


Long-Le Transistors Summary

Percentage of Long-Le device width excluding RAM arrays:

Cores Uncore

To reduce sub-threshold leakage, most devices will be slower and only a handful of transistors will be fast

Long-Le54%

Nominal46%

Long-Le76%

Nominal24%


10e-12

100e-12

1e-5 1e-4 1e-3 1e-2 1e-1 1e+0

Normalized Ioff

Nor

mal

ized

iso-

load

dela

y

1

Low-VT

High-Vt

Two-stackLow -Vt

Two-stackHigh-Vt

10e-12

100e-12

1e-5 1e-4 1e-3 1e-2 1e-1 1e+0

Normalized Ioff

Nor

mal

ized

iso-

load

dela

y

1

Low-VT

High-Vt

Two-stackLow -Vt

Two-stackHigh-Vt

wu?½ wwl?½ wwu+wl = w

10e-12

100e-12

1e-5 1e-4 1e-3 1e-2 1e-1 1e+0

Normalized Ioff

Nor

mal

ized

iso-

load

dela

y

1

Low-VHigh-Vt

Two-stackLow -Vt

Two-stackHigh-Vt


11

10

Nor

mal

ized

del

ayun

der i

so- in

put l

oad

10e-12

100e-12

1e-5 1e-4 1e-3 1e-2 1e-1 1e+0

Nor

mal

ized

iso-

load

dela

y

1

Low-High-Vt

Two-stackLow -Vt

Two-stackHigh-Vt


10e-12

100e-12

1e-5 1e-4 1e-3 1e-2 1e-1 1e+0Normalized Ioff

Nor

mal

ized

iso-

load

dela

y

1

Low-High-VT

Two-stackLow -VT

Two-stackHigh-VT


11

10

Nor

mal

ized

del

ayun

der i

so- in

put l

oad

Stack Forcing

w

wu

wl

wu

wl

wu

wl

wu

wl

Equal Loading

Leakage Reduction

Performance Loss

Wu = Wl

• Force one transistor into a two transistor stack with the same input load

• Can be applied to gates with timing slack• Trade-off between transistor leakage and speed[Narendra, et al – ISLPED 2001]


Montecito Voltage Domains [2]

Core 0,frequency tracks voltage, 40W

250m

W

Foxton Control1GHz fixed frequency

Bus ArbiterFixed frequency, 2.5W

Bus

I/O

, fix

ed fr

eque

ncy,

8W

to

tal

12M

B L

3 ca

che,

as

ynch

rono

us

oper

atio

n, 2

.5W

12MB

L3 cache, asynchronous operation, 2.5W

Core 1,frequency tracks voltage, 40W

f1


Tulsa Voltage Domains

Uncore

16MB L3

TAG

TAG

1MB L2

1MB L2

Core 1

Core 0

FSB TOP

FSB BOT

Control Logic I/O

Core

PLL

0V

1.25V

Vol

tage

1.10V

0.25VCores 16MB arrayCtrl

+Tag

Voltage ProfileCut Line

VirtualVSS


Tulsa Clock Domains [4]

Uncore I/OCore PLL

16MB L3

TAG

TAG

1MB L2

1MB L2

Core 1

Core 0

FSB TOP

FSB BOT

Legend:

System Clock (BCLK)


Tulsa Uncore Clock Distribution [4]

PLL (Clock Generator)

Core dense MCLK grid

Un-Core pre-global MCLK spine

Un-Core sparse SCLK grid

De-skew buffer

Un-Core ZCLK grid

Un-Core pre-global ZCLK spine

Horizontal clock spines

Vertical clock spinesPLL (Clock Generator)PLL (Clock Generator)

Core dense MCLK gridCore dense MCLK grid

Un-Core pre-global MCLK spineUn-Core pre-global MCLK spine

Un-Core sparse SCLK gridUn-Core sparse SCLK grid

De-skew bufferDe-skew buffer

Un-Core ZCLK gridUn-Core ZCLK grid

Un-Core pre-global ZCLK spineUn-Core pre-global ZCLK spine

Horizontal clock spines

Vertical clock spines


Montecito Clock System [5]

FixedSupply

VariableSupply

Bus Clock

I/Os

Foxton

Bus Logic

Core0

Core1

1/N

CVD

RAD

Gater

1/1

GaterCVD

RVD

DFD

DFD

SLCB CVD GaterPLL

DFD

SLCB CVD Gater

SLCB

DFD

CVD Gater

1/M

DFD

DFD

1/NRAD

SLCB

SLCB

DivisorsFrequency Translation Table

Fuses

PinsBalanced

Tree ClockDistribution

Phase Aligner


Montecito Clock Distribution [6]

Variable FrequencyFull Rail Transitions

Bus Clock

IOs

Fixed frequencyLow Voltage Swings

Foxton

Bus Logic

REPEATERS

core0

core1

SLCB

RAD

RAD

GATERSCVDSLCB

DFD

DFD

PLL

DFD

DFD

DFD

DFD

L0 route L1 route L2 route

Latches

Latches

Latches

Latches

Latches

Latches

Latches

SLCB

SLCB

SLCB

GATERSCVD

GATERSCVD

GATERSCVD

GATERSCVD

L3 Route

Differential Single Ended


Single-Ended vs. Differential Clocks

• Differential clock – Lower skew – High power– Longer distance

between repeaters

• Single-ended clock– Lower power– Need sharp edges

to control skew

GNDCLK -CLK +GND

GND VDD VDD VDDGND GND

GNDCLKVDD


Dense vs. Sparse Grid Tiles

M7M7

85um

M6

M7M7

M5

450u

m

194umM7

M8 M6

1200um

1200um1200um1200um1200um

400umM8

M7

M6M7M7

85um

M6

M7M7

M5

450u

m

M7M7

85um85um

M6

M7M7

M5

450u

m45

0um

194umM7

M8 M6

1200um

194um194umM7

M8 M6

1200um1200um

1200um1200um1200um1200um

400umM8

M7

M6

1200um1200um1200um1200um

400umM8

M7

M6

Core Full Grid Tile Un-Core Sparse Tile

Un-Core Dense Tile


Xeon® Processor Package

• 12 layers organic package (53.3 mm/side)

• 4-4-4 stacking

• Integrated heat spreader (38.5 mm/side)

• 604 total pins

• 366 signal I/Os

• System management components and decoupling capacitorson package


Itanium® 2 Processor Package

Powerdeliveryconnector

Heat spreader

SystemManagementComponents

DecouplingCaps


Intel® Core™ Duo Processor Package


Design for Test and Debug Features

L3 cache DFT/DFM

•Built-in pattern generator (PBIST)

•Programmableweak-write test

•Low-yield analysis

•Stability test mode

•32-entry cache line disable (Pellston)

FSB DFT/DFM

•I/O loopback

•I/O test generator

Die-level DFT/DFM

•Parallel structural core test with XOR

•Scan and observability registers (scan-out)

•Three TAP controllers(core0, core1, uncore)

•Within-die process monitors

•On-die clock shrink


• Two thermal sensors per core• Mux thermal diodes into VCOs to measure temp

Itanium® 2 Thermal Map


Xeon® Infra-red Emission Image

Temperature Sensors

ThermalDiode


Potential Multi-Core Thermal Control

• Multiple core designs require a central thermal management unit and an efficient mechanism for transmitting thermal sensor measurements [8]

Cache

TS Thermal Sensor TMU Thermal Management Unit

Cache Cache Cache

Cache Cache Cache Cache

Core Core Core Core

Core Core Core Core

I/O I/O

TS

TS

TS

TS

TS

TS

TS

TS

TS

TS

TS

TS

TS

TS

TS

TS

TMU


Summary• Dual-core processors cover the entire compute

spectrum, from laptops, to desktop and servers• Moore’s Law applied to multi-core processors:

– Process technology scaling enables number of coresand cache size to double every two years

• Power and leakage reduction circuit techniques are essential for multi-core processors– Massive Long-Le usage → many slow transistors and

a few fast devices– Cache sleep and shut-off modes (static and dynamic)

• Multiple on-die clock and voltage domains are required to control active power and leakage– Need new verification tools and automated checks


References[1] S. Rusu, et al., “A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3

Cache,” ISSCC Dig. Tech. Papers, Paper 5.3, Feb. 2006.[2] S. Naffziger, et al., “The Implementation of a 2-Core, Multi-Threaded Itanium®

Family Processor,” IEEE J. Solid-State Circuits, pp. 197-209, Jan. 2006.[3] R. Korner, “Yonah and Sossaman Processor Briefing”, Intel Developer

Forum, Fall 2005[4] S. Tam, et al., “Clock Generation and Distribution of a Dual-Core Xeon®

Processor with 16MB L3 Cache,” ISSCC Dig. Tech. Papers, Paper 21.2, 2006.[5] T. Fischer, et al., “A 90-nm variable frequency clock system for a power

managed Itanium® architecture processor,” IEEE J. Solid-State Circuits, pp. 217–227, Jan. 2006.

[6] P. Mahoney, et al., “Clock Distribution on a Dual-core Multi-threaded Itanium™-Family Microprocessor,” ISSCC Dig. Tech. Papers, 2005

[7] S. Rusu, “Timing analysis of high-speed VLSI designs - Trends and challenges,” International Workshop on Timing Issues (TAU), 1997

[8] S. Rusu and S. Tam, “Apparatus for thermal management of multiple core microprocessors,” US patent 6,908,272, issued 7/21/2005

Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large...

Documents

Transcript of Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large...