Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large...

37
1 Circuit Technologies for Multi-Core Processor Design Stefan Rusu Intel Corporation Santa Clara, CA [email protected] Xeon, Itanium, Core are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2006, Intel Corporation

Transcript of Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large...

Page 1: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

1Copyright © 2006, Intel Corporation

Circuit Technologies for Multi-Core Processor Design

Stefan Rusu

Intel Corporation

Santa Clara, CA

[email protected]

Xeon, Itanium, Core are trademarks or registered trademarks of Intel Corporation or itssubsidiaries in the United States and other countries. *Other names and brands may beclaimed as the property of others. Copyright © 2006, Intel Corporation

Page 2: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

2Copyright © 2006, Intel Corporation

Outline• Dual-core architectural directions• Interconnect trends• Power and leakage reduction

– Cache Sleep and Shut-off Modes– Long-Le Transistor Usage

• Voltage domains• Clock distribution• Package details• DFT/DFM features• Thermal management• Summary

Page 3: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

3Copyright © 2006, Intel Corporation

Dual Core Processors Everywhere!

Intel tips Viiv, Yonah in consumer pushMark LaPedus, 01/06/2006 SAN JOSE, Calif. — Making a big splash in the consumer market, Intel Corp. on

Thursday unveiled a dual-core processor, PC platform and several content alliances

that are said to provide the foundation for digital entertainment and wireless laptops.

Intel unwraps dual-core Xeon server processors Tom Krazit, 10/10/2005At an event Monday in San Francisco, Intel unveiled its first dual-core Xeon chips for

two-processor and four-processor servers, previously known by the Paxville code name.

Intel ready to ship dual-core processorsDaniel A. Begun, 3/15/2005Earlier this year, Intel announced that desktop processors using dual-core technology will

be available by the end of June, but the company recently hinted at March's Intel

Developer Forum (IDF) that dual-core processors could be available even sooner than that.

Laptop

Server

Desktop

Page 4: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

4Copyright © 2006, Intel Corporation

Tulsa – Xeon® Processor

• Process technology: 65 nm, 8 Cu interconnect layers• Transistor count: 1.328 Billion• Die area: 435 mm2

1MB L2

Core 0

Shared 16MB L3 Cache

Bus Interface

1MB L2

Core 1

16MB L3

TAG

TAG

1MB L2

1MB L2

Core 1

Core 0

FSB TOP

FSB BOT

Control Logic

Page 5: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

5Copyright © 2006, Intel Corporation

Montecito – Itanium® 2 Processor

• Process technology: 90nm, 7 Cu interconnect layers• Transistor count: 1.72 Billion• Die area: 596mm2

1MB L2I

Core 0

12MB L3

Bus Interface

256kB L2D

1MB L2I

Core 1

12MB L3

256kB L2D

Core 1

Core 01MB

L2I

1MB

L2I

12MB L3 12MB L3

Page 6: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

6Copyright © 2006, Intel Corporation

Yonah – Mobile/Desktop/Blade Server

• Process technology: 65 nm, 8 Cu interconnect layers• Transistor count: 151 Million• Die area: 90.3 mm2

Core 0

Shared 2MB L2 Cache

Bus Interface

Core 1

Bus

2 MB L2 Cache

Core 0 Core 1

Page 7: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

7Copyright © 2006, Intel Corporation

Moore’s Law for Multi-Core Processors

Process technology scaling enables the number of cores to double every two years

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1.E+10

1980 1990 2000 2010

Tran

sist

ors

386

486

Pentium ®

286

Pentium ® II

Pentium ® III

Itanium ®

Itanium ® 2

Pentium ® 4

Dual-coreItanium ® 2 Dual-core

Xeon ®

Page 8: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

8Copyright © 2006, Intel Corporation

Server Processors L3 Cache Trend

Cache size doubles with every process generation

◄ Xeon®

Processors

Itanium® ►Processors

048

1216202428

180nm 130nm 90nm 65nm

On-DieL3 Cache

[MB]

Page 9: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

9Copyright © 2006, Intel Corporation

SRAM Cell Size Scaling

SRAM cell size scales ~0.5x per generation

0.57

4421

1.02.4

5.610.6

0.1

1

10

100

0.5 0.35 0.25 0.18 0.13 90n 65n

SR

AM

Cel

l Siz

e (u

m2 )

Page 10: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

10Copyright © 2006, Intel Corporation

Server Processors L3 Cache Trend

Large per-core and per-thread cachesenable server-class performance

Dual-core Processor Process L3 cache per

core [MB]L3 cache per thread [MB]

Itanium® 90nm 12

8Xeon® 65nm

6

4

Page 11: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

11Copyright © 2006, Intel Corporation

On-chip Interconnect Trend

• Local interconnects scale with gate delay• Global interconnects do not scale

0.1

1

10

100250 180 130 90 65 45 32

Feature size (nm)Relativedelay

Source: ITRS Gate delay (FO4)

Local interconnect (M1,2)

Global interconnectwith repeaters

Global interconnectwithout repeaters

Page 12: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

12Copyright © 2006, Intel Corporation

Layout Techniques to Reduce Coupling

• Interleave bussed signals with other busses that switch at a different time

• Eliminates capacitive coupling and reduces inductive noise

• Switch bit order of bussed signals at every turn

• Noise will not be additive across the entire bus route

V

t

Bus_A

Bus_B

012345

304152

Swizzled Bus Route

Page 13: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

13Copyright © 2006, Intel Corporation

• Interleave narrow Vcc/Vss lines through bus

• Use staggered inverting buffers [7]

Layout Techniques (Cont)

VccVssVccVssVss

Page 14: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

14Copyright © 2006, Intel Corporation

XX

X

L3 Cache Sleep and Shut-off ModesActive Mode Sleep Mode Shut-off Mode

0V

1.1V

Vol

tage

0V250mV

520mVVirtualVSS

2x lowerleakage

2x lowerleakage

Virtual VSS

BlockSelect

Sub-array

SleepBias

Shutoff

Sub-array Sub-array

Page 15: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

15Copyright © 2006, Intel Corporation

Leakage Shut-off Infrared Images

16MB SKU

All 16MB insleep mode

8MB SKU

8MB insleep mode

8MB inshut-off mode

Shut-off feature reduces the leakage of the8MB disabled sub-arrays by about 3W

Page 16: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

16Copyright © 2006, Intel Corporation

Dynamic Intel® Smart Cache Sizing• First implementation in Yonah dual-core mobile processor

– Dynamic implementation of the shut-off mode

• HW based algorithm predicts cache usage requirements– Considers the % of time the CPU is in Active state compared to

the various sleep states

• During periods of low activity or inactivity the processor dynamically adapts its effective cache size– Cache content is gradually flushed to system memory– Cache ways are gradually turned off (physically as well as

logically), thus reducing power

• Cache ways are re-powered on demand to deliver full performance when needed

Page 17: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

17Copyright © 2006, Intel Corporation

Leakage Mitigation: Long-Le Transistors

• All transistors can be either nominal or long-Le

• Most library cells are available in both flavors

• Long-Le transistors are about 10% slower, but have 3x lower leakage

• All paths with timing slack use long-Le transistors

• Initial design uses only long channel devices

Nominal Le

Long Le(Nom+10%)

Page 18: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

18Copyright © 2006, Intel Corporation

Long-Le Transistors Usage Map

0%

20%

40%

60%

80%

100%

Long

-Le

Usa

ge (%

)

Core 0

Cor 1 L3 CacheControl

Page 19: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

19Copyright © 2006, Intel Corporation

Long-Le Transistors Summary

Percentage of Long-Le device width excluding RAM arrays:

Cores Uncore

To reduce sub-threshold leakage, most devices will be slower and only a handful of transistors will be fast

Long-Le54%

Nominal46%

Long-Le76%

Nominal24%

Page 20: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

20Copyright © 2006, Intel Corporation

10e-12

100e-12

1e-5 1e-4 1e-3 1e-2 1e-1 1e+0

Normalized Ioff

Nor

mal

ized

iso-

load

dela

y

1

Low-VT

High-Vt

Two-stackLow -Vt

Two-stackHigh-Vt

10e-12

100e-12

1e-5 1e-4 1e-3 1e-2 1e-1 1e+0

Normalized Ioff

Nor

mal

ized

iso-

load

dela

y

1

Low-VT

High-Vt

Two-stackLow -Vt

Two-stackHigh-Vt

wu?½ wwl?½ wwu+wl = w

10e-12

100e-12

1e-5 1e-4 1e-3 1e-2 1e-1 1e+0

Normalized Ioff

Nor

mal

ized

iso-

load

dela

y

1

Low-VHigh-Vt

Two-stackLow -Vt

Two-stackHigh-Vt

wu?½ wwl?½ wwu+wl = w

11

10

Nor

mal

ized

del

ayun

der i

so- in

put l

oad

10e-12

100e-12

1e-5 1e-4 1e-3 1e-2 1e-1 1e+0

Nor

mal

ized

iso-

load

dela

y

1

Low-High-Vt

Two-stackLow -Vt

Two-stackHigh-Vt

wu?½ wwl?½ wwu+wl = w

10e-12

100e-12

1e-5 1e-4 1e-3 1e-2 1e-1 1e+0Normalized Ioff

Nor

mal

ized

iso-

load

dela

y

1

Low-High-VT

Two-stackLow -VT

Two-stackHigh-VT

wu?½ wwl?½ wwu+wl = w

11

10

Nor

mal

ized

del

ayun

der i

so- in

put l

oad

Stack Forcing

w

wu

wl

wu

wl

wu

wl

wu

wl

Equal Loading

Leakage Reduction

Performance Loss

Wu = Wl

• Force one transistor into a two transistor stack with the same input load

• Can be applied to gates with timing slack• Trade-off between transistor leakage and speed[Narendra, et al – ISLPED 2001]

Page 21: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

21Copyright © 2006, Intel Corporation

Montecito Voltage Domains [2]

Core 0,frequency tracks voltage, 40W

250m

W

Foxton Control1GHz fixed frequency

Bus ArbiterFixed frequency, 2.5W

Bus

I/O

, fix

ed fr

eque

ncy,

8W

to

tal

12M

B L

3 ca

che,

as

ynch

rono

us

oper

atio

n, 2

.5W

12MB

L3 cache, asynchronous operation, 2.5W

Core 1,frequency tracks voltage, 40W

f1

Page 22: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

22Copyright © 2006, Intel Corporation

Tulsa Voltage Domains

Uncore

16MB L3

TAG

TAG

1MB L2

1MB L2

Core 1

Core 0

FSB TOP

FSB BOT

Control Logic I/O

Core

PLL

0V

1.25V

Vol

tage

1.10V

0.25VCores 16MB arrayCtrl

+Tag

Voltage ProfileCut Line

VirtualVSS

Page 23: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

23Copyright © 2006, Intel Corporation

Tulsa Clock Domains [4]

Uncore I/OCore PLL

16MB L3

TAG

TAG

1MB L2

1MB L2

Core 1

Core 0

FSB TOP

FSB BOT

Legend:

System Clock (BCLK)

Page 24: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

24Copyright © 2006, Intel Corporation

Tulsa Uncore Clock Distribution [4]

PLL (Clock Generator)

Core dense MCLK grid

Un-Core pre-global MCLK spine

Un-Core sparse SCLK grid

De-skew buffer

Un-Core ZCLK grid

Un-Core pre-global ZCLK spine

Horizontal clock spines

Vertical clock spinesPLL (Clock Generator)PLL (Clock Generator)

Core dense MCLK gridCore dense MCLK grid

Un-Core pre-global MCLK spineUn-Core pre-global MCLK spine

Un-Core sparse SCLK gridUn-Core sparse SCLK grid

De-skew bufferDe-skew buffer

Un-Core ZCLK gridUn-Core ZCLK grid

Un-Core pre-global ZCLK spineUn-Core pre-global ZCLK spine

Horizontal clock spines

Vertical clock spines

Page 25: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

25Copyright © 2006, Intel Corporation

Montecito Clock System [5]

FixedSupply

VariableSupply

Bus Clock

I/Os

Foxton

Bus Logic

Core0

Core1

1/N

CVD

RAD

Gater

1/1

GaterCVD

RVD

DFD

DFD

SLCB CVD GaterPLL

DFD

SLCB CVD Gater

SLCB

DFD

CVD Gater

1/M

DFD

DFD

1/NRAD

SLCB

SLCB

DivisorsFrequency Translation Table

Fuses

PinsBalanced

Tree ClockDistribution

Phase Aligner

Page 26: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

26Copyright © 2006, Intel Corporation

Montecito Clock Distribution [6]

Variable FrequencyFull Rail Transitions

Bus Clock

IOs

Fixed frequencyLow Voltage Swings

Foxton

Bus Logic

REPEATERS

core0

core1

SLCB

RAD

RAD

GATERSCVDSLCB

DFD

DFD

PLL

DFD

DFD

DFD

DFD

L0 route L1 route L2 route

Latches

Latches

Latches

Latches

Latches

Latches

Latches

SLCB

SLCB

SLCB

GATERSCVD

GATERSCVD

GATERSCVD

GATERSCVD

L3 Route

Differential Single Ended

Page 27: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

27Copyright © 2006, Intel Corporation

Single-Ended vs. Differential Clocks

• Differential clock – Lower skew – High power– Longer distance

between repeaters

• Single-ended clock– Lower power– Need sharp edges

to control skew

GNDCLK -CLK +GND

GND VDD VDD VDDGND GND

GNDCLKVDD

Page 28: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

28Copyright © 2006, Intel Corporation

Dense vs. Sparse Grid Tiles

M7M7

85um

M6

M7M7

M5

450u

m

194umM7

M8 M6

1200um

1200um1200um1200um1200um

400umM8

M7

M6M7M7

85um

M6

M7M7

M5

450u

m

M7M7

85um85um

M6

M7M7

M5

450u

m45

0um

194umM7

M8 M6

1200um

194um194umM7

M8 M6

1200um1200um

1200um1200um1200um1200um

400umM8

M7

M6

1200um1200um1200um1200um

400umM8

M7

M6

Core Full Grid Tile Un-Core Sparse Tile

Un-Core Dense Tile

Page 29: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

29Copyright © 2006, Intel Corporation

Xeon® Processor Package

• 12 layers organic package (53.3 mm/side)

• 4-4-4 stacking

• Integrated heat spreader (38.5 mm/side)

• 604 total pins

• 366 signal I/Os

• System management components and decoupling capacitorson package

Page 30: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

30Copyright © 2006, Intel Corporation

Itanium® 2 Processor Package

Powerdeliveryconnector

Heat spreader

SystemManagementComponents

DecouplingCaps

Page 31: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

31Copyright © 2006, Intel Corporation

Intel® Core™ Duo Processor Package

Page 32: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

32Copyright © 2006, Intel Corporation

Design for Test and Debug Features

L3 cache DFT/DFM

•Built-in pattern generator (PBIST)

•Programmableweak-write test

•Low-yield analysis

•Stability test mode

•32-entry cache line disable (Pellston)

FSB DFT/DFM

•I/O loopback

•I/O test generator

Die-level DFT/DFM

•Parallel structural core test with XOR

•Scan and observability registers (scan-out)

•Three TAP controllers(core0, core1, uncore)

•Within-die process monitors

•On-die clock shrink

Page 33: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

33Copyright © 2006, Intel Corporation

• Two thermal sensors per core• Mux thermal diodes into VCOs to measure temp

Itanium® 2 Thermal Map

Page 34: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

34Copyright © 2006, Intel Corporation

Xeon® Infra-red Emission Image

Temperature Sensors

ThermalDiode

Page 35: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

35Copyright © 2006, Intel Corporation

Potential Multi-Core Thermal Control

• Multiple core designs require a central thermal management unit and an efficient mechanism for transmitting thermal sensor measurements [8]

Cache

TS Thermal Sensor TMU Thermal Management Unit

Cache Cache Cache

Cache Cache Cache Cache

Core Core Core Core

Core Core Core Core

I/O I/O

TS

TS

TS

TS

TS

TS

TS

TS

TS

TS

TS

TS

TS

TS

TS

TS

TMU

Page 36: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

36Copyright © 2006, Intel Corporation

Summary• Dual-core processors cover the entire compute

spectrum, from laptops, to desktop and servers• Moore’s Law applied to multi-core processors:

– Process technology scaling enables number of coresand cache size to double every two years

• Power and leakage reduction circuit techniques are essential for multi-core processors– Massive Long-Le usage → many slow transistors and

a few fast devices– Cache sleep and shut-off modes (static and dynamic)

• Multiple on-die clock and voltage domains are required to control active power and leakage– Need new verification tools and automated checks

Page 37: Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large per-core and per-thread caches enable server-class performance Dual-core Processor Process

37Copyright © 2006, Intel Corporation

References[1] S. Rusu, et al., “A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3

Cache,” ISSCC Dig. Tech. Papers, Paper 5.3, Feb. 2006.[2] S. Naffziger, et al., “The Implementation of a 2-Core, Multi-Threaded Itanium®

Family Processor,” IEEE J. Solid-State Circuits, pp. 197-209, Jan. 2006.[3] R. Korner, “Yonah and Sossaman Processor Briefing”, Intel Developer

Forum, Fall 2005[4] S. Tam, et al., “Clock Generation and Distribution of a Dual-Core Xeon®

Processor with 16MB L3 Cache,” ISSCC Dig. Tech. Papers, Paper 21.2, 2006.[5] T. Fischer, et al., “A 90-nm variable frequency clock system for a power

managed Itanium® architecture processor,” IEEE J. Solid-State Circuits, pp. 217–227, Jan. 2006.

[6] P. Mahoney, et al., “Clock Distribution on a Dual-core Multi-threaded Itanium™-Family Microprocessor,” ISSCC Dig. Tech. Papers, 2005

[7] S. Rusu, “Timing analysis of high-speed VLSI designs - Trends and challenges,” International Workshop on Timing Issues (TAU), 1997

[8] S. Rusu and S. Tam, “Apparatus for thermal management of multiple core microprocessors,” US patent 6,908,272, issued 7/21/2005