Dual-Core Intel Xeon Processor Intel® Server Products Overview
Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large...
Transcript of Circuit Technologies for Multi-Core Processor Design · Server Processors L3 Cache Trend Large...
1Copyright © 2006, Intel Corporation
Circuit Technologies for Multi-Core Processor Design
Stefan Rusu
Intel Corporation
Santa Clara, CA
Xeon, Itanium, Core are trademarks or registered trademarks of Intel Corporation or itssubsidiaries in the United States and other countries. *Other names and brands may beclaimed as the property of others. Copyright © 2006, Intel Corporation
2Copyright © 2006, Intel Corporation
Outline• Dual-core architectural directions• Interconnect trends• Power and leakage reduction
– Cache Sleep and Shut-off Modes– Long-Le Transistor Usage
• Voltage domains• Clock distribution• Package details• DFT/DFM features• Thermal management• Summary
3Copyright © 2006, Intel Corporation
Dual Core Processors Everywhere!
Intel tips Viiv, Yonah in consumer pushMark LaPedus, 01/06/2006 SAN JOSE, Calif. — Making a big splash in the consumer market, Intel Corp. on
Thursday unveiled a dual-core processor, PC platform and several content alliances
that are said to provide the foundation for digital entertainment and wireless laptops.
Intel unwraps dual-core Xeon server processors Tom Krazit, 10/10/2005At an event Monday in San Francisco, Intel unveiled its first dual-core Xeon chips for
two-processor and four-processor servers, previously known by the Paxville code name.
Intel ready to ship dual-core processorsDaniel A. Begun, 3/15/2005Earlier this year, Intel announced that desktop processors using dual-core technology will
be available by the end of June, but the company recently hinted at March's Intel
Developer Forum (IDF) that dual-core processors could be available even sooner than that.
Laptop
Server
Desktop
4Copyright © 2006, Intel Corporation
Tulsa – Xeon® Processor
• Process technology: 65 nm, 8 Cu interconnect layers• Transistor count: 1.328 Billion• Die area: 435 mm2
1MB L2
Core 0
Shared 16MB L3 Cache
Bus Interface
1MB L2
Core 1
16MB L3
TAG
TAG
1MB L2
1MB L2
Core 1
Core 0
FSB TOP
FSB BOT
Control Logic
5Copyright © 2006, Intel Corporation
Montecito – Itanium® 2 Processor
• Process technology: 90nm, 7 Cu interconnect layers• Transistor count: 1.72 Billion• Die area: 596mm2
1MB L2I
Core 0
12MB L3
Bus Interface
256kB L2D
1MB L2I
Core 1
12MB L3
256kB L2D
Core 1
Core 01MB
L2I
1MB
L2I
12MB L3 12MB L3
6Copyright © 2006, Intel Corporation
Yonah – Mobile/Desktop/Blade Server
• Process technology: 65 nm, 8 Cu interconnect layers• Transistor count: 151 Million• Die area: 90.3 mm2
Core 0
Shared 2MB L2 Cache
Bus Interface
Core 1
Bus
2 MB L2 Cache
Core 0 Core 1
7Copyright © 2006, Intel Corporation
Moore’s Law for Multi-Core Processors
Process technology scaling enables the number of cores to double every two years
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
1.E+10
1980 1990 2000 2010
Tran
sist
ors
386
486
Pentium ®
286
Pentium ® II
Pentium ® III
Itanium ®
Itanium ® 2
Pentium ® 4
Dual-coreItanium ® 2 Dual-core
Xeon ®
8Copyright © 2006, Intel Corporation
Server Processors L3 Cache Trend
Cache size doubles with every process generation
◄ Xeon®
Processors
Itanium® ►Processors
048
1216202428
180nm 130nm 90nm 65nm
On-DieL3 Cache
[MB]
9Copyright © 2006, Intel Corporation
SRAM Cell Size Scaling
SRAM cell size scales ~0.5x per generation
0.57
4421
1.02.4
5.610.6
0.1
1
10
100
0.5 0.35 0.25 0.18 0.13 90n 65n
SR
AM
Cel
l Siz
e (u
m2 )
10Copyright © 2006, Intel Corporation
Server Processors L3 Cache Trend
Large per-core and per-thread cachesenable server-class performance
Dual-core Processor Process L3 cache per
core [MB]L3 cache per thread [MB]
Itanium® 90nm 12
8Xeon® 65nm
6
4
11Copyright © 2006, Intel Corporation
On-chip Interconnect Trend
• Local interconnects scale with gate delay• Global interconnects do not scale
0.1
1
10
100250 180 130 90 65 45 32
Feature size (nm)Relativedelay
Source: ITRS Gate delay (FO4)
Local interconnect (M1,2)
Global interconnectwith repeaters
Global interconnectwithout repeaters
12Copyright © 2006, Intel Corporation
Layout Techniques to Reduce Coupling
• Interleave bussed signals with other busses that switch at a different time
• Eliminates capacitive coupling and reduces inductive noise
• Switch bit order of bussed signals at every turn
• Noise will not be additive across the entire bus route
V
t
Bus_A
Bus_B
012345
304152
Swizzled Bus Route
13Copyright © 2006, Intel Corporation
• Interleave narrow Vcc/Vss lines through bus
• Use staggered inverting buffers [7]
Layout Techniques (Cont)
VccVssVccVssVss
14Copyright © 2006, Intel Corporation
XX
X
L3 Cache Sleep and Shut-off ModesActive Mode Sleep Mode Shut-off Mode
0V
1.1V
Vol
tage
0V250mV
520mVVirtualVSS
2x lowerleakage
2x lowerleakage
Virtual VSS
BlockSelect
Sub-array
SleepBias
Shutoff
Sub-array Sub-array
15Copyright © 2006, Intel Corporation
Leakage Shut-off Infrared Images
16MB SKU
All 16MB insleep mode
8MB SKU
8MB insleep mode
8MB inshut-off mode
Shut-off feature reduces the leakage of the8MB disabled sub-arrays by about 3W
16Copyright © 2006, Intel Corporation
Dynamic Intel® Smart Cache Sizing• First implementation in Yonah dual-core mobile processor
– Dynamic implementation of the shut-off mode
• HW based algorithm predicts cache usage requirements– Considers the % of time the CPU is in Active state compared to
the various sleep states
• During periods of low activity or inactivity the processor dynamically adapts its effective cache size– Cache content is gradually flushed to system memory– Cache ways are gradually turned off (physically as well as
logically), thus reducing power
• Cache ways are re-powered on demand to deliver full performance when needed
17Copyright © 2006, Intel Corporation
Leakage Mitigation: Long-Le Transistors
• All transistors can be either nominal or long-Le
• Most library cells are available in both flavors
• Long-Le transistors are about 10% slower, but have 3x lower leakage
• All paths with timing slack use long-Le transistors
• Initial design uses only long channel devices
Nominal Le
Long Le(Nom+10%)
18Copyright © 2006, Intel Corporation
Long-Le Transistors Usage Map
0%
20%
40%
60%
80%
100%
Long
-Le
Usa
ge (%
)
Core 0
Cor 1 L3 CacheControl
19Copyright © 2006, Intel Corporation
Long-Le Transistors Summary
Percentage of Long-Le device width excluding RAM arrays:
Cores Uncore
To reduce sub-threshold leakage, most devices will be slower and only a handful of transistors will be fast
Long-Le54%
Nominal46%
Long-Le76%
Nominal24%
20Copyright © 2006, Intel Corporation
10e-12
100e-12
1e-5 1e-4 1e-3 1e-2 1e-1 1e+0
Normalized Ioff
Nor
mal
ized
iso-
load
dela
y
1
Low-VT
High-Vt
Two-stackLow -Vt
Two-stackHigh-Vt
10e-12
100e-12
1e-5 1e-4 1e-3 1e-2 1e-1 1e+0
Normalized Ioff
Nor
mal
ized
iso-
load
dela
y
1
Low-VT
High-Vt
Two-stackLow -Vt
Two-stackHigh-Vt
wu?½ wwl?½ wwu+wl = w
10e-12
100e-12
1e-5 1e-4 1e-3 1e-2 1e-1 1e+0
Normalized Ioff
Nor
mal
ized
iso-
load
dela
y
1
Low-VHigh-Vt
Two-stackLow -Vt
Two-stackHigh-Vt
wu?½ wwl?½ wwu+wl = w
11
10
Nor
mal
ized
del
ayun
der i
so- in
put l
oad
10e-12
100e-12
1e-5 1e-4 1e-3 1e-2 1e-1 1e+0
Nor
mal
ized
iso-
load
dela
y
1
Low-High-Vt
Two-stackLow -Vt
Two-stackHigh-Vt
wu?½ wwl?½ wwu+wl = w
10e-12
100e-12
1e-5 1e-4 1e-3 1e-2 1e-1 1e+0Normalized Ioff
Nor
mal
ized
iso-
load
dela
y
1
Low-High-VT
Two-stackLow -VT
Two-stackHigh-VT
wu?½ wwl?½ wwu+wl = w
11
10
Nor
mal
ized
del
ayun
der i
so- in
put l
oad
Stack Forcing
w
wu
wl
wu
wl
wu
wl
wu
wl
Equal Loading
Leakage Reduction
Performance Loss
Wu = Wl
• Force one transistor into a two transistor stack with the same input load
• Can be applied to gates with timing slack• Trade-off between transistor leakage and speed[Narendra, et al – ISLPED 2001]
21Copyright © 2006, Intel Corporation
Montecito Voltage Domains [2]
Core 0,frequency tracks voltage, 40W
250m
W
Foxton Control1GHz fixed frequency
Bus ArbiterFixed frequency, 2.5W
Bus
I/O
, fix
ed fr
eque
ncy,
8W
to
tal
12M
B L
3 ca
che,
as
ynch
rono
us
oper
atio
n, 2
.5W
12MB
L3 cache, asynchronous operation, 2.5W
Core 1,frequency tracks voltage, 40W
f1
22Copyright © 2006, Intel Corporation
Tulsa Voltage Domains
Uncore
16MB L3
TAG
TAG
1MB L2
1MB L2
Core 1
Core 0
FSB TOP
FSB BOT
Control Logic I/O
Core
PLL
0V
1.25V
Vol
tage
1.10V
0.25VCores 16MB arrayCtrl
+Tag
Voltage ProfileCut Line
VirtualVSS
23Copyright © 2006, Intel Corporation
Tulsa Clock Domains [4]
Uncore I/OCore PLL
16MB L3
TAG
TAG
1MB L2
1MB L2
Core 1
Core 0
FSB TOP
FSB BOT
Legend:
System Clock (BCLK)
24Copyright © 2006, Intel Corporation
Tulsa Uncore Clock Distribution [4]
PLL (Clock Generator)
Core dense MCLK grid
Un-Core pre-global MCLK spine
Un-Core sparse SCLK grid
De-skew buffer
Un-Core ZCLK grid
Un-Core pre-global ZCLK spine
Horizontal clock spines
Vertical clock spinesPLL (Clock Generator)PLL (Clock Generator)
Core dense MCLK gridCore dense MCLK grid
Un-Core pre-global MCLK spineUn-Core pre-global MCLK spine
Un-Core sparse SCLK gridUn-Core sparse SCLK grid
De-skew bufferDe-skew buffer
Un-Core ZCLK gridUn-Core ZCLK grid
Un-Core pre-global ZCLK spineUn-Core pre-global ZCLK spine
Horizontal clock spines
Vertical clock spines
25Copyright © 2006, Intel Corporation
Montecito Clock System [5]
FixedSupply
VariableSupply
Bus Clock
I/Os
Foxton
Bus Logic
Core0
Core1
1/N
CVD
RAD
Gater
1/1
GaterCVD
RVD
DFD
DFD
SLCB CVD GaterPLL
DFD
SLCB CVD Gater
SLCB
DFD
CVD Gater
1/M
DFD
DFD
1/NRAD
SLCB
SLCB
DivisorsFrequency Translation Table
Fuses
PinsBalanced
Tree ClockDistribution
Phase Aligner
26Copyright © 2006, Intel Corporation
Montecito Clock Distribution [6]
Variable FrequencyFull Rail Transitions
Bus Clock
IOs
Fixed frequencyLow Voltage Swings
Foxton
Bus Logic
REPEATERS
core0
core1
SLCB
RAD
RAD
GATERSCVDSLCB
DFD
DFD
PLL
DFD
DFD
DFD
DFD
L0 route L1 route L2 route
Latches
Latches
Latches
Latches
Latches
Latches
Latches
SLCB
SLCB
SLCB
GATERSCVD
GATERSCVD
GATERSCVD
GATERSCVD
L3 Route
Differential Single Ended
27Copyright © 2006, Intel Corporation
Single-Ended vs. Differential Clocks
• Differential clock – Lower skew – High power– Longer distance
between repeaters
• Single-ended clock– Lower power– Need sharp edges
to control skew
GNDCLK -CLK +GND
GND VDD VDD VDDGND GND
GNDCLKVDD
28Copyright © 2006, Intel Corporation
Dense vs. Sparse Grid Tiles
M7M7
85um
M6
M7M7
M5
450u
m
194umM7
M8 M6
1200um
1200um1200um1200um1200um
400umM8
M7
M6M7M7
85um
M6
M7M7
M5
450u
m
M7M7
85um85um
M6
M7M7
M5
450u
m45
0um
194umM7
M8 M6
1200um
194um194umM7
M8 M6
1200um1200um
1200um1200um1200um1200um
400umM8
M7
M6
1200um1200um1200um1200um
400umM8
M7
M6
Core Full Grid Tile Un-Core Sparse Tile
Un-Core Dense Tile
29Copyright © 2006, Intel Corporation
Xeon® Processor Package
• 12 layers organic package (53.3 mm/side)
• 4-4-4 stacking
• Integrated heat spreader (38.5 mm/side)
• 604 total pins
• 366 signal I/Os
• System management components and decoupling capacitorson package
30Copyright © 2006, Intel Corporation
Itanium® 2 Processor Package
Powerdeliveryconnector
Heat spreader
SystemManagementComponents
DecouplingCaps
31Copyright © 2006, Intel Corporation
Intel® Core™ Duo Processor Package
32Copyright © 2006, Intel Corporation
Design for Test and Debug Features
L3 cache DFT/DFM
•Built-in pattern generator (PBIST)
•Programmableweak-write test
•Low-yield analysis
•Stability test mode
•32-entry cache line disable (Pellston)
FSB DFT/DFM
•I/O loopback
•I/O test generator
Die-level DFT/DFM
•Parallel structural core test with XOR
•Scan and observability registers (scan-out)
•Three TAP controllers(core0, core1, uncore)
•Within-die process monitors
•On-die clock shrink
33Copyright © 2006, Intel Corporation
• Two thermal sensors per core• Mux thermal diodes into VCOs to measure temp
Itanium® 2 Thermal Map
34Copyright © 2006, Intel Corporation
Xeon® Infra-red Emission Image
Temperature Sensors
ThermalDiode
35Copyright © 2006, Intel Corporation
Potential Multi-Core Thermal Control
• Multiple core designs require a central thermal management unit and an efficient mechanism for transmitting thermal sensor measurements [8]
Cache
TS Thermal Sensor TMU Thermal Management Unit
Cache Cache Cache
Cache Cache Cache Cache
Core Core Core Core
Core Core Core Core
I/O I/O
TS
TS
TS
TS
TS
TS
TS
TS
TS
TS
TS
TS
TS
TS
TS
TS
TMU
36Copyright © 2006, Intel Corporation
Summary• Dual-core processors cover the entire compute
spectrum, from laptops, to desktop and servers• Moore’s Law applied to multi-core processors:
– Process technology scaling enables number of coresand cache size to double every two years
• Power and leakage reduction circuit techniques are essential for multi-core processors– Massive Long-Le usage → many slow transistors and
a few fast devices– Cache sleep and shut-off modes (static and dynamic)
• Multiple on-die clock and voltage domains are required to control active power and leakage– Need new verification tools and automated checks
37Copyright © 2006, Intel Corporation
References[1] S. Rusu, et al., “A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3
Cache,” ISSCC Dig. Tech. Papers, Paper 5.3, Feb. 2006.[2] S. Naffziger, et al., “The Implementation of a 2-Core, Multi-Threaded Itanium®
Family Processor,” IEEE J. Solid-State Circuits, pp. 197-209, Jan. 2006.[3] R. Korner, “Yonah and Sossaman Processor Briefing”, Intel Developer
Forum, Fall 2005[4] S. Tam, et al., “Clock Generation and Distribution of a Dual-Core Xeon®
Processor with 16MB L3 Cache,” ISSCC Dig. Tech. Papers, Paper 21.2, 2006.[5] T. Fischer, et al., “A 90-nm variable frequency clock system for a power
managed Itanium® architecture processor,” IEEE J. Solid-State Circuits, pp. 217–227, Jan. 2006.
[6] P. Mahoney, et al., “Clock Distribution on a Dual-core Multi-threaded Itanium™-Family Microprocessor,” ISSCC Dig. Tech. Papers, 2005
[7] S. Rusu, “Timing analysis of high-speed VLSI designs - Trends and challenges,” International Workshop on Timing Issues (TAU), 1997
[8] S. Rusu and S. Tam, “Apparatus for thermal management of multiple core microprocessors,” US patent 6,908,272, issued 7/21/2005