0 1 Thousand Core Chips A Technology Perspective Shekhar Borkar Intel Corp. June 7, 2007.
-
Upload
morgan-casey -
Category
Documents
-
view
223 -
download
0
Transcript of 0 1 Thousand Core Chips A Technology Perspective Shekhar Borkar Intel Corp. June 7, 2007.
1
2
Thousand Core ChipsThousand Core ChipsA Technology PerspectiveA Technology Perspective
Shekhar Shekhar BorkarBorkar
Intel Corp.Intel Corp.
June 7, 2007June 7, 2007
3
OutlineOutline
Technology outlookTechnology outlook
Evolution of Multi—thousands of cores?Evolution of Multi—thousands of cores?
How do you feed thousands of coresHow do you feed thousands of cores
Future challenges: variations and reliabilityFuture challenges: variations and reliability
ResiliencyResiliency
SummarySummary
4
Technology OutlookTechnology OutlookHigh Volume High Volume ManufacturingManufacturing
20042004 20062006 20082008 20102010 20122012 20142014 20162016 20182018
Technology Technology Node (nm)Node (nm)
9090 6565 4545 3232 2222 1616 1111 88
Integration Integration Capacity (BT)Capacity (BT)
2 4 8 16 32 64 128 256
Delay = CV/I Delay = CV/I scalingscaling
0.70.7 ~0.7~0.7 >0.7>0.7 Delay scaling will slow downDelay scaling will slow down
Energy/Logic Op Energy/Logic Op scalingscaling
>0.35>0.35 >0.5>0.5 >0.5>0.5 Energy scaling will slow downEnergy scaling will slow down
Bulk Planar Bulk Planar CMOSCMOS
High Probability Low ProbabilityHigh Probability Low Probability
Alternate, 3G etcAlternate, 3G etc Low Probability High ProbabilityLow Probability High Probability
VariabilityVariability Medium High Very HighMedium High Very High
ILD (K)ILD (K) ~3~3 <3<3 Reduce slowly towards 2-2.5Reduce slowly towards 2-2.5
RC DelayRC Delay 11 11 11 11 11 11 11 11
Metal LayersMetal Layers 6-76-7 7-87-8 8-98-9 0.5 to 1 layer per generation0.5 to 1 layer per generation
5
Terascale Integration CapacityTerascale Integration Capacity
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
2001 2005 2009 2013 2017
Tra
nsi
sto
rs (
Mil
lio
ns) Total Transistors,
300mm2 die
~1.5B LogicTransistors
~100MB Cache
100+B Transistor integration capacity100+B Transistor integration capacity
6
Scaling ProjectionsScaling Projections
0
10
20
30
40
50
2001 2005 2009 2013 2017
Fre
qu
ency
(G
Hz)
1.5X Ideal
1.25X Realistic
0.0
0.4
0.8
1.2
2001 2005 2009 2013 2017
Vd
d (
Vo
lts)
0.7X Ideal
Realistic
Freq scaling will slow downFreq scaling will slow down
VVdddd scaling will slow down scaling will slow down
Power will be too highPower will be too high0
200
400
600
800
1,000
1,200
1,400
2001 2005 2009 2013 2017
Po
wer
(W
atts
)
Power too high
300mm2 Die
7
Why Multi-core? –PerformanceWhy Multi-core? –Performance
1
10
1 10Area (X) or Power (X)
Pe
rfo
rma
nc
e (
X)
Slope ~ 0.5
Pollack's Rule2X Power = 1.4X Performance
1
10
100
1,000
2001 2005 2009 2013 2017
Rel
ativ
e P
erfo
rman
ce
Single Core
Multi-Core(Potential)
> 10X
Ever increasing single cores yield diminishing performance in a power envelope
Multi-cores provide potential for near-linear performance speedup
8
Why Dual-core? –PowerWhy Dual-core? –Power
VoltageVoltage FrequencyFrequency PowerPower PerformancePerformance
1%1% 1%1% 3%3% 0.66%0.66%
Rule of thumb
CoreCore
CacheCache
CoreCore
CacheCache
CoreCore
Voltage = 1Freq = 1Area = 1Power = 1Perf = 1
Voltage = -15%Freq = -15%Area = 2Power = 1Perf = ~1.8
In the same process technology…
9
C1C1 C2C2
C3C3 C4C4
Cache
Large CoreLarge Core
Cache
1
2
3
4
1
2 SmallCoreSmallCore
1 1
1
2
3
4
1
2
3
4
Power
PerformancePower = 1/4
Performance = 1/2
Multi-Core:Multi-Core:Power efficientPower efficient
Better power and Better power and thermal managementthermal management
Multi-Core:Multi-Core:Power efficientPower efficient
Better power and Better power and thermal managementthermal management
From Dual to Multi—From Dual to Multi—
10
GPGP GPGP
GPGP
GPGP GPGP
GPGP
GPGP
GPGP GPGP
GPGP
GPGP GPGP
General Purpose Cores
Future Multi-core PlatformFuture Multi-core Platform
SPSP SPSP
SPSP SPSPSpecial Purpose HW
CC
CC
CC
CC
CC
CC
CC
CC Interconnect fabric
Heterogeneous Multi-Core Platform—SOCHeterogeneous Multi-Core Platform—SOC
11
Fine Grain Power ManagementFine Grain Power Management
ff ff
ff
ff
ff ff
Vdd Cores with critical tasksFreq = f, at VddTPT = 1, Power = 1
f/2f/2
f/2f/2
f/2f/2
f/2f/2
f/2f/2
0.7xVdd
Non-critical coresFreq = f/2, at 0.7xVddTPT = 0.5, Power = 0.25
00
00
00
00 00
Cores shut downTPT = 0, Power = 0
12
Performance ScalingPerformance Scaling
0
2
4
6
8
10
0 10 20 30
Number of Cores
Per
form
ance
Amdahl’s Law: Parallel Speedup = 1/(Serial% + (1-Serial%)/N)
Serial% = 6.7%N = 16, N1/2 = 8
16 Cores, Perf = 8
Serial% = 20%N = 6, N1/2 = 3
6 Cores, Perf = 3
Parallel software key to Multi-core successParallel software key to Multi-core successParallel software key to Multi-core successParallel software key to Multi-core success
13
From Multi to Many…From Multi to Many…
0
5
10
15
20
25
30
TPT OneApp
TwoApp
FourApp
EightApp
Sys
tem
Per
form
ance
Large
Med
Small
Single Core Performance
1
0.5
0.3
0
0.2
0.4
0.6
0.8
1
1.2
La
rge
Me
d
Sm
all
Re
lati
ve
Pe
rfo
rma
nc
e
13mm, 100W, 48MB Cache, 4B Transistors, in 22nm12 Cores 48 Cores 144 Cores
14
From Many to Too Many…From Many to Too Many…
Single Core Performance
1
0.5
0.3
0
0.2
0.4
0.6
0.8
1
1.2
La
rge
Me
d
Sm
all
Re
lati
ve
Pe
rfo
rma
nc
e
13mm, 100W, 96MB Cache, 8B Transistors, in 16nm24 Cores 96 Cores 288 Cores
0
5
10
15
20
25
30
TPT OneApp
TwoApp
FourApp
EightApp
Sys
tem
Per
form
ance
Large
Med
Small
15
On Die Network PowerOn Die Network Power
1
10
100
1000
10000
2001 2005 2009 2013 2017
Th
rou
gh
pu
t (R
ela
tiv
e)
Small, 1.5MT core~1000 Cores
Large, 15MT core~ 100 Cores
1
10
100
1,000
2001 2005 2009 2013 2017
Ne
two
rk P
ow
er
(W)
4B wide links, 4 links/core
~150W
~15W
300mm2 Die
A careful balance of:
1. Throughput performance
2. Single thread performance (core size)
3. Core and network power
16
ObservationsObservationsScaling Multi— demands more parallelism every Scaling Multi— demands more parallelism every generationgeneration• Thread level, task level, application level
Many (or too many) cores does not always Many (or too many) cores does not always mean…mean…• The highest performance
• The highest MIPS/Watt
• The lowest power
If on-die network power is significant, then power If on-die network power is significant, then power is even worseis even worse
Now software, too, must follow Moore’s LawNow software, too, must follow Moore’s LawNow software, too, must follow Moore’s LawNow software, too, must follow Moore’s Law
17
Memory BW GapMemory BW GapBusses have become wider to deliver necessary memory BW (10 to 30 GB/sec)
Yet, memory BW is not enough
Many Core System will demand 100 GB/sec memory BW
0
1000
2000
3000
4000
5000
6000
1985 1990 1995 2000 2005 2010
MH
z
Core Clock
Bus Clock
GAP
How do you feed the beast?How do you feed the beast?How do you feed the beast?How do you feed the beast?
18
IO Pins and PowerIO Pins and Power
0
5
10
15
20
25
30
0 5 10 15 20
Signaling Rate GBit/sec
Po
wer
(m
W/G
bp
s)
State of the artState of the art
Research
State of the art:100 GB/sec ~ 1 Tb/sec = 1,000 Gb/sec 25mw/Gb/sec = 25 WattsBus-width = 1,000/5 = 200, about 400 pins (differential)
Too many signal pins, too much powerToo many signal pins, too much power
19
SolutionSolution
ChipChip ChipChip> 5mm
Bus
High speed busses
Busses are transmission linesL-R-C effectsNeed signal terminationSignal processing consumes power
Solutions:Reduce distance to << 5mmR-C busReduce signaling speed (~1Gb/sec)Increase pins to deliver BW1-2 mw/Gbps
ChipChip ChipChip
<2mm
100 GB/sec ~ 1 Tb/sec = 1,000 Gb/sec 2mw/Gb/sec = 2 WattsBus-width = 1,000/1 = 1,000 pins
20
Package
Anatomy of a Silicon ChipAnatomy of a Silicon Chip
Si Chip
Heat-sink
Heat
PowerSignals
21
Package
System in a PackageSystem in a Package
Si Chip Si Chip
Limited pins: 10mm / 50 micron = 200 pins
Limited pinsSignal distance is large ~10 mm – higher powerComplex package
22
Package
DRAM on TopDRAM on Top
CPU
Temp = 85°C
Junction Temp = 100+°C
High temp, hot spotsNot good for DRAM
DRAM
Heat-sink
23
Package
DRAM at the BottomDRAM at the Bottom
DRAM
CPU
Heat-sink
Power and IO signals go through DRAM to CPU
Thin DRAM die
Through DRAM vias
The most promising solution to feed the beastThe most promising solution to feed the beast
24
ReliabilityReliability
Soft Error FIT/Chip (Logic & Mem)
0
50
100
150
Re
lati
ve
Time dependent device degradation
0
1
1 2 3 4 5 6 7 8 9 10
Time
Ion
Re
lati
ve
Burn-in may phase out…?
1
10
100
1000
10000
180 90 45 22
Jo
x (
Re
lati
ve
)Hi-K?
?
Extreme device variations
0
50
100
100 120 140 160 180 200
Vt(mV)
Re
lati
ve
Wider
25
Implications to ReliabilityImplications to Reliability
Extreme variations (Static & Dynamic) will result in Extreme variations (Static & Dynamic) will result in unreliable componentsunreliable components
Impossible to design reliable system as we know Impossible to design reliable system as we know todaytoday
• Transient errors (Soft Errors)
• Gradual errors (Variations)
• Time dependent (Degradation)
Reliable systems with unreliable components Reliable systems with unreliable components ——Resilient Resilient ArchitecturesArchitectures
Reliable systems with unreliable components Reliable systems with unreliable components ——Resilient Resilient ArchitecturesArchitectures
26
Implications to TestImplications to Test
One-time-factory testing will be outOne-time-factory testing will be out
Burn-in to catch chip infant-mortality will not be Burn-in to catch chip infant-mortality will not be practicalpractical
Test HW will be part of the designTest HW will be part of the design
Dynamically self-test, detect errors, Dynamically self-test, detect errors, reconfigure, & adaptreconfigure, & adapt
27
In a Nut-shell…In a Nut-shell…
100 Billion
Transistors
100 Billion
Transistors
100 BT integration capacity
Billions unusable (variations)
Some will fail over time
Yet, deliver high performance in the power & Yet, deliver high performance in the power & cost envelopecost envelope
Yet, deliver high performance in the power & Yet, deliver high performance in the power & cost envelopecost envelope
Intermittent failures
28
Resiliency with Many-CoreResiliency with Many-Core
Dynamic on-chip testingDynamic on-chip testing
Performance profilingPerformance profiling
Cores in reserve (spares)Cores in reserve (spares)
Binning strategyBinning strategy
Dynamic, fine grain, performance Dynamic, fine grain, performance and power managementand power management
Coarse-grain redundancy Coarse-grain redundancy checkingchecking
Dynamic error detection & Dynamic error detection & reconfiguration reconfiguration
Decommission aging cores, swap Decommission aging cores, swap with spareswith spares
Dynamically…Dynamically…1.1. Self test & detectSelf test & detect2.2. Isolate errorsIsolate errors3.3. ConfineConfine4.4. Reconfigure, andReconfigure, and5.5. AdaptAdapt
Dynamically…Dynamically…1.1. Self test & detectSelf test & detect2.2. Isolate errorsIsolate errors3.3. ConfineConfine4.4. Reconfigure, andReconfigure, and5.5. AdaptAdapt
CC
CC
CC
CC
CC
CC
CC
CC
29
SummarySummaryMoore’s Law with Terascale integration capacity Moore’s Law with Terascale integration capacity will allow integration of thousands of coreswill allow integration of thousands of cores
Power continues to be the challengePower continues to be the challenge
On-die network power could be significantOn-die network power could be significant
Optimize for power with the size of the core and Optimize for power with the size of the core and the number of coresthe number of cores
3D Memory technology needed to feed the beast3D Memory technology needed to feed the beast
Many-cores will deliver the highest performance in Many-cores will deliver the highest performance in the power envelope the power envelope with resiliencywith resiliency