Challenges and Opportunities Ruchir Puri
Transcript of Challenges and Opportunities Ruchir Puri
© 2013 IBM Corporation
High Performance Microprocessor Design, and Automation:
Challenges and Opportunities
Ruchir Puri IBM Fellow, VLSI Systems
IBM Thomas J Watson Research Center, Yorktown Heights, NY
© 2012 IBM Corporation
First, something about lunch and technical work…
LUNCH
LUNCH
ISPD/TAU
attendee
ISPD/TAU
attendee ISPD/TAU
attendee
© 2013 IBM Corporation
The Technology Era: Frequency Scaling
Once upon a time, life used to be Great,
when technology was the superman and
Design tagged along for the ride and even EDA
g rabbed de s igne r l egs fo r the fun !
© 2013 IBM Corporation
Characteristics of Single Thread Era
0
1
2
3
4
5
6
POWER4 POWER5 POWER6
Fre
qu
en
cy
(G
Hz)
0
100
200
300
400
500
POWER4 POWER5 POWER6
TX
s p
er
co
re
Dennard Scaling
Exponential Frequency Growth
Optical Scaling / Node Migration
Expanding uArch Complexity
© 2013 IBM Corporation
Single Thread Era EDA: Transistor Analysis & Optimization
Static timing analysis of complex circuits
Transistor Level timing optimization
clkin
w2
w3
w0
fbk
Cycle
fbk = 0
evaluate precharge
fbk = 1
w2 = 1 1st Timing
2nd Timing
w3_int
0
200
400
600
800
1000
1200
-6 -4 -2 0 2 4 6 8
Slack (ps)#
of
pa
ths
Pre-Tuning
Post-Tuning
© 2013 IBM Corporation
End of Frquency Scaling : The Power Wall
0.010.11
0.001
0.01
0.1
1
10
100
1000
Gate Length (microns)
Active Power
Passive Power
1994 2004
Po
wer
Den
sit
y (
W/c
m2) Air Cooling limit
Inability to scale Oxide thickness & lower voltage resulted in a power wall for single thread performance
© 2013 IBM Corporation
Frequency Scaling : POWER6 (65nm, 2007)
5+ GHz operation, >790M transistors, 341mm2 die
65nm SOI with 10 levels of Cu interconnect
Same pipeline depth & power @ 2x frequency versus POWER5
2 MB
L2
2 MB
L2
2 MB
L2
2 MB
L2
L2 Dir
L2 Dir
Mem.
Cntl.
Mem.
Cntl. SMP Fabric
L
3
C
O
N
T
R
O
L
L
E
R
LSU
IFU /
IDU V
M
X
F
X
U
B
F
U
D
F
U
RU
Core 1
© 2013 IBM Corporation
Technology Tantrums
Designers
Technology
Design
Shock and awe of 65nm:
Wire delays overtaking
Gate delays
End of Frequency
Scaling with
Technology
Squeezing the
design hard
Unfortunately, Liner for copper
Interconnects doesn’t scale
Liner
Copper
© 2013 IBM Corporation
Multi-Core Era
Multi-Core
End of frequency scaling ushered in a new era of innovation with multi-core design
© 2013 IBM Corporation
POWER Processors Began the Multi-Core / Multi-Thread Era
Power 4
2001
Introduced First Dual core
Power 5
2004
Dual Core
Introduces SMT (4 threads)
Power 6
2007
Dual Core – 4 threads
Enhances SMT Efficiency
© 2013 IBM Corporation
Life starts to become interesting: Technology ride very bumpy
0%
20%
40%
60%
80%
100%
180nm 130nm 90nm 65nm 45nm 32nm
Gain by Traditional Scaling Gain by Innovation
Rela
tive
% I
mp
rovem
en
tIBM Transistor Performance Improvement
0%
20%
40%
60%
80%
100%
180nm 130nm 90nm 65nm 45nm 32nm
Gain by Traditional Scaling Gain by Innovation
Rela
tive
% I
mp
rovem
en
tIBM Transistor Performance Improvement
WL
BOX
Deep Trench
Cap
BL
Node
Passing WL
4.0
um
18fF Storage Node
WL
3fF BL (32 Cells)
Node
High-K Metal Gate
eDRAM
Innovation
Playing a
Dominant
Role
© 2013 IBM Corporation
POWER7 Processor Chip
567mm2, 45nm SOI w/ eDRAM
1.2B transistors – Equivalent function of 2.7B
(eDRAM)
Eight processor cores w/ 4 way SMT – 32 threads / chip
32MB on chip eDRAM shared L3
Dual DDR3 Memory Controllers w/ 100GB/s sustained Memory bandwidth
Scalability up to 32 Sockets – 360GB/s SMP bandwidth/chip – 20,000 coherent operations in flight
FX0
FX1
FP0
FP1
LS0
LS1
BRX
CRL
4 Way SMT
Thread 1
Executing
Thread 0
Executing
No Thread
Executing
Thread 3
Executing
Thread 2
Executing
© 2013 IBM Corporation
Z EC 12 (32nm, 2012)
Core Core
Core Core Core
Core
L3 Cache
MC
U
I/O
s
I/O
s
GX
G
X I
/Os
G
X I
/Os
L2 D L2 D L2 D
L2 D L2 D L2 D
L
2
I
L
2
I
L
2
I
L
2
I
L
2
I
L
2
I
SC I/Os SC I/Os
SC I/Os SC I/Os
32nm high-k CMOS
597 mm2
5.5 GHz
2.75B transistors
15 levels of metal
© 2013 IBM Corporation
Multi-Core Era Limiters
1
2
4
8
16
32
64
1
10
100
90 65 45 32 22 14 10
Technology Node
log
(p
erf
orm
an
ce)
Ideal Growth
Likely Multi-Core Path
Technology complexity & rising costs
Power
SW parallelism
Socket BW
© 2013 IBM Corporation
Innovation Drive Architecture &
Productivity
Innovation
© 2013 IBM Corporation
Compute Throughput Potential
Socket Throughput Limitation
(Power, memory bandwidth)
Need to Amplify Effective
Socket Throughput
To Achieve Potential
Multi-Core Advantage
© 2013 IBM Corporation
Compute Throughput Potential
Socket Throughput Limitation
(Power, memory bandwidth)
EDRAM = large, low power cache
High bandwidth memory buffer
Low-Power Off-Chip Signaling
Technology
Coherence Innovation to
minimize socket-to-socket
communication
High performance uP Designs: Extending Multi-Core Gains (Power processor)
© 2013 IBM Corporation
Innovation Drive : System Level Technologies
3D Stacking with Through Silicon Vias
FPGA Accelerators Flash Memory / SSD Silicon Photonics
Single Processor–Memory Socket
Heterogeneous systems on Chip
Specialized functions
Specialized cores:
Single thread focused
Throughput focused
© 2013 IBM Corporation
As you start to get sleepy…Dilbert to the rescue..
© 2013 IBM Corporation
Productivity Innovation: Structured Synthesis and Large Block Synthesis
P/Z server Macro Quad FPU
Customs take large amount of resources and productivity is key
Merge the domain of customs and Synthesis targeting design productivity
and improved quality
– through merging of custom and synthesis hierarchy with structure in
synthesis (not random logic any more)
–Global Optimization view; Targeted structured data paths and synthesis
–A methodology with numerous algorithmic and practical innovations
spanning from incremental logic design processing, to data paths to
structured clocking to custom synthesis merged techniques.
© 2013 IBM Corporation
Productivity Innovation : Reduce Custom Design (Structured Synthesis)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
# of Customs over Time
Synthesis results w/ custom-like data flow alignment.
>10x reduction over 5 generation
*ISPD 2013 Best Paper Award: “Network Flow Based Datapath Bit Slicing” H.Xiang et al.
Milestone: Digital Logic
in 22nm server class
Microprocessors
99% synthesized
and signed-off by
Gate Level signoff
© 2013 IBM Corporation
Productivity Innovation: Reduced # of Design Partitions (Large Block Synthesis)
60 logic macros, 25 customs, 14 unique arrays/RFs
1 macro, 0 customs, 9 unique arrays / RFs Reduced area & power; equal cycle time
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
# of Macros over Time
© 2013 IBM Corporation
Productivity & TAT Innovation: Gate Level Analysis & Signoff
0
1
Arb
itra
ry U
nit
s
Runtime Cleanup Work Accuracy
TX Level
Gate Level
Large speedup
Reduced cleanup
Similar accuracy
© 2013 IBM Corporation
Productivity & TAT Innovation: Hierarchical Abstraction & Multi-Threading
0
50
hrs.
Base Cleanup Coarse
Parallelism
Hierarchical
abstracts
Multi-
threading
Projected Chip Timing Runtime
Fast global analysis tools allow designers to iterate more
often resulting in improved final designs.
Hierarchical abstraction & multi-threading are the most
promising ways to minimize TAT.
–Applies to all disciplines (timing, verification, etc)
© 2013 IBM Corporation
Productivity Innovation Challenges: Retiming
Significant fraction of logic designer effort spent in optimizing cycle boundaries
Retiming enables physical synthesis to optimally place latches in logic cones to
balance timing/area/power
Invention is required to seamlessly handle divergence between functional RTL
(Verilog/VHDL) and physical implementation throughout methodology.
.
.
. latch
Doesn’t meet cycle time
Area/Power too high
Optimal
© 2013 IBM Corporation
Challenges Back-end: Scaling and Interconnects
Continued
Pressure on
Wiring.
Innovation
needed
Liner
Copper
High-performance designs will not be able to tolerate such large RC increases
• Push for more wiring interconnect layers (coarse-pitch)
• Will still need some number of fine-pitch layers for short run local connections
Improved DA tools (routers) needed*
• Optimize wire plane usage to limit technology complexity
• Negotiate through special design rules for the finest levels
• Via optimization, especially at driver end
• Tricky performance vs wireability tradeoffs
• Many wires will need “special” treatment
• Increase width, push higher, add buffers, etc.
0
1
2
3
4
5
6
7
9s 10s 11s 12s 13s
Mx Res (ohm/um)
0
2
4
6
8
10
12
14
16
18
20
9s
Cu11
10s
Cu08
11s
Cu65
12s
Cu45
Mx Via ResNOM (ohm)
WC (ohm)
*J.Warnock’s talk earlier yesterday at ISPD
© 2013 IBM Corporation
Innovation at Technology, Design Interface: Double/Triple Patterning
Need for Double / Triple Patterning
0
50
100
150
32nm 20nm Future
Technology Node
Pit
ch
(n
m)
Device Pitch
Single Exposure Limit
Metal Pitch
Double Patterning Limit
EUV?
Ripe field for
Physical Design
Innovation
© 2013 IBM Corporation
How We Stage Designs to Market: Current vs. Desired
Innovation Implement
Implement
Implement
Go to Market
Go to Market
Go to Market
Innovation Implement Go to Market
Innovation Implement
Significant effort spent on implementation
Significant effort spent on Innovation
© 2013 IBM Corporation 29
Innovation: The sweet spot in this new era
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Designer
Time
Wait forTools
Implement
Plan
Innovate
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Designer
Time
Wait forTools
Implement
Plan
Innovate
© 2013 IBM Corporation
New Frontier: Innovating up the Value Stack
Design Implementation Innovation
IP Design Process Innovation
IP Design content creation innovation
:
Syste
m v
alu
e m
ovin
g u
p th
e s
tack
© 2013 IBM Corporation
Innovation Drive : Hardware-Software Co-Optimization
Benefit of Specialized HW
Acceleration
0
20
40
60
80
100
Base w/ HW Accel
Encryption
Application
Profiling Hardware
Original
program Dynamically
Optimized program
Software
Optimizer
Execution
Engine
Dynamic Re-compilation with Profile Data
Heterogeneous design: Specialized HW acceleration viable in many areas
•Graphics •Compression •Cryptography…
FPGA Accelerators
Innovation on
Acceleration in
PD/timing domain?
© 2013 IBM Corporation
Hardware Programming
VHDL / Verilog
Synthesis
Place & Route
LUT
FF
RAM
VHDL / Verilog
Synthesis
Place & Route
LUT
FF
RAM
HLL: C/C++,
LiMe, OpenCL
HL Compiler
Traditional High-level
1000s of RTL
designers
Millions of Software designers
Hardware
- Hardware mistakes
cannot be patched
- Significant barriers
to be overcome
between RTL,
physical design
domain and high
level languages.
- Correlation between
higher level spec. and
lower level design.
There is more
religion in
logic design
than number priests
in religion
© 2013 IBM Corporation
Lime Virtual Machine (LVM)
“Bus”
Artifact Store
Liquid Metal (A high level harware compiler)
GPU CPU FPGA
bytecode bitfile
Lime
Lime Compiler
bitfile bitfile binary
http://www.research.ibm.com/liquidmetal
© 2013 IBM Corporation
Architectural Synthesis
Functional Cycle Accurate RTL: VHDL/Verilog
Metric: Cache
Miss rate etc.
Back End PD
Implementation
and Analysis
C/C++
Model
Metrics:
Performance
Models, CPI
etc
C/C++
Model
Metrics: Electrical,
Timing, Area, Noise
etc.
Successive Refinement
© 2013 IBM Corporation
Capturing the Vibrant spirit of DA Researcher
Once upon a time there lived three men: a
doctor, a chemist, and a DA Researcher.
For some reason all three offended the
king and were sentenced to die on the
same day.
The day of the execution arrived, and the
doctor was led up to the guillotine.
© 2013 IBM Corporation
As he strapped the doctor to the guillotine, the
executioner asked, "Head up or head down?" "Head up," said the doctor.
"Blindfold or no blindfold?" "No blindfold."
So the executioner raised the axe, z-z-z-z-ing!
Down came the blade--and stopped barely an
inch above the doctor's neck. Well, the law
stated that if an execution didn't succeed the
first time the prisoner had to be released, so
the doctor was set free.
Capturing the Vibrant spirit of DA Researcher
© 2013 IBM Corporation
Then the chemist was led up to the guillotine. "Head up or head down?" said the executioner.
"Head up."
"Blindfold or no blindfold?" "No blindfold."
So the executioner raised his axe, z-z-z-z-ing! Down came the blade--and stopped an inch above the chemist's neck. Well, the law stated that if the execution didn't succeed the first time the prisoner had to be released, so the chemist was set free.
Capturing the Vibrant spirit of DA Researcher
© 2013 IBM Corporation
Finally the DA Researcher was led up to the guillotine.
"Head up or head down?" "Head up."
"Blindfold or no blindfold?" "No blindfold."
So the executioner raised his axe, but before he could cut the rope, ever curious Chandu yelled out:
"WAIT! I see what the problem is!".
Capturing the Vibrant spirit of DA Researcher
© 2013 IBM Corporation
Role of Memory
© 2013 IBM Corporation
Cache Levels
Main Memory
Storage Class
Memory
Storage
Processor
Data moves across all levels of
storage hierarchy before and
after being processed at the
processor
Processor
Processing
at Storage
Processing
at SCM
Processing
in Memory
Traditional Computing Hierarchical Data Processing
Increasing
opportunity to
exploit
parallelism
Systems view of Data processing
© 2012 IBM Corporation
More Memory Capacity per Socket though More DRAM Chips per Socket
Memory Capacity per Socket
0
500
1000
1500
2000
2500
3000
3500
4000
4500
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
GB
ytes
Capacity per Socket
Moore's Law Trend line
Capicity per socket pure DRAM
• Memory is becoming pervasive throughout, and bits are becoming plentier
and cheaper every day (< 50c /GB for flash)
- Memory technology continues to evolve even if scaling slows down.
Scope of memories is expanding
© 2013 IBM Corporation 42
�
42
Phase Change
Memory (PCM)
�
�
Storage Class Memory – Market View
Bit Line Complement
Bit Line
Word Line
Spin Torque
Transfer
MRAM (STT-
MRAM)
Resistive RAM
(RRAM)
Projected Date of
Enterprise
Availability
Vendors
Researching
Initial Characteristics
Density Cost Latency
Micron
Hynix
Samsung
Low Density
High Density
2015+
2018+
Risk
Comments
= DRAM
0.25x
NAND
Parity in
2015
4x
NAND
3x DRAM
Reads
10x DRAM
Writes
Most advanced
technology
Productization and
Aggressive dev. on hold
until market is identified
Current very low
density NOR Flash
Offerings exist for
a small market
2016/2017+
Micron- Partnership
with IBM but also
pursuing other DRAM
replacement tech
Hynix-allied
w/Toshiba
Samsung- Acquired start-up
Grandis
= DRAM = DRAM in
both
Reads &
writes
Early in technology dev.
Promising scalability
Manufacturability
unexplored
Target
DRAM
Parity
Low Density
2016+
High Density
2018+
Micron
Hynix-partnered
with HP
Samsung
Sandisk/Toshiba
2x NAND Target
NAND
Parity
> DRAM,
< NAND
(~us)
Early in technology dev.
Promising scalability
& performance
Manufacturability
unexplored
© 2013 IBM Corporation
Scope of memories: The whole spectrum
21 23 27 211 213 215 219 223
Typical access latency in processor cycles (@ 4 GHz)
L1(SRAM) eDRAM DRAM HDD
25 29 217 221
NAND PCM?
Storage (IO) Memory System
RRAM?
STT-MRAM PCM
© 2013 IBM Corporation
Computing with Memory
• Potential to solve a systems bottleneck of memory
bandwidth in a flexible way for data intensive
applications/tasks
– Reconfigurable on demand (compilation step)
– Robust (ECC protected)
– Verification advantage
– Fast and enables implementation of explicit parallelism
• Utilizes bit level parallelism.
• One must think about computational memory from a
systems and software perspective (How and where
will computational memory fit in and what functions
can it implement).. Active Ongoing research…
Innovating in IP Content Creation
© 2013 IBM Corporation
Life Cycle: Technology Trigger, Hype, Disillusionment, Enlightenment, and Commoditization
© 2013 IBM Corporation
Summary
• Information technology landscape is changing dramatically
– Value is in innovating across the entire stack and increasingly higher up in the stack
– Key problems remain to be solved in technology, design and automation as technology continues to scale
– Significant emerging opportunities in new ways to solve system bottlenecks at every levels: Logic, Architecture, Memory.
– In last several years, life became very challenging but also very interesting as the ride has gotten a lot choppier…
– With challenges and opportunities abound, Winners and Losers will be decided by organizations that grab these challenge and innovate their way out of the current dilemmas…
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Designer
Time
Wait forTools
Implement
Plan
Innovate
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Designer
Time
Wait forTools
Implement
Plan
Innovate
Design Implementation Innovation
IP Design Process Innovation
IP Design content creation innovation Sys
tem
valu
e m
ovin
g u
p th
e s
tack
© 2013 IBM Corporation
Memoirs of a DA Engineer
A lawyer is flying in a hot air balloon over lake tahoe and
realizes he is lost.
He reduces height and spots a man down below. He
lowers the balloon further and shouts, "Excuse me, can
you tell me where I am?"
© 2013 IBM Corporation
The man below said, "Yes, you're in a hot air balloon, hovering 30 feet above this field." "You must be a DA engineer," said the lawyer. "I am," replied the man. "How did you know?"
"Well," said the lawyer, "everything you have told me is technically correct, but it's of absolutely no use to anyone." The man below said, "You must be a lawyer." "I am," replied lawyer, "but how did you know?"
"Well," said the man, "you don't know where you are, or
where you're going, but you expect me to be able to help.
You're in the same position you were before we met, but now it's my fault."
Memoirs of a DA engineer