ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ ( [email protected]) · 2003 IEEE NSREC Short Course 2003...
Transcript of ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ ( [email protected]) · 2003 IEEE NSREC Short Course 2003...
ΗΜΥ 664ΨΗΦΙΑΚΟΣ ΣΧΕΔΙΑΣΜΟΣ ΜΕ FPGAs
Χειμερινό Εξάμηνο 2010
ΔΙΑΛΕΞΗ 7:FPGAs: Research Issues
ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ( [email protected])
Ack: •Rick Padovani, Peter Alfke
•Xilinx Corporation
Are Von Neumann processors running out of steam?
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Pentium MMX (P55C) 1997
Celeron (Mendocino)
1998
Pentium III EB1999
Pentium III-S2001
Pentium 4 (Willamette) 2001
Pentium 4 (Northwood)
2002
(MOP
S/MHz
/Millio
n Tran
sistor
s
Source: UC Berkeley HERC and CPUscorecard.com
Compute Density of Processors
0.1
1.0
10.0
1997 1998 1999 2000 2001 2002 2003 2004 2005
GH
z Clock Speed• Lack of increased clock
speed is being addressed by:– Increased cache size– Longer pipelines– Trying to do more per
cycle• This approach also
nearing its limit
What’s next for Computing Platforms?
• Hyperthreading?• CMP?• Clusters?• Configurable instruction sets?• Configurable coprocessors?
In general, the need for parallel execution is nowrecognized as a requirement, as is the desire for customizable instruction sets
Reconfigurable FPGAs to the rescue• For at least 15 years people have seen the Von Neumann limitations
and have argued that FPGAs were the ultimate supercomputer– Better programmability – not stuck with a fixed ALU– Parallel processing – not just hyperthreading but limitless opportunities for
parallelism– No wasted cost on features that you don’t need
• Some traction over the years, but very limited – Numerous chess-playing machines from Deep Thought to Hydra– Craig Venter used Xilinx chips for the Human Genome project– Other people are using Xilinx chips for Bioinformatics– Cray, SGI and others have been using FPGAs as coprocessors to offload
certain operations– Berkeley Emulation Engine is a recent example – Numerous companies represented in the consortium have been extolling the
virtues of FPGA computing for a long time
Raw Processing Performance Characteristics and Comparisons
Three axes of performance• Computational capability• Memory Bandwidth• IO Bandwidth
050
100150200250
Computation(GOPS)
MemoryBandwidth(GB/sec)
IO Bandwidth(Gbps)
Pentium Virtex-4
Virtex 4
Computation
IO B
andw
idth
Memory
Bandwidth
Pentium
Why has adoption taken so long?
• Traction has been limited by programming model– Direct C translation to gates
• Definite progress in development and productization• Limited customer acceptance in the supercomputer market
but picture may be changes– Direct HDL design
• Difficult to implement current applications of supercomputing in HDL
• Need for high connectivity lowers performance
To date, the only model in widespread use for supercomputing-type applications is HDL
7
Process Technology Advances
Advanced 90-nm process
11-Layer metallization 10 copper + 1 aluminum
New Triple-Oxide Structures Lower quiescent power consumption
Benefits: Best cost
Highest performance
Lowest power
Highest density
Over 1 million 90 nm FPGAs shipped
Channel
Gate
Source Drain
SourceMetal
Connection
DrainMetal
Connection
Moore’s Law Continues Fueling Reprogrammable FPGA Advances
65 nm
90 nm
130 nm
150 nm
180 nm
45 nm32 nm22 nm
1999 2001 2003 2005 2007 2009 2011 2013 2015 2017
8 nm
MatureFPGA Product
Technology
DevelopingFPGA Product
Technology
FutureProcess Technology
• Plan continuation of 2 year Technology node cycle
• “Traditional Scaling” is starting to be effected by the fundamental material limits of the planar CMOS process
• “Equivalent Scaling” or the assimilation of new materials, structures and functional integration will drive continued scaling
Architectural EvolutionReconfigurable FPGAs
Devic
e Com
plex
ity an
d Pe
rform
ance
1985 1992 2000 2002 2004
• FPGA Fabric• Block RAM• Embedded Registers
and Multipliers• Clock Management• Multi-standard
Programmable IO
• FPGA Fabric• Block RAM
• FPGA Fabric
Domain-optimized System Logic
• FPGA Fabric• Block RAM• Embedded Registers
and Multipliers• Clock Management• Multi-standard
Programmable IO• Embedded
Microprocessor• Multigigabit
Transceivers
• FPGA Fabric• Block RAM• Embedded Registers
and Multipliers• Clock Management• Multi-standard
Programmable IO• Embedded
Microprocessor• Multigigabit
Transceivers• Embedded DSP-
optimized Multiplers• Embedded Ethernet
MACs
GlueLogic
BlockLogic
PlatformLogic
SystemLogic
2005
Programmable “System in a
Package”
10
’65 ’70 ’75 ’80 ’85 ’90 ’95 ’00 ’05 ’10Year
Clock Frequency in MHz
Trace Length in cm per 1/4 clock period
2048
1024
512
256
128
64
32
16
8
4
2
1
Moore Meets Einstein
Speed Doubles Every 5 Years…...but the speed of light never changes
11
A Bird’s Eye View...
Lower Cost
Moore’s Law is alive Smaller geometries and larger wafers
and lower defect density (=higher yield ) continue to achieve lower cost per function
LUT + flip-flop: $1.- in 1990, $ 0.002 in 2003
State-of-the-art: 90 nm on 300 mm wafersSpartan-3 uses this technology for lowest cost
Rapid price reductions, intense competition
12
A Bird’s Eye View…
More Logic and Better Features:
>100,000 LUTs & flip-flops>200 BlockRAMs, and same number 18 x 18 multipliers
1156 pins (balls) with >800 GP I/O50 I/O standards, incl. LVDs with internal termination
16 low-skew global clock linesMultiple clock management circuits
On-chip microprocessor(s) and Gbps transceivers
Gate count is really a meaningless metric
13
A Bird’s Eye View…
Higher SpeedSmaller and faster transistors
90 nm technology, using 193 nm ultra-violet lightCu interconnect ( instead of Al ) was easily achievedLow-K dielectric progress is disappointing
System speed: up to 500 MHz,Mainly through smart interconnects, clock management, dedicated
circuits, flexible I/O.
Integrated transceivers running at 10 Gigabits/sec
Speeding up general-purpose logic is getting difficult
14
A Bird’s Eye View…
Better toolsBack-End Place&Route and XST synthesis
VHDL and Verilog becoming entry point
IP/Cores speed up design and verification
Embedded Software Development Toolssupport architectures and merge HW and SW
Domain-Specific LanguagesSystem Generator bridges the gap betweenMatlab/Simulink and FPGA circuit description
ASIC-size FPGAs need ASIC-like tools
ASIC-like size requires ASIC-quality tools
15
ASICs Are Losing GroundMask set >$1M + design + verification + risk
ASICS are only for extreme designs:
Extreme volume, speed, size, low power
Source:IBM
16
FPGAs in 2003
1000 to 80,000 LUTs and flip-flops, millions of bits in dual-ported RAMs
Low-skew Global Clocks, Frequency synthesis, 50 ps phase control
18 Kbit BlockRAMs and 18 x 18 multipliers
FPGAs are not glue-logic anymore
17
Virtex-4 in September 2004 / Virtex-5 Even Better
ASMBL™ Column-Based
Architecture500 MHz
SmartRAM™BRAM/FIFO
0.6 - 11.1 GbpsRocketIO™
SelectIO withChipSync™Technology:
- 1 Gbps LVDS- 600 Mbps SE
500 MHz Xtreme DSP™ Slice
500 MHzXesium™ Clocking
IntegratedSystem Monitor
IntegratedTri-Mode
Ethernet MACCores
Integrated 450 MHzPowerPC Cores
4th GenerationAdvanced Logic
18
1 Hz to 640 MHz Pulse Generator
Direct Digital Synthesis in smallest Spartan3 chip PicoBlaze for arithmetic and user interface Special DCM frequency synthesis for <350 ps jitter External PLL for jitter reduction to 100 picoseconds
Max 640 MHz in 1 Hz steps, 1 ppm accuracy
Three SMA outputs: LVDS plus single-ended 1000 frequency values can be stored in EEPROM
Small size, low cost, easy single-knob control
Early 2005, next generation will reach 5 GHz
19
640 MHz Pulse Generator
20
Challenges
Technology moves rapidly: 130, 90, 65 nm
Multiple Vcc, lower voltage - higher currentLower Vcc makes decoupling very critical
Moore’s law becomes more difficult to sustainLeakage current has increased significantlyTriple-oxide transistors and clever design provide relief
Signal integrity on pc-boards is crucial“homebrew” prototyping would waste money and time
Use Standard Evaluation Boards Instead
21
AFX Basic Evaluation Boards
22
Low-Cost ML40X (~ $ 700)
23
ML46X- Memory Eval. Board
24
25
ChipScope Pro for Real-Time Debug
Debugging usually dominates the design effort needs access to chip-internal nodes and busses practically impossible to dedicate extra pins and routing don’t waste time “debugging the debugger”
ChipScope Pro has internal virtual test headers Small cores that act as internal logic state analyzers
ChipScope Pro provides full visibility at speed Read-out via JTAG, no extra pins needed
ChipScope Pro is the best tool for logic debugging
26
FPGAs in 2004+
2727
“a bold new course into the cosmos”Reconfigurable Scalable Computing (RSC)
for Space Applications - $14.8M
2828
Spirit & Opportunity Rovers6 Radiation-tolerant FPGAs:1M gates @ 100kRads-----------------------------------------Next:6M gates @ 200kRads
29
Power and Reliability
Supply voltage fluctuations can increase in power-aware design
Need models that can be adapted by architects/software designers that abstract detailed circuit issuesCost of hardware solutions - supply grids, decoupling capacitancesCost of software solutions – balancing work load
Substrate coupling between digital and analog Interconnect reliability and power
Noise sources – single errors, multiple errors, amplitude of the error sources, modes of failureChallenge – detecting the unobservedHow to offset encoding/decoding costs – Just Enough Power
Soft errors
30
Higher Leakage Current…
High Leakage current = static power consumptionWas <100 microamps, now > 100 mA, even amps (!)
Caused by:
Gate leakage due to 16 Å gate thickness
Sub-threshold leakage current incomplete turn-off because threshold does not scale
Tyranny of numbers:
10 nA x 100 million transistors = 1 Aevenly distributed, thus no reliability problem
Sub-100 nm is not ideal for portable designs
31
Dramatic Power Reduction in Virtex-4
Frequency
Power Consumption
50%Virtex-4 cuts power by 50%• Measured 40% lower static power with
Triple-Oxide technology• 50% lower dynamic power with 90-nm
• Lower core voltage• Less capacitance
• Up to 10x lower dynamic power with hard IP• Integration means fewer transistors per function
Challenges- Static power grows with process generations
- Transistor leakage current- Dynamic power grows with frequency
- P = cv2f
32
Single-Event Upsets in Virtex-II
SEU = random soft error, directly or indirectly caused by solar radiation
Known problem at high altitude and spacetraditionally not a problem at sea level.
Many tests, papers, show ways to mitigate:readback, scrubbing, triple redundancy
Aerospace apps tolerate the cost/size penalty.
Creates FUD: Fear, Uncertainty & Doubt
33
Radiation Sources
Trapped ParticlesProtons, Electrons, Heavy Ions
Nikkei Science, Inc. of Japan, by K. Endo, Prof. Yohsuke Kamide
Galactic Cosmic Rays (GCRs)
Solar Protons&
Heavier Ions
34
FPGA Radiation ToleranceTID Trends vs Product/Technology
050100150200250300350400
50100150200250300350nm
TID
Kra
ds (S
i)(p
er 1
019.
6)
Process trends*:• Gate oxide continues to thin• Oxide tunnel currents increase• Gate stress voltage decreases
*See “CMOS SCALING, DESIGN PRINCIPLES and HARDENING-BY-DESIGN METHODOLOGIES” by Ron Lacoe, Aerospace Corp2003 IEEE NSREC Short Course 2003
350nm - XQ4000XL − 60K Rads (Si)
220nm - XQVR (Virtex)− 100K Rads (Si)
150nm - XQR2V (Virtex-II)
− 200K Rads (Si)
130nm - XQR2VP− 250K Rads (Si)
90nm (Preliminary)− 300K Rad (Si)
TID tolerance of Military-grade FPGAs with full production test:
35
Applied Mitigation (TMR + Scrubbing)
Virtex-II
PROM ScrubControl
TMR
Single FPGA with TMR and Configuration Scrubbing Continuous, uninterrupted
operation (except SEFI) Can employ readback for error
detection Scrub controller detects and
handles SEFIs Critical data processing
applications (Communications, Navigation)
FPGA can manage itsown configuration scrubbing!
36
Xilinx TMR (XTMR)
XTMR
Single-String
37
Spectrum of Reconfiguration
Field Upgrades Rapid Design Data Processing Networking Signal Processing
Occasionally Periodic Frequent Run-time
New use models enabled with Reconfigurable FPGAs
More efficient use of hardware Adaptive hardware algorithms Design modules that time share device resources
- Reduced device count and lower power consumption
38
FPGA Partial Reconfiguration
Think of an FPGA as Two Layers Configuration Memory Layer User Logic Layer
Configuration memory controls functions on user logic layer
Partial Reconfiguration allows a portion of device to be changed while the rest is still running
Documented in XAPP 290
Configuration Memory Layer
User Logic Layer
What FPGA Configuration Memory Controls • All interconnection (wiring)• Logic Definition (Look-up Tables or “LUTs”)• Multiply by, divide by, etc.• Inversion• Feature selection• Interface to hardwired blocks, e.g. PPC• Pipeline on/off• ECC enable/disable• BRAM width• I/O Modes
>really EVERYTHING!
39
Partial Reconfiguration Modules (PRMs)
XC2VP30
PRM_A0
PRM_A1PRM_A2
PRM_B0
PRM_B1PRM_B2
PR Region A
PR Region B
One or more PR regions can be defined
Multiple PRMs can be defined for each region
40
Current Virtex Families Future RadHardby Design Families
Logic Capacity of Virtex Rad Tolerant FamiliesCurrent Virtex Families with TMR Mitigation vs. Future RHBD Families
XQVR1000 XQR2V6000 XQR2VP70 SIRF 4V100(Virtex) (Virtex-II) (Virtex-II pro) (Virtex-4 RHBD)
Avail
able
Logi
c Cell
s (K)
25
125
75
50
100
0
41
Issues
CMOS scaling will continue well into the next decade fueling reconfigurable FPGA architectural advances and system-level integration
The computing industry is trying to increase performance with parallel execution and reconfigurability today and this is clearly the way of the future
Performance of FPGAs as a compute platform exceed conventional processors in all three performance vectors; implementing an effective programming model is the main issue the industry is working hard to solve
Partial Reconfiguration capability is here today enabling new use models and software support tools are imminent
Rad Tolerant Reconfigurable FPGAs available today achieve virtual SEE immunity by applying Partial Reconfiguration and soft TMR techniques
Rad Hard by Design Reconfigurable FPGAs are under development and will offer a dramatic increase in available logic cells and radiation performance while freeing up reconfiguration resources for more efficient use of hardware, reconfigurable processing or computing applications