전문연구요원(병역특례) 채용 - ee.kaist.ac.kr±„용 공고문.pdf · 4. 근무환경 근무형태 : 정규직 근무지 : 서울시 서초구 본사 (강남역 8번 출구)
경종민 [email protected] 1 Multiple-FPGA System; SoC Verification using an array of FPGA ’...
-
Upload
britton-griffith -
Category
Documents
-
view
229 -
download
2
Transcript of 경종민 [email protected] 1 Multiple-FPGA System; SoC Verification using an array of FPGA ’...
2
Introduction• Hardware Emulation System
– Device for verifying digital circuit design along with its target system prior to fabrication of chips
– Merits• Fast verification of the logic design compared to software
simulation• Real (physical) signal application/monitoring is possible.
Emulator
In-circuit Logic Emulator
C testbench
HDL testbench
Emulator
Simulation Accelerator
3
Introduction• Why Emulation?
– FPGA prototype allows design verification with real signals to and from the target environment, at a speed lot faster than software simulation.
Year Month Week Day Hour Minute
Month
Week
Day
Hour
Minute Logic Simulati
on
Final Silicon
Accelerated
Simulation Logic
Emulation
Setup time
Executiontime
4
FPGA vs. Emulation System
• FPGA– Developed in1980’s– Reconfigurable logic and routing architecture– The gate capacity of FPGA is smaller than that of the state-of
-the–art ASIC design• Currently, gate count of FPGA with maximum gate capacity is ab
out 8M gates (About 1.6M gates are for logic gates while the rest is memory.)
• Emulation Systems– An array of FPGA’s or special processors are interconnected v
ia. interconnection networks.– The whole target design must be partitioned into a set of sub
circuits, such that each can be contained in an FPGA.
5
Requirements of Hardware Emulator• Five Requirements
– 1.Gate capacity• With the advent of SOC era, about 50-100M (Intel’s P4 is
about 60M gates including cache memories)
– 2.Speed• Emulation system should be faster than other verification
environmentsCycle/sec (Hz)
Software Simulation
Coverfication w/ Emulator
100Hz
10KHz
1MHz
Hardware Emulator
6
Requirements of Hardware Emulator
– 3.Debuggability• Today’s emulators provide 100% debugging capability.
– 4.Expandible architecture• The architecture of emulator should be expandible to include mo
re logic gates.– 5.Low Cost
• Cost of current emulators is still very high. • Cost of Mercury from Quickturn $4M ~ $5M, with each FPGA
board addition $0.2M
7
Basic architecture of Emulators
• Multiple FPGA’s– Multiple FPGA’s can be connected to increase the gate capac
ity of emulation system.– Several interconnection architecture
• Full Crossbar network, folded-Clos network• Time-multiplexed interconnect• Virtual wire
• Embedded logic analyzer– Customized FPGA to extract internal values.– Local memory to save extracted values.
8
Pin limitation
Xilinx, Altera do not have many pins for partitioned circuits
Partitioned circuits by commercial partitioning tools for several designs.
• Pin count vs. Gate count of partition– Many large designs yield far more pins than available from FPGA’s
• Sparcle processor : Processor designed by MIT, LSI Logic and Sun for multiprocessor system at 1994.
• Alewife CC : Cache controller designed at MIT
9
Interconnection Architectures
• Mesh type interconnection– 2D mesh for FPGA interconnection
• Crossbar network (Separated Interconnection)– One Full crossbar network– Partial crossbar network (folded-Clos)
• Time-multiplexed– Dynamic FPID interconnect architecture– Time-multiplexed interconnect from Quickturn– Virtual Wire
10
Mesh Interconnection• Another 2-D mesh
– (US patent 6389379, Axis, 2002)– FPGA’s on the same row/column are connected.– Only two “Hops” and “Jumps” are sufficient for any type of net.– Each FPGA resource is used for routing as well as logic mapping, which agg
ravates pin limitation problem.
FPGA 11
FPGA 21
FPGA 31
FPGA 41
FPGA 12
FPGA 22
FPGA 32
FPGA 42
FPGA 13
FPGA 23
FPGA 33
FPGA 43
FPGA 14
FPGA 24
FPGA 34
FPGA 44
11
Crossbar Network• Full crossbar
– Separates logic FPGA from interconnection device.– One full crossbar connects any net of any FPGA to any net of any FP
GA after programming.– The size of full-crossbar grows exponentially as the number of FPGA’
s increases
FPGA 0 FPGA 1 FPGA 2 FPGA 3A B C D A B C D A B C D A B C D
Full Crossbar
12
Crossbar Network• Partial crossbar
– (“An Efficient Logic Emulation System,” TVLSI, 1993)– I/O of each FPGA is divided into subsets. The pins of each
crossbar chip are connected to the same subset of pins from each FPGA.
– Still requires a large number of crossbars called FPID (Field Programmable Interconnection Device).
FPGA 0 FPGA 1 FPGA 2 FPGA 3A B C D A B C D A B C D A B C D
C0 C1 C2 C3
13
Crossbar Network• What is FPIC?
– “Field-Programmable Interconnect Component”– Reconfigurable interconnection chip.– Aptix.Inc incorporates FPIC for the interconnection.
FPIC
14
Time-Multiplexed Interconnect• 1)Dynamic FPID;
– Different interconnection among the same logic module set– Each FPID is time-multiplexed FPID (“Routability Improvement Using Dynamic Interconnect Architecture,” TVLSI 199
8)
L-th crossbar
1
L
15
Time-Multiplexed Interconnect• 2)Time-multiplexed interconnect from Quickturn
– (US Patent 5960191, Quickturn, 1999)– Partial crossbar is used but connected pins are time-multiplexed.– Multiple pins are multiplexed with only 1/n pins are required if n-to-1 mux i
s used.
MUX CHIP
Crossbar
A B E F A E B F
DEMUX DEMUX MUX MUX
MUX DEMUX
A B C D
MUX DEMUX
E F G
MUX DEMUX
C D A E
MUX DEMUX
G B F
MUX CHIP
Crossbar
C D G C D G
MUX MUX DEMUX DEMUX
16
Time-Multiplexed Interconnect– “Mux Clock” samples signal A and signal B– “SYNC” disables sampling and synchronize sampling operation
MUX Clock
Divided Clock
SYNC for user clock
Signal A
Signal B
B A B A B A A B A B A BComposite Signal
17
Time-Multiplexed Interconnect• 3)Virtual Wire
– (“Logic Emulation with Virtual Wires,” TCAD, 1997)– Several logic connections share the same physical wire.– Communication schedule is static and predicted (Analysis of
logic circuit should be done before assigning phase to each circuit partition.)
Mux
SimultaneousLogical outputs
Logical outputsLogical inputs
Logical inputsShift loops
FPGA #1 FPGA #2
phase2
phase1
Virtual wire
18
Time-Multiplexed Interconnect
– Phase assignment• At the end of the phase, the produced outputs are
transferred to the other partition.
Emulation Clock
Phase 1 Phase 3 Phase 4
CLK
Enable
CommunicationEvaluationComb. logic
Comb. logic
Phase 2
Comb. logic
19
Software for Multi-FPGA• Partitioning
– Various partitioning algorithms were proposed. – Placing highly interconnected circuit into a single chip is desir
able due to limited number of I/O pins.– Circuit paths that require short delay time should be inside o
ne FPGA. Routability vs. Performance trade-off.
• Placement– Assign each partitioned circuit to one of FPGA’s
20
Software for Multi-FPGA• Routing
– Global routing• Select routing switches (or crossbar) or additional FPGA’s the si
gnal must pass through to get to the destination FPGA.– Detailed routing
• Assign signals to actual traces on each FPGA
• Time-multiplexed FPGA– The routing algorithm to meet the relevant precedence relatio
ns is necessary in Virtual Wire.
21
Run-Time Reconfiguration• Meeting the exploding gate capacity
– With RTR, where time-multiplexed FPGA executes the whole circuits in time-domain slices, the gate capacity is greatly increased.
– Run-time reconfiguration was proposed in mid-1990’s.
• Run-time reconfiguration (RTR)– Technology to swap different configurations in the
reconfigurable hardware– The configuration in different time slots should be
assigned registers to communicate with the other configurations.
22
Reconfiguration Model• Single context
– One full-chip configuration can be loaded at a time. – Sequential access for reconfiguration requires high overhead.
(Configuration of FPGA takes 5s~20s for XILINX Virtex series)
FPGA(Logic & Routing)
FPGA(Logic & Routing)
Configuration
FPGA(Logic & Routing)
FPGA(Logic & Routing)
incomingconfiguration
23
Reconfiguration Model• Multi-context
– Multiple planes of configuration information– Switching between several configurations is fast. – Xilinx XC4000E, Chameleon Inc.’s CS2000 RCP
FPGA(Logic & Routing)
FPGA(Logic & Routing)
Configuration
FPGA(Logic & Routing)
FPGA(Logic & Routing)
incomingconfiguration
24
Xilinx XC4000E
Each logic element incl. LUT has multiple(8) configuration planes in SRAM
micro-registers,which stores the result of each context are routed to the relevant logic block.
25
Reconfiguration Model• Partially Reconfigurable
– Some part of the FPGA can reconfigured. Not entire array reconfiguration.
– Reduction of configuration data.– Programming information can be large because of address informati
on.– Xilinx 6200, Xilinx Virtex-II
FPGA(Logic & Routing)
FPGA(Logic & Routing)
Configuration
FPGA(Logic & Routing)
FPGA(Logic & Routing)
incomingconfiguration
26
Reconfiguration Model• Pipeline Reconfigurable
– Partial reconfiguration occurs in increments of pipeline stages.
– Primarily used in datapath style computations.
Configure 1Configure 1
Configure 2Configure 2
Configure 3Configure 3
Execute 1Execute 1
Execute 1Execute 1
Execute 2Execute 2
Configure 4Configure 4
Configure 5Configure 5
Configure 6Configure 6
Execute 4Execute 4
Execute 4Execute 4
Execute 5Execute 5
Execute 2Execute 2
Execute 3Execute 3
Execute 3Execute 3
Time
Pipelinestages
27
Configuration Scheduling
1 23
45
76
group 0
group 1
group 2
group0
group1
group2
The precedence relation shouldbe preserved when the circuit is partitioned.
timeConfiguration Lifetime of each wire
wire 4,6
wire4,7
wire2,5
wire5,6
28
Fast Configuration• Problem of run-time reconfiguration
– Run-time reconfigurable systems involve reconfiguration during program execution.
– The reconfiguration time can somewhat offset the performance improvement achieved by hardware acceleration.
• Configuration time– DISC II system
• “Dynamic Instruction Set Computer’’ implemented on partially reconfigurable FPGA’s.
• M. J. Wirthlin, and B. L. Hutchings, “Sequencing run-time reconfigured hardware with software”, ACM/SIGDA International Symposium on FPGAs, pp. 122-128, 1996.
• 25%-71% of execution time is spent on reconfiguration.– UCLA ATR :
• “Automatic Target Recognition” implemented with RTR.• W. H. Mangione-smith et al., “Seeking solutions in configurable computing”, IEEE C
omputer, vol. 30, No. 12, pp. 38-43, 1997.• Over 98.5% is reconfiguration time.
29
Fast Configuration• Configuration Prefetching
– Used in cosimulation environment. – Configuration time and host execution time are overlapped.
Hides the configuration time.
• Configuration Compression– Used in cosimulation environment. – The communication time between host processor and FPGA c
an be minimized by compressing configuration data. – Xilinx XC6200 : One configuration data can configure several
configuration registers.
30
Fast Configuration• Configuration Caching
– The communication time to send configuration data from host to target FPGA is the main reason for slow configuration.
– Cache to save configuration data can be located near the FPGA to reduce the communication time.
31
Commercial Emulators• Axis
– XtremeTM
• Simulation/Emulation/Acceleration• System with multiple PCI cards with multi-FPGA.• Interconnection between FPGAs is mesh interconnection.• RCC technology is used for simulating designs.
32
Commercial Emulators– RCC (ReConfigurable Computing)
• Consists of many computing elements (Small compact processor dedicated to perform one function)
always @(posedge clk)nr_bus = inst;if(bus_active)..
always @(posedge clk)nr_bus = inst;if(bus_active)..
inv inv1(a, b);nand(a, b, c);
…
inv inv1(a, b);nand(a, b, c);
…
initialbegin
#monitor(…);$my_pile(…);
initialbegin
#monitor(…);$my_pile(…);
UltraSPARCII Workstation
ALTERAFLEX10K
ALTERAFLEX10K
ALTERAFLEX10K
ALTERAFLEX10K
ALTERAFLEX10K
ALTERAFLEX10K
ALTERAFLEX10K
ALTERAFLEX10K
PCI Interface SIMD Controller
33
Emulation engineEmulation engine
Commercial Emulators– RTL language compiler
• RCC RTL compiler compiles HDL to RCC elements. • The user needs not to debug in gate-level.
RTL DesignSynthesis
Gate-level
Design
Debugger
Traditional RTL verification flow
RCC RTL compiler
RTL DesignComputing elements
Debugger
Compiler
RCC array
34
Commercial Emulators– Debugging
• HotSwapping between software simulation state and RCC states.
• HotSwapping enables the user to probe the RTL constructs from the simulation time where the user wants to view.
35
Commercial Emulators• Quickturn
– PalladiumTM
• Components– 1. Custom ASIC matrix to emulate circuits– 2. FPGA array– 3. Embedded logic analyzer– 4. External I/O interface
• Simulation acceleration modes– Synthesized testbench : Testbenches are synthesized into the emu
lator.– Transaction-based simulation : Decoupled simulator and accelerat
or– Accelerated cosimulation(cycle-level transaction) : Simulator and ac
celerator run in lock-step.
36
Summary• Multi-FPGA Architecture
– Mesh– Crossbar– VirtualWire, Time-multiplexed interconnect– Problems
• Nets with multiple ports may not be routable in the previous architectures.
• The usage of FPGA logic in mesh topology is small (under 30%)• Crossbar architecture needs additional hardware for interconnec
tion larger cost for emulators
• RTR Architecture