ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

27
Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

description

ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation. Overview. Background Rent’s Rule Overcoming pin limitations through scheduling “Virtual wires” implementation Veloce 2 DEEP. The Challenge. Making a large multi-FPGA system is easy. Making it programmable is hard. - PowerPoint PPT Presentation

Transcript of ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Page 1: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

ECE 636

Reconfigurable Computing

Lecture 13

Logic Emulation

Page 2: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Overview

• Background

• Rent’s Rule

• Overcoming pin limitations through scheduling

• “Virtual wires” implementation

• Veloce 2

• DEEP

Page 3: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

The Challenge

• Making a large multi-FPGA system is easy. Making it programmable is hard.

• New approach is a software technology that facilitates hardware implementation.

• Effectively make a large number of discrete devices look like one large one.

• Leads to low-cost, scalable multi-FPGA substrate.

Page 4: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Logic Emulation

• Emulation takes a sizable amount of resources

• Compilation time can be large due to FPGA compiles

• One application: also direct ties to other FPGA computing applications.

Page 5: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Are Meshes Realistic?

• The number of wires leaving a partition grows with Rent’s Rule

P = KGB

• Perimeter grows as G0.5 but unfortunately most circuits grow at GB where B > 0.5

• Effectively devices highly pin limited

• What does this mean for meshes?

Page 6: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Possible Device Scenarios

• Rent’s Rule indicates that pin limited situation is getting worse.

• Frequently some logic must be left unused leading to limited utilization

• Perhaps this logic can be “reclaimed”

Page 7: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Partition vs FPGA Pin Count

• FPGAs don’t have enough pins

• Problem may or may not get worse depending on “structured” design.

Page 8: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Virtual Wires

• Overcome pin limitations by multiplexing pins and signals

• Schedule when communication will take place.

Page 9: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Virtual Wires Software Flow

• Global router enhanced to include scheduling and embedding.

• Multiplexing logic synthesized from FPGA logic.

Page 10: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

A Simple Example

FPGA 1 FPGA 2

FPGA 3FPGA 4

Page 11: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Clocking Strategy

• Evaluation and communication split into phases

• Longest dependency path determines number of phases

- Overall emulation performance

Page 12: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Example Scheduling

• Initial phase requires one uClk for computation, one for communication.

• Second phase requires 2 communication uClks due to through hop.

• Note this example assumed needed bandwidth was available.

Page 13: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Routing Algorithm

• For each phase, only some internal signals are ready for routing.

• Routing resources between FPGAs may be considered channels.

• Solution: Route signals use maze route for each phase.

• If available bandwidth not present, delay signals until later phases.

Page 14: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Worst Case Microcycle Count

• Most designs dominated by latency bound.

• If original design has been pipelined this is less of an issue

V >= max ( L*D, PC/Pf )

L = critical path lengthD = network diameterPC = max circuit partition pin countPf = FPGA pin count

Page 15: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Improved Scheduling

• Overlap computation and communication.

• Effectively create a “data flow” of information

• Schedule communication to happen as soon as possible

- No need for phases.

uCLK

Page 16: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Physical Implementation

• Small finite state machine encoded and placed in each FPGA

• Current implementation is one-hot encoding.

Page 17: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

System Implementation

• Low cost hardware

• So simple a graduate student can build it

Page 18: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Benchmark Designs

• Sparcle – modified Sparc processor

- 17K gates

- 4,352 bits of memory

- Emulated in circuit.

• CMMU – cache controller for scalable multiprocessor system

- 85K gates

- Designed as gate array and optimized with SIS

• Palindrome

- 14K gates

- systolic

Page 19: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Emulation Results

• At least 31 FPGAs needed for HW full connectivity (>100 for torus)

• Some degradation in overall system performance.

Page 20: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Device Utilization

• Approximately 45% of CLBs used for design logic.

• ~10% virtual wires overhead

Page 21: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Utilizations

• As devices scale projected utilization increases

• Hardwired approach doesn’t scale

• Equation ->

Page 22: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Veloce 2

• Emulates up to 2 billion logic gates using 128 boards of FPGAs• VirtuaLAB environment

- Allows emulator sharing across a company (128 users)- Allows for emulation of popular interfaces (Ethernet, USB,

PCIe)• Significant company investment• 40 MG per hour compile• 11 KW power consumption

Veloce 2 Quattro

Source: Mentor Graphics Veloce2 brochure

Page 23: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Veloce 2 Execution Environment

• Emulation environment contains three parts- Generation of test vectors and other stimulus- Design under test (DUT) and other support tools- Design checking resources (response checker)

Source: Mentor Graphics Veloce2 brochure

Page 24: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Veloce 2

• Emulator must interact with a number of interfaces

Source: Mentor Graphics Veloce2 brochure

Page 25: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

A Contemporary Example: DEEP

• Developed at U. Delaware in 2011

• Ten FPGA processor boards

• Backplane used for communication in conjunction with switch boards

• Ethernet interface

• Somewhat limited scalability

Page 26: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

A Contemporary Example: DEEP

• Used to emulate a multi-threaded processor

• Subblocks assigned to specific FPGAs

• The same logic is repetitively used for different threads

• Hand-partitioning of subblocks

• Bottom line: 80K cycles per second

Page 27: ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation

Lecture 10: Logic Emulation October 8, 2013

Summary

• Virtual wires overcome pin limitations by intelligently multiplexing I/O signals

• Key CAD step is scheduling. Simplifies routing and partitioning.

• Latest push is towards incremental compilation

• Commercialized by Mentor Graphics – Veloce2

• Contemporary multi-FPGA systems often contain many boards

- Some require multiplexing of subblocks