Post on 01-Jan-2016
Tool #2
Introduction
DESIGN OF FAST AND EFFICIENT HYBRID-FPGAs FOR NUMERICALLY INTENSIVE APPLICATIONS IN FLUID DYNAMICS, MOLECULAR MODELING AND IMAGE/VIDEO PROCESSING
A. Akoglu, A. Dasu, S. Panchanathan Center for Ubiquitous Computing, Arizona State University, Tempe, AZ
{akoglu,dasu,panch}@asu.edu
AbstractThis research work presents results obtained from hybrid FPGA architecture design methodology proposed in earlier work. Hybrid architecture is formed of ASIC units and LUT based processing elements. ASIC units represent tasks or core clusters obtained through common sub-graph analysis between basic blocks within and across routines of computation intensive applications and are basically recurring patterns. Results show that partial reconfiguration with the use of computation cores embedded in a sea of LUTs offer potential for massive savings in gate density by eliminating the need for redundant sub-circuit pattern configurations. Since ASICs cover only parts of data flow graphs, remaining computations are implemented on LUT based reconfigurable hardware. A new packing function is proposed to form LUT based processing elements. Packing cost function prioritizes reduction of input/output pins of the clusters being formed. Results show that significant savings in number of nets to be routed are obtained through proposed method.
Captions to be set in Times or Times New Roman or equivalent, italic, 18 to 24 points, to the length of the column in case a figure takes more than 2/3 of column width.
Captions to be set in Times or Times New Roman or equivalent, italic, between 18 and 24 points. Left aligned if it refers to a figure on its left. Caption starts right at the top edge of the picture (graph or photo).
applications. We have conducted experiments on MPEG-4 VVM, Gnu Scientific Library (GSL) and NAMD molecular modeling library. A map report based on Spartan 2E architecture was obtained based on the synthesis report. Results show that partial reconfiguration with the use of computation cores embedded in a sea of LUTs offer the potential for massive savings in gate density by eliminating the need for redundant sub-circuit pattern configurations (Table 1).
Since CIPEs cover only parts of DFGs, remaining computations (reconfigurable data flow computations) are then implemented on LUT based reconfigurable hardware. Methodology in Figure 4 proposes to provide optimum interconnection pathways between different hierarchy levels with variable size processing elements, allocating just enough switching and wiring resources as a result of profiling the computational characteristics of the application domains. In existing approaches packing threats number of intersecting nets as positive gain and doesn’t address how wiring requirement grows after including an LUT into a cluster. We also argue that cost function should give priority to the nets causing a decrease in the number of input or output pins of the target cluster. Rent’s rule based packing mechanism (Figure 5) designed to improve the routing architecture by reducing the number of nets to be routed has been implemented. This mechanism prioritizes (Figure 6,7) nets that lead to reduction of number of input/output pins during packing in addition to routability driven cost metrics defined by other researchers. Table 2 presents the performance of packing compared to V-Pack ( Rose et. al) and R-Pack( Sarrafzadeh et. al)
analysis between basic blocks within and across routines are basically recurring computation patterns implemented as ASICs on non-reconfigurable area(CIPEs). Designing processing elements based on identifying correlated compute intensive regions within each application and between applications result in large amounts of processing in localized regions of the chip. This reduces the amount of reconfigurations and on-chip communication hence results with faster application switching and reduced power consumption. This task comprises of finding the Common Sub-graphs (Figure 3), which is closely related to the Largest Common Sub-Graph problem (a proven NP complete problem). Core reusable regions that have been detected as common within or across applications by peer research efforts, have either been at the granularity of MAC units (2 nodes) or at the granularity of entire function modules. There has been no reported work that has detected core reusable regions consisting of several operation (multiply, add, divide etc) nodes between basic blocks in applications, with emphasis on accelerating data flow on hardware. Our method generates ASIC cores of higher granularity by specifically focusing on Dataflow graphs of Hardware Computations and taking advantages of the restrictions that they offer. In Hybrid-FPGA model, we propose that the CIPE region be constrained and mapped onto a slab (a region of LUTs isolated by MacroBus as in the Virtex architecture) or implemented as gates in ASIC technology. Even though a large ASIC on chip increases the costs of mask design, it offers the maximum amount of gate savings. Remaining slabs are implemented on LUT based reconfigurable. To the best of our knowledge currently there exists no known technology that maps regions within a single DFG into multiple Slabs for Partial Reconfiguration We have conducted experiments on several complex routines from the target
ConclusionFrom this research effort we believe that partial reconfiguration with the use of computation cores embedded in a sea of LUTs offer the potential for massive savings in gate density and by eliminating the need for unnecessary and redundant sub-circuit pattern configurations. We believe that this direction will lead to the next generation FPGA devices geared towards computationally intensive applications such as bio-chemical algorithms and scientific applications.
Selected Publications1. “Reconfigurable Media Processing”, A. Dasu et.al, Parallel
Computing Vol. 28, August 2002. Pg(s) 1111 - 1139. 2. A.Akoglu, A. Dasu and S. Panchanathan ,”A Framework for the
Design of the Heterogeneous Hierarchical Routing Architecture of a Dynamically Reconfigurable Application Specific Media Processor”,Workshop on Embedded Systems for Media Processing,Dec 17, 2003,Hyderabad, India
3. A. Dasu, A.Akoglu, S.Panchanathan , “An Analysis Tool Set for Reconfigurable Media Processing”,The International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA'03),June 2003,Las Vegas
4. 3 Patents pending under US and International Protection in the technologies for “Designing High Performance Reconfigurable Processing”
Application
Source code
Application
Source code
Application
Source code
COMPILER
Lance, SUIF, gcc
ANALYSIS
TOOLS
•Function level ?•BB level ?•Or something in between ?
Figure 1. Methodology
Figure 2a. Hybrid-FPGA
Table-1 configuration bits and clock cycles savings
Figure 5. Rent ‘s Rule based Packing Parameters
7
1
)(i
iCaseGain
Figure 7. Packing Cost Function
Table-2 Amount of nets and tracks savingsFigure 4. LUT Based Architecture Methodology
Figure 6. How to prioritize net reduction
Methodology (Figure 1) involves extraction of tasks or core clusters in Control Data Flow Graphs (CDFGs) of applications followed by designing the architecture to embed them in Hybrid-FPGA environments (Figure 2a,2b). By Hybrid, we mean that the proposed FPGA architectures will involve LUT and ASIC regions. Tasks or core clusters obtained through the common-sub-graph
Numerical simulations in Computational Fluid Dynamics, Molecular Modeling have some common computation features, and performed iteratively. Similarly video and image processing applications involve tasks (mosaic building to compress video into images, image compression such as DCT, DWT etc.) which also have some common computation features and require iterative processing. It has been shown by several researchers that these applications are well suited to be executed on spatially parallel processor architectures. FPGAs in particular offer large amounts of on-chip spatial parallel units, thus capable of performing orders of magnitude faster than regular serial processors. But FPGAs suffer from the drawbacks of being application agnostic and hence incur penalties of loss of clock cycles in redundant reconfigurations, generic routing and poor memory architectures which impact speed, power and silicon area. All these factors have led us into exploring the reconfigurable architecture design space with the application domain being prioritized.
Figure 2b. Hybrid-FPGA
Finding Comon Sub-Graph Dominant Sub-GraphFigure 3
)( mBrmm CBP BR
3
2
2
RCW
11112
11111
1
1
1
1
''
''
''
''
NKNK
LTLT
CN
SL
CN
SL
size
size
size
size
'1LWin
1LWin
'1LWout
1LWout
1'
1'
sizesize
sizesize
ssize
sssize
CC
SS
CC
CLS
Step-1 Step-2 Step-3
Step-5
L
iT
iL
iT
i
i
C
rCR
0
0
Step-4