AutoPilot

The AutoESL AutoPilot High-Level Synthesis Tool

© 2010 BDTI (www.BDTI.com). All rights reserved.

An Independent Evaluation of:

Berkeley Design Technology, Inc.

By the staff of

ContentsIntroduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1About BDTI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Design Using AutoPilot and Xilinx ISE . . . . . . . . 2The BDTI High-Level Synthesis Tool Certification Program™. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Evaluation Workloads . . . . . . . . . . . . . . . . . . . . . . 4Description of Metrics . . . . . . . . . . . . . . . . . . . . . . 5Description of Platforms . . . . . . . . . . . . . . . . . . . . 5Implementation, Certification Process . . . . . . . . . 5Quality of Results Metrics . . . . . . . . . . . . . . . . . . . 6Usability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 7Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9The BDTI Optical Flow Workload™ . . . . . . . . . 11The BDTI DQPSK Receiver Workload™ . . . . . . 12Usability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 13

IntroductionAutoESL Design Technologies, Inc., (“AutoESL”)

is a high-level synthesis tool company founded in2006 and headquartered in Cupertino, California.The company unveiled its AutoPilot high-level syn-thesis tool at the Design Automation Conference in

2009. AutoPilot is based on tools and techniquesdeveloped at UCLA and licensed exclusively toAutoESL.

AutoPilot takes as its input a C, C++ or SystemCdescription of functionality at a high level of abstrac-tion. It then generates a device-specific Verilog orVHDL register-transfer-level (RTL) description of ahardware implementation targeting an FPGA (forexample, from Altera or Xilinx) or ASIC. This elim-inates the time-consuming and error-prone step ofmanually creating the RTL implementation. Accord-ing to AutoESL, AutoPilot also generates a cycle-accurate SystemC simulation model for the synthe-sized results.

High-level synthesis tools, such as AutoPilot, thattarget FPGAs are typically of interest to two classesof prospective users: current FPGA users who wantto improve their productivity and gain design porta-bility, and processor users who are consideringswitching to an FPGA to achieve superior perfor-mance or cost/performance for computationallydemanding applications. There are good reasons forconsidering such a switch—BDTI’s report, FPGAs for

OVERVIEW Interest in high-level synthesis tools for FPGAs is intensifying as FPGAs and their applications grow larger and more complex. Prospective users want to understand how well high-level synthesis tools work, both in terms of usability and quality of results. To meet this need, BDTI launched the BDTI High-Level Synthesis Tool Certification Program™. This program evaluates high-level synthesis tools used to implement demanding digital signal processing applications on FPGAs.

This white paper presents detailed results of BDTI’s analysis of the AutoESL AutoPilot high-level synthesis tool used in conjunction with Xilinx RTL tools to target a Xilinx Spartan-3A DSP 3400 FPGA. We discuss the ease of use, productivity, and quality of results obtained using AutoPilot compared to implementing the same application on a DSP processor. We also compare AutoPilot quality of results vs. traditional RTL FPGA design.

Our findings will be surprising to many—and may indicate a major shift on the horizon for FPGA and DSP processor users.


DSP, includes benchmark results showing thatFPGAs can achieve 100X higher performance and30X better cost-performance than DSP processors insome highly parallel DSP applications. Yet FPGAsare not widely used as processing engines in theseapplications, primarily due to development chal-lenges—using traditional design techniques, it takesmuch longer to develop an application on a FPGAthan on a DSP processor. In addition, RTL FPGAdesign requires different skills than those required forDSP processor software development, and most DSPengineers don’t have the necessary RTL hardwaredesign expertise.

High-level synthesis tools have the potential toovercome these challenges and bring the performanceof FPGAs to a much wider range of users, but suchtools are often met with skepticism. In part, this skep-ticism is due to the common notion that softwarecannot produce results that are as good as what askilled engineer can produce. And in part, the skepti-cism is due to the fact that high-level synthesis toolshave been around for a long time but have neverachieved widespread use.

Opposing this skepticism is a growing body ofanecdotal evidence suggesting that some modernhigh-level synthesis tools, such as AutoPilot, are veryeffective, both in terms of usability and quality ofresults. Given this conflicting information, how is aprospective user to judge whether a high-level synthe-sis tool is worth considering?

In 2009, BDTI created the BDTI High-Level Syn-thesis Tool Certification Program to fill this gap,with the goal of providing objective, credible data andanalysis to enable potential users of high-level synthe-sis tools for FPGAs to quickly understand the capa-bilities and limitations of these tools.

BDTI has used this methodology to evaluateAutoESL’s AutoPilot tool used in conjunction withXilinx’s ISE and EDK tool chain targeting a XilinxSpartan-3A DSP 3400 FPGA. This white paper sum-marizes the results of BDTI’s evaluation, and is basedon BDTI’s in-depth, independent assessment of Auto-Pilot augmented with the results of detailed inter-views of AutoPilot users conducted by BDTI. Itincludes several results that will be surprising tomany, and that may indicate that the time for high-level synthesis tools for FPGAs has finally arrived.

About BDTI Berkeley Design Technology, Inc. (BDTI) is an

independent technology analysis, consulting, andengineering services company headquartered in Oak-

land, California. Founded in 1991, BDTI is widelyknown for its highly regarded, independent perfor-mance analysis of processing platforms for embeddedapplications. Its six suites of chip benchmarks havebeen licensed for use with nearly 100 processing plat-forms, from MCUs to FPGAs. Further informationabout the company and benchmark results are avail-able at www.BDTI.com.

Design Using AutoPilot and Xilinx ISEFigure 1 shows the design flow used in BDTI’s

evaluation of the AutoESL AutoPilot high-level syn-thesis tool. AutoPilot is used to compile C code and

generate an RTL implementation, and Xilinx’s RTLtools (which include the ISE tool chain and theEmbedded Developer’s Kit, or EDK) are used totransform that RTL implementation into a completeFPGA implementation in the form of a bitstream forprogramming a specific FPGA. AutoPilot offers a sig-nificant advantage over hand-coded RTL in terms ofretargetability; the tool enables users to migrate fromone FPGA to another (or even to an ASIC design)without manually rewriting their RTL.

For this project, BDTI could have limited the eval-uation to the use of AutoPilot alone, ignoring theRTL-to-bitstream portion of the design flow. Webelieve, however, that it is important for potentialusers to understand how difficult (or easy) it is to get

FIGURE 1. Design Flow Using the AutoPilot High-Level Synthesis Tool with Xilinx “ISE” RTL Tools

Tool-generated data enables code tuning

Modify code based on reports generated by RTL tools

Manual restructuring ofC code

Standard SW Tools(e.g. MS Visual Studio)

AutoPilot

Xilinx ISE/EDK

ModelSim

Reference C code

Implementation C code

RTL

Bitstream, Netlist

Tool

Design representation

AutoCC

Tool-generated data enables code tuning

Modify code based on reports generated by RTL tools

Manual restructuring ofC code

Standard SW Tools(e.g. MS Visual Studio)

AutoPilot

Xilinx ISE/EDK

ModelSim

Reference C code

Implementation C code

RTL

Bitstream, Netlist

Tool

Design representation

AutoCC


from the high-level application description all theway to an FPGA implementation—which requiresthe Xilinx RTL tools in addition to the high-levelsynthesis tool. For this reason, BDTI evaluated theentire implementation process shown in Figure 1—not just the C-to-RTL portion, but also the XilinxRTL tool chain.

When starting with a C language description ofrequired functionality, the first step in implementingan application on any hardware target is often torestructure the input reference C code to represent anarchitecture that is suitable for hardware implementa-tion. On a DSP, for example, it may be appropriateto rearrange the application’s control flow so thatintermediate data always fits in cache. On an FPGA,restructuring often provides a more parallel, FPGA-friendly representation of the application.

Restructuring can sometimes enable improve-ments of several orders of magnitude in speed and/orresource utilization. If the original C source code isnot appropriate for the hardware target, restructur-ing is required in order to obtain an efficient imple-mentation. For example, a video application may passentire video frames of intermediate data betweenalgorithm blocks. A straightforward C implementa-tion of the application may require large frame buff-ers that do not fit in an FPGA, but typically the codecan be restructured to enable much smaller buffers.By appropriately streaming the dataflow betweenalgorithm blocks, the application may be able tobuffer just a few scan lines—or even just a few pixels—rather than buffering entire frames.

In some cases, applications may require the use ofexternal memory (for example, if a video applicationmust store entire frames and can’t be restructured touse scan lines or pixels). In this case, the restructuringprocess requires the addition of code that supportsthe use of an external memory interface.

The effort required to restructure the applicationdepends on the specifics of the application and thehardware being targeted, as well as the capabilities ofthe tools used.

AutoPilot, like other current high-level synthesistools, does not handle restructuring automatically.Instead, the restructuring is typically done by hand.In fact, the restructuring can be done entirely inde-pendently of AutoPilot (in our evaluation, for exam-ple, we used Microsoft Visual Studio for restructuringand verifying the C code). Compared to hand-writtenRTL, where restructuring and language translationare performed as a single combined step, restructur-ing entirely in C is easier and less error-prone.

The reference C code may also need to be modi-fied to take advantage of the FPGA’s ability to imple-ment arbitrary-precision data paths and otherfeatures. AutoPilot supports arbitrary-width datatypes and also automates synthesis of a variety ofinterfaces between modules, including interfacesbetween various buses and third-party intellectualproperty modules such as the Xilinx multi-port mem-ory controller (MPMC).

The AutoPilot tool chain includes AutoCC,which compiles the C code for execution on theworkstation and handles FPGA-oriented modifica-tions such as arbitrary-width data types. AutoCCdoesn’t generate RTL; instead, it provides a quickcheck of the functionality of the FPGA-oriented Ccode. This intermediate verification capability pro-vides a significant advantage over verifying hand-written RTL, since RTL simulation is orders of mag-nitude slower. In our evaluation, for example, the Csimulation typically required 15-30 seconds, whileRTL simulation time varied from hours to days,depending on the workload under consideration.However, C simulation does not simulate the fullhardware features (such as pipelining and scheduling),so RTL simulation is still needed for hardware verifi-cation.

The functionally verified C code can then be opti-mized for better performance or efficiency. This stepis primarily accomplished using AutoPilot synthesisdirectives that can be either embedded in the C codeas pragmas or placed in AutoPilot scripts as com-mands. Because AutoPilot generates RTL veryquickly, the user can explore a variety of design strat-egies that would be prohibitively time-consuming toexplore using RTL.

AutoPilot compiles the optimized C code andgenerates an RTL implementation by convertingeach function call in the C code into an RTL module.For the moderately complex workloads used in thisproject, that process typically took less than 30 sec-onds.

In addition to generating RTL, AutoPilot alsogenerates reports that estimate the FPGA resourceutilization, latency, and throughput of the RTLimplementation. The reports include a breakdownby individual functions and loops in the C sourcecode, allowing users to pinpoint areas for improve-ment and fine-tune synthesis directives or C code inresponse.

Once the user is satisfied with the performanceand efficiency estimated by AutoPilot, the processswitches over to more traditional RTL tools. An


RTL simulator such as Mentor Graphics’ ModelSimcan be used to functionally verify the AutoPilot-gen-erated RTL. Alternatively, AutoPilot can generate acycle-accurate SystemC RTL model with all the nec-essary test bench related files and scripts based on theC language reference test bench which then can beused for a faster functional verification than thatachieved with ModelSim.

Finally, Xilinx’s RTL tools (ISE and EDK) areused to take the RTL generated by AutoPilot asinput, perform synthesis and place-and-route tasks,report the exact resource utilization of the implemen-tation and alert the user to any timing closure issues.The user may then further tune the C source codeand repeat the RTL generation if needed.

The transition point from AutoPilot to the Xilinxtools is smoothed, to some extent, by AutoPilot’sability to automatically generate scripts and con-straints for running synthesis and place-and-routesteps, which would be quite time-consuming if donemanually. AutoPilot does not, however, provide aunified “cockpit” from which the entire C-to-FPGAimplementation process can be executed. For exam-ple, AutoPilot does not generate scripts to transferRTL and netlist outputs to appropriate folders for theXilinx EDK tools; this must be done manually. It’s atrivial process once the user understands exactly whatneeds to be done—but understanding what needs tobe done requires a high level of familiarity with theXilinx tools, including their complex directory struc-ture requirements. AutoPilot also does not provide ascript to download the bitstream onto the FPGA;this requires working with the Xilinx tools. In otherwords, while AutoPilot eliminates the labor-inten-sive (and error-prone) process of hand-coding inRTL, it does not entirely insulate the user from hav-ing to learn and use the RTL tools.

The BDTI High-Level Synthesis Tool Certification Program™

BDTI evaluates high-level synthesis tools (includ-ing the underlying RTL tools) using two well-definedsample applications, or “workloads” (describedbriefly below, and in more detail in Appendix A).These applications (described in the next section) arerepresentative of typical digital signal processingapplications for FPGAs. The two applications areimplemented using several approaches. First, a givenworkload is implemented on the target FPGA usingthe high-level synthesis tool in conjunction with theXilinx RTL tools. The same workload is then imple-mented on the same FPGA using a traditional RTL-

based approach (for the BDTI DQPSK ReceiverWorkload), or on a DSP processor using its associateddevelopment tools (for the BDTI Optical FlowWorkload). In this manner, BDTI is able to comparethe quality of results and productivity levels associ-ated with using various design flows.

The two workloads have been chosen to bebroadly representative of the types of embedded com-puting applications that electronic system designersimplement using FPGAs, and as such, they are inher-ently well suited for FPGAs. This is an importantpoint to keep in mind. There are many importantapplications (such as high-definition audio codecs)that don’t require the computational performancelevels or data rates of the workloads used here, andthat require much more complex algorithms. Suchapplications may yield very different results thanthose reported in this analysis.

Evaluation WorkloadsThe two workloads used in BDTI’s evaluation are

the BDTI Optical Flow Workload™ and the BDTIDQPSK Receiver Workload™.

The term “optical flow” (or “optic flow”) refers toa class of video processing algorithms that analyze themotion of objects and object features (such as edges)within a scene. The BDTI Optical Flow Workloadoperates on a 720p resolution (1280×720 progressivescan) input video sequence and produces a series oftwo-dimensional matrices characterizing the appar-ent vertical and horizontal motion within thesequence. In designing this workload, BDTI hasincreased the control complexity relative to what isoften used in this class of optical flow algorithms inorder to ensure that the workload provides a suffi-ciently challenging test case for the tools. More specif-ically, BDTI incorporated dynamic, data-dependentdecision making and array indexing into the opticalflow application.

There are two Operating Points associated withthe BDTI Optical Flow Workload, each of whichuses the same algorithm but is optimized for a differ-ent metric.

• Operating Point 1 is a fixed workload defined asprocessing video with 720p resolution (1280×720progressive scan) at 60 frames per second. Theobjective for Operating Point 1 is to achieve therequired throughput while minimizing resourceutilization. Resource utilization refers to the per-centage of total processing engine resourcesrequired to implement the workload.


• Operating Point 2 is defined as the maximumthroughput capability of the workload imple-mentation on the target device (measured inframes per second) for 720p resolution (1280×720progressive scan). The objective for OperatingPoint 2 is to maximize the throughput (measuredin frames per second) using all available deviceresources.

The secondary workload is the BDTI DQPSKReceiver Workload. This workload is a wireless com-munications receiver baseband application thatincludes classical communications blocks found inmany types of wireless receivers. It is a fixed work-load with a single Operating Point defined as process-ing an input stream of complex, modulated data at18.75 Msamples/second with the receiver chainclocked at 75 MHz. The receiver produces a demodu-lated output bitstream of 4.6875 Mbits/second. Theobjective for this workload is to minimize the FPGAresource utilization needed to achieve the specifiedthroughput.

Additional details on the two workloads are pro-vided in Appendix A.

Description of MetricsTwo categories of metric are used in this evalua-

tion:• Quality of results metrics assess the perfor-

mance and efficiency of the workload implemen-tation. For the BDTI Optical Flow Workload,quality of results metrics are reported for theAutoESL-Xilinx implementation and for theDSP processor implementation. For the BDTIDQPSK Receiver Workload, quality of resultsmetrics are reported for the AutoESL-Xilinx flowand for a traditional FPGA implementation usinga hand-written RTL design.

• Usability metrics assess the productivity andease of use associated with the AutoESL-Xilinxdesign flow, and are based on the BDTI OpticalFlow Workload. These metrics compare the pro-ductivity and ease of use associated with using theAutoPilot and Xilinx tools targeting an FPGArelative to using a DSP processor with its associ-ated software development tool chain.Usability metrics are evaluated qualitativelybased on nine aspects of tool use, including out-of-the-box experience, ease of use, completenessof tool capabilities, efficiency of overall designmethodology, and quality of documentation andsupport.

Description of PlatformsFor this evaluation, the target FPGA was the Xil-

inx Spartan-3A DSP 3400 (XC3SD3400A). For theBDTI Optical Flow Workload, the Xilinx Xtrem-eDSP Video Starter Kit (which is based on theXC3SD3400A) was used. Spartan-3A DSPs are basedon Xilinx’s low-cost Spartan-3A family, but have anumber of enhancements to accelerate digital signalprocessing. For example, Spartan-3A DSP chips havedouble the block RAM (“BRAM”) memory of otherSpartan devices and incorporate hard-wired DSP datapaths, called “DSP48A slices.” Each DSP48A slicecontains an 18×18 multiplier with pre-adders and anaccumulator, among other features. TheXC3SD3400A includes 126 DSP48A slices that can beclocked at up to 250 MHz, and roughly 54,000 logiccells. Xilinx RTL tools, including the ISE and EDKtool suites, were used with AutoPilot. (ISE and EDKversion 10.1.03 (lin64)).

The target DSP processor was the Texas Instru-ments TMS320DM6437. The TMS320DM6437 is avideo-oriented processor that includes a 600 MHzTexas Instruments TMS320C64x+ DSP core alongwith video hardware accelerators. (The hardwareaccelerators do not support the BDTI Optical FlowWorkload, and therefore were not used in the DSPprocessor implementation of the BDTI Optical FlowWorkload.) The evaluation used the Texas Instru-ments DM6437 Digital Video Development Environ-ment combined with the Texas Instruments CodeComposer Studio tools suite (version V3.3.82.13,Code Generation Tools version 6.1.9).

We should note here that, as is typical for high-level synthesis tools, AutoPilot costs considerablymore than DSP processor software developmenttools. Tool cost is not reflected in the cost-perfor-mance results presented later in this paper. A funda-mental question is whether the higher tool cost isjustified by higher quality of results and/or higherproductivity enabled by AutoPilot. We believe thatthe results and analysis presented in this paper willprove valuable to prospective AutoPilot users inanswering that question.

Implementation, Certification ProcessThe work of implementing the two workloads on

the two chips was distributed between AutoESL, Xil-inx, and BDTI based on the chip and tool chain used.AutoESL implemented both workloads using theAutoPilot and Xilinx tools and submitted resourceutilization results to BDTI for independent verifica-tion and certification. In parallel, BDTI’s engineers


independently implemented portions of the BDTIOptical Flow Workload using the AutoPilot and Xil-inx tools to gain first-hand insight into the usabilityof the tool chain. BDTI implemented the BDTI Opti-cal Flow Workload on the DSP processor, and Xilinximplemented the hand-coded RTL FPGA version ofthe BDTI DQPSK Receiver workload (which wasthen verified and certified by BDTI).

In addition, BDTI interviewed a number of Auto-Pilot customers about their experiences in using thetool and the quality of results they have attained.These interviews were used to augment (and sanity-check) the results obtained using the workload imple-mentations.

Quality of Results MetricsIn this section we present the Quality of Results

findings of our evaluation. The first results wepresent are for the BDTI Optical Flow Workload.Table 1 shows results for Operating Point 1, whichevaluates the percentage of on-chip hardwareresources consumed by a 60 fps Optical Flow Work-load at 720p resolution.

As shown in Table 1, the FPGA implementationusing the AutoESL-Xilinx tool chain utilized 39% ofthe FPGA resources to implement the workload. TheDSP processor was unable to implement this work-load; a minimum of 12 ‘DM6437 chips would beneeded to achieve 60 fps operation on the BDTI Opti-cal Flow Workload.

These results illustrate the difference in processinghorsepower available on the Spartan-3A DSP versusthe TI ‘DM6437 chip—for a 25% higher chip price,the FPGA used with AutoPilot provides more than

an order of magnitude higher computational power.For applications that can make good use of this horse-power (such as the BDTI Optical Flow Workload),the FPGA’s performance advantage is compelling.

In contrast to Operating Point 1 of the BDTIOptical Flow Workload, which specifies a fixedthroughput requirement, the goal of Operating Point2 is to achieve the highest throughput possible, usingall of the available chip resources. Table 2 presentsthese results in terms of the maximum frames per sec-ond supported by each single chip and the associatedcost per frame per second.

Here again, the FPGA implementation createdwith AutoPilot achieves much higher performancethan the DSP processor, and also much better cost-performance.

Prospective users are often interested in under-standing the capacity of synthesis tools. The OpticalFlow Operating Point 2 represents the largest designevaluated by BDTI, and used 76% of the XilinxXC3SD3400A FPGA. AutoPilot had no trouble han-dling a design of this size. For this workload, the ref-erence C code contained 559 functional lines of code.The restructured and optimized reference C codecontained 1,604 lines of code, which then generated38,222 lines of Verilog RTL or 35,849 lines of VHDLRTL. AutoPilot synthesized the C description andgenerated the RTL in less than 30 seconds.

The next question our evaluation attempted toanswer was: How efficient is an AutoPilot-basedFPGA implementation relative to an implementation

TABLE 1. Quality of Results for BDTI Optical Flow Workload Operating Point 1: Fixed Throughput (1280×720 Progressive Scan, 60 fps)

PlatformChip Unit

Cost (Qty. 10K)

Chip Resource Utilization

AutoESL AutoPilot plus Xilinx RTL tools targeting the Xilinx XC3SD3400A FPGA

$26.65 39%

Texas Instruments soft-ware development tools targeting the TMS320DM6437 DSP processor

$21.25

N/A(a minimum of 12 DSPs would be required to meet this oper-

ating point)

CERTIFIEDTM

TABLE 2. Quality of Results for BDTI Optical Flow Workload Operating Point 2: Maximum Throughput (1280×720 Progressive Scan)

Platform

Chip Unit Cost(Qty. 10K)

Maxi-mum

Framesper

Second(FPS)

Cost per FPS (Lower

is Better)

AutoESL AutoPilot plus Xilinx RTL tools targeting the Xilinx XC3SD3400A FPGA

$26.65 183 fps $0.14

Texas Instruments software develop-ment tools targeting the TMS320DM6437 DSP processor

$21.25 5.1 fps $4.20

CERTIFIEDTM


created using hand-written RTL code? For this partof the evaluation we used the BDTI DQPSK ReceiverWorkload (the Optical Flow workload would havebeen prohibitively time-consuming to hand code inRTL).

Table 3 shows the efficiency of the implementa-tion using the AutoESL-Xilinx tool chain versus thehand-written RTL implementation.

In this case, the hand-written RTL implementa-tion was created by an experienced FPGA designer,and made use of Xilinx Coregen IP blocks whereapplicable. As shown in Table 3, AutoPilot was ableto achieve a level of efficiency comparable to that ofhand-coded RTL on this workload. Achieving thisresult required modifications to the C languagedescription of the workload; however, the extent ofthese modifications was modest.

The similarity of AutoPilot and hand-writtenRTL results is probably not accidental; AutoESL wasprovided with the resource utilization for the hand-coded RTL result at the outset of the implementationprocess, and likely used this as a target in optimizingits implementation. It is important to note, however,that such information is not required for effective useof AutoPilot, and that AutoESL was not providedwith the hand-coded design. The results shown inTable 3 are consistent with results reported by Auto-Pilot customers interviewed by BDTI. These usersgenerally reported that the tool produced results thatwere similar in efficiency to hand-coded RTL. Fur-thermore, they said that using a high-level synthesistool enabled them to easily explore alternative archi-tectures, which often led to more efficient implemen-tations.

Historically, poor quality of results has been oneof the biggest pitfalls of high-level synthesis tools.

From the results presented here and the user inter-views conducted by BDTI, it is clear that AutoESLhas done an excellent job in overcoming this prob-lem, at least for the class of workloads used in thisanalysis.

Usability Metrics In this section we present the usability metrics for

the BDTI High-Level Synthesis Tool CertificationProgram. The usability metrics, presented in Table 4and Table 5, provide an assessment of the productiv-ity and ease of use of the high-level synthesis toolflow compared with the DSP processor tool chain.For each usability metric described below, BDTIassigns a score of Excellent, Very Good, Good, Fair,or Poor. For the AutoESL-Xilinx flow, the first scorelisted is the score for the complete flow. Beneaththose scores, in parenthesis, separate scores are pro-vided for AutoPilot and for the Xilinx RTL toolchain.

In assigning these scores, BDTI considers the over-all design process for a complete project—startingwith a C language application specification and end-ing with a real-time implementation on the targetprocessing platform (either an FPGA or DSP proces-sor). Detailed descriptions of each usability metricare provided in Appendix B.

In general, AutoPilot was quite straightforward toinstall and use. In contrast, the Xilinx RTL tools weredifficult to install and complicated to use, particularlyfor a novice FPGA user. The net result is that, asshown in Table 4 and Table 5, the AutoPilot plus Xil-inx tool chain has productivity ratings similar tothose of the DSP processor flow.

Although we did not directly compare the ease ofuse of AutoPilot vs. hand-written RTL code, theAutoPilot customers we interviewed estimated thatusing AutoPilot cut their design times by roughly50% compared with using hand-written RTL code. Inaddition some saw a more significant reduction intime required for verification. For example, some cus-tomers said that because much of their simulationcould be done at the C level rather than at the RTLlevel, it was much faster and easier. And some saidthat they expected to be able to reduce total designtime even further.

The quality of documentation is a weak spot forAutoPilot; there is very little of it. This is likelyattributable to the newness of the tool.

TABLE 3. Quality of Results for DQPSK Receiver Workload: Fixed Throughput (18.75 Msamples/Second Input Data with a 75 MHz Clock Speed)

PlatformChip Resource

Utilization(Lower is Better)

AutoESL AutoPilot plus Xilinx RTL toolstargeting the Xilinx XC3SD3400A FPGA

5.6%

Hand-written RTL code using Xilinx RTL toolstargeting the Xilinx XC3SD3400A FPGA

5.9%

CERTIFIEDTM


TABLE 4. Usability metrics (1 of 2)

Out-of-Box Experience Ease of Use Completeness of

Capabilities

Quality of Documentation

and Support

Combined AutoESL AutoPilot plus Xilinx RTL tools rating1

(AutoESL AutoPilot rating / Xilinx rating)

Fair

(Very Good/Poor)

Good

(Very Good/Fair)

Good

(Good/Good)

Good

(Fair/Very Good)

Texas Instruments soft-ware development tools rating2

Good Very Good Very Good Very Good

1.AutoESL AutoPilot plus Xilinx RTL tools targeting the Xilinx XC3SD3400A FPGA2.Texas Instruments software development tools targeting the TMS320DM6437 DSP processor

TABLE 5. Usability metrics (2 of 2)

Efficiency of Design Methodology

Extent of Modifications Required to

Reference Code

Learning to Use the Tool

Design and Implementation

(First Compiling Version)

Design and Implementation

(Final Optimized Version)

Platform InfrastructureDevelopment

Combined AutoESL Auto-Pilot plus Xil-inx RTL tools rating1

(AutoESL Auto-Pilot rating / Xil-inx rating)

Very Good

(Very Good/NA2)

Very Good

(Very Good/NA)

Good

(Good/Good)

Good

(Good/Good)

Good

(Good/NA)

Texas Instru-ments software development tools rating3

NA (assuming already familiar) Excellent Good Good Fair

1.AutoESL AutoPilot plus Xilinx RTL tools targeting the Xilinx XC3SD3400A FPGA2.“NA” = not applicable3.Texas Instruments software development tools targeting theTMS320DM6437 DSP processor

CERTIFIEDTM

CERTIFIEDTM


The process of developing the platform infrastruc-ture was similar for the FPGA and DSP in thisproject. In general, DSP processors come wellequipped with peripherals for implementing DSPapplications, but provide limited flexibility in periph-erals and interfaces. FPGAs, in contrast, provideexcellent flexibility in peripherals and interfaces, butmay not, by default, include needed peripheralblocks.

As mentioned earlier, obtaining an efficient imple-mentation of the BDTI Optical Flow Workloadusing AutoPilot required modifications to the Ccode. This was also true for the DSP processor. Theextent of modifications required to the C code wasmoderate for the AutoESL-Xilinx flow and high forthe DSP processor flow. Optimizing the BDTI Opti-cal Flow Workload for the DSP processor requiredextensive code reorganization relative to the refer-ence code. This reorganization process is not onlytime-consuming but also requires a very high level ofexpertise.

Part of the reason the DSP processor required somuch reorganization is that the BDTI Optical FlowWorkload used in this analysis is well suited forimplementation on FPGAs and not well suited forimplementation on DSP processors with caches (suchas the TMS320DM6437 chip used in this analysis).The amount of reorganizing required will vary byapplication.

Overall, considering the entire design flow weexpect that, in general, the level of effort for imple-menting many kinds of applications on a XilinxFPGA using AutoPilot and the Xilinx RTL tools willbe similar to that of implementing the same applica-tions on a DSP processor using software developmenttools.

As shown in Table 6, two different skill sets arerequired for the FPGA implementation: skills associ-ated with AutoPilot, and skills associated with theXilinx RTL tools. In comparison, for DSP processorapplication development, an engineer with special-ized DSP algorithm and programming skills isrequired. A typical DSP software engineer with anawareness of hardware architecture fundamentals(e.g., pipelining, latency) can learn to effectively useAutoPilot. Although a learning curve on the order ofseveral weeks to a few months is required to becomeproficient using AutoPilot (depending on the back-ground of the engineer), no other specialized exper-tise is required. Since BDTI engineers learned to useAutoPilot from the ground up, we include a Learningto Use the Tool usability metric. However, because the

DSP processor and FPGA tools were used by engi-neers already familiar with them, BDTI did not assigna score for learning to use these tools.

Perhaps the most significant usability weakness ofthe AutoPilot-based tool flow is not AutoPilotitself—it’s the Xilinx RTL tools. We had assumed thatAutoPilot would largely abstract the user from theprocess of going from RTL to the FPGA bitstream,and, as mentioned earlier, that turned out to be a falseassumption. Generating the final FPGA implementa-tion requires the user to create scripts that run someparts of the Xilinx tools, and more importantly, canrequire the user to manually integrate and run vari-ous RTL blocks together.

Though it is relatively straightforward to get goodresults from AutoPilot, the remaining task of gettingfrom RTL to bitstream using the Xilinx RTL toolchain was time-consuming and not a process thatmost DSP software engineers would be equipped tohandle without assistance from an FPGA expert. It’sclear that, while AutoPilot can produce efficientresults and significant improvements in productivity,it does not eliminate the requirement for an FPGAexpert as part of the team. Along these same lines, forcurrent FPGA users, AutoPilot accelerates severalsignificant portions of the design process, but doesnot address all important aspects of the process.

Conclusions BDTI’s earlier benchmarking of FPGAs and DSP

processors showed large performance and cost-per-formance advantages for FPGAs on some applica-tions when the FPGA implementations were createdusing traditional RTL design techniques. The newanalysis presented here shows that FPGAs canachieve similar performance and cost performanceadvantages when used with the AutoESL AutoPilothigh-level synthesis tool. In addition, we found thatAutoPilot can achieve quality of results equivalent tohand-written RTL code. We were surprised andimpressed by the quality of results that AutoPilot wasable to produce, given that this has been a historicweakness for high-level synthesis tools in general.

FPGA designs done using traditional hand-writ-ten RTL coding typically take much more effort thanthe equivalent application implemented in softwareon a DSP processor. Therefore, perhaps the most sur-prising outcome of this project was that it tookroughly the same effort to implement the evaluationworkload on the FPGA using the AutoPilot plus Xil-inx RTL tools as it took on the DSP processor. Thisis a significant breakthrough, and one with the poten-


tial to have a major impact on the design of high-per-formance embedded computing applications. Giventhe FPGA’s advantages in speed and cost-perfor-mance, we expect that the availability of AutoPilotwill significantly change the trade-offs between DSPsand FPGAs for certain types of applications.

Overall, our analysis indicates that the AutoESL-Xilinx tool chain used with the Spartan-3A DSP 3400FPGA yields much better performance and cost/per-formance than the TI DSP processor on the work-load we considered, while offering similardevelopment effort. And when compared with hand-written RTL, AutoPilot was able to deliver equiva-lent results. In exchange for these advantages, userswill pay much more for the tool chain and will needFPGA expertise on the development team.

We expect that many system designers will behappy to make that trade-off.

TABLE 6. BDTI High-Level Synthesis Tool Certification Program Results Skills Required

Platform Required Skill Set

AutoESL AutoPilot high-level synthesis tool

• Application expertise• C programming• Hardware architecture fundamentals• Algorithmic optimization and restructuring

Xilinx RTL tools

• FPGA architecture details• Basic RTL knowledge• RTL tools• Devices and interfaces

TI DSP tools

• Application expertise• C and assembly programming• Processor chip architecture details• Algorithmic optimization and restructuring• Devices and drivers


APPENDIX A: Workload Details

The BDTI Optical Flow Workload™A block diagram of the BDTI Optical Flow

Workload is shown in Figure 2. The BDTI Optical Flow Workload is a video pro-

cessing application suitable for implementation on anFPGA or high-performance processor. The term“optical flow” (or “optic flow”) refers to a class ofvideo processing algorithms that analyze the motionof objects and object features (such as edges) within ascene.

The BDTI Optical Flow workload operates on a720p resolution (1280×720 progressive scan) inputvideo sequence and produces a series of two-dimen-sional matrices characterizing the apparent verticaland horizontal motion within the sequence. As men-tioned in the main text of this white paper, there aretwo Operating Points associated with the BDTIOptical Flow Workload, each of which uses the samealgorithm:

• Operating Point 1 is a fixed workload defined asprocessing video with 720p resolution (1280×720progressive scan) at 60 frames per second. Theobjective for Operating Point 1 is to achieve therequired throughput while minimizing resourceutilization. Resource utilization refers to the per-centage of total processing engine resourcesrequired to implement the workload.

• Operating Point 2 is defined as the maximumthroughput capability of the workload imple-mentation on the target device (measured inframes per second) for 720p resolution (1280×720progressive scan). The objective for OperatingPoint 2 is to maximize the throughput (measuredin frames per second) using all available deviceresources.

Quality of results scores for Operating Points 1and 2 are based on results obtained when using aBDTI-proprietary video clip.

The block diagram shown in Figure 2 character-izes the implementation of the BDTI Optical FlowWorkload in the C language reference implementa-tion included with the BDTI specification package.However, it is not required that final optimizedimplementations of the workload maintain this struc-ture. Blocks may be merged or restructured toimprove the efficiency of the final implementation,provided all acceptance criteria are met (e.g., all testvectors pass).

Note that, at the level of this block diagram, theworkload is characterized by a feed-forward chain ofsignal processing blocks. The feed-forward nature ofthe application enables straightforward pipelining ofworkload implementations to increase throughput.The recommended data widths at the input and out-put of each block are specified by BDTI.

A brief description of each of the five main func-tional blocks in the workload follows.

Gradient CalculationThe gradient calculation block computes lumi-

nance gradients of the incoming video sequence in thehorizontal, vertical, and time dimensions. It is imple-mented as a one dimensional FIR filter applied to thevideo sequence in each of the respective dimensionsto produce three 1280×720 gradient matrices.

Gradient WeightingThe gradient weighting block smoothes the gradi-

ent matrices computed in the gradient calculationblock using a separable two-dimensional FIR filter.The separable filter is implemented in the referencecode using one-dimensional horizontal and verticalFIR filters applied to each of the three gradient matri-ces.

Outer ProductThe outer product calculation is implemented as

an element-wise product of each of the weighted gra-dient matrices as shown below:

FIGURE 2. Block Diagram of the BDTI Optical Flow Workload

Image Sequence

1.Gradient

Calculation

2.Gradient

Weighting

3.Outer

Product

4.Tensor

Calculation And

Normalization

5.Velocity

CalculationY

Vx

Vy

Gy

Gx

Gz

Gx'

Gy'

Gz'

Txx

Tyy

Txy

Txz

Tyz

Pxx

Pyy

Pxy

Pxz

Pyz

Image Sequence

1.Gradient

Calculation

2.Gradient

Weighting

3.Outer

Product

4.Tensor

Calculation And

Normalization

4.Tensor

Calculation And

Normalization

5.Velocity

Calculation

5.Velocity

CalculationY

Vx

Vy

Gy

Gx

Gz

Gx'

Gy'

Gz'

Txx

Tyy

Txy

Txz

Tyz

Pxx

Pyy

Pxy

Pxz

Pyz


Pxx = Gx' * Gx'Pyy = Gy' * Gy'Pxy = Gx' * Gy'Pxz = Gx' * Gz'Pyz = Gy' * Gz'

Where Gx', Gy', and Gz' are each 1280×720 gradi-ent matrices.

Tensor Calculation and NormalizationThe tensor calculation is a two-dimensional, sepa-

rable FIR filter. It is implemented in the referencecode using one-dimensional horizontal and verticalFIR filters, applied to each outer product.

Tensor normalization manages data word widthsby scaling the five tensors at each pixel position. Thetensors at each pixel position are normalized based onthe largest tensor magnitude within a small neighbor-hood of pixels. The size of the neighborhood is datadependent, but this block is designed to havebounded computational requirements, guaranteeingthat a real-time implementation is feasible.

Velocity CalculationThe velocity calculation computes the horizontal

and vertical components of velocity at each pixelposition, and is implemented as follows:

velocityX = (Tyz*Txy-Txz*Tyy) / (Txx*Tyy-Txy*Txy)

velocityY = (Txz*Txy-Tyz*Txx) / (Txx*Tyy-Txy*Txy)

Where each T in the above equations is a 1280×720matrix output from the tensor calculation.

The BDTI DQPSK Receiver Workload™A block diagram of the BDTI DQPSK Receiver

Workload is shown in Figure 3. The BDTI DQPSK

Receiver Workload is a wireless communicationsapplication suitable for implementation on an FPGAor a high performance digital signal processor. Theworkload includes classic communications blocksthat can be found in many wireless receivers in vari-ous forms and complexities.

The BDTI DQPSK Receiver Workload is a fixedworkload with a single Operating Point defined asprocessing an input stream of complex modulateddata at 18.75 Msamples/second with the receiverchain clocked at 75 MHz. The correspondingDQPSK demodulated output bitstream is4.6875 Mbits/second. The objective for this work-load is to minimize the FPGA resource utilizationneeded to achieve the specified throughput. Resourceutilization refers to the percentage of total processingengine resources required to implement the work-load.

The high-level block diagram shown in Figure 3characterizes the implementation of the workload inthe C language reference implementation includedwith BDTI’s specification package. Blocks may bemerged or restructured to improve the efficiency ofthe final implementation, provided all acceptance cri-teria are met (e.g., all test vectors pass). The recom-mended data widths at the input and output of eachblock are specified by BDTI.

A brief description of each of the five main func-tional blocks in the workload follows.

Matched FilterThe matched filter block implements a square

root raised cosine (SQRC) FIR filter with selectableroll-off factor. The input to the block is a complexDQPSK modulated signal at four times the symbolrate. The output of the block is a complex filtered sig-nal at four times the symbol rate.

FIGURE 3. Block Diagram of the BDTI DQPSK Receiver Workload

1 .M a tch ed

F ilte r

I

Q

2 .C arrie r

R ecove ry

coe ff_ in d

3 .T im ing

R ecov e ry

4 .D Q P S KD e m od

5 .V ite rb i

D e code r

6 .D efram e r

de fra m ed _d a ta

fram e_s ize

C o n tro l R e g is te rs

s oft_va lue co rr_ th resho ld

fou nd_ w o rd s1 .

M a tch edF ilte r

I

Q

2 .C arrie r

R ecove ry

coe ff_ in d

3 .T im ing

R ecov e ry

4 .D Q P S KD e m od

5 .V ite rb i

D e code r

6 .D efram e r

de fra m ed _d a ta

fram e_s ize

C o n tro l R e g is te rs

s oft_va lue co rr_ th resho ld

fou nd_ w o rd s


Carrier RecoveryThe carrier recovery block implements a modified

Costas Loop to estimate, correct, and track errorsbetween the transmitter and receiver carrier frequen-cies.

The input to this block is the complex output ofthe SQRC at four times the symbol rate. The outputis a complex sample corrected by the estimated phaseerror at four times the symbol rate.

Timing RecoveryThe timing recovery block implements a modified

Mueller and Mueller algorithm for symbol timingrecovery. The timing error is calculated using the cur-rent received and detected (sliced) symbols, and previ-ously received and detected symbols. The block alsoincludes an interpolator that interpolates the receiveddata value at the estimated sampling time using thenewly received data and the previously sampled data.

The input to this block is a phase corrected com-plex sample at four times the symbol rate from thecarrier recovery block and the output is a complexsample at the estimated sampling time at symbol rate.

DQPSK DemodulatorThe DQPSK demodulator accepts a complex sym-

bol and differentially decodes its phase relative to thepreviously received complex symbol. It then slicesthis phase into corresponding decoded bits.

The input to the DQPSK demodulator is a com-plex value and the output consists of two soft valuescorresponding to the two decoded bits.

Viterbi DecoderThe Viterbi decoder accepts a continuous stream

of soft decisions from the slicer and outputs a streamof decoded binary bits.

DeframerThe deframer is a correlator that operates on

every output bit searching for an “access code.” Theaccess code is a predefined sequence of bits indicatingthe start of the payload.

APPENDIX B: Usability Metrics Details

Usability MetricsThe usability metrics shown in Table 4 are

defined as follows:

Required SkillsIn addition to assigning scores for the usability

metrics listed below, BDTI also identifies the skillsrequired to effectively work with each tool chain. Noscore is provided for this metric.

Out-of-the-Box ExperienceThis includes all activities from unpacking the box

to getting everything set up and installed so that theuser can start the design process. The assessmentincludes items such as: clarity of documentation,smoothness of the installation process, the timerequired to perform the installation, and the helpful-ness of tutorials or demo applications. Note: for theXilinx tools, the Out-of-the-Box Experience wasassessed by DSP software engineers rather than anexperienced FPGA designer.

Ease of UseThis is an assessment of how easy it is to use the

features provided by the tool chain. It is not meant toidentify missing features (this is addressed as part ofthe “completeness of capabilities” metrics). Theassessment includes items such as the intuitivenessand user-friendliness of the user interface, responsive-ness (i.e., whether the tool chain was slow to com-plete actions), reliability (did the tools crash or hang?)and clarity of on-line help.

Completeness of CapabilitiesThis is an assessment of the extent to which the

tool chain includes all capabilities necessary to enablea user to efficiently complete the implementation ofthe workloads.

Quality of Documentation and SupportThroughout the certification process BDTI

assesses the documentation supplied with the toolchain. This includes the quality of the getting startedguide, tutorials, and the ease of finding answers tospecific questions (including technical support).

Efficiency of the Overall Design MethodologyThis is primarily an assessment of user productiv-

ity. It includes assessing the extent to which a user of


the high-level tool is abstracted from the underlyingarchitecture (RTL for FPGA implementations andthe DSP processor core and chip architecture for DSPprocessor implementations). Since this is a broad cat-egory, it is broken up into the following sub-catego-ries:

Learning to Use the Tool:• For FPGA designs: Learning to efficiently use the

high-level synthesis tool. As mentioned above, noscore is provided for learning the Xilinx RTLtools because an experienced BDTI FPGA engi-neer worked with them.

• For DSP designs: As mentioned above, no score isprovided for learning the DSP tools because expe-rienced BDTI DSP engineers worked with them.

First Compiling Version:• The effort required to create an initial functional

implementation of the application. For the FPGAthis includes only the HLS tool—use of the RTLtools to support integration into the FPGA is notincluded. The first compiling version is notexpected to be optimized in terms of performanceor resource utilization, but rather an initial imple-mentation based on which the optimization pro-cess can begin.

Final Optimized Version:• The effort required to take the application from

the first compiling version to a final optimizedimplementation (not including completion ofinterfaces required to run on an actual chip). Forthe FPGA, this does not include final integrationwith the video and memory interfaces, but doesinclude adding C language code required to inter-act with external memory. For the DSP proces-sor, this includes testing via file I/O, but notintegration with external video ports.

Platform Infrastructure Development:• This category includes integration of platform

components that must be incorporated into adesign so that it will run on a physical chip (i.e., aDSP processor or FPGA). This includes, but isnot limited to:• For FPGA implementations: Complete inte-

gration of the memory controller, externalmemory and video I/O.

• For DSP processor implementations: Installa-tion and configuration of drivers and libraries,and interfacing to video I/O.

Extent of Modification to the Original Reference Code

This is an assessment of how closely the code usedto generate the final optimized BDTI Optical FlowWorkload implementation matches the original Clanguage reference code provided by BDTI. This isnot a simple count of the number of lines of code thathave been changed, but rather reflects the complexityand effort involved in making the necessary changes.Typically changes are made for one of the followingreasons:

• Structural changes: Changes to the overall dataflow and code structure required to map the appli-cation the underlying device architecture

• Changes required because of tool restrictions orlimitations

• Timing and resource optimizations on individualblocks

• Interfacing to peripherals and other external mod-ules

BDTI considers factors such as: the number ofchanges, the extent to which the tools automateimplementing these changes, the level of difficulty forthe developer in incorporating the changes and thelevel of difficulty in debugging and testing thechanges.

AutoPilot

Documents

Transcript of AutoPilot