Toward the Implementation of an ASIC-Like System on FPGA ...

Research ArticleToward the Implementation of an ASIC-Like System onFPGA for Real-Time Video Processing with Power Reduction

Lilia Kechiche ,1 Lamjed Touil,2 and Bouraoui Ouni 2

1Laboratory of Electronics and Microelectronics, University of Monastir, Monastir, Tunisia2Networked Objects Control and Communication Systems Lab, Sousse, Tunisia

Correspondence should be addressed to Lilia Kechiche; [email protected]

Received 4 January 2018; Revised 15 February 2018; Accepted 1 March 2018; Published 22 April 2018

Academic Editor: Michael Hubner

Copyright © 2018 Lilia Kechiche et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Driven by the importance of energy consumption in system-on-chip design as an evaluation factor, this paper presents a designmethodology at the system level to optimize power consumption on ARM-based architecture for real-time video processing. Theproposed design flow is based on the interaction between the tool and user optimizations. The tool optimizations are the optionsand best practices available on the integrated design environment for the Xilinx technology and the target Zynq-7000 architecture.The user methods present methods proposed by the user to optimize power consumption.We used the principles of voltage scalingand frequency scaling techniques for user methods. These two techniques allow energy to be consumed in the proportion of workto be done. The suggested flow is applied on real-time video processing system.The results show power savings for up to 60% withrespect to performance and real-time constraints.

1. Introduction

Embedded real-time video applications are becoming widelyspread andmore andmore used in a lot of systems.They haveimportant applications in various domains such as segmen-tation [1], object tracking [2], visual detection and match-ing [3], and stereo vision [4]. These systems are generallyexecuted in an embedded environment and are subjected tomany constraints like power consumption, time, and resourceconstraints.

To develop and implement real-time high-quality appli-cations, a combination of dedicated technologies and metho-dologies has been proposed in order to obtain optimized sys-tems with required performance. The used technologies varyfrom specific processors such as General Purpose Processors(GPP), Graphic Processing Unit (GPU), and Digital SignalProcessors (DSP) to parallel architectures like ApplicationSpecific Integrated Circuits (ASICs) or even programmablelogic devices (FPGAs).

Today, FPGAs are more used to build complex videoprocessing applications. They give real-time performance,hard to achieve with GPP or DSP [5, 6], while limiting

the extensive design work required for ASICs. Furthermore,FPGAs enable implementing highly parallel architecturesthanks to the huge amount of programmable logic availableon a chip [7].

One of the major problems that face FPGA comparedwith ASICs is the power consumption, which becomes alimiting factor [8]. Therefore, more effort is being spent topropose a design with low-power dissipation.

Since FPGAs are based on CMOS transistors, power con-sumption can be divided into two main categories, static anddynamic.The static power is dissipatedwhen the circuit is in aquiescent state. It is caused by leakage currents ofCMOS tran-sistors. These currents are the subthreshold leakage current,the gate leakage current, and the junction leakage current [9].Dynamic power consumption is given by

𝑃𝑑 = 𝛼 ⋅ 𝐶 ⋅ 𝑉2 ⋅ 𝑓, (1)

with the following:𝐶 is the capacitance, 𝛼 is the charging rate(depends on clock frequency), 𝑉 is the voltage supply, and 𝑓is clock frequency.

HindawiInternational Journal of Reconfigurable ComputingVolume 2018, Article ID 2843582, 11 pageshttps://doi.org/10.1155/2018/2843582

http://orcid.org/0000-0002-5259-4213

http://orcid.org/0000-0001-5708-3802

https://doi.org/10.1155/2018/2843582

2 International Journal of Reconfigurable Computing

This power is highly related to the technology and it hasbecome a concern with modern FPGAs, which are beingimplemented in a 24 nm technology.

To reduce power consumption, the principle causes ofpower dissipation have to be investigated during all the stepsof the design process of integrated circuits from the algo-rithmic level down to the transistor level. Designers shouldconsider the latest methods in low-power technology at alldesign levels. Power savings can be realized by making theright choices during all abstraction levels.

In this paper, we propose a design methodology for real-time video processing to optimize power consumption andto give an ASIC-like system on FPGA. This methodology isapplied on a high performance video system of 1920 × 1080pictures at 60 frames/sec. The paper is divided as follows: thesecond part contains the state of the art of some methodsused to reduce static and dynamic power for CMOS devices.The third part presents a real-time video processing system.The fourth part gives the suggested design methodology toend up with an ASIC-like system. The fifth part containsthe experimental results and discussion and the sixth partconcludes the work.

2. FPGA Power Reduction: State of the Art

Most of modern FPGA boards are computing platformsincluding programmable hardware element, memory resour-ces, configurable I/O, embedded processors, and even em-bedded operating systems running within FPGA. Theseplatforms allow a high level of system integration, with theability to combine related peripheral devices and embeddedprocessor cores with hardware functions. The challenge ingiving such complete platforms with hardware and softwarefunctionalities is to permit the combination of hardwareperformance and software flexibility. This combination is atthe expense of high power consumption, which is the majorlimit of FPGA platforms.

To overcome this limit, researchers continually try to findnewmethods to optimize power consumption.The proposedmethods have explored all the abstraction levels of the designprocess from the system level to the circuit and the techno-logical level.This paragraph will be dedicated for some worksand methods for power reduction at various design levels.

Some works [10–12] have used HW/SW partitioning toput forward optimal systems with low-power consumptionas power-aware decisions at a very early stage of the designprocess. HW/SW partitioning is the problem of assigningapplication tasks to the existing computational cores underdefined constraints such as area and power. It is formalizedas an optimization problem aiming to minimize an objectivefunction under defined constraints. The authors in [10]proposed an algorithm for HW/SW partitioning to find thebest tradeoff between power and latency. They modeled theapplication as a data flow graph and computed the latencyand power consumption for every suggested partitioning.The proposed algorithm made a heuristic search for the bestsolution which respected the defined constraints.

The authors in [12] modeled the system using dataflow graph using the bees’ colony to solve the optimization

problem of HW/SW partitioning under power and timeconstraints. The heuristic algorithms were utilized as theoptimization problem was NP-Hard and the exact resolutionmight take a longer time. To adjust the constraints accordingto the user wishes, weighting coefficients were added tothe constraints to specify which of the two conflictingterms would be more important for the final partitioningresult.

As the HW/SW problem is a high-level method used atthe system level, the metrics of the problem are computedutilizing estimations and assumptions. These estimationsmake the proposed solution very abstract with a low accuracywhen confronted with the constraints of real world imple-mentation.

Tomove deeper into the design process, other researchershave used methods at the architecture level like the dynamicvoltage scaling (DVS) and dynamic frequency scaling (DFS).These methods use the dynamic adjustment of voltage andfrequency at run time. They aim to save power by enablingthe device regulation of its frequency and voltage based onoperating conditions. These methods need deep knowledgeof the working blocks and their frequencies.

The DVS, the DFS, or even Dynamic Voltage and Fre-quency Scaling (DVFS) was first proposed to reduce powerconsumption of microprocessors [15–17] and as they weresuccessful, they were generalized and used on FPGAs [18,19]. The authors in [15] propose a method for static timinganalysis for dynamic scheduling schemes in order to use theDFS. They propose a safe timing analysis for systems withoff-chip and hierarchical memories where memory latencywould not scale with processor frequency. This method,called “frequency-aware,” replaced theWorst Case ExecutionTime (WCET) obtained by static timing analysis. It expressedWCET bounds with frequency-sensitive parameters wherecycles were interpreted in terms of the processor frequencyand where memory access is expressed in terms of thememory latency overhead. The new suggested WCET isdetermined on the fly for a given frequency.

The authors of [16] addressed the problem of real-timesystems with time critical applications. They put forward anew algorithm for the DFS, which is directly applied to thescheduler tomodify the OS’s scheduler and taskmanagementservice. First, a static frequency was assigned to every task.This frequency was the lowest possible frequency that allowsthe scheduler to meet deadlines for a given task set.

If a task finished before the worst case specification, thefrequency would be recomputed using the latest information.The new value would be utilized until the release of the taskfor a future invocation. Therefore, the frequency is recom-puted at every new scheduling time using the real computedtime for accomplished tasks and the specified worst case forothers. This method provided a power reduction from 20 to40%.

The authors of [17] proposed a DVFS technique for non-real-time applications. The main idea reposes on loweringthe CPU frequency when accessing off-chip peripherals. Thetemporal distribution of the on-chip and the off-chip work-load is computed with different scenarios to determine theCPU frequency during idle periods. The energy savings went

International Journal of Reconfigurable Computing 3

up to 70% with a variable performance penalty depending onthe saving value.

In [18] the authors described a methodology for DVS oncommercial FPGAs.The authors used a circuit tomeasure thelogic delay, which would be used by the voltage controllerblock to dynamically adjust the supply voltage using aclosed loop. The voltage controller was an external moduleimplemented on a PC. The experimental results using aXilinx Virtex XCV300E FPGAwere presented and the powersavings were between 4% and 54%. Although the authorssaid that implementing the voltage controller as an internalmodule is simple, giving results and statistics based on theexternalmodule is disputable since it is possible to cause delayviolation or more resource utilization.

The author in [19] suggested a method for dynamic volt-age scaling.The critical path is used to locate the in situ detec-tors, which would be utilized by the logic delay measurementcircuit to identify the critical point of operation. The voltagewas adjusted, and then a suitable corresponding frequency foroperation is identified in a closed loop configuration. As theused Virtex-5 XC5VLX110T platform did not allow voltagescaling, the authors use a redesign of DC-to-DCmodule thatsupplies the Vccint voltage to the FPGA.

In the next paragraphs we present the proposed HW/SWarchitecture for real-time video platform and the methodsused for optimization.

3. Hardware Software Architecture forReal-Time Video Processing

Real-time video processing plays a key role in industrialsystems as it is being expanded to a lot of fields like multi-media-based electronic products as video surveillance sys-tems and cell-phone cameras.

Real-time video processing systems consist of processinghuge amounts of image data in a short time. The purposevaries from a simple display to the extraction of usefulinformation for intelligent scene analysis. Digital videos aremultidimensional signals, which make them data intensiveand resource demanding as they need an important amountof resources for computations and memory operations. Forexample, for a full HDTV resolution of a 1920 × 1080 digitalimage frame with 3 bytes of pixels at 60 frames/sec, such animage contains 1920 × 1080 × 24 bits of data. Furthermore,the large dimension of digital video demands the processingof huge amounts of data of approximately 49 million pixelsper second.

While designing FPGA systems for real-time video pro-cessing, the following issues have to be considered:

(i) How to propose an architecture that can fulfill theperformances required by different communicatingblocks: blocks in relation to the video processingsystems are subjected to timing constraints for highbandwidth and data intensive operations with theneed for accessing the memory at the video rate.

(ii) How to propose an architecture that can be extendedwhen needed: the extension is the ability to add any

ARM A9

Central interconnect

DD

R controller

DDR3

AXI3

AXI-VDMA

Video-in Video to stream

processingVideo-out

AXI4

HPGp

Custom

Figure 1: Block diagram of implemented design.

block in relation to the video processing system in thelimit of available resources.

(iii) How to get an optimal use of available resources: theproblem of partitioning the system between HW/SWresources allows permitting the high performance ofhardware and the flexibility of software.

In previous works [20, 21], we suggested various real-timevideo processing architecture with a variety of power savingtechniques. In [20] the authors put forward a hardwarearchitecture for video zoom-in processing with power reduc-tion. The proposed hardware architecture was based on theMicroBlaze 32-bit microprocessor and uses the ProcessorLocal Bus (PLB), an ancient protocol adopted byXilinx beforemoving to the Advanced eXtensible Interface (AXI) protocol.To reduce power consumption, the authors proposed a clockcontroller block, which would verify every input beforeprocessing.The new input would go through processing onlyif it differs from the previous one. Otherwise, the blockswould be deactivated. In [21], the authors explored differentdesign methodologies at different levels of abstraction. Theydiscussed how the choice of the design methodology hasan impact on the proposed architecture. The platform-baseddesign was proposed as a design methodology and wastested on real-time video processing system. Both works areimplemented on Virtex 5 platform.

The choice of the appropriate target platform is veryessential for real-time systems. For the existing FPGA plat-forms, many are suitable for video processing at real-time asthey can work at a frequency that can exceed 150MHz hencethe ability to support HD resolutions. In addition, for theFPGA vendors, they have incorporated various hardware andsoftware IP cores for the most important functions neededfor video processing like timing generation and RGB con-versions.

Figure 1 shows the block diagram for the proposed archi-tecture implementation. The hardware or programmablelogic (PL) is used to implement video-related blocks likevideo-in, video processing subsystems, and video-out. Thesoftware part or processing system (PS) is responsible forsystem-level-control registers, DMA controllers, and accel-erator coherency port (ACP). The memory interface enables


the PS and blocks implemented in the PL to access thememory. The exploitation of the PS and the PL to implementvideo processing modules is done according to the availableresources, performance constraints, power constraints, andother issues like flexibility and time to market.

Choosing the best hardware platform to implement areal-time video processing engine is challenging. In thiswork,the Zynq ZC702 based Xilinx evaluation kit is chosen asa target platform. It includes software, hardware, and IPcomponents facilitating the development of custom videoapplications. In this design, we use the AXI protocol [22],which is a part of ARM AMBA microcontrollers that wasadopted by Xilinx with Spartan 6 and future platforms.

The AXI interconnect, AXI3, and AXI Video DMA IPcores can form the basis of video systems capable of allowinghandling video frame buffers and giving access to a sharedDDR3 SDRAM [23–25]. The AXI Video Direct MemoryAccess (VDMA) core implements a video optimized directmemory access engine with frame buffering.TheAXIVDMAcore transfers video data to and from memory under adynamic software control. The processor can access the PLusing the AXI3 master General Purpose (GP) interfaces. TheAXI HP interface allows the PL to access the DDR memorythrough high bandwidth data path bus masters.

The PS is used to initialize blocks and master memoryaccess.The customprocessing block presents any kind of pro-cessing that can be added to the video system.This processingcan be a low-level treatment, which makes changes on thevideo directly, or a high-level treatment which aims to extractsemantic information. This block represents the extensiblepart of the video system.

The video-out block is responsible for displaying thevideo on a monitor.

The next paragraphwill present the usedmethods to opti-mize power consumption on the proposed architecture.

4. Real-Time Video Processing System Designfor Power Optimization

Asdemonstrated in the introduction, the power consumptionis a key factor in the evaluation of embedded systems. Achallenging idea for researchers as well as for system design-ers is to propose architecture with minimal power consump-tion.The design of an ASIC-like system needs tailored poweroptimization decisions that will verify the system require-ments and performance. The design process has to start bydefining different application requirements. Figure 2 showsthe steps followed during the design process in order to pro-pose an ASIC-like system. The figure shows a typical FPGAdesign process and the interaction with the power-awaredecisions. At the specification step, the device is selectedand the system requirements are well defined. A thermalpower and supply current allocation are done at this step.After the proposition of the architecture and the definitionof the hardware and the software tasks, the designer addsthe appropriate power-aware decisions. These decisions canbe tool and user methods. The tool methods are high-level decisions available within the design tool and can beselected and added by user as optimization decisions. When

Specification

Synthesis

Placement

Power metricsdefinition

Power optimizationdecisions

Optimized architecture

Performances?

Power estimation

Lab testing

Routing

Figure 2: Power in the FPGA design process.

chosen, these decisions will be added automatically to thedesign during the synthesis and implementation step. Theuser methods are optimization strategies that are suggestedand implemented by the user and which can be applicationspecific or generalized methods. The power-aware decisionswill be followed by a test and verification to check that thesystem performance is not violated.This verification uses thephysical information, which is added to the dimension of theproblem after the synthesis, placement, and routing steps. Atthese stages, a more accurate estimation for power can beadded, and so the designer can change the decision if thebudget exceeded or performance is violated.

The validation of the power-aware methods is based onpower estimations at every design stage. Xilinx providestwo different tools for power estimation, which are VivadoPower Analysis (VPA) and Xilinx Power Estimator (XPE).XPE is a power estimation tool used in the predesign andpreimplementation phases. Based on architecture and device,it helps select power supply and to specify I/O loading, designresource usage, and activity rates.TheVPAallows performingpower estimation at all stages of the design flow: synthesis,placement, and routing. It is more accurate at the postroutesince it can read from the implemented design database theexact logic and routing resources used.

4.1. Tool Optimization Method. The FPGA market is almostdominated with two top FPGA companies, which are XilinxandAltera. Every company provides its specific platforms andspecific design tools used for these platforms. As a result,the tool optimizations are platform-specific and tool-specific.They are applied on specific details of the design. In this work,we target the Xilinx FPGA and hence we use the Vivadodesign tool. The Vivado Integrated Design Environment(IDE) is a tool that offers many options to help designersduring the design process

The available optimization options are as follows:

(i) Intelligent clock gating: this method is used to reducedynamic power consumption through turning off theclock.There is a general clock gating or specific block


clock gating. General clock gating is provided byVivado IDE to automatically reduce the switchingfactor 𝛼 defined in (1). This option can be selectedwhether during the preplacement savings or the post-placement savings.

(ii) BRAM savings: they are a configurablememorymod-ule compatible with a variety of BRAM interface con-trollers. The Vivado IDE offers power optimizationfor available Xilinx 7 series devices for BRAMs in truedual port mode. The writing mode can be changedsafely for both reading and writing operations if itsoutput is not connected or not needed during writingor reading operations. These optimizations will beperformed as long as they do not affect user function-ality and performance. This option can be explicitlyenabled and disabled during the preplacement sav-ings. The designer can exclude a part of the designfrom the specified optimizations when it is necessaryto protect a defined critical path that may be changedafter power savings.

(iii) Standby mode: this option puts the CPU in thestandby mode and starts it up when an event orinterrupt is detected. In this mode of operation, thedevice is powered up, but themajority of its clocks aregated off.This puts the processor in a static statewherethe only consumption will be caused by leakage cur-rents and some logic which are responsible for thedetection of wakeup conditions.

4.2. User Methods for Power Optimization. User methods areall the methods that can be added to optimize the powerconsumption of the whole system. Some of the famousmethods used are voltage and frequency scaling.

The DVS or the DFS is the process of varying thevoltage or frequency for a target processor voltage domainand frequency domain. As the voltage and frequency aredirectly related to power consumption (1), reducing thevoltage permits a quadric reductionwhile reducing frequencyallows a linear reduction of power consumption.

To use the DVS and the DFS on a target FPGA, thefollowing definitions have to be considered [26]:

(i) The process strength of a device is defined as theability and degree of variation in the attributes ofintegrated transistors of a device. There are threeclasses, weak, nominal, and strong. A weak device isable to operatewith the lowest acceptable frequency atnominal voltages. Strong devices can run at faster fre-quencies than required at nominal voltages and canfunction at voltages lower than nominal voltage val-ues at the minimum specified frequency.

(ii) The voltage domain is defined as the group ofmodulessharing the same power supply voltage for the corelogic of each device.

(iii) The operating performance point is the voltage re-quired for every device to be able to operate at adesired processor clocking frequency. For example,for the processing systemDDR I/O supply voltage, the

Inputs: 𝑉min 𝑥, 𝑉op 𝑥, st;BeginIdentify voltage bounds for the target block;Do𝑉op = 𝑉op − st;Test performance;if test failed𝑉op = 𝑉op − st;Break;

End ifWhile (𝑉op > 𝑉min 𝑥)End

Algorithm 1: Voltage scaling algorithm.

minimum value is 1.14 v and the maximum is 1.89 v[27]. This information can be found in the technicalreport of the target FPGA device.

(iv) The critical path of the system is defined as the long-est path for achieving the execution of a programfrom the source to the sink of a dataflow graph. Thisinformation is used to track the critical tasks whenscaling the voltage and the frequency to ensure per-formance of the system.

4.2.1. Dynamic Voltage Scaling. In this paper, the DVS is usedfor power saving. Let𝑉min 𝑥 be theminimumvoltage requiredfor the block 𝑥 to operate in normal conditions and underwhich the system fails. These values are characteristics fixedat the manufacturing of the device. Let𝑉op 𝑥 be the operatingvoltage of the block 𝑥.The algorithm used to scale the voltageis as in Algorithm 1, where st is a float representing the stepof incrementing and decrementing the voltage.

In the ZC702 board, the power is supplied to the com-ponents through a number of independent rails using pro-grammable power regulators (UCD9248) and Power Man-agement Bus (PMBus) compliant system controller fromTexas Instruments. The PMBus is an open standard protocolthat defines communication with power converters. ThePMBus allows writing and reading power, current, and volt-age information. The processors can access the PMBus using1-to-8 I2C switch present on the board. The Zynq ZC7020device is divided into several power domains. These powerdomains are generated by 6 regulators which generate differ-ent voltages required by the Zynq and the onboard compo-nents. The supplied voltage and currents are continuouslymeasured and monitored by three (UCD9248) power con-trollers available on the ZC702 board.

In this work, scaling the voltage of hard blocks is doneusing the PMBus protocol, which simplifies the communi-cation with power converters in a power system. It enablessoftware or hardware in the device to access to a manageablepower supply [28–30].

The optimal operating voltages are computed using algo-rithm in Algorithm 1. Then the system is configured onlyone time with these values at every start of the system.The configuration and the implementation of the proposed


algorithm are done by the processor through soft code. Moredetails about the scaled blocks will be found in experimentalresults.

4.2.2. Frequency Scaling. Frequency scaling is the ability tochange the frequency of a block without degrading the per-formance of the system.

As we target a defined reconfigurable platform, which isthe Zynq, the analysis of technical details about the clockingsystem is essential to determine to which level the scaling ispossible.

In this work, we test two methods for frequency scaling:scaling the frequency of the processor, which we call theDVS, and scaling the frequency of the blocks implementedon the PL, which we call adaptive frequency scaling (AFS).Whether applying theDVS or theAFS, an analysis of differentexecution paths is essential to ensure the nonviolation of thesystem performance.

Let 𝑂 = {𝑜1, 𝑜2, . . . , 𝑜𝑛} be a set of 𝑛 tasks defining anapplication 𝐴. The meaning of task varies according to thechosen granularity; it can be an instruction, a portion of code,or a block. Let𝐺 be a Control Data Flow Graph (CDFG).𝐺 isdefined as 𝐺(𝐸, 𝑉) with

𝑉 = {V1; V2; . . . ; V𝑛}

𝐸 = {𝑒𝑖𝑗 | 1 < 𝑖, 𝑗 < 𝑛} .(2)

𝐺(𝐸, 𝑉) is an oriented graph that describes the dependenciesbetween the tasks 𝑜𝑖 of an application, 𝑉 is the set of nodes,and 𝐸 is a set of edges.

Every task 𝑜𝑖 is mapped to a node V𝑖. 𝑒𝑖𝑗 is the edge linkingtask 𝑜𝑖 to task 𝑜𝑗.

Let 𝑆 = {𝑠1, 𝑠2, . . . , 𝑠𝑚} be a set of paths from the source tothe sink, 𝑠𝑖 = {V1 − V𝑗 − V𝑘 ⋅ ⋅ ⋅ − V𝑛}, 1 < 𝑖 < 𝑚. Every pathstarts with a source node V1 ∈ 𝑉 and ends with a sink nodeV𝑛 ∈ 𝑉.

Every task 𝑜𝑖 has a Best Case Execution Time (BCET)and Worst Case Execution Time (WCET). The WCET of aprogram is the WCET of the longest path from the sourceto the sink, WCET = {maxwcet𝑖/1 < 𝑖 < 𝑚}, and theBCET of a an application is the BCET of the longest path,BCET = {max bcet𝑖/1 < 𝑖 < 𝑚}.

BCET is when everything goes as expected withoutexceptions and WCET is due to unexpected scenarios likeresource occupation, memory refreshment, or pipeline haz-ards.

Using timing analysis and the computation of WCEThelps define the bounds of the execution time of a task ona particular hardware or software platform.

There exist two methods to performWCET analysis [31]:

(i) Static methods: they take the task code with someannotations, analyze the set of possible paths, andcombine control flowwith amodel of the target archi-tecture to obtain upper bounds for this combination.

(ii) Measurement-based methods: they execute the taskon the target hardware or a simulator for a vector ofinputs.

In this work we use a measurement-based method to deter-mine the bounds of the system. The computed WCETs areused as input information for the scaling algorithm.

As we target a defined reconfigurable platform which isthe Zynq, the analysis of technical details about the clockingsystem is essential to determine to which level the scaling ispossible.

The 7 series FPGAs clocking resources [32]manage clock-ing using clockmanagement tiles (CMT)which provide clockfrequency deskew, synthesis, and jitter filtering. Each CMTconsists of one phase-locked loop (PLL) and one mixed-mode clockmanager (MMCM).The PLLs andMMCMs usedto synthesize frequencies with a wide range.The PS clocks arederived from one of three programmable PLLs. These 3 PLLsdrive the clocks of theCPU,DDR, and I/O. Each of these PLLsis loosely associated with the clocks in the CPU, DDR, andperipheral subsystems.

ThePL has its independent clockmanagement generationand distribution structures. It also receives four clock signalsfrom the PS side. These signals are completely asynchronousto each other and do not have a relation with other PLclocks. The communication between PL and PS side ispossible through functional interfaces like AXI interfaces,interrupts, and clocks and through configuration signals [27].The Dynamic Configuration Port (DRP) allows software toconfigure hardware using a DRP interface, which is an AXI4-Lite interface.TheDRP also enables generating custom clockson the fly through a dynamic partial reconfiguration of PLLorMMCM modules. The relation between the input frequencyand the output is outlined using

Fclkout = Fclkin ∗ 𝐷𝑀

(3)

with 𝐷,𝑀 being two programmable frequency dividers byconfiguration through DRP.

The DFS is performed at run time with frequencies thatconform to the used processormode,whereas theAFS is doneaccording to the data input size. As the target is a real-timevideo processing system, the input varies according to the sizeof the video. The operating frequency of the PL blocks variesalso with the video size: hence, a HD video of 1360/738 doesnot require the same processing frequency as full HD videoof 1920/1080.

TheAFS allows the system to operate under theminimumfrequency needed to guarantee the nonviolation of perfor-mance.

5. Experimental Results

The proposed architecture is prototyped on the Zynq ZC702evaluation board. The ZC702 provides a hardware environ-ment for developing and evaluating designs targeting theZynq XC7Z020 device. It includes 1 GB DDR3 componentmemory, 128Mb Quad SPI flash memory, USB 2.0, SecureDigital (SD) connector, a HDMI codec, I2C bus, and 2 UARTinterfaces.

This paragraph is dedicated to give experimental resultsand show the benefits of the proposed design flow in power


AXI performance monitor

Zynq processing system AXI interconnect

Video On Screen Dislay RGB YCrCb Space Converter

Chroma Resampler

Video Timing ControllerAXI4-Stream to Video-out

Video Pipeline

Figure 3: Simplified block diagram of proposed system.

reduction. A performance comparison is also held to verifythe performance of the whole system.

5.1. Hardware and Software Architecture. The design of thetarget system is carried out using the Vivado Design Suitetool. It is utilized to design, integrate, and implement with theZynq-7000 All Programmable, Xilinx 7 series, and UltraScaledevices. Working with this design suite, the design imple-mentation can be accelerated with place and route tools thatanalytically optimizemultiple and concurrent designmetrics,such as timing and power. The Vivado gives the ability toanalyze the design at each design stage, which allows earliermodifications in the design processes. It provides also timingand power estimations after synthesis, placement, or routing.

Figure 3 illustrates a simplified block diagram of the hard-ware implementation of the video processing architecture.This figure is limited to interface connection between blocks.It contains the Zynq7 processing system, a video pipeline,AXI interconnects, AXI performance monitor, and video-related blocks.

(i) The processing system is used to initialize blocks andmaster memory access. The frequency and voltagescalingmodules are also implemented on the process-ing system.

(ii) The AXI interconnects are responsible for handlinginformation to and from the processing system.

(iii) The AXI performance monitor is responsible fordifferent statistics with the AXI protocol interfaces. Itis used for transactions, external system events, andperformance measurement for AXI4-, AXI3-, andAXI4-Stream and AXI4-Lite interfaces. It capturesreal-time performance metrics for the throughputand the latency. In this work we used the performancemonitor IP to perform real-time profiling using theSDK.

(iv) The video pipeline contains theVideoDirectMemoryAccess (VDMA), which is a soft IP core that provides

high bandwidth for direct access to thememory usingAXI4-Stream-video peripherals. The AXI4-Lite slaveinterface is used to perform initialization, registers,and status.

(v) The video-related blocks are standard blocks used in avideo chain and transfer the video stream after mak-ing the necessary conversions. The RGB to YCrCbconverts the design pixels to from RGB used by theAXI4-Stream to the 16-bit YUV 4 : 4 : 4 signal format.The Chroma Resampler block takes the output of theRGB to YCrCb and converts it to YUV 4 : 2 : 2 whichis the format required by HDMI output.

Table 1 gives the resource utilization of the proposedarchitecture and some blocks.

The software part running on the PS is built using theXilinx Software Development Kit (SDK). The SDK is anembedded environment to create and test software platformsand applications targeting Xilinx embedded processors. Thissoftware environment works with hardware designs createdwith Vivado. In this work, the software part running on theARMprocessorwith standalone operating systemhas the rolecontrolling hardware.

5.2. Tool Optimizations Results. Figures 4(a) and 4(b) show,respectively, the power consumption and the resource utiliza-tion for various optimization approaches using tool optimiza-tions.The proposed groups of values are named, respectively,as follows:

(i) No optimization (no-opt) for synthesis and imple-mentation without any additional optimization

(ii) Preplacement power gating (pre-opt), which is apower optimization method added by the tool andperformed after the placement step

(iii) Postplacement clock gating (post-opt), which is apower optimization added by the tool after the place-ment step

(iv) Low DDR mode (low-ddr).

The maximum of power savings is obtained with a low DDRmode. The tool optimizations allow power savings up to 7%.

5.3. Voltage Scaling Results. The voltage scaling method isimplemented as a software code running on the processor.The communication with voltage rails is done using theI2C bus. To avoid the unnecessary additional resources, theoptimal values are computed using excessive test scenarios.This analysis defines the optimal values that will be used bythe system without need to collect them at run time usingadditional resources.

The DVS method requires the knowledge of the voltagebounds of different blocks. Table 2 [27] shows the maximumand minimum of recommended operating voltage values forsome blocks of the ZC702 device.

Although the voltage value can reach more inferiorvalues than those indicated in the safety bounds (e.g., the𝑉ccpint minimal value can reach 0.5 v) [27], the safety of afunctioning of the device outside the indicated bounds is not


Tool-based approaches

Power consumption (mw)

1600

1700

1800

1900

2000Va

lue (

mw

)

no-optpost-opt

pre-optlow-ddr

(a)

slice Lut Slice Lut Flip Flop Slice registersResources

Ressource utilization

0

5000

10000

15000

20000

Valu

e

no-optpost-opt

pre-optlow-ddr

(b)

Figure 4: Resource utilization for different saving approaches.

Table 1: Resource utilization.

Slice LUTs Slice registers Slice Lut Flip Flop Block RAM DSPsThe total system 10823 14524 4689 5121 10.5 19AXI performance Monitor 2787 3504 1135 1079 0 0Clock reset 19 40 14 15 0 0Processing system 538 655 226 284 0 0Video-out 239 355 126 103 1 0RGB to YCrCb 365 335 144 163 0 4Video Timing Controller 131 212 78 39 0 0Video pipe (VDMA) 5886 8371 2514 3042 9.5 12

Table 2: Voltage recommended values for some blocks of the target Zynq ZC702 device.

Blocks Description Minimum operatingvoltage (v)

Typical operatingvoltage (v)

Maximum operatingvoltage (v)

VccpintPS internal supply

voltage 0.95 1 1.05

VccintPL internal supply

voltage 0.95 1 1.05

VccpauxPS auxiliary supply

voltage 1.71 1.80 1.89

VccauxPL auxiliary supply

voltage 1.71 1.80 1.89

VccbramPL block RAM supply

voltage 0.95 1.00 1.05

tested.The technical report [27] indicates that an exposure tomaximum values for extended periods of time might affectdevice reliability.

Figure 5 highlights the voltage scaling results of some 𝑉ccrails. Each figure indicates the powermargin for every chosenvalue. For example, for the 𝑉ccint rail, when the minimumoperating value is chosen (0.95) the power consumption ofthis block varies between 22 and 31mw.

5.4. Frequency Scaling and Design Performances. The fre-quency scaling varies according to the target architecture.There is the frequency of the PL blocks and the frequencyof the PS. Defining the scaling values needs an analysis

of the functioning of the system. To do so, we use theSystem Performance Analysis (SPA) toolbox available underthe SDK development tool. This toolbox allows exploringthe performances of hardware and software at run time.The observation of the system performance at critical stagesallows taking the right optimization decisions in order torefine the system performances without degradation.

Figure 6 illustrates the CPU utilization rate. We use onlyone core in the proposed architecture. The graph shows thatthe utilization rate is between 91% and 100%. This informa-tion shows that applying DFS on a processor executing a real-time video system does not allow for a noticeable optimiza-tion as the CPU utilization is usually at its maximum values.


Typ MaxMinVoltage (V)

20

30

40

50Po

wer

(mw

)

(a) 𝑉ccint rail


1

4

7

10

13

Pow

er (m

w)

(b) 𝑉ccbram rail

45

52

59

66

73

80

Pow

er (m

w)


(c) 𝑉ccaux rail

340

360

380

400

Pow

er (m

w)


(d) 𝑉ccpint rail

Figure 5: Voltage scaling results on different rails.

99.599.55

99.699.65

99.799.75

99.899.85

99.999.95

100

CPU

util

izat

ion

0:04 0:08 0:12 0:16 0:20 0:24 0:280:00Time

Figure 6: CPU utilization.

2.2

2.3

2.4

2.5

2.6

2.7

Pow

er (w

)

1280/1024 1024/768 800/6001920/1080Video sizes

Figure 7: AFS power results.

Adaptively, we propose the AFS, which allows scalingfrequency according to the input size.

Figure 7 shows the result of AFS on different video inputsizes. Power savings are up to 12%.

5.5. Comparison with Other Works. Voltage and frequencyscaling, for both hard and soft parts for commercial FPGAs,are implemented by several researchers using different meth-ods. In this part, we compare our work with two existingworks, which are the main works in the literature that focuson frequency and voltage for commercial 28 nm FPGAs.The authors used voltage and frequency scaling [13] andadded logic scaling in [14] to optimize power consumption.The voltage scaling and frequency scaling were performedthrough two separate units for each of them. In [13], theDVS unit was composed of a MicroBlaze processor, a DualPort RAM, and an I2C IP core. It allows accessing theconfiguration and monitoring the PMBus power rails. Thecontrol and record of power and voltage values were done atrun time.TheMicroBlaze received the power and the voltagevalues and communicated with the PMBUS via the I2C towrite new values.The dialog between theMicroBlaze and thepower rails was done through commands written in C.

The frequency scaling unit utilized a PicoBlaze� 8-bitmicrocontroller. Frequency scaling was made through thecommunication with the off-chip Silicon Labs Si570, whichwas an oscillator programmable at run time to scale thefrequency to the wanted values. The communication wasdone through I2C protocol.

In [14] the authors used voltage, frequency, and logicscaling to optimize power consumption. The voltage scalingunit is identical to the one used in [13].The frequency scalingunit utilized a ROM containing the configuration parametersused by the MMCM to generate the clock for the userlogic. The frequency would be decreased continually at runtime and stopped when a value causing timing violations isdetected.


Table 3: Comparison with different works.

Additional resources Frequency scalingmethod

Voltage scalingmethod Test method Power achievements

Work 1 [13]MicroBlaze PicoBlaze

I2C IP coreDual Port RAM

I2C communicationwith Si570 oscilator

I2C communicationwith PMBus

Random functionswith different sizes

Up to 64.98% (usingvoltage and frequency

scaling)

Work 2 [14]

MicroBlaze I2C IPcore

Dual Port RAMROM

Configuration ofMMCMROM

I2C communicationwith PMBus

Random functionswith different sizes

Up to 60% (usingfrequency, voltage,and logic scaling)

Our work I2C IP core Soft configuration ofthe MMCM

Soft configuration ofPMBus

Real-time videoapplication

With real-timeconstraints

Up to 60% (usingfrequency and voltage

scaling and tooloptimizations)

Compared to the previous citedworks, we utilize adaptivefrequency and voltage scaling as user methods in addition tothe tool ones. The two previous works collected informationsuch as the frequency and voltage of functioning blocks at runtime and scale them accordingly. Collecting information atrun time would result in extra resources and additional com-plexity in the algorithms used tomanage the information.Thetwo works applied the proposed methods on test modules,which they called Power Consuming and Speed TestingModules (PCASTMs). These PCASTMs were proposed withvarious numbers of modules to occupy different percentagesof the device. The proposed tests were tasks without real-time requirements. In addition, for the voltage scaling, theytested voltage values under the recommended values availablein the datasheet. Testing with values under the minimumrecommended could give more power savings up to 70% [33]but with doubtful safety for long term functioning with real-time constraints.

In our work, the idea is simple: as the different execu-tion scenarios are known in advance, the collection of theinformation is done at the design level using tests and timinganalysis to determine the optimal operating points. The pro-posed method necessitates only I2C IP core and its imple-mentation presents a low complexity. In addition, scaling thevoltage many times at run time is not useful in our opinionas it leads to power and heat dissipation while accessing thepower rails. Configuring the device blocks with the optimalvoltage at the start is more efficient. Table 3 gives a summaryabout the three works.

6. Conclusion

Driven by the complexity of the SOC design and the big con-currence between SOC vendors, the design of an optimizedsystem is very challenging. With the increasing consumers’demands and the saturation that faces Moore’s law, the powerconsumption has become a struggle. Designers have to findnew solutions for the power consumption problem. In thiswork, we have proposed a design methodology that bringstogether tool and user optimizations. These methods havebeen used to design a real-time video processing system.Thework has been implemented onZynqZC702 boardwithARM9 target processor. The tool methods have been specific to

the design tool and work with a target platform. The usermethods have been general and can be applied on otherreal-time video processing systems. The adaptive frequencyscaling and the dynamic voltage scaling have been controlledby the ARM processor. Compared to the existing works,this methodology has allowed up to 60% of power savingswith minimal additional resources. Future work will be onthe application of other methods like clock gating and itsintegration into the design of real-time video systems.

Data Availability

The data used to support the findings of this study areavailable from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

[1] M. Genovese and E. Napoli, “ASIC and FPGA implementationof the gaussian mixture model algorithm for real-time segmen-tation of high definition video,” IEEETransactions onVery LargeScale Integration (VLSI) Systems, vol. 22, no. 3, pp. 537–547, 2014.

[2] M. Happe, E. Lubbers, andM. Platzner, “A self-adaptive hetero-geneous multi-core architecture for embedded real-time videoobject tracking,” Journal of Real-Time Image Processing, vol. 8,no. 1, pp. 95–110, 2013.

[3] J. Wang, S. Zhong, L. Yan, and Z. Cao, “An embedded sys-tem-on-chip architecture for real-time visual detection andmatching,” IEEE Transactions on Circuits and Systems for VideoTechnology, vol. 24, no. 3, pp. 525–538, 2014.

[4] W. Wang, J. Yan, N. Xu, Y. Wang, and F.-H. Hsu, “Real-TimeHigh-Quality Stereo Vision System in FPGA,” IEEE Transac-tions on Circuits and Systems for Video Technology, vol. 25, no.10, pp. 1696–1708, 2015.

[5] J. Fowers, G. Brown, P. Cooke, and G. Stitt, “A performanceand energy comparison of FPGAs, GPUs, and multicoresfor sliding-window applications,” in Proceedings of the ACM/SIGDA International Symposium on Field Programmable GateArrays (FPGA ’12), pp. 47–56, ACM, Monterey, Calif, USA,February 2012.

[6] K. Pauwels, M. Tomasi, J. Dıaz Alonso, E. Ros, and M. M.Van Hulle, “A Comparison of FPGA and GPU for real-time


phase-based optical flow, stereo, and local image features,” IEEETransactions on Computers, vol. 61, no. 7, pp. 999–1012, 2012.

[7] M. Baklouti, Y. Aydi, P. Marquet, J. L. Dekeyser, and M. Abid,“Scalable mpNoC for massively parallel systems-Design andimplementation on FPGA,” Journal of Systems Architecture, vol.56, no. 7, pp. 278–292, 2010.

[8] I. Kuon and J. Rose, “Measuring the gap between FPGAs andASICs,” IEEE Transactions on Computer-Aided Design of Inte-grated Circuits and Systems, vol. 26, no. 2, pp. 203–215, 2007.

[9] Altera, 40-nm FPGA Power Management and Advantages,Altera Corporation, 2008.

[10] S. Ben Haj Hassine, M. Jemai, and B. Ouni, “Power and execu-tion time optimization through hardware software partitioningalgorithm for core based embedded system,” Journal of Opti-mization, vol. 2017, Article ID 8624021, 11 pages, 2017.

[11] H. Han, W. Liu, W. Jigang, and G. Jiang, “Efficient algorithmfor hardware/software partitioning and scheduling onMPSoC,”Journal of Computers (Finland), vol. 8, no. 1, pp. 61–68, 2013.

[12] L. Kechiche, L. Touil, and B. Ouni, “SOPC for real time multi-video treatments with QoS requirements,” International Journalof Advanced Intelligence Paradigms, 2017.

[13] A. F. Beldachi and J. L. Nunez-Yanez, “Run-time power and per-formance scaling in 28 nm FPGAs,” IET Computers & DigitalTechniques, vol. 8, no. 4, pp. 178–186, 2014.

[14] J. Luis Nunez-Yanez, M. Hosseinabady, and A. Beldachi,“Energy Optimization in Commercial FPGAs with Voltage,Frequency and Logic Scaling,” IEEE Transactions on Computers,vol. 65, no. 5, pp. 1484–1493, 2016.

[15] K. Seth, A. Anantaraman, F. Mueller, and E. Rotenberg, “FAST:frequency-aware static timing analysis,” in Proceedings of the24th IEEE International Real-Time Systems Symposium, pp. 40–51, Cancun, Mexico.

[16] P. Pillai and K. G. Shin, “Real-time dynamic voltage scaling forlow-power,” in Proceedings of the eighteenth ACM symposiumon Operating systems principles, pp. 89–102, Alberta, Canada,October 2001.

[17] K. Choi, R. Soma, andM. Pedram, “Fine-grained dynamic volt-age and frequency scaling for precise energy and performancetradeoff based on the ratio of off-chip access to on-chip com-putation times,” IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems, vol. 24, no. 1, pp. 18–28, 2005.

[18] C. T. Chow, L. S. M. Tsui, P. H. W. Leong, W. Luk, and S. J. E.Wilton, “Dynamic voltage scaling for commercial FPGAs,” inProceedings of the 2005 IEEE International Conference on FieldProgrammable Technology, pp. 173–180, Singapore, December2005.

[19] J. L. Nunez-Yanez, “Adaptive voltage scaling with in-situ detec-tors in commercial FPGAs,” IEEE Transactions on Computers,vol. 64, no. 1, pp. 45–53, 2015.

[20] L. Touil, L. Kechiche, and B. Ouni, “Design of low power systemon programmable chip for video zoom-in processing,” TurkishJournal of Electrical Engineering & Computer Sciences, vol. 24,no. 5, pp. 3405–3418, 2016.

[21] L. Kechiche, L. Touil, and B. Ouni, “Real-time image and videoprocessing: Method and architecture,” in Proceedings of the 2ndInternational Conference on Advanced Technologies for Signaland Image Processing, ATSIP 2016, pp. 194–199, Tunisia, March2016.

[22] XILINX, AXI Reference Guide UG761 (v13.4), XILINX, 2012.[23] XILINX, ZC702 Evaluation Board for the Zynq-7000 XC7Z020

all programmable SoC, UG850, Xilinx, 2015.

[24] XILINX, Zynq- 7000 all Programmable SoC technical referencemanual, UG 585, XILINX, 2016.

[25] J. Lucero and B. Slous, Designing High-Performance Video Sys-tems with the Zynq-7000 All Programmable SoC Using IP Inte-grator, XAPP1205, XILINX, 2014.

[26] F. Dehmelt, Adaptive (Dynamic) Voltage (Frequency) Scaling—Motivation and Implementation, Texas Instruments, 2014.

[27] XILINX, “Zynq-7000 all programmable SoC (Z-7007S, Z-7012S, Z-7014S, Z-7010, Z-7015, and Z-7020),” in DC and ACSwitching Characteristics, XILINx INC, 2017.

[28] A. F. Beldachi and J. L. Nunez-Yanez, “Accurate power controland monitoring in ZYNQ boards,” in Proceedings of the 24thInternational Conference on Field Programmable Logic andApplications, FPL 2014, Germany, September 2014.

[29] J. Nunez-Yanez, “Adaptive voltage scaling in a heterogeneousFPGA device with memory and logic in-situ detectors,”Micro-processors and Microsystems, vol. 51, pp. 227–238, 2017.

[30] Xilinx, Zynq 7000 All Programmable SOC PCB Design Guide(UG933), Xilinx, 2016.

[31] R. Wilhelm, J. Engblom, A. Ermedahl et al., “The worst-caseexecution-time problem-overview of methods and survey oftools,”ACMTransactions on Embedded Computing Systems, vol.7, no. 3, article no. 36, 2008.

[32] Xilinx, 7 Series FPGAs Clocking Ressources user guide (UG 472),Xilinx, 2017.

[33] M. Hosseinabady and J. L. Nunez-Yanez, “Run-time powergating in hybrid ARM-FPGA devices,” in Proceedings of the24th International Conference on Field Programmable Logic andApplications (FPL’14), 6, 1 pages, Germany, September 2014.

International Journal of

AerospaceEngineeringHindawiwww.hindawi.com Volume 2018

RoboticsJournal of

Hindawiwww.hindawi.com Volume 2018


Active and Passive Electronic Components

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwww.hindawi.com

Volume 2018

Hindawi Publishing Corporation http://www.hindawi.com Volume 2013Hindawiwww.hindawi.com

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of


Hindawiwww.hindawi.com

Journal ofEngineeringVolume 2018

SensorsJournal of



RotatingMachinery


Modelling &Simulationin EngineeringHindawiwww.hindawi.com Volume 2018


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation


Hindawi

www.hindawi.com Volume 2018

Advances in

Multimedia

Submit your manuscripts atwww.hindawi.com

https://www.hindawi.com/journals/ijae/

https://www.hindawi.com/journals/jr/

https://www.hindawi.com/journals/apec/

https://www.hindawi.com/journals/vlsi/

https://www.hindawi.com/journals/sv/

https://www.hindawi.com/journals/ace/

https://www.hindawi.com/journals/aav/

https://www.hindawi.com/journals/jece/

https://www.hindawi.com/journals/aoe/

https://www.hindawi.com/journals/tswj/

https://www.hindawi.com/journals/jcse/

https://www.hindawi.com/journals/je/

https://www.hindawi.com/journals/js/

https://www.hindawi.com/journals/ijrm/

https://www.hindawi.com/journals/mse/

https://www.hindawi.com/journals/ijce/

https://www.hindawi.com/journals/ijap/

https://www.hindawi.com/journals/ijno/

https://www.hindawi.com/journals/am/

https://www.hindawi.com/

https://www.hindawi.com/

Toward the Implementation of an ASIC-Like System on FPGA ...

Documents

Transcript of Toward the Implementation of an ASIC-Like System on FPGA ...