Post on 07-Apr-2018
8/3/2019 6 Month Technical Report
1/23
RECONFIGURABLE SYSTEM FOR VIDEO PROCESSING
(SIX MONTH TECHNICAL REPORT)
Student:
BEN COP E
Supervisors:
PROF PETER CHEUNG & PROF WAYNE LUK
Research Group:
CIRCUITS & SYSTEMS, EEE
Submission Date:
FRIDAY 22 ND APRIL, 2005
8/3/2019 6 Month Technical Report
2/23
8/3/2019 6 Month Technical Report
3/23
2 Literature Survey
Covered is a literature survey of the use of FPGAs along with Graphics Hardware in the application of Video Processing.
Firstly, I will look at current architectures using GPUs (Graphics Processing Units) and FPGAs for video processing. Fol-
lowed by interconnect structures (including Network on Chip.) FPGA architectural features which make it adaptable to such
applications will be shown, along with tools used for debugging & programming.
The implementation of video processing applications is moving away from being predominantly software based to a more
hardware based solution. This can be seen with the first video cards: all image processing was done, prior to it reaching
the card. Today much more of the processing is performed on the card, such as lighting effects. Graphics hardware has
progressed from being display hardware to allow user programmability leading to non-graphics applications.
There have also been advances in research into interconnect structures, the bus is no-longer seen as the only solution to
connect hardware cores together. Switch-boxes and networks are emerging, it is likely that we will see more topologies being
considered in the future and some of the new ideas today becoming common-place.
2.1 Current Architectures
This section is split into the use of GPUs and FPGAs individually for Video applications and discusses the possibility of
interlinking these modules.
GPU Architectures:
Emerging Field: Research into the use of GPUs for general-purpose computation began on the Ikonas machine[1] which
was developed in 1978. This was used for the Genesis Planet Sequence in Star Trek: The Wrath of Khan and has also been
used in TV commercials and the Superman III video game. This highlighted early on the potential for graphics hardware
to do much more than output successive images. GPUs were firstly used for non-graphics applications (www.GPGPU.org)
by Lengyel et al in 1990 for robot motion planning, this highlighted the start of the era of GPGPU (General Purpose GPUs.)
Recent Developments: Trendall and Stewart in 2000[2] gave a summary of the possible calculations available on a GPU with
a real-time calculation of refractive caustics. These capabilities have progressed further since, with more on-board memory
2
8/3/2019 6 Month Technical Report
4/23
and larger processing ability. More recently Moreland and Angel[3] implemented an FFT routine using GPUs with FFT,
filtering and IFFT, performed on a 512x512 image, in under 1 second with a GeForce 5800. This has been made possible by
graphics hardware manufacturers (Nvidia / ATI being the largest) allowing programmers more control over the GPU. This
is facilitated through shader programs (released in 2001) written in languages such as Cg (C for Graphics) by Nvidia or
DirectX by Microsoft, which have been enhanced since for greater user control (DirectX 9.0 now allows 128-bit precision
i.e. 32-bit/RGBA pixel component[4]). There is still room for future progress as some hardwares functionality is still hidden
from the user, there is no pointer support and debugging is difficult. Recognition for non-graphical applications of GPUs was
given at the Siggraph / Eurographics Hardware Workshop in San Diego (2003) showing its emergence as a recognised field.
Nvidia: The intentions of the manufactures is clear, in an article in July 2002[5], NVIDIAs CEO announced teaming with
AMD to develop nForce. This will be capable of handling multimedia tasks and will bring theatre-style DVD playback to the
home computer. Previously the CPU offloaded tasks, such as video processing, onto plug-in cards which were later shrunk
and encapsulated within the CPU. This minimisation was beneficial to the likes of Intel and less so to GPU producers such
as Nvidia. The implementation of multimedia applications on graphics cards (and their importance to the customer) means
the screen is now seen as the computer: rather than the network it sits on. In development of Microsofts Xbox more money
was given to Nvidia than Intel, this trend is likely to continue to show a power shift to Graphics card manufacturers.
Performance: The rate of increase of processing performance of GPUs has been 2.8 times/year since 1993[4] (compared to
2 times/1.5 years for CPUs according to Moores Law) a trend which is expected to continue till 2013. The GeForce 5900
performs at 20 G/flops, which is equivalent to a 10 GHz Pentium Processor[4]. This shows the potential of graphics hardware
to out-perform CPUs, with a new generation being unveiled every 6 months. It is expected that TFLOP performance will be
seen from graphics hardware in 2005. For Example in Strzodka and Garbes implementation of Motion Estimation[6] they
out-perform a P4 2GHz processor by 4 to 5 times (with a GeForce 5800 Ultra.)
Increases in performance also benefit filtering operations: in the previously mentioned FFT implementation[3] the potential
for frequency domain filtering was shown. The amount of computations required to do filtering are reduced by performing
them in the frequency domain from an O(NM2) problem to an O(NM) + FFT + IFFT (about O(MN(log(M) + log(N))))
one. Moreland and Angel implemented clever tricks with indexing (dynamic programming), frequency compression and
3
8/3/2019 6 Month Technical Report
5/23
splitting a 2D into two 1D problems to achieve this speed up. With the rapidly increasing power of graphics cards it can be
expected that the computation time will be reduced from 1 second to allow real-time processing. A final factor which aids
this is that 32-bit precision calculations are now possible on GPUs which are vital for such calculations.
Cost: Another benefit of the GPU is cost, a top end graphics card (capable of near TFLOP performance) can be purchased
for less than 300. Such a graphics card can perform equivalently to image generators costing 1000s in 1999[4]. This gives
the opportunity for real-time video processing capabilities on a standard workstations.
Parallelism: The architecture of graphics hardware is equivalent to that of stretch computers (designed for fast floating point
arithmetic). They use stream processing: requiring a sequence of data in some order. This method exploits the dataflow in
the organisation of the processing elements to reduce caching (CPUs are typically 60% cache[4].) Other features are the
exploitation of spatial parallelism of images and that pixels are generally independent.
Strzodka and Garbe in their paper on motion estimation / visualisation on graphics cards[6] show how a parallel computer
application can be implemented in graphics hardware. They identify GPUs as not the best solution but to have a better
price-performance ratio than other hardware solutions. In such applications the data-stream controls the flow rather than
instructions, facilitating the above cache benefit. Moreland and Angel[3] go further in branding the GPU as no-longer a fixed
pipeline architecture but a SIMD (Single Instruction-stream, Multiple Data-stream) parallel processor, which highlights the
flexibility in the eyes of the programmer.
How to program: There are 2 programming sources in graphics hardware programming[6]:
Flowware: Assembly and Direction of dataflow
Configware: Configuration of processing elements
In comparison these 2 features are implemented together in an FPGA, however in graphics hardware these are explicitly
different. Careful implementation of GPU code is necessary for platform (e.g. DX 9.0) and system (e.g. C++) independence.
APIs also handle Flowware and Configware separately. This becomes important if considering programming FPGAs and
GPUs simultaneously.
4
8/3/2019 6 Month Technical Report
6/23
FPGA Architectures:
ASIC solutions for processing tasks are optimal in speed, power and size[7], however, are expensive and inflexible. DSPs
allow for more flexibility and can be reused for many applications, however, are energy inefficient and can cause delays if
not optimised per task. For these reasons it is often favourable to implement such applications in a reconfigurable unit.
An Alternative: When deciding which hardware to use for a graphics sub-system there is a trade-off between operating speed
and flexibility. To maximise its benefit, an FPGA implementation must: give more flexibility than custom graphics processors
and be faster than a general purpose processor. The need for flexibility is justified as one may need a change an algorithm
(e.g. compression standard) post manufacture. By utilising its re-programmability a small FPGA can appear as a large and
efficient device.
Example: Singh and Bellac in 1994[8] implemented three graphics applications on a FPGA, namely outline of circle, filled
circle and a fast sphere algorithm. They found a RAM based FPGA was favourable due to the large storage requirements.
The performance in drawing the circle was seen to be satisfactory out-performing a general purpose display processor
(TMS34020) by a factor of 6 (achieving 16 million pixels/sec.) It was however worse in the application of fast sphere
rendering at only 2627 spheres/sec vs. 14,300 from a custom hardware block. Improvements are however expected with new
FPGAs, such as the Virtex 4, which have larger on-chip storage and more processing power. FPGAs today also have more
built-in blocks to speed up operations such as multiplication (these will be considered later.)
Sonic / Ultra-Sonic: Two more possibilities to accelerate video processing which highlight benefits of a re-configurable ar-
chitecture. The hardware is systolic: 1 data item is clocked in and 1 out of the modules on every clock cycle, this maintains
a high throughput rate although latency can vary. The challenges involved, highlighted in [9, 10] are: Correct hardware and
software partitioning, spatial and temporal resolution, hardware integration with software, keeping memory accesses low
and real-time throughput. Sonic approaches these challenges with PIPEs (Plug in Processing Elements) with 3 main com-
ponents of an engine (for computations), a router (for routing, formatting and data access) and memory for storing video data.
Typical applications are the removal of distortions introduced by watermarking an image[10], 2D filtering[11] and 2D
convolution[12]. In the latter an implementation at 1/2 the clock rate of state-of-the-art technology was adequate suggesting
5
8/3/2019 6 Month Technical Report
7/23
a lower power solution. 2D filtering was split into two 1D filters and showed a 5.5 times speed up when using 1 PIPE, and
greater speed up with more PIPEs.
Bottlenecks: In contrast to memory the FPGAs bottleneck isnt bus speed but configuration time. Configurations can be
stored in a memory bank and copied into a local cache as required. Singh and Bellac[8] propose partitioning the FPGA
into zones, each with good periphery access to the network and a different size. The capability of partial reconfiguration is
important here: if a new task is required only a section of the FPGA need be reconfigured, leaving other sections untouched
for later reuse. A practical example of this is seen with the above Sonic Architecture: The router and engine are imple-
mented on separate FPGA elements. If a different application required only a different memory access pattern (e.g. 2*1D
implementation of a 2D filter[11]) only the router need be reconfigured, this separation also provides abstraction. Another
architecture where the bus bottleneck problem is reduced is seen in [7] where a daughter board is incorporated to perform
D/A conversion. The sharing of data and configuration control path reduces the bottleneck, however, data-loss occurs during
configuration phase but this is seen as acceptable.
Parallelism: On a Task-level parallelism is often ignored in designs, by proposing a design method focused on the system
dataflow Sedcole et al[12] hope to overcome this. Taking Sonic as an example: spatial parallelism is done through distrib-
uting parts of each frame across multiple hardware blocks (PIPEs in this case). Temporal parallelism can be exploited by
distributing entire frames over these blocks. Further, these elements can also be grouped to perform bigger tasks, Singh
and Bellec[8] similarly suggest the grouping of zones of a partitioned FPGA in a design. The following are general issues,
Sedcole et al propose to be considered in such large scale implementations:
Design Complexity
Modularisation - allocation / management of resources
Connectivity / Communication between modules
Power Minimisation (ties in with low memory accesses)
Hardware or Software: The benefits of a software implementation are seen with irregular, data-dependent or floating point
calculations. A hardware solution is beneficial for regular, parallel computation at high speeds[7]. Tasks must be split opti-
mally between these 2 methods. Advancements in hardware mean that some of the problems with floating point calculations
and alike have been overcome. Hardware can now perform equally or even better than software. The software designer needs
6
8/3/2019 6 Month Technical Report
8/23
a good software model of the hardware and the hardware designer requires good abstraction[11]. Hardware acceleration is
particularly suited to video processing applications due to the parallelism and relatively simple calculations.
In the Sonic example: PIPEs act as plug-ins, analogous to software plug-ins, which provides an easy path for Sonic into
software. This overcomes a previous problem with re-configurable hardware that there were no good software models.
Co-operation: Another way to look at the use of FPGAs in a graphics system is to extend the instruction set of a host-
processor as virtual hardware. This idea is approached by Vermeulen et al[13] where a processor is mixed with some
hardware to extend its instruction set. In general this hardware could be another processor or an ASIC component, again
there are issues with finding ways to get the components to communicate and work together.
The requirements of a reconfigurable implementation are therefore to be flexible, powerful, low cost, run-time / partial
reconfigurable and to fit in well with software. The current FGPA limitations highlighted by papers [9, 14] are: configuration
speed, debugging, number of gates, partial reconfiguration (Altera previously had no support) and PCI Bus Bottleneck.
These considerations would be important if considering an FPGAs implementation with other hardware and some / all of
these requirements may also apply to this mixed system.
2.2 Interconnects
Interconnects currently used for graphics card to processor communications will be discussed, followed by a look at some
System-on-Chip (SOC) and Network-on-Chip (NOC) architectures.
GPU view: GPU components are implemented in conjunction with CPUs acting as graphics sub-systems and working as
co-processors with the CPUs. To do this a high speed interface is required as GPUs can process large amounts of data in
parallel, doubling in required bandwidth every 2 years[15]. The AGP standard has progressed through 1x to the current 8x
model (peaking at 2.1 GBytes/sec,) however with new GPUs working at higher bit precisions (128bit/RGBA in the GeForce
6800 series) greater throughput was required. AGP uses parallel point to point interconnections with timing relative to the
source. As the transfer speed increased, the capacitance and inductance on connectors needed to be reduced, this became
restrictive past 8x. A new transfer method was required: serial differential point to point offers a high speed interconnect at
7
8/3/2019 6 Month Technical Report
9/23
8/3/2019 6 Month Technical Report
10/23
The Network: The advantages of a network are that it has high performance / bandwidth, modularity, can handle concurrent
communications and has better electrical properties than a bus or switch. As the size of chips increases global synchrony
becomes infeasible as it takes a signal several clock cycles to travel across a chip. The NOC overcomes this problem by being
a GALS (Globally-Asynchronous Locally-Synchronous) architecture.
Dally and Towles[18] propose a mesh structured NOC interconnect as a general purpose interconnect structure. The advan-
tage of being general purpose is that the frequency of use would be greater justifying more effort in design, the disadvantage
is that one could do better by optimising to an certain application (though this may not be financially viable.)
In Dally and Towles example they divide a chip into an array of 16 tiles, numbered 0 through 3 in each axis. Interconnec-
tions between tiles are made as a folded torus topology (i.e. in the order 0,2,3,1.) This attempts to minimise the number
of tiles a packet must pass through to reach its destination. Each tile therefore has a N,S,E,W connection and each has
an input and output path to put data into the network or take it out respectively. The data, address and control signals are
grouped and sent as a single flit. Area is dominated by buffers (6.6% of tile area in their example.) The limitations are
opposite to computer networks: less constraint on number of interconnections, but more on buffer space. The network could
be run at 4GB/s (at least twice the speed of tiles) to increase efficiency, however this would increase space required for buffers.
The disadvantage of the above example is that the tiles are not always going to be the same size and thus space would be
wasted for smaller designs. Jantsch[19] proposes a solution which overcomes this, using a similar mesh structure. The main
differences are that he no longer uses the torus topology but a standard connection to a tiles neighbours. He also provides a
region wrapper, around a block considerably larger than others, which emulates the original network being present.
Jantsch suggests 2 possibilities for the future: many NOC designs for many applications (expensive in design time) or 1 NOC
design for many applications (inflexible.) The later would justify the design cost, however one would need to decide on the
correct architecture (mix of CPU, DSP etc.), Language (to configure NOC), Operating System (for when running) and design
method for a set of tasks.
9
8/3/2019 6 Month Technical Report
11/23
There are other suggested interconnect methods: Hemani et al[20] suggest a honeycomb structure where each component
connects to 6 others. Benini and De Micheli[21] introduce the SPIN (Scalable, Programmable, Interconnect Network) with
a tree structure of routers and the nodes being the leaves of the tree. Dobkin et al[22] propose a similar mesh structure to
Jantsch however include bit-serial long-range links. They use a non-return to zero method for the bit-serial connection and
believe it to be best for NOC. This shows a snapshot of the NOC ideas for which the are possibly as many topology ideas as
for our standard computer networks today.
2.3 FPGA Internal Structure
FPGAs were first designed to be as programmable as possible, comprising configurable logic blocks and interconnects. As
they have developed, manufacturers have introduced standard components into them, such as embedded memory blocks and
in some of the latest Xilinx FPGAs Power PC Processors. There is potential for future work in this area in the development
of new blocks, which could be placed into an FPGA, to improve functionality. In this section interesting modules will be
considered, which could be used within FPGAs in the future.
Multipliers: The motivation for use of embedded multipliers is that implementation of binary multiplication in FPGAs is
often too large and slow. A possible solution is Programmable Array Modules (PAMs) - these are fixed in size however waste
space if small bit length multiplication are required. Other solutions are trees or pre-processing methods, although these are
difficult to generalise. A better solution is presented by Haynes and Cheung[23] to use reconfigurable multiplier blocks. They
designed a Flexible Array Block (FAB) capable of multiplication of two 4 bit numbers, FABs combine to multiply numbers
of lengths 4n and 4m. The 2 input numbers can be independently signed or unsigned. The speed of the FABs is comparable
to that of non-configurable blocks at a cost of them being twice the size and having twice the number of interconnects. The
later isnt a problem due to the many metal layers in an FPGA, they are also smaller than a pure FPGA implementations.
A modification was proposed later by Haynes, Ferrari and Cheung[24] with a design base on the radix-4 overlapped multiple-
bit scanning algorithm, which was more speed and area efficient. The MFAB (Modified FAB) multiplies 2 numbers of length
8 together, or less with redundancy. The length must be greater than 7 to make a space saving on the FAB. The blocks are
1/30th the size of the equivalent pure FPGA implementation and need only 40% usage to make them a worthwhile asset.
10
8/3/2019 6 Month Technical Report
12/23
Function Evaluation: A more specific block is one for functional evaluation such as that proposed by Sidahao, Constanti-
nides and Cheung[25]. Previously a Lookup Table (LUT) approach was used however their architecture provides a lower
area solution at the cost of execution speed.
Memory: In Video applications storage of frames of data is important, therefore it is useful to be able to store this data
in memory efficiently. Embedded Dual-Port RAMs, currently available in devices such as the Xilinx Virtex II Pro family,
enable two accesses concurrently. It is likely this technology will progress further, perhaps to an Autonomous Memory Block
(AMB), proposed by Melis, Chueng and Luk[26], which can create its own memory address.
2.4 Debugging tools / coding
The testing of a hardware module can be split into 2 areas: pre-load and post-load. A downside to FPGAs over ASICs is in
pre-load: specifically back-annotated compared with initial testing. In ASIC design only wiring capacitance is missing from
pre-synthesis tests, whereas in FPGA design the module placement is decided at synthesis, drastically effecting timing.
Pre-load: The most widely known pre-load test environments are ModelSim (Xilinx) and Quartus (Altera.) COMPASS
(Avant) is an automated design tool, creating a level of abstraction for the user. The benefits are highlighted by Singh and
Bellac in 1994[8]: The user can enter a design as a state machine or dataflow and therefore implement at the system rather
than lower (e.g. VHDL) level.
Post-load: The issue of post-load testing is currently approached by using part of the FPGA space for a debugging environ-
ment, invoked during on-board test. A previously popular test strategy was Bed of Nails where pins are connected directly
to the chip and a logic analyser. Due to the large pin count on todays devices this is impractical, even if possible it would
significantly alter the timing. Following this was Boundary Scanning by JTAG (Joint Test Action Group) however this only
probed external signals. Better still is Xilinx Chipscope: an embedded black box which resides, inside the FPGA, as a probe
unit. The downside is that is uses the slow JTAG interface to communicate readings.
An example of an on-chip debugging environment, which uses the faster interface (PCI Bus), is the SONICmole[27] used
with UltraSonic[14]. This takes up only 4% of a Virtex XVC1000 chip (512 slices.) Its function is to act as a logic analyser,
11
8/3/2019 6 Month Technical Report
13/23
viewing and driving signals, whilst being as small as possible and having a good software interface. This uses the PIPE mem-
ory to store signal captures. It has been implemented at the UltraSonic maximum frequency of 66MHz[27] and is portable to
other reconfigurable systems.
Coding: Firstly coding for FPGAs: these can be programmed through well known languages such as VHDL and Verilog
at the lower level. MATLAB (system generator), and more recently SystemC (see systemc.org) and HandleC at the higher
level. The focus of this sub-section will be on programming the GPU, as FPGA coding is widely understood.
Cg Language: Cg[28] was developed by Nvidia for developers to program GPUs in a C-like manner. The features of C
beneficial for an equivalent GPU programming tool are: performance, portability, generality and user control over machine
level operations. The main difference to C is the stream processing model for parallelism in GPUs.
Cg supports high level programming, however is linkable with assembly code for optimised units - giving the programmer
more control. Cg supports user defined compound types (e.g. arrays and structures) which are useful for non-graphics ap-
plications. It also allows vectors of floating point numbers up to size 4 (e.g. RGBA), along with matrices up to size 4x4
(for operations on the vectors.) A downside is Cg doesnt support pointers or recursive calls (as there is no stack structure).
Pointers may be implemented at a later date.
Nvidia separates programming of the 2 GPU processors (vertex and fragment)to avoid branching and loop problems and so
they are accessed independently. The downside is optimisations across this boundary arent possible, a solution is to use
a meta-programming system to merge this boundary. Nvidia introduce the concept of profiling for handling differences in
generations of GPUs. Each GPU era has a profile of what it is capable of implementing. There is also a profile level for all
GPUs necessary for portable code.
In development of the PlayStation 2 Sony supported a full C implementation with the on-chip GPUs combined with off chip
resources. This shows the trend towards more user programmability. When developing Cg Nvidia worked closely with other
companies (such as Microsoft) who were developing similar tools. An aim of Cg was to support non-shading uses of GPU,
this is of particular interest. (Fernando and Kilgard[29] provide a tutorial on using Cg to program graphics hardware.) For
12
8/3/2019 6 Month Technical Report
14/23
the non-programmable parts of a GPU CgFX[28] handles the configuration settings and parameters.
2.5 Literature Survey Conclusions
In summary some current architectural uses for GPUs and FPGAs have been considered including an FFT routine on the
GPU and some graphics routines on a FPGA. The Sonic architecture was looked at particularly how it is used as a hardware
accelerator for graphics applications. This was followed by interconnect structures looking at buses, switches and networks:
specifically their advantages and disadvantages. The internal structure of an FPGA was then considered investigating embed-
ded components that could be useful in video applications such as multipliers, memory and function solvers. Finally tools
used in pre and post device function load and in device programming were analysed, specifically the Cg language.
3 Research Questions
The interconnect between cores in a design is a common bottleneck. It is important to have a good model of the interconnect,
to either eliminate or reduce this delay. There have been many architectures proposed / developed for module interconnects
(groupable as bus, switch and network) discussed in the Literature Survey. This leads to the first research question: investigate
suitable interconnect architectures for mixed core hardware blocks and find adequate ways to model interconnect behaviour.
A model is important to decide the best interconnect for a task without the need for full implementation.
The potential of Graphics Hardware has long been exploited in the gaming industry, focusing on its high pixel throughput
and fast processing. It has been shown to be particularly efficient where there is no inter-dependence between pixels. Pro-
gramming this hardware was historically difficult: One could use assembly level language in which it takes a long time to
prototype. The alternative is an API, such as OpenGL, which limits a programmers choice to a set of functions. In 2003
Nvidia produced a language called Cg, allowing high level programming without losing the control of assembly level coding.
Following this non-graphical applications were explored, for example Moreland and Angels FFT Algorithm[3].
The adaptability of graphics hardware, to non-standard tasks, leads to the second research question: to further investigate
graphics hardware used in a mixed core architecture. This takes advantage of the price-performance ratio of graphics hard-
ware, whilst maintaining current benefits of using FPGA / Processor cores. FPGA cores allow for high levels of parallelism
13
8/3/2019 6 Month Technical Report
15/23
and flexibility as many designs can be implemented on the same hardware. Processors can be optimised for certain types of
instructions and run many permutations of them without costly reprogramming associated with FPGAs.
When one wishes to resize an image there are two possibilities for determining new values for pixels: filtering or interpo-
lation. Filtering could be a FIR (Finite Impulse Response) Low-Pass filter with complexity variation in the number of taps.
Interpolation could be a Bi-linear, Bi-Cubic or a spline method, each of varying complexity. The final research questions
is: investigate the perceived quality versus computational complexity of the 2 methods. Theory suggests FIR filtering, of a
long enough tap length, should produce a smoother result: this may not however be perceptively the best or could be too
computationally complex.
4 Interconnect Model
My first task was to implement a high level model of the ARM AMBA Bus. This would model its performance for varying
numbers of master & slave and be cycle accurate. SystemC, a relatively new hardware modelling library, was used for this.
The motivation came from a paper by Vermeulen and Catthoor[13] where an ARM7 processor was used, in addition to cus-
tom hardware, to allow for up to 10% post manufacture functional modification.
A multiply function, for a communicating processor and memory, was modelled: Two values, to be multiplied, are loaded in
consecutive cycles, multiplied then returned to memory using an interconnect. This consists of data plus control signals as a
simple bus model. This demonstrates how to display and debug the results of a hardware model. SystemC is used to create
a VCD (Value Change Dump) file, which can be displayed in a waveform viewer such a ModelSims. The results are seen
Figure 1. Waveform for multiplier implementation
14
8/3/2019 6 Month Technical Report
16/23
8/3/2019 6 Month Technical Report
17/23
Figure 3. Test output showing reset and bus request / grant procedure
A number of meetings were held with Ray Cheung from Computing (currently modelling processors) to discuss the possible
interoperability, between an AMBA bus model and processor model. A fully flexible bus and processor model was suggested,
which could be later extended to include other hardware blocks such as FPGAs.
Following this, my attention was turned to the design of such a bus model. A physical interpretation of how the AMBA AHB
bus blocks fit together can be seen in figure 2. Missing from figure 2, are global clock and reset signals, which are routed to
each block. HWDATA and HRDATA apply to write and read data respectively and H prefix denotes AHB bus as apposed to
ASB. The control signals are requests from masters and split (resume transfer) signals from slaves. Complexity in coding the
multiplexer blocks lay in making them general. Constants were used, in place of actual numbers, for data and address signal
widths throughout. The master multiplexer used a delayed master select signal from the Arbiter to pipeline address and data
buses. One master can use the data bus, whilst another controls the address bus.
For the decoder an assumption was made about how a slave is chosen. The number of address bits, used to decipher which
slave to use, is calculated as: log2(numberslaves) rounded up. The bits are taken as the MSBs of the address. The literal
value of the binary number indicates which slave to use, i.e. 01 would be slave 1.
16
8/3/2019 6 Month Technical Report
18/23
A test procedure was produced, this loads stimulus from a text file, with the result viewed as a waveform, as with the mul-
tiplexer example. The file consists of lines of either, variable and value pairs or tick followed by a number of cycles to
run for. Initially, simple tests were carried out, to check for correct reset behaviour and that the 2 multiplexers worked (with
a setup of 1 master and 2 slaves.) An example of a test output is shown in figure 3.
In the example as HSEL signals change at the bottom of the waveform, the 2 read data signals are multiplexed. When reset,
all outputs are set to zero, irrespective of inputs, which is what would be expected. When the master requests the bus, the
arbiter waits till HREADY goes high, before granting access, through HGRANT. In the case of more than 1 master, the
HMASTER signal changes immediately (with HBUSREQ) to the correct master, allowing for multiplexing and so the slaves
know which master is communicating.
The model was further tested with 2 masters and 2 slaves, a common configuration. The correct, cycle accurate, results were
seen. Within this, the sending of packets consisting of 1 and multiple data items was experimented with along with split
transfers and error responses from slaves. The waveforms for these become complicated and large very quickly, however are
of a similar form to figure 3.
5 Primary Colour Correction
Primary Colour Correction is a non-graphical application, as with the FFT on a GPU algorithm discussed above, I will now
discuss my optimised version of this. The algorithm performs three main transformations per pixel: Input Correction, His-
togram Equalisation and Colour Balancing (see Figure 4.)
Input Correction and Colour Balancing require the RGB signal to be converted to HSL (Hue, Saturation and Luminance)
space. In my optimisations, I converted half way to a chroma representation ycbcr and implemented the algorithm at this
level, which showed considerable speed up.
Other key optimisations were to perform calculations in vector space and to remove, where possible, conditional statements
which are inefficient on GPUs. The lessons learnt can be summarised below:
17
8/3/2019 6 Month Technical Report
19/23
InputCorrect
Texture
Fetch
decalCoords
2D Texture
input
x
x
x
255
black
white
saturation
hue
HistogramCorrect
BlackLevel
Gamma
WhiteLevel
OutputBlackLevel
OutputWhiteLevel
ChannelSelection
ColorBalance
Hue
Shift
SatShift
Lum
Shift
HighlightMidtone
Cross
MidtoneHighlightCross
Area
Selection
/dn
255
/dn
255/dn
255
Fix torange[0,1]
Fix torange[0,1]
Fix torange[0,1]
R
G
B
R'
G'
B'
color
Figure 4. Primary Colour Correction Block Diagram
Perform calculations in vectors & matrices
Use in-built functions to replace complex maths & conditional statements
Pre-compute uniform inputs, where possible, avoiding repetition for each pixel
Consider what is happening at assembly code - decipher code if necessary
Dont convert between colour spaces if not explicitly required
Table 1 shows the performance results for the initial and optimised designs using various generations of GPUs. It is seen
that there is a large variation in the throughput rates of the devices, although there is only 2-3 years between them. For more
information on the optimisation of the primary colour correction algorithm see[30].
Architecture Throughput (Final) MP/s Throughput (Initial) MP/s
6800 Ultra 116.36 44.14
6800 GT 101.82 38.62
6600 72.73 27.59
5700 Ultra 12.67 2.12
5200 Ultra 7.08 1.24
Table 1: Performance Comparison on GeForce architectures for the Optimised (Final) and Initial Designs
18
8/3/2019 6 Month Technical Report
20/23
For efficient optimisation of an algorithm it is important to understand the performance penalty of each section. A detailed
breakdown of the above primary colour correction algorithm, in terms of delay, was carried out. Some performance bot-
tlenecks in the implementation were compare and clamping operations. The Colour Balancing function, which includes
many of each of these, was seen to be the slowest of the three main blocks. The conversion between colour spaces was seen
to have a large delay penalty due mainly to the conversion from RGB to XYL space. In Histogram Equalisation pow was
seen also to add greatly to the delay and accounts for almost 50% of the delay (0.00089s/MP).
The register usage, although minimal, was seen to be larger in calculations than compare operations. This is due to the large
number of min-terms in the calculations and there being fewer intermediate storages required in compares. In this case the
register usages was not a limiting factor to the implementation, however it may be for other algorithms. The breakdown of
delay for each block can be seen below, for more detail see [31].
Block Cycles Rregs Hregs Instructions Throughput (MP/s) Delay (s/MP)
Input Correction 16 3 1 35 350.00 0.00286
Histogram Correct 12 2 1 25 466.67 0.00214
Colour Balancing 23 3 1 56 243.47 0.00411
Table 2: Effect on Performance of Each Block of the Primary Colour Correction Algorithm
6 Plan of Work Leading to Transfer
The next step in the modelling of interconnects is to consider a general bus structure, this can also consist of: multiple masters
and slaves, varying methods of arbitration, clock speeds, shared / individual read-write lines etcetera. This requires a more
abstract implementation, which is allowed for in the SystemC library. A model of cross-bar switches and a network on chip
structure are other possibilities for the future work on interconnect modelling.
The next stage on the question of graphics hardware is to implement the primary colour correction algorithm on a Pentium
Processor and on a FPGA. An optimised implementation in MATLAB completed the computation, on a Pentium 4, with a
512x512 image in 2.3 seconds. This equates to 0.1MP/s which is much slower than the graphics card. An implementation in
C / C++ is expected to perform better, but to still be 1-2 orders of magnitude worse. The FPGA implementation is expected to
19
8/3/2019 6 Month Technical Report
21/23
out-perform both, if a large enough device is used. When limited to a device of equivalent cost to a graphics card the FPGA
is expected to perform worse than the graphics card but better than the CPU.
A comparison of the visual differences of filtering and interpolation will be performed, along with the computation time re-
quired by each. The algorithms will be tried on the graphics hardware and any limitations of the interconnect, either on or off
board, noted. Implementations may also be prototyped on an FPGA device and Pentium 4 processor for further comparison
of computational capabilities. The literature survey will also be updated to include documents relating to interpolation and
filtering algorithms, particularly in hardware.
An updated Gantt chart for my work intensions, up to transfer, can be found in Appendix 1 at the rear of this document. This
relates to my above aims.
7 Conclusion
A literature survey of related work to my chosen research area has been presented, highlighting possibilities for work in
the areas of interconnects and utilisation of graphics hardware in a mixed core system. My three main research questions:
investigating interconnects and their modelling; the use of graphics hardware for video processing and comparison of FIR
filtering and interpolation were then explained. The work covered to-date on Interconnect Modelling and Primary Colour
Correction implementation on a graphics card was summarised, followed by a plan of my future work including a Gantt chart.
20
8/3/2019 6 Month Technical Report
22/23
References
[1] J.N. England, A system for interactive modelling of physical curved surface objects, SIGGRAPH 78 1978, pp.336-340
[2] Chris Trendall and A. James Stewart, General calculations using graphics hardware, with applications to interactive caustics, 2000
[3] Kenneth Moreland and Edward Angel, The FFT on a GPU, in The Eurographics Association, 2003, pp. 112-136
[4] Micheal Macedonia, The GPU Enters Computings Mainstream, Entertainment Computing, pp. 106-108, 2003
[5] Jeffery M. OBrian, Nvidia, www.wired.com Issue 10.07, 2002
[6] Robert Strzodka and Christoph Garbe, Real-Time Motion Estimation and Visualisation on Graphics Cards, University of Duisburg,
2004
[7] Wayne Luk, P. Andreou, A. Derbyshire, F. Dupont-De-Dinechin, J. Rice, N. Shirazi, D. and Siganos, A Reconfigurable Engine for
Real-Time Video Processing, Lecture Notes in Computer Science, 1998
[8] Satnam Singh and Pierre Bellec, Virtual Hardware for Graphics Applications Using FPGAs, FCCM 1994
[9] Simon Haynes, John Stone, Peter Cheung and Wayne Luk, Video Image Processing with the Sonic Architecture, Computer, pp.
50-57, 2000
[10] Wim Melis, Peter Cheung and Wayne Luk, Image Registration of Real-Time Broadcast Video Using the UltraSONIC Reconfigurable
Computer, FPL, pp. 1148-1151, 2002
[11] Simon Haynes, John Stone, Peter Cheung and Wayne Luk, SONIC - A Plug in Architecture for Video Processing, FPGA, pp.21-30,
1999
[12] Pete Sedcole, Peter Cheung, G.A. Constantinides and Wayne Luk, A Reconfigurable Platform for Real-Time Embedded Video
Image Processing, FPGA, 2003
[13] Fredrick Vermeulen and Francky Catthoor, Power-Efficient Flexible Processor Architecture for Embedded Applications, IEEE
Transactions on VLSI Systems Vol 11, pp. 376-385, 2003
[14] Simon Haynes, Sonic - A reconfigurable image processing architecture, Poster - IEEE Symposium on FPGAs for Custom Com-
puting Machines, 1999
[15] Intel Developers Network for PCI Esxpress Architecture, Why PCI Express Architectures for Graphics,www.express-lane.org,
2004
21
8/3/2019 6 Month Technical Report
23/23
[16] AMBA SPECIFICATION (Rev 2.0), ARM, 1999
[17] Jiang Xu, wayne Wold, Joerg Henkel, Srimat Chakradhar and Tiehan Lv, A case study in Networks-on-Chip Design for Embedded
Video, Automation and Test European Conference, 2004
[18] William J. Dally and Brian Towles, Route Packets, Not Wires: On-Chip Interconnection Networks, DAC, 2001
[19] Axel Jantsch, Networks on Chip, 2002
[20] Ahmed Hemani, Axel Jantsch, Shashi Kumar, Adam Postula, Johnny Oberg, Mikael Millberg and Dan Lindvist, Network on Chip:
Architecture for billion transistor era., Proceedings of the IEEE NorChip Conference, 2000
[21] Luca Benini and Giovanni De Micheli, Networks on Chips: A New SOC Paradigm, Computer, pp. 70-78, 2002
[22] Rostislav Dobkin, Isral Cidon, Ran Ginosar, Avinaom Kolodny and Arkadiy Morgenshtein, Fast Asynchronous Bit-Serial Intercon-
nects for Network-on-Chip, 2004
[23] Simon Haynes and Peter Cheung, A Reconfigurable Multiplier Array for Video Image Processing Tasks, Suitable for Embedded In
An FPGA Structure, IEEE Symposium on Field-Programmable Custom Computing, 1998
[24] Simon Haynes, Antonio Ferrari and Peter Cheung, Flexible Reconfigurable Multiplier Blocks Suitable for Enhancing the Architec-
ture of FPGAs, Proceedings of Custom Integrated Circuit Conference, 1999
[25] Nalin Sidahao, George Constantinides and Peter Cheung, Architectures for Function Evaluation on FPGAs, IEEE Symposium on
Circuits and Systems, pp. 804-807, 2003
[26] Wim Melis, Peter Cheung and Wayne Luk, Autonomous Memory Block for Reconfigurable Computing, ISCAS, pp. 581-584, 2004
[27] T. Wiangtong, C.T. Ewe and P.Y.K. Cheung, SONICmole: A Debugging Environment for the UltraSONIC Reconfigurable Com-
puter, ISCAS, pp.808-811, 2003
[28] William R. Mark, R. Stephen Glanville, Kurt Akeley and Mark J. Kilgard, Cg: A system for programming graphics hardware in a
C-like language, ACM Transactions on Graphics, pp. 896-907, 2003
[29] R. Fernando and M.J. Kilgard, The Cg Tutorial: The definative guide to programming real-time graphics, Addison Wesley, 2003
[30] Ben Cope, Efficient Implementation of Primary Colour Correction on Graphics Hardware, avaliable from author, 2005
[31] Ben Cope, Breakdown of performance for Primary Colour Correction, avaliable from author, 2005
22