Outline Floating point arithmetic 80x87 Floating Point Unit.
Floating Point Support for FPGAs
-
Upload
ruxandra-bob -
Category
Documents
-
view
234 -
download
0
description
Transcript of Floating Point Support for FPGAs
Floating-point
library for FPGAs
Rathlin
Aim: create a high performance FPGA-based image processor
architecture and programming environment which will allow
complex image processing algorithms to be implemented.
Rathlin
Rathlin
Rathlin
Rathlin
Trade-offs considered before this summer:
How much computation can you do on one frame?
How many objects can you have in one frame?
How many frames are you willing to drop?
New trade-off to take into consideration:
Precision, a sliding scale from whole integers, fixed through high
precision floats.
Motivation
Original picture – count the number of people using an image processing algorithm.
Motivation
Result when only using integers.
Motivation
Result when using floating-point numbers.
Motivation
Why FPGAs?
High flexibility – reprogrammable and easily upgradable.
Can be application specific – no wasted resources.
Offers a compromise between the flexibility of general-purpose
processors and the hardware-based speed of ASICs (application-
specific integrated circuit ).
Large number of memory banks – supports high-level parallelism.
Designers need to minimize the number of operations and memory
accesses in each unit to exploit more parallelism.
Why FPGAs?
Example – stereo vision algorithm (extraction of 3D information from digital images).
FPGA performance is superior when high-level parallelism is involved.
Why FPGAs?
Example – performance of two-dimensional filters (processing of 2D digital signals).
FPGA performance is inferior because of the low constant operational frequency.
Problem
Most FPGAs do not have Floating Point Units, so they do not support
floating-point operations.
The ones that do have FPUs are few and costly.
Solutions
1. Implement floating-point with integer libraries in higher level
languages:
Short implementation time.
Possible cost of non-optimal use of FPGA resources.
2. Implement floating point with low level description languages:
Efficient use of FPGA hardware.
Not easily abstracted to high level languages.
RIPL DSL
Image processing DSL - RIPL
DataFlow - CAL
HDL - Verilog
Floating-point and Fixed-point
libraries implemented in CAL.
Floating-point representation
IEEE-754 floating-point representation standard .
Single-precision (32-bit).
Each number has a sign, exponent and mantissa (1 integer bit and 23 fraction bits).
In the binary string representation : 1 sign bit, 8 exponent bits and 23 fraction bits.
The internal representation of floats in most machines.
Fixed-point representation
QM.N number format.
M bits for the integer part, N bits for the fraction part.
Internal Two’s Complement representation : +1 sign bit.
We used Q8.7 ( 16 bits : 1 sign bit, 8 integer bits, 7 fraction bits).
8 integer bits 7 fraction bits
Floating-point vs. Fixed-point
Choice depends on the algorithms at hand.
Floating-point representation:
High precision.
Range : [-3.402823 × 1038, 3.402823 × 1038].
Position of the point must be determined at processing time.
Internal operations are performed on 32-bit integers, higher utilization of
resources.
Precision of 6-7 decimals after the point.
Complex implementation.
Floating-point vs. Fixed-point
Fixed-point representation (Q8.7):
Lower precision.
Range: [-256,255.9921875].
Position of the point is always fixed.
Internal operations are performed on 16-bit integers, so less resources
utilized.
Easier to implement.
CAL Floating-point library
Size: 962 lines
Functions for : addition, subtraction, multiplication, division, square
root.
Convert the floating-point number to its binary representation.
Use it to perform operations using the library.
Convert the result back to floating-point.
CAL Floating-point library
Example:
3.14 + 2.53 = 5.67
Add(0X4048F5C3,0x4021EB85) = 0x40B570A4
(4.57 * 6.75) / (3.22 * 1.55) = 6.1806
Div( Mul(0x40923D71,0x40D80000), Mul(0x404E147B,0x3FC66666)) = 0x40C5C77A
CAL Fixed-point library
Size: 42 lines.
Functions for: multiplication, division.
Addition and subtraction performed by “+” and “-” on 16-bit
integers.
Procedures for : square root, rounding.
Scale the floating-point number by 27 and round the result.
Use it to perform operations using the library.
Divide the result by 27.
CAL Fixed-point library
Example:
3.14 + 2.53 = 5.67
401 + 323 = 724
(4.57 * 6.75) / (3.22 * 1.55) = 6.1806
Div ( Mul(584,864) , Mul(412,198) ) = 791
Evaluation
We used Xilinx ISE to create micro-benchmarks to test the
performance of the libraries.
Each micro-benchmark comprised of one actor that would have
either 1 instance or 10 instances of a function in order to:
see how the maximum clock frequency and used resources would be
scaled across more than one instance of a function.
Observe the trade-off compared to using 8-bit integers.
8-bit Integers micro-benchmark
Instances
Maximum
Clock
frequency
(MHz)
Number of
slice
registers
Utilization
out of total
number of
registers
Number of
slice LUTs
Utilization out
of total
number of
LUTs
Block
RAM
Utilization
out of total
Block RAM
Addition 1 379.651 100 1% 174 1% 32 3%
Addition 10 379.651 105 1% 181 1% 32 3%
Subtraction 1 379.651 100 1% 174 1% 32 3%
Subtraction 10 379.651 100 1% 197 1% 32 3%
Multiplication 1 379.651 100 1% 166 1% 32 3%
Multiplication 10 379.651 100 1% 166 1% 32 3%
Division 1 223.169 140 1% 268 1% 32 3%
Division 10 245.842 144 1% 750 1% 32 3%
Floating-point micro-benchmark
Instances
Maximum
Clock
frequency
(MHz)
Number of
slice
registers
Utilization
out of total
number of
registers
Number of
slice LUTs
Utilization out
of total
number of
LUTs
Block
RAM
Utilization
out of total
Block RAM
Addition 1 61.630 94 1% 2,281 1% 0 0%
Addition 10 7.546 378 1% 33,021 10% 0 0%
Subtraction 1 64.336 98 1% 1,969 1% 0 0%
Subtraction 10 7.728 399 1% 34,107 11% 0 0%
Multiplication 1 58.891 64 1% 736 1% 0 0%
Multiplication 10 6.274 67 1% 11,082 3% 0 0%
Division 1 34.967 283 1% 1,771 1% 0 0%
Division 10 3.955 2,295 1% 17,553 5% 0 0%
Fixed-point micro-benchmark
Instances
Maximum
Clock
frequency
(MHz)
Number of
slice
registers
Utilization
out of total
number of
registers
Number of
slice LUTs
Utilization out
of total
number of
LUTs
Block
RAM
Utilization
out of total
Block RAM
Addition 1 381.403 80 1% 113 1% 0 0%
Addition 10 381.403 121 1% 206 1% 62 6%
Subtraction 1 381.403 108 1% 192 1% 64 6%
Subtraction 10 381.403 119 1% 209 1% 60 5%
Multiplication 1 381.403 108 1% 176 1% 64 6%
Multiplication 10 381.403 108 1% 176 1% 64 6%
Division 1 206.752 204 1% 506 1% 64 6%
Division 10 68.049 981 1% 3,597 1% 64 6%
Conclusions
There is a significant trade-off for high precision.
Users should consider carefully what the requirements of their
algorithms are, to determine how vital high precision is for them .
Thank you!