Floating Point Support for FPGAs

Floating-point

library for FPGAs

Rathlin

Aim: create a high performance FPGA-based image processor

architecture and programming environment which will allow

complex image processing algorithms to be implemented.

Rathlin

Rathlin

Trade-offs considered before this summer:

How much computation can you do on one frame?

How many objects can you have in one frame?

How many frames are you willing to drop?

New trade-off to take into consideration:

Precision, a sliding scale from whole integers, fixed through high

precision floats.

Motivation

Original picture – count the number of people using an image processing algorithm.

Motivation

Result when only using integers.

Motivation

Result when using floating-point numbers.

Motivation

Why FPGAs?

High flexibility – reprogrammable and easily upgradable.

Can be application specific – no wasted resources.

Offers a compromise between the flexibility of general-purpose

processors and the hardware-based speed of ASICs (application-

specific integrated circuit ).

Large number of memory banks – supports high-level parallelism.

Designers need to minimize the number of operations and memory

accesses in each unit to exploit more parallelism.

Why FPGAs?

Example – stereo vision algorithm (extraction of 3D information from digital images).

FPGA performance is superior when high-level parallelism is involved.

Why FPGAs?

Example – performance of two-dimensional filters (processing of 2D digital signals).

FPGA performance is inferior because of the low constant operational frequency.

Problem

Most FPGAs do not have Floating Point Units, so they do not support

floating-point operations.

The ones that do have FPUs are few and costly.

Solutions

1. Implement floating-point with integer libraries in higher level

languages:

Short implementation time.

Possible cost of non-optimal use of FPGA resources.

2. Implement floating point with low level description languages:

Efficient use of FPGA hardware.

Not easily abstracted to high level languages.

RIPL DSL

Image processing DSL - RIPL

DataFlow - CAL

HDL - Verilog

Floating-point and Fixed-point

libraries implemented in CAL.

Floating-point representation

IEEE-754 floating-point representation standard .

Single-precision (32-bit).

Each number has a sign, exponent and mantissa (1 integer bit and 23 fraction bits).

In the binary string representation : 1 sign bit, 8 exponent bits and 23 fraction bits.

The internal representation of floats in most machines.

Fixed-point representation

QM.N number format.

M bits for the integer part, N bits for the fraction part.

Internal Two’s Complement representation : +1 sign bit.

We used Q8.7 ( 16 bits : 1 sign bit, 8 integer bits, 7 fraction bits).

8 integer bits 7 fraction bits

Floating-point vs. Fixed-point

Choice depends on the algorithms at hand.

Floating-point representation:

High precision.

Range : [-3.402823 × 1038, 3.402823 × 1038].

Position of the point must be determined at processing time.

Internal operations are performed on 32-bit integers, higher utilization of

resources.

Precision of 6-7 decimals after the point.

Complex implementation.

Floating-point vs. Fixed-point

Fixed-point representation (Q8.7):

Lower precision.

Range: [-256,255.9921875].

Position of the point is always fixed.

Internal operations are performed on 16-bit integers, so less resources

utilized.

Easier to implement.

CAL Floating-point library

Size: 962 lines

Functions for : addition, subtraction, multiplication, division, square

root.

Convert the floating-point number to its binary representation.

Use it to perform operations using the library.

Convert the result back to floating-point.

CAL Floating-point library

Example:

3.14 + 2.53 = 5.67

Add(0X4048F5C3,0x4021EB85) = 0x40B570A4

(4.57 * 6.75) / (3.22 * 1.55) = 6.1806

Div( Mul(0x40923D71,0x40D80000), Mul(0x404E147B,0x3FC66666)) = 0x40C5C77A

CAL Fixed-point library

Size: 42 lines.

Functions for: multiplication, division.

Addition and subtraction performed by “+” and “-” on 16-bit

integers.

Procedures for : square root, rounding.

Scale the floating-point number by 27 and round the result.

Use it to perform operations using the library.

Divide the result by 27.

CAL Fixed-point library

Example:

3.14 + 2.53 = 5.67

401 + 323 = 724

(4.57 * 6.75) / (3.22 * 1.55) = 6.1806

Div ( Mul(584,864) , Mul(412,198) ) = 791

Evaluation

We used Xilinx ISE to create micro-benchmarks to test the

performance of the libraries.

Each micro-benchmark comprised of one actor that would have

either 1 instance or 10 instances of a function in order to:

see how the maximum clock frequency and used resources would be

scaled across more than one instance of a function.

Observe the trade-off compared to using 8-bit integers.

8-bit Integers micro-benchmark

Instances

Maximum

Clock

frequency

(MHz)

Number of

slice

registers

Utilization

out of total

number of

registers

Number of

slice LUTs

Utilization out

of total

number of

LUTs

Block

RAM

Utilization

out of total

Block RAM

Addition 1 379.651 100 1% 174 1% 32 3%

Addition 10 379.651 105 1% 181 1% 32 3%

Subtraction 1 379.651 100 1% 174 1% 32 3%

Subtraction 10 379.651 100 1% 197 1% 32 3%

Multiplication 1 379.651 100 1% 166 1% 32 3%

Multiplication 10 379.651 100 1% 166 1% 32 3%

Division 1 223.169 140 1% 268 1% 32 3%

Division 10 245.842 144 1% 750 1% 32 3%

Floating-point micro-benchmark

Instances

Maximum

Clock

frequency

(MHz)

Number of

slice

registers

Utilization

out of total

number of

registers

Number of

slice LUTs

Utilization out

of total

number of

LUTs

Block

RAM

Utilization

out of total

Block RAM

Addition 1 61.630 94 1% 2,281 1% 0 0%

Addition 10 7.546 378 1% 33,021 10% 0 0%

Subtraction 1 64.336 98 1% 1,969 1% 0 0%

Subtraction 10 7.728 399 1% 34,107 11% 0 0%

Multiplication 1 58.891 64 1% 736 1% 0 0%

Multiplication 10 6.274 67 1% 11,082 3% 0 0%

Division 1 34.967 283 1% 1,771 1% 0 0%

Division 10 3.955 2,295 1% 17,553 5% 0 0%

Fixed-point micro-benchmark

Instances

Maximum

Clock

frequency

(MHz)

Number of

slice

registers

Utilization

out of total

number of

registers

Number of

slice LUTs

Utilization out

of total

number of

LUTs

Block

RAM

Utilization

out of total

Block RAM

Addition 1 381.403 80 1% 113 1% 0 0%

Addition 10 381.403 121 1% 206 1% 62 6%

Subtraction 1 381.403 108 1% 192 1% 64 6%

Subtraction 10 381.403 119 1% 209 1% 60 5%

Multiplication 1 381.403 108 1% 176 1% 64 6%

Multiplication 10 381.403 108 1% 176 1% 64 6%

Division 1 206.752 204 1% 506 1% 64 6%

Division 10 68.049 981 1% 3,597 1% 64 6%

Conclusions

There is a significant trade-off for high precision.

Users should consider carefully what the requirements of their

algorithms are, to determine how vital high precision is for them .

Thank you!

Floating Point Support for FPGAs

Documents

Transcript of Floating Point Support for FPGAs