Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer...

Double-PrecisionFloating-Point Numbers

Douglas Wilhelm Harder

Department of Electrical and Computer Engineering

University of Waterloo

Copyright © 2007 by Douglas Wilhelm Harder. All rights reserved.

ECE 204 Numerical Methods for Computer Engineers


• This topic introduces binary numbers– requirements– a poor means of storage– a good means of storage


• We will now use this same floating-point format, but we will apply it to binary numbers


• In our example, we used six decimal digits

• The double-precision floating-point format uses 64 bits (or eight bytes)

• Like our format, they are broken up into– a leading sign bit, – an exponent (with a bias), and– a mantissa


• Like our six-digit version, the bits are stored in the order:SEEEEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

• The bias is 01111111111, or 1023

• This allows us to represent numbers in the range 2-1023 to 21025, though the floating-point standard IEEE 754 reserves the use of the lowest (all zeros) and highest (all ones) exponents


• Recall that the leading bit in a floating-point representation must be non-zero, thus, the bit must be 1

• We therefore do not store the leading digit, thus, the mantissa actually represents 1.MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM


• Rather than printing out a lot of 1s and 0s, instead, we will use hexadecimal numbers:

0 0 0000

1 1 0001

2 2 0010

3 3 0011

4 4 0100

5 5 0101

6 6 0110

7 7 0111

8 8 1000

9 9 1001

a 10 1010

b 11 1011

c 12 1100

d 13 1101

e 14 1110

f 15 1111


• To convert a binary number into hexadecimal, simply group the bits into groups of four (starting a a radix point if it exists) and replace each group with the corresponding hexadecimal value

• To convert from hexadecimal to binary, replace each hexadecimal digit with its four-bit equivalent (including leading zeros)


• Some of the more common numbers are:>> format hex

>> 1

ans = 3ff0000000000000

>> 2

ans = 4000000000000000

>> -1

ans = bff0000000000000

>> -2

ans = c000000000000000

• Recall that 3ff16 = 0011111111112 which is our bias


• Some operations are quite straight-forward:– multiplication by 2 adds 1 to the exponent and

leaves the mantissa unchanged– division by 2 subtracts 1 from the exponent

and leaves the mantissa unchanged


• Rounding rules are simplified

• Given a binary number which has more than 53 bits of precision, then to round it to a 53 bit number– if the 54th bit is 0, then truncate (round down)– if all bits after the 53rd bit are 1000··· then

round up if the 53rd bit is 1, otherwise truncate, and

– otherwise, round up (add 1 to the 53rd bit)


• Remember, we deal with 53 bits because we store 52 bits together with the implicit leading 1

Usage Notes

• These slides are made publicly available on the web for anyone to use

• If you choose to use them, or a part thereof, for a course at another institution, I ask only three things:– that you inform me that you are using the slides,

– that you acknowledge my work, and

– that you alert me of any mistakes which I made or changes which you make, and allow me the option of incorporating such changes (with an acknowledgment) in my set of slides

Sincerely,

Douglas Wilhelm Harder, MMath

[email protected]

Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer...

Documents

Transcript of Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer...