Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer...

13
Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007 by Douglas Wilhelm Harder. All rights reserved. ECE 204 Numerical Methods for Computer Engineers

Transcript of Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer...

Page 1: Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007.

Double-PrecisionFloating-Point Numbers

Douglas Wilhelm Harder

Department of Electrical and Computer Engineering

University of Waterloo

Copyright © 2007 by Douglas Wilhelm Harder. All rights reserved.

ECE 204 Numerical Methods for Computer Engineers

Page 2: Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007.

Double-PrecisionFloating-Point Numbers

• This topic introduces binary numbers– requirements– a poor means of storage– a good means of storage

Page 3: Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007.

Double-PrecisionFloating-Point Numbers

• We will now use this same floating-point format, but we will apply it to binary numbers

Page 4: Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007.

Double-PrecisionFloating-Point Numbers

• In our example, we used six decimal digits

• The double-precision floating-point format uses 64 bits (or eight bytes)

• Like our format, they are broken up into– a leading sign bit, – an exponent (with a bias), and– a mantissa

Page 5: Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007.

Double-PrecisionFloating-Point Numbers

• Like our six-digit version, the bits are stored in the order:SEEEEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

• The bias is 01111111111, or 1023

• This allows us to represent numbers in the range 2-1023 to 21025, though the floating-point standard IEEE 754 reserves the use of the lowest (all zeros) and highest (all ones) exponents

Page 6: Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007.

Double-PrecisionFloating-Point Numbers

• Recall that the leading bit in a floating-point representation must be non-zero, thus, the bit must be 1

• We therefore do not store the leading digit, thus, the mantissa actually represents 1.MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

Page 7: Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007.

Double-PrecisionFloating-Point Numbers

• Rather than printing out a lot of 1s and 0s, instead, we will use hexadecimal numbers:

0 0 0000

1 1 0001

2 2 0010

3 3 0011

4 4 0100

5 5 0101

6 6 0110

7 7 0111

8 8 1000

9 9 1001

a 10 1010

b 11 1011

c 12 1100

d 13 1101

e 14 1110

f 15 1111

Page 8: Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007.

Double-PrecisionFloating-Point Numbers

• To convert a binary number into hexadecimal, simply group the bits into groups of four (starting a a radix point if it exists) and replace each group with the corresponding hexadecimal value

• To convert from hexadecimal to binary, replace each hexadecimal digit with its four-bit equivalent (including leading zeros)

Page 9: Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007.

Double-PrecisionFloating-Point Numbers

• Some of the more common numbers are:>> format hex

>> 1

ans = 3ff0000000000000

>> 2

ans = 4000000000000000

>> -1

ans = bff0000000000000

>> -2

ans = c000000000000000

• Recall that 3ff16 = 0011111111112 which is our bias

Page 10: Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007.

Double-PrecisionFloating-Point Numbers

• Some operations are quite straight-forward:– multiplication by 2 adds 1 to the exponent and

leaves the mantissa unchanged– division by 2 subtracts 1 from the exponent

and leaves the mantissa unchanged

Page 11: Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007.

Double-PrecisionFloating-Point Numbers

• Rounding rules are simplified

• Given a binary number which has more than 53 bits of precision, then to round it to a 53 bit number– if the 54th bit is 0, then truncate (round down)– if all bits after the 53rd bit are 1000··· then

round up if the 53rd bit is 1, otherwise truncate, and

– otherwise, round up (add 1 to the 53rd bit)

Page 12: Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007.

Double-PrecisionFloating-Point Numbers

• Remember, we deal with 53 bits because we store 52 bits together with the implicit leading 1

Page 13: Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer Engineering University of Waterloo Copyright © 2007.

Usage Notes

• These slides are made publicly available on the web for anyone to use

• If you choose to use them, or a part thereof, for a course at another institution, I ask only three things:– that you inform me that you are using the slides,

– that you acknowledge my work, and

– that you alert me of any mistakes which I made or changes which you make, and allow me the option of incorporating such changes (with an acknowledgment) in my set of slides

Sincerely,

Douglas Wilhelm Harder, MMath

[email protected]