Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer...
-
Upload
clementine-malone -
Category
Documents
-
view
215 -
download
0
Transcript of Double-Precision Floating-Point Numbers Douglas Wilhelm Harder Department of Electrical and Computer...
Double-PrecisionFloating-Point Numbers
Douglas Wilhelm Harder
Department of Electrical and Computer Engineering
University of Waterloo
Copyright © 2007 by Douglas Wilhelm Harder. All rights reserved.
ECE 204 Numerical Methods for Computer Engineers
Double-PrecisionFloating-Point Numbers
• This topic introduces binary numbers– requirements– a poor means of storage– a good means of storage
Double-PrecisionFloating-Point Numbers
• We will now use this same floating-point format, but we will apply it to binary numbers
Double-PrecisionFloating-Point Numbers
• In our example, we used six decimal digits
• The double-precision floating-point format uses 64 bits (or eight bytes)
• Like our format, they are broken up into– a leading sign bit, – an exponent (with a bias), and– a mantissa
Double-PrecisionFloating-Point Numbers
• Like our six-digit version, the bits are stored in the order:SEEEEEEEEEEEMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
• The bias is 01111111111, or 1023
• This allows us to represent numbers in the range 2-1023 to 21025, though the floating-point standard IEEE 754 reserves the use of the lowest (all zeros) and highest (all ones) exponents
Double-PrecisionFloating-Point Numbers
• Recall that the leading bit in a floating-point representation must be non-zero, thus, the bit must be 1
• We therefore do not store the leading digit, thus, the mantissa actually represents 1.MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
Double-PrecisionFloating-Point Numbers
• Rather than printing out a lot of 1s and 0s, instead, we will use hexadecimal numbers:
0 0 0000
1 1 0001
2 2 0010
3 3 0011
4 4 0100
5 5 0101
6 6 0110
7 7 0111
8 8 1000
9 9 1001
a 10 1010
b 11 1011
c 12 1100
d 13 1101
e 14 1110
f 15 1111
Double-PrecisionFloating-Point Numbers
• To convert a binary number into hexadecimal, simply group the bits into groups of four (starting a a radix point if it exists) and replace each group with the corresponding hexadecimal value
• To convert from hexadecimal to binary, replace each hexadecimal digit with its four-bit equivalent (including leading zeros)
Double-PrecisionFloating-Point Numbers
• Some of the more common numbers are:>> format hex
>> 1
ans = 3ff0000000000000
>> 2
ans = 4000000000000000
>> -1
ans = bff0000000000000
>> -2
ans = c000000000000000
• Recall that 3ff16 = 0011111111112 which is our bias
Double-PrecisionFloating-Point Numbers
• Some operations are quite straight-forward:– multiplication by 2 adds 1 to the exponent and
leaves the mantissa unchanged– division by 2 subtracts 1 from the exponent
and leaves the mantissa unchanged
Double-PrecisionFloating-Point Numbers
• Rounding rules are simplified
• Given a binary number which has more than 53 bits of precision, then to round it to a 53 bit number– if the 54th bit is 0, then truncate (round down)– if all bits after the 53rd bit are 1000··· then
round up if the 53rd bit is 1, otherwise truncate, and
– otherwise, round up (add 1 to the 53rd bit)
Double-PrecisionFloating-Point Numbers
• Remember, we deal with 53 bits because we store 52 bits together with the implicit leading 1
Usage Notes
• These slides are made publicly available on the web for anyone to use
• If you choose to use them, or a part thereof, for a course at another institution, I ask only three things:– that you inform me that you are using the slides,
– that you acknowledge my work, and
– that you alert me of any mistakes which I made or changes which you make, and allow me the option of incorporating such changes (with an acknowledgment) in my set of slides
Sincerely,
Douglas Wilhelm Harder, MMath