ECE 313 - Computer Organization Floating Point Feb 2005 Reading: 3.6-3.9 HW Due Monday 3/28:...

ECE 313 - Computer Organization

Floating PointFeb 2005

Reading: 3.6-3.9HW Due Monday 3/28: 3.7,3.9, 3.10, 3.14

3.30, 3.35, 3.37, 3.38, 3.40, 3.44

EXAM 1: Wednesday 3/30Why did the Ariane 5 Explode?(image source: java.sun.com)

Portions of these slides are derived from: Textbook figures © 1998 Morgan Kaufmann Publishers all rights reserved Tod Amon's COD2e Slides © 1998 Morgan Kaufmann Publishers all rights reserved Dave Patterson’s CS 152 Slides - Fall 1997 © UCB Rob Rutenbar’s 18-347 Slides - Fall 1999 CMU other sources as noted

Feb 2005 Floating Point 2

Outline - Floating Point

Motivation and Key Ideas IEEE 754 Floating Point Format Range and precision Floating Point Arithmetic MIPS Floating Point Instructions Rounding & Errors Summary


Floating Point - Motivation

Review: n-bit integer representations Unsigned: 0 to 2n-1 Signed Two’s Complement: - 2n-1 to 2n-1-1 Biased (excess-b): -b to 2n-b

Problem: how do we represent: Very large numbers 9,345,524,282,135,672,

2354

Very small numbers 0.00000000000000005216,2-100

Rational numbers 2/3 Irrational numbers sqrt(2) Transcendental numbers e, π


Fixed Point Representation

Idea: fixed-point numbers with fractions Decimal point (binary point) marks start of fraction

Decimal: 1.2503 = 1 X 100 + 2 X 10-1 + 5 X 10-2 + 3 X 10-4

Binary: 1.0100001 = 1 X 20 + 1 X 2-2 + 1 X 2-7

Problems Limited locations for “decimal point” (binary point”) Won’t work for very small or very larger numbers


Another Approach: Scientific Notation

Represent a number as a combination of Mantissa (significand): Normalized number

AND Exponent (base 10)

Example: 6.02 X 1023

Significand(mantissa)

Radix(base)

Exponent


Floating Point

Key idea: adapt scientific notation to binary Fixed-width binary number for significand Fixed-width binary number for exponent (base 2)

Idea: represent a number as1.xxxxxxxtwo X 2yyyy

Significand(mantissa)

Radix(2)

Exponent

Leading ‘1’(Implicit)

Important Points: This is a tradeoff between precision and range Arithmetic is approximate - error is inevitable!


IEEE 754 Floating Point

Single precision (C/C++/Java float type)

Value N = (-1)S X 1.F X 2E-127

Double precision (C/C++/Java double type)

Value N = (-1)S X 1.F X 2E-1023

S E Exponent F Significand

1 bit 8 bits 23 bits


F Significand (continued - 52 bits total)

1 bit 11 bits 20 bits

32 bits

Bias

Bias


Floating Point Examples

8.75ten = 1 X 23 + 1 X 2-1 + 1 X 2-2 = 1.00011 X 23

Single Precision: • Significand: 1.00011000…. (note leading 1 is implied)

• Exponent: 3 + 127 = 130 = 10000010two

Double Precision:• Significand: 1.00011000…

• Exponent: 3 + 1023 = 1026 = 10000000010two

0E Exponent F SignificandS

1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0



0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0



-0.375ten = 1 X 2-2 + 1 X 2-3 = 1. 1 X 2-2

Single Precision: • Significand: 1.1000….

• Exponent: -2 + 127 = 125 = 01111101two

Double Precision:• Significand: 1.1000…

• Exponent: -2 + 1023 = 1021 = 01111111101two


0 1 1 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0



1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0



Q: What is the value of the following single-precision word?

Significand = 1 + 2-1 + 2-4 + 2-8 + 2-10 + 2-12

Exponent = 8 - 127 = -119 Final Result = (1 + 2-1 + 2-4 + 2-8 + 2-10 + 2-12) X 2-119

= 2.36 X 10-36


0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0


Special Values in IEEE Floating Point

0000000 exponent - reserved for zero value (all bits zero) “Denormalized numbers” - drop the “1.”

• Used for “very small” numbers … “gradual underflow”

• Smallest denormalized number (single precision):0.00000000000000000000001 X 2-126 = 2-149

1111111 exponent Infinity - 111111 exponent, zero significand NaN (Not a Number) - 1111111 exponent, nonzero

significand


Floating Point Range and Precision

The tradeoff: range in exchange for uniformity “Tiny” example: floating point with:

3 exponent bits 2 signficand bits

– –10 –5 0 +5 +10 +

Denormalized Normalized Infinity

–1 –0.8 –0.6 –0.4 –0.2 0 +0.2 +0.4 +0.6 +0.8 +1

Denormalized Normalized Infinity

+0–0

Graphic and Example Source: R. Bryant and D. O’Halloran, Computer Systems: A Programmer’s Perspective, © Prentice Hall, 2002

s exp S

01245


Visualizing Floating Point - “Small” FP Representation

8-bit Floating Point Representation the sign bit is in the most significant bit. the next four bits are the exponent, with a bias of 7. the last three bits are the frac

Same General Form as IEEE Format normalized, denormalized representation of 0, NaN, infinity)

s exp significand

02367

Example Source: R. Bryant and D. O’Halloran, Computer Systems: A Programmer’s Perspective, © Prentice Hall, 2002


Small FP - Values Related to Exponent

Exp exp E 2E

0 0000 -6 1/64 (denorms)1 0001 -6 1/642 0010 -5 1/323 0011 -4 1/164 0100 -3 1/85 0101 -2 1/46 0110 -1 1/27 0111 0 18 1000 +1 29 1001 +2 410 1010 +3 811 1011 +4 1612 1100 +5 3213 1101 +6 6414 1110 +7 12815 1111 n/a (inf, Nan).


Small FP Example - Dynamic Range

s exp frac E Value0 0000 000 -6 00 0000 001 -6 1/8*1/64 = 1/5120 0000 010 -6 2/8*1/64 = 2/512…0 0000 110 -6 6/8*1/64 = 6/5120 0000 111 -6 7/8*1/64 = 7/5120 0001 000 -6 8/8*1/64 = 8/5120 0001 001 -6 9/8*1/64 = 9/512…0 0110 110 -1 14/8*1/2 = 14/160 0110 111 -1 15/8*1/2 = 15/160 0111 000 0 8/8*1 = 10 0111 001 0 9/8*1 = 9/80 0111 010 0 10/8*1 = 10/8…0 1110 110 7 14/8*128 = 2240 1110 111 7 15/8*128 = 2400 1111 000 n/a inf

closest to zero

largest denormsmallest norm

closest to 1 below

closest to 1 above

largest norm

Denormalizednumbers

Normalizednumbers


Learning from Tiny & Small FP

Non-uniform spacing of numbers very small spacing for large negative exponents very large spacing for large positive exponents

Exact representation: sums of powers of 2 Approximate representation: everything else


Summary: IEEE Floating Point Values

Single

precision

Double precision Object Represented

Exponent Significand Exponent Significand

0 0 0 0 0

0 nonzero 0 nonzero +/- denormalizednumber

1-254 anything 1-2046 anything +/- floating-pointnumber

255 0 2047 0 +/0 infinity

255 nonzero 2047 nonzero NaN (Not a Number)

Source: book p. 301


IEEE Floating Point - Interesting Numbers

Description exp frac Numeric Value

Zero 00…00 00…00 0.0

Smallest Pos. Denorm. 00…00 00…01 2– {23,52} X 2– {126,1022}

Single 1.4 X 10–45

Double 4.9 X 10–324

Largest Denormalized 00…00 11…11 (1.0 – ) X 2– {126,1022}

Single 1.18 X 10–38

Double 2.2 X 10–308

Smallest Pos. Normalized 00…01 00…00 1.0 X 2– {126,1022}

Just larger than largest denormalized

One 01…11 00…00 1.0

Largest Normalized 11…10 11…11 (2.0 – ) X 2{127,1023}

Single 3.4 X 1038

Double 1.8 X 10308


Floating Point Addition (Fig. 3.16)

1. Align binary point to number with larger exponent

2. Add significands

3. Normalize result and adjust exponent

4. If overflow/underflow throw exception

5. Round result (go to 3 if normalization needed again)

A 1.11 X 20 1.11 X 20 1.75

+ B + 1.00 X 2-2 + 0.01 X 20 0.25

10.00 X 20

(Normalize) 1.00 X 21 2.00

Hardware - Fig. 3.17, p. 201


Compare the exponent of the two numbers.Shift the smaller to the right until its exponent Would match the larger exponent

Normalize the sum, either shifting right and Increment the exponent or shifting left andDecrementing the exponent

Round the significant to the appropriateNumber of bits

Add the significants

Overflow or underflow

Still normalized?


Hardware Fig. 3.17, p. 201


Floating Point Multiplication (Fig. 3.18)

1. Add 2 exponents together to get new exponent (subtract 127 to get proper biased value)

2. Multiply significands

3. Normalize result if necessary (shift right) & adjust exponent

4. If overflow/underflow throw exception

5. Round result (go to 3 if normalization needed again)

6. Set sign of result using sign of X, Y


MIPS Floating Point Instructions

Organized as a coprocessor Separate registers $f0-$f31 Separate operations Separate data transfer (to same memory)

Basic operations add.s - single add.d - double sub.s - single sub.d - double mul.s - single mul.d - double div.s - single div.d - double


MIPS Floating Point Instructions (cont’d)

Data transfer lwc1, swcl (l.s, s.s) - load/store float to fp reg

l.d, s.d - load/store double to fp reg pair

Testing / branching c.lt.s, c.lt.d, c.eq.s, c.eq.d, …

compare and set condition bit if true bclt - branch if condition true bclf - branch if condition false


Rounding

Extra bits allow rounding after computation Guard Digit (may shift into number during normalization) Round digit - used to round when guard bit shifted during

normalization Sticky bit - used when there are 1’s to the right of the

round digit e.g., “0.010000001” (round to nearest even)

IEEE 754 supports four rounding modes Always round up Always round down Truncate Round to nearest even (most common)


Limitations on Floating-Point Math

Most numbers are approximate Roundoff error is inevitable Range (and accuracy) vary depending on exponent “Normal” math properties not guaranteed:

Inverse (1/r)*r ≠ 1 Associative (A+B) + C ≠ A + (B+C)

(A*B) * C ≠ A * (B*C) Distributive (A+B) * C ≠ A*B + B*C

Scientific calculations require error management take a numerical analysis for more info


IEEE Floating Point - Special Properties Floating Point 0 same as Integer 0

All bits = 0

Can (Almost) Use Unsigned Integer Comparison A > B if:

• A.EXP > B.EXP or

• A.EXP=B.EXP and A.SIG > B.SIG

But, must first compare sign bits Must consider -0 == 0 NaNs problematic

• Will be greater than any other values

• What should comparison yield?

This is equivalent to unsigned comparision!


Addendum - Why Did the Ariane 5 Explode?

In 1996 Ariane 5 Flight 501 exploded after launch. Estimated cost of accident: $500 million


Addendum - Why Did the Ariane 5 Explode?

The cause was traced to the Inertial reference system (SRI).

Both the main and backup SRI failed. Both units failed due to an out-of-range conversion

Input: double precision floating point Output: 16-bit integer for “horizontal bias” (BH)

Careful analysis during software design had indicated that BH would “fit” in 16 bits

So, why didn’t it fit?


Addendum -Why did the Ariane 5 Explode?

Careful analysis during software design had indicated that BH would “fit” in 16 bits

BUT, all analysis had been done for the Ariane 4, the predecessor of Ariane 5 - software was reused

Since Ariane 5 was a larger rocket, the values for BH were higher than anticipated

AND, there was no handler to deal with the exception!

For more information: http://www.ima.umn.edu/~arnold/disasters/ariane.html Or, Google “Ariane 5”


Summary - Chapter 3

Important Topics Signed & Unsigned Numbers (3.2) Addition and Subtraction (3.3) Carry Lookahead (B.6) Constructing an ALU (B.5) Multiplication and Division (3.4, 4.5) Floating Point (3.6)

Coming Up: Performance (Chapter 4)

ECE 313 - Computer Organization Floating Point Feb 2005 Reading: 3.6-3.9 HW Due Monday 3/28:...

Documents

Transcript of ECE 313 - Computer Organization Floating Point Feb 2005 Reading: 3.6-3.9 HW Due Monday 3/28:...