Project Report Multiplication

7/28/2019 Project Report Multiplication

1/32

CONTENTS

ABSTRACT LIST OF SYMBOLS AND ACRONYMS VHDL INTRODUCTION TO FLOATING POINT NUMBER INTRODUCTION HISTORY RANGE OF FPN FPN PRECISION IEEE 754 FPN STANDARD FPN REPRESENTATION COMPUTER REPRESENTATION IEEE FPN REPRESENTATION ATTRIBUTES & ROUNDING FPN ARITHMETIC FPN REPRESENTATION FORMAT PARAMETERS FOR THE IEEE 754 FLOATING-POINT STANDARD FPN MULTIPLICATION DENORMALS FPN MULTIPLICATION ALGORITHM HARDWARE OF FLOATING POINT MULTIPLIER UNSIGNED MULTIPLIER ADDITION PROCESS NORMALIZER UNDERFLOW/OVERFLOW DETECTION MULTIPLICATION FLOWCHART


2/32

STRUCTURE OF MULTIPLICATION FLOATING POINT MULTIPLIER ARCHITECTURE PROPOSED CIFM ARCHITECTURE Real-life application Optimization criteria Application APPENDIX CONCLUSION REFERENCES


3/32

FLOATING POINT MULTIPLICATION USING VHDL

A report submitted in partial fu lf il lment of the requirements for the Degree

of

Bachelor of Technology

in

Electronincs and Communication Engineering

Under the Guidance of

Manas Ranjan Tripathy

Department of Electronics and Communication Engineering

INSTITUTE OF TECHNICAL EDUCATION & RESEARCH, BHUBANESWAR

(SIKSHA O ANUSANDHAN UNIVERSITY, ODISHA)

2012

Submitted by:Bibhu bhusHan panda (0911016214)

Sadbhab patra (0911016231)Chandrakanta parida (1021016041)

Sweta chandan(0911016244)


4/32

INSTITUTE OF TECHNICAL EDUCATION AND RESEARCH

CERTIFICATE

This is to certif y that the project titled, FLOATING POINT MULTIPLICATION USINGVHDL is the a bonafide work of group C4, in partial ful fi llment for the award of

Degree Bachelor of Technology in E lectronics and communication Engineer ing,

conducted under my supervision.

Project guide:

Mr. Manas Ranjan Tripathy

(Lecturer)

Department of Electronics and communication Engineering

ITER, BHUBANESWAR


5/32

DECLARATION

We certify that

a. The work contained in this report is original and has been done by us under the guidance of

our supervisor.

b. The work has not been submitted to any other Institute for any degree or diploma.

c. We have followed the guidelines provided by the Institute in preparing the report.

d. We have conformed to the norms and guidelines given in the Ethical Code of Conduct of the

Institute.

e. Whenever we have used materials (data, theoretical analysis, figures, and text) from other

sources, we have given due credit to them by citing them in the text of the report and giving their

details in the reference. Further, we have taken permission from the copyright owners of the

sources, whenever necessary.

BIBHU BHUSHAN PANDA (0911016214)

SADBHAB PATRA (0911016231)

CHANDRAKANTA PARIDA (1021016041)

SWETA CHANDAN(0911016244)


6/32

ACKNOWLEDGMENT

We would like to thank Mr. Manas Ranjan Tripathy for providing us this

opportunity to present the project on FLOATING POINT

MULTIPLICATION USING VHDL

We would like to thank Prof. Mr.Bibhu Prasad Mohanty(HOD), Prof Dr

Niva Das (Associate Dean) & Mr.Manas Ranjan Tripathyfor his constant supportand guidance. We would also extend my gratitude to the faculty and staff of

Department Electronics and communication engineering, for their valuable

insights which made this project a success.

Lastly, we would thank one and all, who helped building this project and

guided us in all aspects to its success.

BIBHU BHUSHAN PANDA (0911016214)

SADBHAB PATRA (0911016231)

CHANDRAKANTA PARIDA (1021016041)

SWETA CHANDAN(0911016244)


7/32

ABSTRACT

Shrinking feature sizes gives more headroom for designers to extend the functionality of

microprocessor. As processor support for decimal floating-point arithmetic emerges, it isimportant to investigate efficient algorithms and hardware designs for common decimal floating-

point arithmetic algorithms.

This paper presents designs for a decimal floating-point adder . floating-point arithmetic is

usually sufficient for scientific and statistics applications. However, it is not sufficient for many

commercial applications and database systems, in which operations often need to mirror manual

calculations. Therefore, these applications often use software to perform decimal floating-point

arithmetic operations .

This standard provides a method for computation with floating-point numbers that will yield the

same result whether the processing is done in hardware, software, or a combination of the two.

The results of the computation will be identical, independent of implementation, given the same

input data. Errors, and error conditions, in the mathematical processing will be reported in a

consistent manner regardless of implementation.

Keywords:- exponent,normalized value,subnormal,subnormal numbers.


8/32

LIST OF SYMBOLS

LIST OF ACRONYMS

Serial No. Symbols Meaning1 X Real Number

2 M Significant

3 E Exponent

Serial No. Acronyms Meaning1 OFL Overflow level

2 UFL Underflow Level

3 NAN Not a Number


9/32

Chapter 1

1. VHDLThe VHSIC (very high speed integrated circuits) Hardware Description Language (VHDL)

was first proposed in 1981. The development of VHDL was originated by IBM, Texas

Instruments, and Inter-metrics in 1983. The result, contributed by many participating EDA

(Electronics Design Automation) groups, was adopted as the IEEE 1076 standard in

December 1987. VHDL is intended to provide a tool that can be used by the digital systems

community to distribute their designs in a standard format. Using VHDL, they are able to

talk to each other about their complex digital circuits in a common language withoutdifficulties of revealing technical details.

As a standard description of digital systems, VHDL is used as input and output to various

simulation, synthesis, and layout tools. The language provides the ability to describe

systems, networks, and components at a very high behavioral level as well as very low gate

level. It also represents a top-down methodology and environment.

Simulations can be carried out at any level from a generally functional analysis to a very

detailed gate-level wave form analysis.

1.1 INTRODUCTION TO FLOATING POINT NUMBERS

1. INTRODUCTION

In computing, floating point describes a method of representing real numbers in a way that can

support a wide range of values. Numbers are, in general, represented approximately to a fixed

number ofsignificant digits and scaled using an exponent. The base for the scaling is normally 2,

10 or 16. The typical number that can be represented exactly is of the form:

Significant digits baseexponent
http://en.wikipedia.org/wiki/Computinghttp://en.wikipedia.org/wiki/Real_numberhttp://en.wikipedia.org/wiki/Significant_figureshttp://en.wikipedia.org/wiki/Exponentiationhttp://en.wikipedia.org/wiki/Exponentiationhttp://en.wikipedia.org/wiki/Significant_figureshttp://en.wikipedia.org/wiki/Real_numberhttp://en.wikipedia.org/wiki/Computing


10/32

The term floating point refers to the fact that the radix point (decimal point, or, more commonly

in computers, binary point) can "float"; that is, it can be placed anywhere relative to the

significant digits of the number. This position is indicated separately in the internal

representation, and floating-point representation can thus be thought of as a computer realization

ofscientific notation. Over the years, a variety of floating-point representations have been used

in computers. However, since the 1990s, the most commonly encountered representation is that

defined by the IEEE 754 Standard.

The advantage of floating-point representation overfixed-point and integerrepresentation is that

it can support a much wider range of values. For example, a fixed-point representation that has

seven decimal digits with two decimal places can represent the numbers 12345.67, 123.45, 1.23

and so on, whereas a floating-point representation (such as the IEEE 754 decimal32 format) with

seven decimal digits could in addition represent 1.234567, 123456.7, 0.00001234567,

1234567000000000, and so on. The floating-point format needs slightly more storage (to encode

the position of the radix point), so when stored in the same space, floating-point numbers achieve

their greater range at the expense ofprecision.

i.History

Leonardo Torres y Quevedo in 1914 designed an electro-mechanical version of the Analytical

Engine ofCharles Babbage which included floating-point arithmetic. In 1938, Konrad Zuse of

Berlin completed the Z1, the first mechanical binary programmable computer, this was however

unreliable in operation. It worked with 22-bit binary floating-point numbers having a 7-bit signed

exponent, a 15-bit significand (including one implicit bit), and a sign bit. The memory used

sliding metal parts to store 64 words of such numbers. The relay-based Z3, completed in 1941

had representations for plus and minus infinity. It implemented defined operations with infinity

such as 1/ = 0 and stopped on undefined operations like 0. It also implemented the square

root operation in hardware. Konrad Zuse, architect of the first programmable computer, which

used 22-bit binary floating point.

Zuse also proposed, but did not complete, carefully rounded floatingpoint arithmetic that would

have included and NaNs, anticipating features of IEEE Standard floatingpoint by four
http://en.wikipedia.org/wiki/Radix_pointhttp://en.wikipedia.org/wiki/Scientific_notationhttp://en.wikipedia.org/wiki/IEEE_754http://en.wikipedia.org/wiki/Fixed-point_arithmetichttp://en.wikipedia.org/wiki/Integer_(computer_science)http://en.wikipedia.org/wiki/Decimal32_floating-point_formathttp://en.wikipedia.org/wiki/Accuracy_and_precisionhttp://en.wikipedia.org/wiki/Leonardo_Torres_y_Quevedohttp://en.wikipedia.org/wiki/Analytical_Enginehttp://en.wikipedia.org/wiki/Analytical_Enginehttp://en.wikipedia.org/wiki/Charles_Babbagehttp://en.wikipedia.org/wiki/Konrad_Zusehttp://en.wikipedia.org/wiki/Z1_%28computer%29http://en.wikipedia.org/wiki/Relayhttp://en.wikipedia.org/wiki/Z3_%28computer%29http://en.wikipedia.org/wiki/Konrad_Zusehttp://en.wikipedia.org/wiki/Konrad_Zusehttp://en.wikipedia.org/wiki/Z3_%28computer%29http://en.wikipedia.org/wiki/Relayhttp://en.wikipedia.org/wiki/Z1_%28computer%29http://en.wikipedia.org/wiki/Konrad_Zusehttp://en.wikipedia.org/wiki/Charles_Babbagehttp://en.wikipedia.org/wiki/Analytical_Enginehttp://en.wikipedia.org/wiki/Analytical_Enginehttp://en.wikipedia.org/wiki/Leonardo_Torres_y_Quevedohttp://en.wikipedia.org/wiki/Accuracy_and_precisionhttp://en.wikipedia.org/wiki/Decimal32_floating-point_formathttp://en.wikipedia.org/wiki/Integer_(computer_science)http://en.wikipedia.org/wiki/Fixed-point_arithmetichttp://en.wikipedia.org/wiki/IEEE_754http://en.wikipedia.org/wiki/Scientific_notationhttp://en.wikipedia.org/wiki/Radix_point


11/32

decades. By contrast, von Neumann recommended against floating point for the 1951 IAS

machine, arguing that fixed point arithmetic was preferable.

The first commercial computer with floating point hardware was Zuse's Z4 computer designed in

19421945. The Bell Laboratories Mark V computer implemented decimal floating point in

1946.

Prior to the IEEE-754 standard, computers used many different forms of floating-point. These

differed in the word sizes, the format of the representations, and the rounding behavior of

operations. These differing systems implemented different parts of the arithmetic in hardware

and software, with varying accuracy.

The IEEE-754 standard was created in the early 1980s after word sizes of 32 bits (or 16 or 64)

had been generally settled upon. This was based on a proposal from Intel who were designing the

i8087 numerical coprocessor. Prof. W. Kahan was the primary architect behind this proposal,

along with his student Jerome Coonen at U.C. Berkeley and visiting Prof. Harold Stone, for

which he was awarding the 1989 Turing award. Among the innovations are these:

A precisely specified encoding of the bits, so that all compliant computers wouldinterpret bit patterns the same way. This made it possible to transfer floating-point

numbers from one computer to another.

A precisely specified behavior of the arithmetic operations: arithmetic operations wererequired to be correctly rounded, i.e. to give the same result as if infinitely precise

arithmetic was used and then rounded. This meant that a given program, with given data,

would always produce the same result on any compliant computer. This helped reduce

the almost mystical reputation that floating-point computation had for seemingly

nondeterministic behavior.

The ability of exceptional conditions (overflow, divide by zero, etc.) to propagate througha computation in a benign manner and be handled by the software in a controlled way.
http://en.wikipedia.org/wiki/John_von_Neumannhttp://en.wikipedia.org/wiki/IAS_machinehttp://en.wikipedia.org/wiki/IAS_machinehttp://en.wikipedia.org/wiki/Z4_%28computer%29http://en.wikipedia.org/wiki/IEEE-754http://en.wikipedia.org/wiki/Intel_8087http://en.wikipedia.org/wiki/William_Kahanhttp://en.wikipedia.org/wiki/William_Kahanhttp://en.wikipedia.org/wiki/Intel_8087http://en.wikipedia.org/wiki/IEEE-754http://en.wikipedia.org/wiki/Z4_%28computer%29http://en.wikipedia.org/wiki/IAS_machinehttp://en.wikipedia.org/wiki/IAS_machinehttp://en.wikipedia.org/wiki/John_von_Neumann


12/32

ii.Range of floating-point numbers

By allowing the radix point to be adjustable, floating-point notation allows calculations over a

wide range of magnitudes, using a fixed number of digits, while maintaining good precision. For

example, in a decimal floating-point system with three digits, the multiplication that humans

would write as

0.12 0.12 = 0.0144

would be expressed as

(1.20101

) (1.20101

) = (1.44102

).

In a fixed-point system with the decimal point at the left, it would be

0.120 0.120 = 0.014.

A digit of the result was lost because of the inability of the digits and decimal point to 'float'

relative to each other within the digit string.

The range of floating-point numbers depends on the number of bits or digits used for

representation of the significand (the significant digits of the number) and for the exponent. On a

typical computer system, a 'double precision' (64-bit) binary floating-point number has acoefficient of 53 bits (one of which is implied), an exponent of 11 bits, and one sign bit. Positive

floating-point numbers in this format have an approximate range of 10308

to 10308

, because the

range of the exponent is [1022,1023] and 308 is approximately log10(21023

). The complete range

of the format is from about 10308

through +10308

.

The number of normalized floating point numbers in a system F (B, P, L, U) (where B is the base

of the system, P is the precision of the system to P numbers, L is the smallest exponent

representable in the system, and U is the largest exponent used in the system) is: .

There is a smallest positive normalized floating-point number, Underflow level = UFL = which

has a 1 as the leading digit and 0 for the remaining digits of the significand, and the smallest

possible value for the exponent.
http://en.wikipedia.org/wiki/Radix_pointhttp://en.wikipedia.org/wiki/Radix_point


13/32

There is a largest floating point number, Overflow level = OFL = which has B 1 as the

value for each digit of the significand and the largest possible value for the exponent.

In addition there are representable values strictly between UFL and UFL. Namely, zero and

negative zero, as well as subnormal numbers.

iii.Floating-point precisions

IEEE 754:

16-bit: Half (binary16)

32-bit: Single (binary32), decimal32

64-bit: Double (binary64), decimal64

128-bit: Quadruple (binary128), decimal128

Extended precision formats

Other: Minifloat Arbitrary precision

The IEEE has standardized the computer representation for binary floating-point numbers in

IEEE 754 (aka. IEC 60559). This standard is followed by almost all modern machines. Notable

exceptions include IBM mainframes, which support IBM's own format (in addition to the IEEE754 binary and decimal formats), and Cray vector machines, where the T90 series had an IEEE

version, but the SV1 still uses Cray floating-point format.

The standard provides for many closely related formats, differing in only a few details. Five of

these formats are called basic formats and others are termed extended formats, and three of these

are especially widely used in computer hardware and languages:

Single precision, called "float" in the C language family, and "real" or "real*4" inFortran. This is a binary format that occupies 32 bits (4 bytes) and its significand has a

precision of 24 bits (about 7 decimal digits).

Double precision, called "double" in the C language family, and "double precision" or"real*8" in Fortran. This is a binary format that occupies 64 bits (8 bytes) and its

significand has a precision of 53 bits (about 16 decimal digits).
http://en.wikipedia.org/wiki/Subnormal_numbershttp://en.wikipedia.org/wiki/Floating-pointhttp://en.wikipedia.org/wiki/Precision_%28computer_science%29http://en.wikipedia.org/wiki/IEEE_754-2008http://en.wikipedia.org/wiki/16-bithttp://en.wikipedia.org/wiki/Half-precision_floating-point_formathttp://en.wikipedia.org/wiki/32-bithttp://en.wikipedia.org/wiki/Single-precision_floating-point_formathttp://en.wikipedia.org/wiki/Decimal32_floating-point_formathttp://en.wikipedia.org/wiki/64-bithttp://en.wikipedia.org/wiki/Double-precision_floating-point_formathttp://en.wikipedia.org/wiki/Decimal64_floating-point_formathttp://en.wikipedia.org/wiki/128-bithttp://en.wikipedia.org/wiki/Quadruple-precision_floating-point_formathttp://en.wikipedia.org/wiki/Decimal128_floating-point_formathttp://en.wikipedia.org/wiki/Extended_precisionhttp://en.wikipedia.org/wiki/Minifloathttp://en.wikipedia.org/wiki/Arbitrary-precision_arithmetichttp://en.wikipedia.org/wiki/IEEEhttp://en.wikipedia.org/wiki/IEEE_754-2008http://en.wikipedia.org/wiki/IBM_Floating_Point_Architecturehttp://en.wikipedia.org/wiki/Crayhttp://en.wikipedia.org/wiki/Cray_T90http://en.wikipedia.org/wiki/Cray_SV1http://en.wikipedia.org/wiki/Single_precisionhttp://en.wikipedia.org/wiki/C_%28programming_language%29http://en.wikipedia.org/wiki/Fortranhttp://en.wikipedia.org/wiki/Double_precisionhttp://en.wikipedia.org/wiki/Double_precisionhttp://en.wikipedia.org/wiki/Fortranhttp://en.wikipedia.org/wiki/C_%28programming_language%29http://en.wikipedia.org/wiki/Single_precisionhttp://en.wikipedia.org/wiki/Cray_SV1http://en.wikipedia.org/wiki/Cray_T90http://en.wikipedia.org/wiki/Crayhttp://en.wikipedia.org/wiki/IBM_Floating_Point_Architecturehttp://en.wikipedia.org/wiki/IEEE_754-2008http://en.wikipedia.org/wiki/IEEEhttp://en.wikipedia.org/wiki/Arbitrary-precision_arithmetichttp://en.wikipedia.org/wiki/Minifloathttp://en.wikipedia.org/wiki/Extended_precisionhttp://en.wikipedia.org/wiki/Decimal128_floating-point_formathttp://en.wikipedia.org/wiki/Quadruple-precision_floating-point_formathttp://en.wikipedia.org/wiki/128-bithttp://en.wikipedia.org/wiki/Decimal64_floating-point_formathttp://en.wikipedia.org/wiki/Double-precision_floating-point_formathttp://en.wikipedia.org/wiki/64-bithttp://en.wikipedia.org/wiki/Decimal32_floating-point_formathttp://en.wikipedia.org/wiki/Single-precision_floating-point_formathttp://en.wikipedia.org/wiki/32-bithttp://en.wikipedia.org/wiki/Half-precision_floating-point_formathttp://en.wikipedia.org/wiki/16-bithttp://en.wikipedia.org/wiki/IEEE_754-2008http://en.wikipedia.org/wiki/Precision_%28computer_science%29http://en.wikipedia.org/wiki/Floating-pointhttp://en.wikipedia.org/wiki/Subnormal_numbers


14/32

Double extended format, 80-bit floating point value. This is implemented on mostpersonal computers but not on other devices. Sometimes "long double" is used for this in

the C language family (the C99 and C11 standards "IEC 60559 floating-point arithmetic

extension- Annex F" recommend the 80-bit extended format to be provided as "long

double" when available), though "long double" may be a synonym for "double" or may

stand for quadruple precision. Extended precision can help minimise accumulation of

round-off errorin intermediate calculations.

Any integer with absolute value less than or equal to 224

can be exactly represented in the single

precision format, and any integer with absolute value less than or equal to 253

can be exactly

represented in the double precision format. Furthermore, a wide range of powers of 2 times such

a number can be represented. These properties are sometimes used for purely integer data, to get

53-bit integers on platforms that have double precision floats but only 32-bit integers. To a rough

approximation, the bit representation of an IEEE binary floating-point number is proportional to

its base 2 logarithm.

Chapter 2
http://en.wikipedia.org/wiki/Extended_precisionhttp://en.wikipedia.org/wiki/Long_doublehttp://en.wikipedia.org/wiki/C99http://en.wikipedia.org/wiki/C11_%28C_standard_revision%29http://en.wikipedia.org/wiki/Round-off_errorhttp://en.wikipedia.org/wiki/Round-off_errorhttp://en.wikipedia.org/wiki/C11_%28C_standard_revision%29http://en.wikipedia.org/wiki/C99http://en.wikipedia.org/wiki/Long_doublehttp://en.wikipedia.org/wiki/Extended_precision


15/32

2. IEEE-754 FLOATING-POINT STANDARDIn the early days of digital computers, it was quite common that machines from different

vendors have different word lengths and unique floating-point formats. This caused many

problems, especially in the porting of programs between different machines (designs). A main

objective in developing such a standard, floating-point representation standard is to make

numerical programs predictable and completely portable, in the sense of producing identical

results when run on different machines.The IEEE-754 floating-point standard, formally named

ANSI/IEEE Std 754-1985, introduced in 1985 tried to solve these problems. Our main

objective for this standard is that an implementation of a floating-point system confirming to

this standard can be realized in software, entirely in hardware, or in any combination of

software and hardware. The standard specifies two formats for floating-point numbers, basic

(single precision) and extended(double precision), it also specifies the basic operations for both

formats which are addition and subtraction of operations. Finally, it describes the different

floating-point exceptions and their handling, including non-numbers (NaNs).

Table 1: Features of the ANSI/IEEE Standard Floating-Point Representation

Feature Single Double

Word length, bits 32 64

Significant bits 23+1(hidden) 52+1(hidden)

Significant Range [1,2-2-23] [1,2-2-52]

Exponent Bits 8 11

Exponent Bias 127 1023

Zero ( 0) E + bias = 0, f = 0 e + bias = 0, f = 0

Denormal E + bias = 0, f 0 e + bias = 0, f 0

Infinity () E + bias = 255, f = 0 e + bias = 2047, f = 0

Not-a-Number (NAN) E + bias = 255, f 0 e + bias = 2047, f 0

Minimum 2-126 1.2 10 -38 2-1023 1.2 10 -308

Maximum 2128 3.4 10 38 21024 1.8 10 308

PROBLEMS ASSOCIATED WITH FLOATING POINT ADDITION


16/32

For the input the exponent of the number may be dissimilar. And dissimilar exponent cant be

added directly. So the first problem is equalizing the exponent. To equalize the exponent the

smaller number must be increased until it equals to that of the larger number. Then significant

are added. Because of fixed size of mantissa and exponent of the floating-point number cause

many problems to arise during addition and subtraction. The second problem associated with

overflow of mantissa. It can be solved by using the rounding of the result. The third problem is

associated with overflow and underflow of the exponent. The former occurs when mantissa

overflow and an adjustment in the exponent is attempted the underflow can occur while

normalizing a small result. Unlike the case in the fixed-point addition, an overflow in the

mantissa is not disabling; simply shifting the mantissa and increasing the exponent can

compensate for such an overflow. Another problem is associated with normalization of addition

and subtraction. The sum or difference of two significant may be a number, which is not in

normalized form. So it should be normalized before returning results.

2.1 Floating Point Representation

i. Computer Representation of Numbers

Computers which work with real arithmetic use a system called floating point.

Suppose a real number x has the binary expansion

X = , 1 and

m = (

To store a number in floating point representation, a computer word is divided into 3 fields,

representing the sign, the exponent E, and the significand m respectively. A 32-bit word could

be divided into fields as follows: 1 bit for the sign, 8 bits for the exponent and 23 bits for the

significand. Since the exponent field is 8 bits, it can be used to represent exponents between -128

and 127. The significand field can store the first 23 bits of the binary representation of m, namely


17/32

FORMATS:- This defines floating-point formats, which are used to represent a finite subset of

real numbers Formats are characterized by their radix, precision, and exponent range, and each

format can represent

a unique set of floating-point data . All formats can be supported as arithmetic formats; that is,

they may be used to represent floating-point operands.Specific fixed-width encodings for binary

and decimal formats are defined in this clause for a subset of the formats . These interchange

formats are identified by their size and can be used for the exchange of floating-point data

between implementations.

Five basic formats are defined : Three binary formats, with encodings in lengths of 32, 64, and

128 bits . Two decimal formats, with encodings in lengths of 64 and 128 bits.

Additional arithmetic formats are recommended for extending these basic formats . The choice

of which of this standards formats to support is language-defined or, if the relevant language

standard is silent or defers to the implementation, implementation-defined. The names used for

formats in this standard are not necessarily those used in programming environments.

ii. IEEE Floating Point Representation

In the 1960's and 1970's, each computer manufacturer developed its own floating point system,

leading to a lot of inconsistency as to how the same program behaved on different machines. Forexample, although most machines used binary floating point systems, the IBM 360/370 series,

which dominated computing during this period, used a hexadecimal base, i.e. numbers were

represented as m . Other machines, such as HP calculators,used a decimal floating point

system. Through the efforts of several computer scientists, particularly W.Kahan, a binary

floating point standard was developed in the early 1980's and, most importantly, followed very

carefully by the principal manufacturers of floating point chips for personal computers,namely

Intel and Motorola. This standard has become known as the IEEE floating point standard since it

was developed and endorsed by a working committee of the Institute for Electrical and

Electronics Engineers.

The IEEE standard has three very important requirements:

-- _ consistent representation of floating point numbers across all machines adopting the standard


18/32

-- correctly rounded arithmetic

--consistent and sensible treatment of exceptional situations such as division by zero

We start with the following observation. In the last section, we chose to normalize a nonzero

number x so that x = m , where 1 m < 2, i.e.

m = ( ..)2

with b0 = 1. In the simple floating point model , we stored the leading nonzero bit b0 in the first

position of the field provided for m. Note, however, that since we know this bit has the value

one, it is not necessary to store it. Consequently, we can use the 23 bits of the significand field

to store b1.b2 b23 instead of b0.b1.. b22, changing the machine precision from = to

= . Since the bitstring stored in the significand field is now actually the fractional part of

the significand, we shall

refer henceforth to the field as the fraction field. Given a string of bits in the fraction field, it is

necessary to imagine that the symbols 1. appear in front of the string, even though these

symbols are not stored. This technique is called hidden bit normalization and was used by Digital

for the Vax machine in the late 1970s.

iii.Attributes and rounding

Attribute specification :- An attribute is logically associated with a program block

to modify its numerical and exception semantics. A user can specify a constant value for an

attribute parameter. Some attributes have the effect of an implicit parameter to most individual

operations of this standard;language standards shall specify

rounding-direction attributes and should specify

alternate exception handling attributes .

Other attributes change the mapping of language expressions into operations of this standard;

language standards that permit more than one such mapping should provide support for:

preferredWidth attributes


19/32

value-changing optimization attributes

reproducibility attributes

For attribute specification, the implementation shall provide language-defined means, such as

compiler directives, to specify a constant value for the attribute parameter for all standard

operations in a block; the scope of the attribute value is the block with which it is associated.

Language standards shall provide for constant specification of the default and each specific value

of the attribute.

Rounding and Correctly Rounded Arithmetic:-

We use the terminology floating point numbers" to mean all acceptable numbers in a given IEEE

floating point arithmetic format. This set consists of0, subnormal and normalized numbers,

and , but not NaN values, and is a finite subset of the reals. We have seen that most real

numbers, such as 1/10 and pi, cannot be represented exactly as floating point numbers. For

ease of expression we will say a general real number is normalized if its modulus lies between

the smallest and largest positive normalized floating point numbers, with a corresponding use of

the word subnormal. In both cases the representations we give for these numbers will parallel

the floating point number representations in that b0 = 1 for normalized numbers, and b0 = 0 with

E = -126 for subnormal numbers.

For any number x which is not a floating point number, there are two obvious choices for the

floating point approximation to x: the closest floating point number less than x, and the closest

floating point number greater than x. The IEEE standard defines the correctly rounded value of

x, which we shall denote round(x), as follows. If x happens to be a floating point number, then

round(x) = x. Otherwise, the correctly rounded value depends on which of the following four

rounding modes is in effect:

Round downround(x) = x_:

Round up

round(x) = x+:

Round towards zero

round(x) is either x_ or x+, whichever is between zero and x.


20/32

Round to nearest

round(x) is either x_ or x+, whichever is nearer to x. In the case of a tie, the one with its least

significant bit equal to zero is chosen.

If x is positive, then x_ is between zero and x, so round down and round towards zero have the

same effect. If x is negative, then x+ is between zero and x, so it is round up and round towards

zero which have the same effect. In either case, round towards zero simply requires truncating

the binary expansion, i.e. discarding bits.

The most useful rounding mode, and the one which is almost always used, is round to nearest,

since this produces the floating point number which is closest to x . In the case of toy precision,

with x = 1=7, it is clear that

round to nearest gives a rounded value of x equal to 1.75. When the word round is used

without any qualification, it almost always means round to

nearest. In the more familiar decimal context, if we round the number pi= 3.14159 to four

decimal digits, we obtain the result 3.142, which is closer to pi than the truncated result 3.141.

iv.Floating Point Arithmetic

Although integers provide an exact representation for numeric values, they suffer from two

major drawbacks:

--the inability to represent fractional values

-- a limited dynamic range.

Floating point arithmetic solves these two problems at the expense of accuracy and, on some

processors, speed. Most programmers are aware of the speed loss

associated with floating point arithmetic; however, they are blithely unware of the problems with

accuracy.

For many applications, the benefits of floating point outweigh the disadvantages.

A big problem with floating point arithmetic is that it does not follow the standard rules of

algebra. Nevertheless, many programmers apply normal algebraic rules when using floating

point arithmetic. This is a source of bugs in many programs. One of the primary

goals of this section is to describe the limitations of floating point arithmetic so it can be properly

used. Normal algebraic rules apply only to infinite precision arithmetic.Let us consider the

simple statement x:=x+1, x is an integer. On any modern computer this statement follows the


21/32

normal rules of algebra as long as overflow does not occur. That is, this statement is valid only

for certain values of x (minint


22/32

smaller number, obtaining 1.68e1 which is even less correct. Extra digits available during a

computation are known as guard digits (or guard bits in the case of a binary format). They

greatly enhance accuracy during a long chain of computations.

The accuracy loss during a single computation usually isnt enough to worry about

unless we are greatly concerned about the accuracy of our computations. However, we compute

a value which is the result of a sequence of floating point operations, the error can accumulate

and greatly affect the computation itself. For example, suppose we

were to add 1.23e3 with 1.00e0. Adjusting the numbers so their exponents are the same before

the addition produces 1.23e3 + 0.001e3. The sum of these two values, even after rounding, is

1.23e3. This might seem perfectly reasonable; after all, we can only maintain three significant

digits, adding in a small value shouldnt affect the result at all.

However, suppose we were to add 1.00e0 1.23e3 ten times. The first time we add 1.00e0 to

1.23e3 we get 1.23e3. Likewise, we get this same result the second, third, fourth. and tenth

time we add 1.00e0 to 1.23e3. On the other hand, had we added 1.00e0 to itself ten times, then

added the result (1.00e1) to 1.23e3, we would have gotten a different result, 1.24e3. This is the

most important thing to know about limited precision arithmetic:

The order of evaluation can effect the accuracy of the result.

We can get more accurate results if the relative magnitudes (that is, the exponents) are close to

one another. Whenever a chain calculation involving addition and subtraction is being

perfomed, it should be attempted to group the values appropriately. Another problem with

addition and subtraction is that you can wind up with false precision. Consider the computation

1.23e0 - 1.22 e0. This produces 0.01e0. Although this is mathematically equivalent to 1.00e-2,

this latter form suggests that the last two digits are exactly zero. Unfortunately, weve only got a

single significant digit at this time. Indeed, some FPUs or floating point software packages might

actually insert random digits (or bits) into the least significant positions. This brings up a second

important rule concerning limited precision arithmetic:

Whenever subtracting two numbers with the same signs or adding two numbers with different

signs, the accuracy of the result may be less than the precision available in the floating point

format. Multiplication and division do not suffer from the same problems as addition and

subtraction since we do not have to adjust the exponents before the operation; all we need to do

is add the exponents and multiply the mantissas (or subtract the exponents and divide the


23/32

mantissas). By themselves, multiplication and division do not produce particularly poor results.

However, they tend to multiply any error which already exists in a value. For example, if we

multiply 1.23e0 by two, when we should be multiplying 1.24e0 by two, the result is even less

accurate. This brings up a third important rule when working with limited precision arithmetic,

When performing a chain of calculations involving addition, subtraction, multiplication,

and division, try to perform the multiplication and division operations first.

Often, by applying normal algebraic transformations, we can arrange a calculation so the

multiply and divide operations occur first. For example, suppose we want to compute x*(y+z).

Normally we would add y and z together and multiply their sum by x. However, we can get a

little more accuracy if we transform x*(y+z) to get x*y+x*z and compute the result by

performing the multiplications first. Multiplication and division are not without their own

problems. When multiplying two very large or very small numbers, it is quite possible for

overflow or underflow to occur. The same situation occurs when dividing a small number by a

large number or dividing a large number by a small number. This brings up a fourth rule we

should attempt to follow when multiplying or dividing values:

When multiplying and dividing sets of numbers, try to arrange the multiplications

so that they multiply large and small numbers together; likewise, try to divide numbers that have

the same relative magnitudes.

Comparing floating pointer numbers is very dangerous. Given the inaccuracies present in any

computation (including converting an input string to a floating point value), two floating point

values should never be compared to see if they are equal. In a binary floating point format,

different computations which produce the same (mathematical) result may differ in their least

significant bits. For example, adding 1.31e0+1.69e0 should produce 3.00e0. Likewise, adding

2.50e0+1.50e0 should produce 3.00e0. However, were you to compare (1.31e0+1.69e0) agains

(2.50e0+1.50e0) we might find out that these sums are not equal to one another. The test for

equality succeeds if and only if all bits (or digits) in the two operands are exactly the same. Since

this is not necessarily true after two different floating point computations which should produce

the same result, a straight test for equality may not work.

The standard way to test for equality between floating point numbers is to determine how much

error (or tolerance) you will allow in a comparison and check to see if one value is within this

error range of the other. The straight-forward way to do this is to use a


24/32

test like the following:

if Value1 >= (Value2-error) and Value1


25/32

binary floating-point numberxis represented as a significand and an exponent, x = s* 2e.

The formula

(s1 *2e1) (s2 *2e2) = (s1 s2) *2e1+e2

Shows that a floating-point multiply algorithm has several parts. The first part multiplies

the significands using ordinary integer multiplication. Because floating point numbers are

stored in sign magnitude form, the multiplier need only deal with unsigned numbers

(although we have seen that Booth recoding handles signed twos complement numbers

painlessly). The second part rounds the result. If the significands are unsigned p-bit

numbers (e.g.,p = 24 for single precision), then the product can have as many as 2p bits and

must be rounded to ap-bit number. The third part computes the new exponent.

Because exponents are stored with a bias, this involves subtracting the bias from the sum

of the biased exponents.

Example

How does the multiplication of the single-precision numbers

1 10000010 000. . . = 1* 23

0 10000011 000. . . = 1* 24

Proceed in binary?

Answer

When unpacked, the significands are both 1.0, their product is 1.0, and so the

result is of the form

1 ???????? 000. . .

To compute the exponent, use the formula

Biased exp (e1 + e2) = biased exp(e1) + biased exp(e2) bias

The bias is 127 = 011111112, so in twos complement 127 is 100000012. Thus, the biased

exponent of the product is10000010

10000011

+ 10000001

10000110


26/32

Since this is 134 decimal, it represents an exponent of 134 bias = 134 127 = 7, as

expected.

The interesting part of floating-point multiplication is rounding. Since the cases are similar

in all bases, the figure uses human-friendly base 10, rather than base 2.

For floating point number multiplication its necessary to know about floating point number

addition. As while performing floating point multiplication we have to perform addition anyhow

to get the final result. So in performing addition there may be some carry generated, for which

we have to renormalize it in which it may lose its precision bits. For that we have to take three

extra bits guard, round and sticky. Hence, its much important to know about how addition

occurs and then multiplication. The next page contains, how addition is done and what all

procedures are obtained in order to get the final result.

Chapter 3

3.ADDITION ALGORITHM


27/32

Let a1 and a2 be the two numbers to be added. The notations ei and si are used for the

exponent and significant of the addends ai. This means that the floating-point inputs

have been unpacked and that si has an explicit leading bit. To add a1 and a2, perform

these eight steps:1. If e1 < e2, swap the operands. This ensures that the difference of the exponents

satisfies d = e1e2 0. Tentatively set the exponent of the result to e1.

2. If the sign of a1 and a2 differ, replace s2 by its twos complement.

3. Place s2 in ap-bit register and shift it d = e1-e2 places to the right (shifting in 1s if the

s2 was complemented in previous step). From the bits shifted out, set g to the most-

significant bit, rto the next most-significant bit, and set sticky bit s to the OR of the rest.

4. Compute a preliminary significant S = s1+s2 by adding s1 to the p-bit register

containing s2. If the signs of a1 and a2 are different, the most-significant bit of S is 1, and

there was no carry out then S is negative. Replace S with its twos complement. This can

only happen when d = 0.

5. Shift S as follows. If the signs of a1 and a2 are same and there was a carry out in step

4, shift S right by one, filling the high order position with one (the carry out). Otherwise

shift it left until it is normalized. When left shifting, on the first shift fill in the low order

position with the g bit. After that, shift in zeros. Adjust the exponent of the result

accordingly.

6. Adjust rand s. If S was shifted right in step 5, set r: = low order bit of S before shifting

and s: = g or ror s. If there was no shift, set r: =g, s: = r. If there was a single left shift,

dont change rand s. If there were two or more left shifts, set r: = 0, s: = 0. (In the last

case, two or more shifts can only happen when a1 and a2 have opposite signs and thesame exponent, in which case the computation s1 + s2 in step 4 will be exact.)

7. Compute the sign of the result. If a1 and a2 have the same sign, this is the sign of the

result.


28/32

Ifa1 and a2 have different signs, then the sign of the result depends on which ofa1, a2 is

negative, whether there was a swap in the step 1 and whether Swas replaced by its twos

complement in step 4.

3.1 ABOUT FLOATING POINT ARITHMETIC

Arithmetic operations on floating point numbers consist of addition, subtraction, multiplication

and division the operations are done with algorithms similar to those used on sign magnitude


29/32

integers (because of the similarity of representation) -- example, only add numbers of the same

sign. If the numbers are of opposite sign, must do subtraction.

ADDITION

Example on decimal value given in scientific notation:

3.25 x 10 ** 3

+ 2.63 x 10 ** -1

-----------------

first step: align decimal points

second step: add

3.25 x 10 ** 3

+ 0.000263 x 10 ** 3

--------------------

3.250263 x 10 ** 3

(presumes use of infinite precision, without regard for accuracy)

third step: normalize the result (already normalized!)

example on fl pt. value given in binary:

.25 = 0 01111101 00000000000000000000000

100 = 0 10000101 10010000000000000000000

to add these fl. pt. representations,

step 1: align radix points

shifting the mantissa LEFT by 1 bit DECREASES THE EXPONENT by 1

shifting the mantissa RIGHT by 1 bit INCREASES THE EXPONENT by 1

we want to shift the mantissa right, because the bits that fall off the end should come from theleast significant end of the mantissa

> we choose to shift the .25, since we want to increase it's exponent.-> shift by 10000101

-01111101

---------


30/32

00001000 (8) places.

0 01111101 00000000000000000000000 (original value)

0 01111110 10000000000000000000000 (shifted 1 place

(note that hidden bit is shifted into msb of mantissa)

0 01111111 01000000000000000000000 (shifted 2 places)

0 10000000 00100000000000000000000 (shifted 3 places)

0 10000001 00010000000000000000000 (shifted 4 places)

0 10000010 00001000000000000000000 (shifted 5 places)

0 10000011 00000100000000000000000 (shifted 6 places)

0 10000100 00000010000000000000000 (shifted 7 places)

0 10000101 00000001000000000000000 (shifted 8 places)

step 2: add ( hidden bit for the 100 shouldnt be forgotten)

0 10000101 1.10010000000000000000000 (100)

+ 0 10000101 0.00000001000000000000000 (.25)

---------------------------------------

0 10000101 1.10010001000000000000000

step 3: normalize the result (get the "hidden bit" to be a 1)

it already is for this example.

result is: 0 10000101 10010001000000000000000

conclusion


31/32

For floating point number multiplication its necessary to know about floating

point number addition. As while performing floating point multiplication we have

to perform addition anyhow to get the final result. So in performing addition

there may be some carry generated, for which we have to renormalize it in which

it may lose its precision bits. For that we have to take three extra bits guard,

round and sticky. Its much important to know about how addition occurs and

then multiplication. This presentation contained how addition is done and what

all procedures are obtained in order to get the final result. Now we have studied

and gathered idea about floating point addition which will be helpful for us while

doing the multiplication part in our next semester as our major project.


32/32

references

Liang-Kai Wang and Michael J. Schulte Decimal Floating-Point Adder andMultifunction Unit with Injection-Based Rounding

Department of Electrical and Computer Engineering18th IEEE Symposium on Computer

Arithmetic(ARITH'07)

G. Even and P. M. Seidel. A comparison of three roundingalgorithms for IEEE floating-point multiplication. IEEE Transactions on Computers,

49(7), July 2000

.

N. Burgess. Renormalizations rounding in IEEE floating pointOperations using a flagged prefix adder. IEEE Transactions on VLSI System,

13(2):266277, Feb 2005 .

IEEE Standard for Floating-Point Arithmetic IEEE 3 Park Avenue New York, NY 10016-5997, USA 29 August 2008 IEEE Computer Society Sponsored by the Microprocessor

Standards Committee.

M. S. Schmookler and A. W. Weinberger. High speed decimal additionIEEE Transactions on Computers, C-20:862867, Aug 1971.

Project Report Multiplication

Documents

Transcript of Project Report Multiplication