5 Practical Floating Point System
Computers represent real numbers using a finite number of bits. This limitation introduces rounding errors and the need for standardized formats. The most widely adopted convention is the IEEE 754 floating-point standard, developed by the Institute of Electrical and Electronics Engineers.
5.1 Floating-Point Number Formats
A floating-point number consists of three main parts:
- Sign bit
(\(s\)): indicates the sign of the number.
- Exponent
(\(e\)): encodes the order of magnitude.
- Mantissa (fraction)
(\(m\)): encodes the significant digits.
5.1.1 Precision levels
The IEEE 754 standard defines several levels of precision:
- Single precision • 32 bits
1 sign, 8 exponent, 23 fraction bits.
- Double precision • 64 bits
1 sign, 11 exponent, 52 fraction bits.
- Long double • 80 bits
1 sign, 15 exponent, 64 fraction bits.
5.2 Normalization and Rounding
In normalized form, the leading (leftmost) bit of the mantissa must be \(1.\) For example: \[1001.01_2 = 1.00101_2 \times 2^3\] Here the leading \(1\) is placed to the left of the binary point, and the exponent is adjusted accordingly.
Since binary fractions cannot always represent decimal fractions exactly (e.g. \(0.3_{10}\)), we must approximate. The IEEE standard uses:
Chopping: discarding extra bits.
Rounding to nearest: adding 1 to the last stored bit if the next bit is 1.
5.3 Exponent Bias
The exponent must represent both positive and negative values. To achieve this, a bias is added: \[\text{Bias} = 2^{k-1} - 1\] where \(k\) is the number of exponent bits. Examples:
Single precision (\(k=8\)): Bias \(=127.\)
Double precision (\(k=11\)): Bias \(=1023.\)
Representing 263 in IEEE 754 (single precision)
Convert \(263\) to binary: \[263_{10} = 100000111_2\]
Normalize: \[100000111_2 = 1.00000111_2 \times 2^8\]
Exponent: \[e = 8 + 127 = 135 = 10000111_2\]
Fraction (mantissa): \(00000111\ldots\)
Final IEEE 754 representation: \[0\ 10000111\ 00000111000000000000000\]
Representing 0.3 in binary Multiply by 2 repeatedly: \[0.3 \times 2 = 0.6 \quad (0)\] \[0.6 \times 2 = 1.2 \quad (1)\] \[0.2 \times 2 = 0.4 \quad (0)\] \[0.4 \times 2 = 0.8 \quad (0)\] \[0.8 \times 2 = 1.6 \quad (1)\] \[0.6 \times 2 = 1.2 \quad (1)\] and so on. Thus: \[0.3_{10} \approx 0.01001100110011\ldots_2\]
Double precision to decimal
Given: \[x = 0\ 10000000011\ 10111001000100\ldots\]
Sign bit \(=0 \Rightarrow\) positive.
Exponent \(=10000000011_2 = 1027.\) Subtract bias (1023): \[e = 1027 - 1023 = 4\]
Fraction \(=1.10111001000100\ldots_2 \approx 1.7229.\)
Value: \[x \approx 1.7229 \times 2^4 = 27.5664\]
5.3.1 General Form
In general, a normalized floating-point number is given by: \[x = (-1)^s \times (1 + f) \times 2^e\] where \(f\) is the fractional part and \(e\) is the unbiased exponent.
5.4 General Floating-Point System (summary)
A floating-point system is characterized by four parameters \((\beta, t, L, U)\):
\(\beta\): base of the system (e.g., 2 for binary).
\(t\): length of the fractional part (precision).
\(L\): lower bound of the exponent. No normalized number can have an exponent smaller than \(L.\) Denormalized numbers and \(0\) extend to \(L-1.\)
\(U\): upper bound of the exponent. No normalized number can have an exponent larger than \(U.\) Special numbers (\(\pm\infty,\) NaN) are stored at exponent \(U+1.\)
5.4.1 Normalized IEEE 754 Numbers
A normalized floating-point number has the form \[(-1)^s \cdot (1.m) \cdot 2^{e - \text{bias}},\] where
\(s\) is the sign bit.
\(m\) is the fractional part (mantissa).
\(e\) is the biased exponent.
The implied leading 1 increases the significand precision.
5.4.2 Normalized (Subnormal) Numbers
Sign: 1 bit
Exponent: 8 bits (bias = 127)
Fraction: 23 bits
5.4.2.0.1 Largest Normalized Positive Number
\[(1.111\ldots1)_2 \times 2^{127} \approx 3.403 \times 10^{38}.\] If a result exceeds this, an overflow occurs.
5.4.2.0.2 Smallest Normalized Positive Number
\[1.0_2 \times 2^{-126} = 2^{-126} \approx 1.1755 \times 10^{-38}.\]
5.4.2.0.3 Representation of Zero
\[\begin{aligned} +0 &: 0~00000000~000\ldots0 \\ -0 &: 1~00000000~000\ldots0 \end{aligned}\] Although distinct, \(+0\) and \(-0\) are treated as equal.
5.4.2.0.4 Representation of Infinity
\[\begin{aligned} +\infty &: 0~11111111~000\ldots0 \\ -\infty &: 1~11111111~000\ldots0 \end{aligned}\]
5.4.2.0.5 Not a Number (NaN)
IEEE 754 defines two types of NaNs:
- Quiet NaN (QNaN):
propagates through computations (e.g., \(0/0\)).
- Signaling NaN (SNaN):
raises an exception when used.
Both are encoded with exponent = 255 and nonzero fraction.
5.4.3 Denormalized (Subnormal) Numbers
Denormalized numbers have the form \[(-1)^s \cdot (0.m) \cdot 2^{-126}.\]
No implied leading 1.
Exponent bits are all zero.
Allow gradual underflow
5.4.3.0.1 Largest Denormalized Number
\[(0.111\ldots1)_2 \times 2^{-126} \approx 0.99999988 \times 2^{-126}.\]
5.4.3.0.2 Smallest Denormalized Number
\[2^{-23} \times 2^{-126} = 2^{-149}.\] A result smaller than this causes underflow
5.4.4 Ranges of Single Precision
Normalized: \([2^{-126}, (2-2^{-23})\cdot 2^{127}].\)
Denormalized: \([2^{-149}, (1-2^{-23})\cdot 2^{-126}].\)