Lab 4: Floating point

Lab sessions Mon Apr 25 to Thu Apr 28

Lab written by Julie Zelenski

Learning goals

In this lab, you will:

experiment with float/double types and observe their operational behavior
explore the rounding inherent in floating point representation/arithmetic
dissect the internals of the floating point format
get more practice using bitwise operators

Find an open computer and somebody new to sit with. Introduce yourself and share whether you are a CISC or RISC processor when using your favorite text editor.

Lab exercises

Get started. Clone the starter repo using Mercurial
```
hg clone /afs/ir/class/cs107/repos/lab4/shared lab4
```
This creates the lab4 directory which contains source files and a Makefile. Pull up the online lab checkoff and have it open in a browser so you'll be able to jot down things as you go. At the end of the lab period, you will submit that sheet and have the TA check off your work.
Basic behavior of floating point types. Compile and run the floats program to answer the following questions about the basic behavior of floating point types:
- What are the values of FLT_MAX? FLT_MIN? FLT_DIG? What are those values for type double?
- What happens on overflow to a value greater than max? What about divide by zero? How does this compare to those same operations for integer types?
- What are the exceptional float values? What kind of calculations produce an exceptional result? What happens when you use an exceptional value in further calculations?
Float bits. Now, let's go under the hood to look at the raw bits for a float. An IEEE 32-bit float is stored as:
```
N EEEEEEEE SSSSSSSSSSSSSSSSSSSSSSS
```
where N is the sign bit (1 if negative), E is the 8-bit exponent (with a bias of 127), and S is the 23-bit significand, with an implicit leading "1." for normalized values. The sign bit is the bit position of most significance, the final bit of the significand is the least. The floats program prints the raw bits for various float values. Run the program and use its output to visualize the bit patterns and answer the following questions. There is also a neat online tool to interactively view float bit patterns.
- What is the bit pattern for FLT_MAX? FLT_MIN? the smallest non-zero denormalized value? the smallest normalized value? the value 1.0? the value 1.75?
- How can you identify an exceptional value by its bit pattern?
- How can you identify a negative value by its bit pattern?
- How does a normalized float's bit pattern change when multiplied by 2? How does a denormalized float's bit pattern change when multiplied by 2?
- Consider the normalized float values that represent exact powers of 2. What do all their bit patterns have in common?
Combine last week's efforts to master bitwise manipulation with this week's understanding of float bits and implement the function quadruple which is given a float value and returns value multiplied by 4. The function should operate without using any multiplication, but instead modify the exponent bits of the float. The C bitwise operators cannot be directly applied to a float variable; these operators are only legal for int-family types. In order to manipulate the raw float bits, you must first copy those bits into an unsigned int, on that type you can play bitwise games. You don't want to assign the float to an int as that will do a type conversion, you want the raw bits to be copied unchanged. Our code shows two different ways to accomplish this:
```
bits = *(unsigned int *)&f;     // (1) trick compiler to copy bits using cast    OR
memcpy(&bits, &f, sizeof(f));   // (2) use raw memcpy to copy bits
```
Once you have the raw bits, use a bitmask to extract the exponent bits, increment them by 2, and merge in the modified exponent bits and convert back to float.
Limits of a finite representation (nearest representable floats). As a finite representation trying to model values drawn from an infinite range, floating point is inherently a compromise. Although some values are exactly representable, many others will have to approximated. Let's drill down a bit to understand what precision you can expect. In this particular exercise, we to refer to the 8-bit mini-float (1 sign bit 4 exponent bits 3 significand bits) described in lecture/textbook. Although there is no such C type, it can be helpful to first work through how the limits apply to the minifloat on paper before trying to generalize to a larger bitwidth such as the 32-bit float. We will see that it is the number of bits dedicated to the significand that dictates the precision of what can be stored/computed.
- Rounding during assignment from a constant. Many constants cannot be assigned to a float without loss of precision. Floating point representation is based on binary decimal. If a given constant does not terminate when expressed as a binary decimal, it will have to be approximated. Consider the constant 0.4. This is 4/10, or, in binary, 100/1010. Apply division to that binary fraction and you'll get a repeating binary decimal 0.01100. There is no terminating sequence of powers-of-two that sums exactly to 0.4, so no hope of exactly representing this value! Let's follow what happens during these assignments:
```
minifloat mf = 0.4;     // not exact, mf rounded to nearest representable minifloat 
float f = 0.4;          // not exact,  f rounded to nearest representable float
```
  Work on paper for the minifloat assignmemt shown above. The constant 0.4 expressed in binary decimal is 0.0110011001100... which normalizes to 1.100110011... x 2^-2. We can only record 4 bits of that infinite significand (3 explicit bits + hidden leading 1). What happens to the excess bits? The default approach is round-to-nearest which adds 1 to the least significant bit if the truncated bits would have contributed more than one-half. This significand is rounded to 1.101. The minifloat mf holds the bit pattern 0 0101 101 This value is not 0.4 but the representable minifloat nearest to 0.4. What value is it? How close is it to what we wanted?
  
  Now consider assigning the constant 0.4 to a float. The 23-bit significand allows retaining more of the repeated expansion than the minifloat, but excess bits still must be rounded. Run the nearest program under gdb and step through rounding function. After assigning f = 0.4, use the gdb command x/1wt &f to look at the stored bit pattern. Identify the sign bit, the 8-bit exponent, and the 23-bit significand. Can you see the repeating binary decimal in the bits? Now step through the printf. What value is the nearest representable float to 0.4? How close is it to what we wanted? The larger bitwidth of the float versus minifloat (or double versus float) tightens the resolution, but some rounding is inevitable for any non-terminating binary decimal.
  
  Some constants are rounded because the binary decimal expansion (though terminating) requires more significand bits than are available in the data type. Consider these assignments:
```
minifloat mf = 9.25;     // not exact, mf rounded to nearest representable minifloat 
float f = 9.25;          // exact, 9.25 can be exactly represented as float
```
  Trace the minifloat assignment shown above on paper. The value 9.25 as a normalized binary decimal is 1.00101 x 2³. The constant has a significand of 6 bits and a minifloat can only store 4 bits (3-bit significand + 1 hidden), so the excess 2 bits must be rounded. The minifloat mf holds the bit pattern 0 1010 001. This value is not exactly 9.25 but the representable minifloat nearest to it. What value is it?
  
  The constant 9.25 can be exactly represented as a float, as the 6 bits fit with room to spare in a 23-bit significand. Step through this float assignment in gdb and examine the float bits to identify the sign, exponent, and significand bits. Execute the printf and see the exact value is printed. Other constants such as 24.25 or 1000.25 can also be represented with no loss of precision, but this is not true for every constant N.25, for example, the float value 1000000000.25 is rounded to cool billion when assigning to float. What is the smallest value of N such that N.25 has to be rounded on assignment to float?
- Rounding during addition (and subtraction). Even if we start with exactly representable values, floating point operations on those values may introduce rounding into the result. To add/subtract two floating point values, the decimal points must first be aligned. For this, the operand with the smaller exponent is adjusted to match the larger. The significands are then combined and result is re-normalized.
```
minifloat mfa = 9, mfb = 1.25;  // both constants exactly representable as minifloat
minifloat sum = mfa + mfb;      // sum not exactly representable, rounded to nearest representable minifloat
float fa = 10000000, fb = 1.25; // both constants exactly representable as float
float sum = fa + fb;            // sum not exactly representable, rounded to nearest representable float
```
  Work through the minifloat addition shown above on paper. The normalized binary decimal for 9 is 1.001 x 2³;for 1.25 is 1.01 x 2⁰. Both of these are exactly representable as minifloat. To sum them, we temporarily denormalize the smaller magnitude value to align its decimal point with the larger, raising its exponent to match and counteracting by right-shifting the significand bits the same number of positions. 1.25 becomes 0.00101 x 2³. This is a 6-bit significand and room for only 4, so the excess bits are rounded off. Trace the addition from here. The minifloat sum stores the bit pattern 0 1010 010. What value is this? This is the nearest representable float. How far off is from the exact sum?
  
  After you've worked through the mechanics on the minifloat addition example, step through the code for the float example shown above in gdb and observe the rounded result. How far is it off from what was expected?
  
  Rounding during addition is responsible for the oddity of floating point values which appear to be their own successor. There are positive float values x for which x+1.0 == x due to rounding during addition. First, come up such a number by working with your partner on paper then write code or evaluate expressions in gdb to verify your answer. What is the smallest such positive float x for which x+1.0 == x?
- Rounding during multiplication (and division). Floating point multiplication works by separately multiplying significands and adding exponents, and normalizing the result. If the result of multiplying two N-bit significands has more than N bits, rounding will be needed.
```
minifloat mfc = 7, mfd = 5;     // both constants exactly representable as minifloat
minifloat product = mfc * mfd;  // product not exactly representable, rounded to nearest representable minifloat
float fc = 8388607, fd = 5;     // both constants exactly representable as float
float product = fc * fd;        // product not exactly representable, rounded to nearest representable float
```
  Work through the minifloat multiplication shown above on paper. The normalized binary decimal for 7 is 1.11 x 2²; 5 is 1.01 x 2². Both of these are exactly representable as minifloat. Multiplying the significands produces 10.0011, adding their exponents is 2⁴. Normalizing produces 1.00011 x 2⁵. The resulting 6-bit significand has to be rounded to 4 bits. The minifloat sum has the bit pattern 0 1100 001. What value is this? How far off is from the exact product?
  
  After you've worked through the mechanics on the minifloat multiplication example, step through the code for the float example shown above in gdb and observe the rounded result. How far is it off from what was expected? Can you explain why the integer multiply for the same values is capable of produces the correct answer with no rounding?
Conversion between integer and float. Assigning from int to float or float to int performs a type conversion. The bits from one numeric representation are rearranged to convert the value into a different representation. The transformation between numeric types is implicit on assignment, no explicit typecast is required. Given a 4-byte type (such as int or float), it can hold one of ~4 billion (2³²) distinct bit patterns. This makes for 4-billion distinct representable ints and 4-billion representable floats, but these are not going to be the same set of 4 billion values. The set of representable integers includes every whole number from INT_MIN to INT_MAX. The set of representable floats spans a much larger magnitude from -FLT_MAX to +FLT_MAX with ever-widening gaps between neighbors. When converting from integer to float or vice versa, only those values that are representable in both camps are value-preserving during conversion. As an example, the value 2 can be exactly represented both as an int and as a float, thus converting 2.0 => 2 or 2 => 2.0 is value-preserving (although the int and float bit patterns for 2/2.0 are not the same). Now let's look into the cases that are not value-preserving during conversion. Experiment with the float_conversions function in the nearest program to explore.
- First consider converting from float => int. Any float with a fractional component is truncated to its integer value, discarding the fraction. You are familiar with this conversion and it is unsurprising is not value-preserving for fractional numbers. However, there are whole-valued float values with no fractional component (float values F.0 for some values of F) that do not have a corresponding integer representation and thus do not retain their value across conversion. What are those float values? What happens when you try to assign one of those floats to an int?
- Now consider converting int => float. At first blush, it seems any integer x has an obvious float equivalent x.0 that is value-preserving. But given we just showed some of the 4 billion floats don't have a match in the 4 billion ints, applying the pigeonhole principle implies that some ints are guaranteed not have a match in the floats. What are those ints that don't have a corresponding float representation? What happens when you try to assign one of those ints to a float?
- Consider converting int=>double. A double is basically a bigger version of float that apportions 11 bits for the exponent and 52 for the significand. Every integer value is value-preserving when converted to a double. Why is this guaranteed?
- Optional diversion of bugs in conversion. See this bug report of a recent discovery that certain values send Java and php into an infinite loop when trying to convert from long to double. This stuff is tricky even for the experts!
Floating point arithmetic. An important lesson that comes from studying floats is learning why/how to arrange calculations to minimize rounding and loss of precision. Read over the code in the arith.c program. Execute it and observe the result of the sum_doubles function to see another surprising outcome. A sequence of doubles is summed forward and again backwards. The same values are included in the sum, but the order of operations is different. The result is not the same, and not just epsilon different either! Has the law of associativity been repealed? Explain what is going on. The relative error for floating point multiplication and division is small, but addition and subtraction have some gotchas to watch out for. For example, both addition and subtraction cannot maintain precision when applied to operands of very different magnitude, as shown by an earlier lab exercise about rounding during addition. Two operands very close in magnitude can also cause a problem for subtraction or addition of values with opposite sign. If two floats have the same exponent and agree in the leading 20 bits of the significand, then only few bits of precision will remain after subtracting the two, and leaving just those few digits that were least significant too--- double bummer! This event is called catastrophic cancellation (the catastrophe is so-called because you lose a large amount of significant digits all at once, unlike roundoff error which occurs more gradually in the bits of low significance). A subtraction of nearly equal values ideally should be reformulated to avoid this cancellation. Consider the task of calculating the roots of a quadratic equation ax² + bx + c = 0. The classic formula to find roots is (-b ± sqrt(b²- 4ac))/2a. This computation works out for many values, but when b is large relative to a and c, the sqrt of the discriminant is nearly equal to b. For one of the roots, you subtract the two terms in the numerator, possibly losing much precision. In extreme cases, the result can go to zero, which is clearly wrong! Thankfully, some simple algebraic rearrangement can avoid this fate and turn the dreaded subtraction into an addition. Take a look at the code we give for the two find_roots functions. The first uses the classic formulation for both roots. The second function changes things up on the second root to avoid the subtraction. Read the comments and work through the algebra of the revised version. Run the program and play with the inputs to observe the results of the two versions. If a and c are both 1, roughly how big does b need to be before you will see a difference in results between the two? How big does b need to be before you get an erroneous zero root from the classic formula?

Fun and interesting further reading on floats:

The classic article What every computer scientist should know about floating-point arithmetic. (how's that for a title that makes you feel compelled to read it?)
A little bit of history on the 1994 Pentium floating-point bug that led to a half-billion dollar chip recall. Thanks to Cay Horstmann for this excerpt.
An excellent blog series on floating point intricacies written by Bruce Dawson.
Can you get rich by capitalizing on floating-point roundoff? More on the great salami slicing scam.

Check off with TA

Before you leave, be sure to submit your checkoff sheet and have lab TA come by and approve it so you will be properly credited for lab. If you don't finish all the exercises in the lab period, we encourage you to work through any remainder on your own. Double-check your progress with self check.

xkcd float comic