CS107 Lab 4: Floating point

Lab sessions Mon Feb 06 to Thu Feb 09

Lab written by Julie Zelenski

Learning goals

In this lab, you will:

practice using bitwise operators
experiment with float/double types and observe their operational behavior
explore the rounding inherent in floating point representation/arithmetic
dissect the internals of the floating point format

Find an open computer and somebody new to sit with. Introduce yourself and share your favorite vegetable.

Lab exercises

1. Get started.

Clone the starter repo using Mercurial

    hg clone /afs/ir/class/cs107/repos/lab4/shared lab4

This creates the lab4 directory which contains source files and a Makefile. Pull up the online lab checkoff and have it open in a browser so you'll be able to jot down things as you go. At the end of the lab period, you will submit that sheet and have the TA check off your work.

2. C bitwise operators and use of bitmasks

The bitops.c program demonstrates use of the bitwise operators (& | ^ ~ << >>). Start the program under gdb and set a breakpoint on bitwise and run to the breakpoint. Now try out the display command, which sets up an auto-print expression to be evaluated each time you single-step. In gdb, use display/t val to show binary for val. Step from here using the gdb step command. Before each line is executed, work out for yourself what you expect the new value to be and then step to verify your understanding is correct.

Now let's try your hand at creating masks and using them in bitwise expressions. The bitmasks function contains comments marked FIXME, which are placeholders for you to fill in:

Initialize each bitmask variable to its correct bit pattern. Don't write as a hard-coded hex/decimal value, instead construct the mask as an expression of simple constants (0, 1, ...) and bitwise ops. Once you have created the masks, run the code through the given assertions to confirm you're correct.
Use those masks to complete the tasks listed in the comments further down in the function. Trace your code to verify its operation. Play around with constructing bitmasks (either on paper or in gdb) until you feel comfortable with bit patterns and bitwise manipulation. For example, start with an int N, set to a number between 0 and 7, and build an expression around N to create a mask with all 0 bits and 1 on bit at position N? a mask with all 1s except for a zero in position N? Least significant N bits on? Most significant N bits on?

3. Basic behavior of floating point types

Compile and run the floats program to answer the following questions about the basic behavior of floating point types:

What are the values of FLT_MAX? FLT_MIN? FLT_DIG? What are those values for type double?
What happens on overflow to a value greater than max? What about divide by zero? How does this compare to those same operations for integer types?
What are the exceptional float values? What kind of calculations produce an exceptional result? What happens when you use an exceptional value in further calculations?

4. Float bits

Now, let's go under the hood to look at the raw bits for a float. An IEEE 32-bit float is stored as:

    N EEEEEEEE SSSSSSSSSSSSSSSSSSSSSSS

where N is the sign bit (1 if negative), E is the 8-bit exponent (with a bias of 127), and S is the 23-bit significand, with an implicit leading "1." for values. The sign bit is the bit position of most significance, the final bit of the significand is the least. The floats program prints the raw bits for various float values. Run the program and use its output to visualize the bit patterns and answer the following questions. There is also a neat online tool to interactively view float bit patterns.

What is the bit pattern for FLT_MAX? FLT_MIN? the value 1.0? the value 1.75?
How can you identify an exceptional value by its bit pattern?
How can you identify a negative value by its bit pattern?
How does a float's bit pattern change when multiplied by 2?
Consider the float values that represent exact powers of 2. What do all their bit patterns have in common?

Combine bitwise manipulation with understanding of float bits and implement the function quadruple which is given a float value and returns value multiplied by 4. The function should operate without using any multiplication (arithmeic * operator), but instead modify the exponent bits of the float. The C bitwise operators cannot be directly applied to a float variable; these operators are only legal for int-family types. In order to manipulate the raw float bits, you must first copy those bits into an unsigned int, on that type you can play bitwise games. You don't want to assign the float to an int as that will do a type conversion, you want the raw bits to be copied unchanged. Our code shows two different ways to accomplish this:

    bits = *(unsigned int *)&f;     // (1) trick compiler to copy bits using cast    OR
    memcpy(&bits, &f, sizeof(f));   // (2) use raw memcpy to copy bits

Once you have the raw bits, use a bitmask to extract the exponent bits, increment them by 2, and merge in the modified exponent bits and convert back to float.

5. Limits of a finite representation (nearest representable floats)

As a finite representation trying to model values drawn from an infinite range, floating point is inherently a compromise. Although some values are exactly representable, many others will have to approximated. Let's drill down a bit to understand what precision you can expect. In this particular exercise, we to refer to the 8-bit mini-float (1 sign bit 4 exponent bits 3 significand bits) described in lecture/textbook. Although there is no such C type, it can be helpful to first work through how the limits apply to the minifloat on paper before trying to generalize to a larger bitwidth such as the 32-bit float. We will see that it is the number of bits dedicated to the significand that dictates the precision of what can be stored/computed.

Rounding during assignment from a constant. Many constants cannot be assigned to a float without loss of precision. Floating point representation is based on binary decimal. If a given constant does not terminate when expressed as a binary decimal, it will have to be approximated. Consider the constant 0.4. This is 4/10, or, in binary, 100/1010. Apply division to that binary fraction and you'll get a repeating binary decimal 0.01100. There is no terminating sequence of powers-of-two that sums exactly to 0.4, so no hope of exactly representing this value! Let's follow what happens during these assignments:
```
    minifloat mf = 0.4;     // not exact, mf rounded to nearest representable minifloat 
    float f = 0.4;          // not exact,  f rounded to nearest representable float
```

Work on paper for the minifloat assignmemt shown above. The constant 0.4 expressed in binary decimal is 0.0110011001100... which normalizes to 1.100110011... x 2^-2. We can only record 4 bits of that infinite significand (3 explicit bits + hidden leading 1). What happens to the excess bits? The default approach is round-to-nearest which adds 1 to the least significant bit if the truncated bits would have contributed more than one-half. This significand is rounded to 1.101. The minifloat mf holds the bit pattern 0 0101 101 This value is not 0.4 but the representable minifloat nearest to 0.4. What value is it? How close is it to what we wanted?

Now consider assigning the constant 0.4 to a float. The 23-bit significand allows retaining more of the repeated expansion than the minifloat, but excess bits still must be rounded. Run the nearest program under gdb and step through rounding function. After assigning f = 0.4, use the gdb command x/1wt &f to look at the stored bit pattern. Identify the sign bit, the 8-bit exponent, and the 23-bit significand. Can you see the repeating binary decimal in the bits? Now step through the printf. What value is the nearest representable float to 0.4? How close is it to what we wanted? The larger bitwidth of the float versus minifloat (or double versus float) tightens the resolution, but some rounding is inevitable for any non-terminating binary decimal.

Some constants are rounded because the binary decimal expansion (though terminating) requires more significand bits than are available in the data type. Consider these assignments:

        minifloat mf = 9.25;     // not exact, mf rounded to nearest representable minifloat 
        float f = 9.25;          // exact, 9.25 can be exactly represented as float

Trace the minifloat assignment shown above on paper. The value 9.25 as a normalized binary decimal is 1.00101 x 2³. The constant has a significand of 6 bits and a minifloat can only store 4 bits (3-bit significand + 1 hidden), so the excess 2 bits must be rounded. The minifloat mf holds the bit pattern 0 1010 001. This value is not exactly 9.25 but the representable minifloat nearest to it. What value is it?

6. Conversion between integer and float

Assigning from int to float or float to int performs a type conversion. The bits from one numeric representation are rearranged to convert the value into a different representation. The transformation between numeric types is implicit on assignment, no explicit typecast is required. Given a 4-byte type (such as int or float), it can hold one of ~4 billion (2³²) distinct bit patterns. This makes for 4-billion distinct representable ints and 4-billion representable floats, but these are not going to be the same set of 4 billion values. The set of representable integers includes every whole number from INT_MIN to INT_MAX. The set of representable floats spans a much larger magnitude from -FLT_MAX to +FLT_MAX with ever-widening gaps between neighbors. When converting from integer to float or vice versa, only those values that are representable in both camps are value-preserving during conversion. As an example, the value 2 can be exactly represented both as an int and as a float, thus converting 2.0 => 2 or 2 => 2.0 is value-preserving (although the int and float bit patterns for 2/2.0 are not the same). Now let's look into the cases that are not value-preserving during conversion. Experiment with the float_conversions function in the nearest program to explore.

Optional diversion of bugs in conversion. See this bug report of a recent discovery that certain values send Java and php into an infinite loop when trying to convert from long to double. This stuff is tricky even for the experts!

7. Floating point arithmetic

An important lesson that comes from studying floats is learning why/how to arrange calculations to minimize rounding and loss of precision. Read over the code in the arith.c program. Execute it and observe the result of the sum_doubles function to see another surprising outcome. A sequence of doubles is summed forward and again backwards. The same values are included in the sum, but the order of operations is different. The result is not the same, and not just epsilon different either! Has the law of associativity been repealed? Explain what is going on. The relative error for floating point multiplication and division is small, but addition and subtraction have some gotchas to watch out for. For example, both addition and subtraction cannot maintain precision when applied to operands of very different magnitude, as shown by an earlier lab exercise about rounding during addition. Two operands very close in magnitude can also cause a problem for subtraction or addition of values with opposite sign. If two floats have the same exponent and agree in the leading 20 bits of the significand, then only few bits of precision will remain after subtracting the two, and leaving just those few digits that were least significant too--- double bummer! This event is called catastrophic cancellation (the catastrophe is so-called because you lose a large amount of significant digits all at once, unlike roundoff error which occurs more gradually in the bits of low significance). A subtraction of nearly equal values ideally should be reformulated to avoid this cancellation. Consider the task of calculating the roots of a quadratic equation ax² + bx + c = 0. The classic formula to find roots is (-b ± sqrt(b²- 4ac))/2a. This computation works out for many values, but when b is large relative to a and c, the sqrt of the discriminant is nearly equal to b. For one of the roots, you subtract the two terms in the numerator, possibly losing much precision. In extreme cases, the result can go to zero, which is clearly wrong! Thankfully, some simple algebraic rearrangement can avoid this fate and turn the dreaded subtraction into an addition. Take a look at the code we give for the two find_roots functions. The first uses the classic formulation for both roots. The second function changes things up on the second root to avoid the subtraction. Read the comments and work through the algebra of the revised version. Run the program and play with the inputs to observe the results of the two versions. If a and c are both 1, roughly how big does b need to be before you will see a difference in results between the two? How big does b need to be before you get an erroneous zero root from the classic formula?

Fun and interesting further reading on floats:

The classic article What every computer scientist should know about floating-point arithmetic. (how's that for a title that makes you feel compelled to read it?)
A little bit of history on the 1994 Pentium floating-point bug that led to a half-billion dollar chip recall. Thanks to Cay Horstmann for this excerpt.
An excellent blog series on floating point intricacies written by Bruce Dawson.
Can you get rich by capitalizing on floating-point roundoff? More on the great salami slicing scam.

Check off with TA

Before you leave, be sure to submit your checkoff sheet and have lab TA come by and approve it so you will be properly credited for lab. If you don't finish all the exercises in the lab period, we encourage you to work through any remainder on your own. Double-check your progress with self check.

xkcd float comic

Lab 4: Floating point