Learning goals

In this lab, you will:

experiment with float/double types and their operational behavior
observe the limitations of floating point representation/arithmetic
explore code that manipulates floats

Find an open computer and somebody new to sit with. Introduce yourself and offer each other encouragement on how to crush Friday's midterm!

Get started

Clone the repo by using the command below to create a lab5 directory containing the project files.

    git clone /afs/ir/class/cs107/repos/lab5/shared lab5

Open the lab checkoff form.

Lab exercises

1) Observe float (15 minutes)

Open up the floats.c program and look through the operations in the float_behavior function. Put your head together with your partner and make predictions about the expected results from executing this code. Compile and run the program to see how floats operate.

How do the min/max values of float/double range compare to the min/max of int/long?
- Note the lack of symmetry in the definition names: INT_MAX/FLT_MAX are largest magnitude positive int/float, but INT_MIN is largest magnitude negative int whereas FLT_MIN is smallest magnitude positive float (normalized). The largest magnitude negative float is -FLT_MAX
When an integer operation overflows, the result is wrapped back round the number circle. What is the result of a float operation that overflows?
An integer divide by zero halts the program. What about floating point divide by zero?
The float representation includes the exceptional float values infinity and nan. What kind of calculations produce an exceptional result? What happens when you use an exceptional value in further calculations?
Do the observed behaviors seem sensible choices to you?

2) Code study: fmax (20 minutes)

The standard library fmax function returns the larger of its two arguments. Review the musl_fmax implementation:

  double musl_fmax(double x, double y)
  {
1     if (isnan(x))
2        return y;
3     if (isnan(y))
4        return x;
5     /* handle signed zeros, see C99 Annex F.9.9.2 */
6     if (signbit(x) != signbit(y))
7        return signbit(x) ? y : x;
8     return x < y ? y : x;
   }

Shouldn't max be just a simple one-liner? What's all this about? Talk it over with your partner:

Review the man page for fmax to see what it says about exceptional inputs. How is NaN to be handled by the function? There is no mention of INFINITY but how do you expect it to be treated based on what you observed in part 1?
Lines 1-4 make a special case for NaN. If Lines 1-4 were removed from the function, which inputs would now get different results? Try your test case with the code.c program. (Note that NaN values have a signbit, oddly enough, meaning both negative and positive NaNs are possible.)
Lines 6-7 seem redundant with Line 8. The comment hints at the motivation for making a special case. Read the man page for signbit. How does accessing the sign bit differ from testing x < 0.0? If Lines 6-7 were removed, which inputs to fmax would now get different results?

The exceptional values are an essential part of the float family, but they do throw a wrench into things. Any float operation that aspires to be robust to all inputs will likely need special consideration of exceptional values. Ignoring or mishandling those values has a hand in many a bug. (my amusing bug story)

3) Nearest representable (45 minutes)

As a finite representation trying to model values from an infinite set, floating point is inherently a compromise. No matter whether we devote all the bits to range (exponent) or all the bits to precision (significand), there are only so many values that can be exactly represented; everything else must be approximated.

Let's examine some of the ways approximation/rounding is introduced.

Rounding during assignment from a constant. A float's value is stored as its binary fraction. A value is exactly representable only if it has a terminating binary fraction that fits into the available precision.

    float f = 12.5;  // has exact representation
    float g = 0.1;   // must be rounded to nearest representable
    float h = 1e12;  // also rounded

Run our dissect program and enter the above values as inputs to see how each is stored as a float.
What is the nearest representable value to 0.1? to 0.2? to 0.5? How far off is each from what was wanted?
Note that this rounding applies across the entire range of magnitudes, not just to small values. The relative error is capped, but the absolute difference can be a bit surprising. How far off is the nearest representable value to 1e12?
Rounding on assignment can be hard to get your head around. You have assigned an ordinary looking decimal constant to a float and it's easy to assume it will be recorded exactly. But hidden in there is a change in base! A value expressed in decimal has a high likelihood of not converting to an exact binary float. Unless you are sure otherwise, assume assigning a constant to a float to be an approximation.

Rounding during addition/subtraction. Even if we start with exactly representable values, floating point operations may introduce rounding into the result. To add/subtract, the binary points must first be aligned. The significand bits of the smaller magnitude value will be right-shifted so as to match the exponent of the larger. Given a large difference in exponents, the low bits can be shifted away and lost entirely.

    float fa = 1e8, fb = 513.25;   // both exact representable 
    float sum = fa + fb;           // sum not exact, rounded

Use the dissect program to verify that 1e8 and 513.25 are each representable, then enter the expression 1e8 + 513.25 to see how the sum is rounded. How far off is it from what was wanted?
Try entering other expressions to observe when the sum is exact and when rounded. Try some subtraction, too, to confirm it is similarly affected.
Rounding during addition is responsible for the oddity of floating point values which appear to be their own successor. There are positive float values x for which x+1 == x, that's wacky! Play with dissect to find such a value that is large enough to behave this way. Challenge problem: Can you find the smallest such positive float x for which x+1 == x? (Hint: it will be an exact power of 2) What is the result of the expression x-1 for that x?

Rounding during multiplication/division. Floating point multiplication works by separately multiplying significands and adding exponents, and normalizing the result. If the result of multiplying two N-bit significands has more than N bits, rounding will be needed.

    float fc = 8388607, fd = 5;     // both exact representable
    float product = fc * fd;        // product not exact, rounded

Use the dissect program to verify that the starting values are each representable, then enter the multiply expression to see how the sum is rounded. How far off is it from what was wanted?
- Can you explain why the integer multiply for the same values is capable of produces the correct answer with no rounding?
If you start with a value that is approximated, the rounding of a subsequent operation can sometimes "cancel it out" and land on the exact result (try the expression .4 * 2.5). More often once an approximation has entered the calculation it continues to propagate (.4 * .25). Use the dissect program to explore various expressions to see this.

Conversions. Assigning from int to float or float to int performs a type conversion. The bits from one numeric representation are rearranged to convert the value into a different representation. The transformation between numeric types is implicit on assignment, no explicit typecast is required. Given a 4-byte type (such as int or float), it can hold one of ~4 billion (2³²) distinct bit patterns. This makes for 4-billion distinct representable ints and 4-billion representable floats, but these are not going to be the same set of 4 billion values. The set of representable integers includes every whole number from INT_MIN to INT_MAX. The set of representable floats spans a much larger magnitude from -FLT_MAX to +FLT_MAX with ever-widening gaps between neighbors. When converting from integer to float or vice versa, only those values that are representable in both camps are value-preserving during conversion. As an example, the value 2 can be exactly represented both as an int and as a float, thus converting 2.0 => 2 or 2 => 2.0 is value-preserving (although the int and float bit patterns for 2/2.0 are not the same). Now let's look into the cases that are not value-preserving during conversion.

First consider converting from float => int. Any float with a fractional component is truncated to its integer value, discarding the fraction. You are familiar with this conversion and it is unsurprising that it is not value-preserving for fractional numbers. However, there are whole-valued float values with no fractional component (float values F.0 for some values of F) that do not have a corresponding integer representation and thus do not retain their value across conversion. What are those float values? What happens when you try to assign one of those floats to an int?
Now consider converting int => float. At first blush, it seems any integer x has an obvious float equivalent x.0 that is value-preserving. But given we just showed some of the 4 billion floats don't have a match in the 4 billion ints, applying the pigeonhole principle implies that some ints are guaranteed not have a match in the floats. What are those ints that don't have a corresponding float representation? What happens when you try to assign one of those ints to a float?
Consider converting int=>double. A double is basically a bigger version of float that apportions 11 bits for the exponent and 52 for the significand. Every integer value is value-preserving when converted to a double. Why is this guaranteed?

4) Code study: hypotenuse (30 minutes)

The C math library function hypot(x, y) computes the length of the hypotenuse of a right triangle with sides of length x and y. It seems like a simple application of Pythagoras would suffice, but the challenge comes in meeting the requirement that the "calculation is performed without undue overflow or underflow during the intermediate steps." Consider the two implementations below:

// NOTE: both hypot ignore exceptional inputs as simplification

double naive_hypot(double x, double y)
{
    return sqrt(x*x + y*y);
}

double better_hypot(double x, double y)
{
    x = fabs(x);
    y = fabs(y);
    double max = fmax(x, y);
    double min = fmin(x, y);
    double r = min / max;
    return max*sqrt(1 + r*r);
}

Here are some issues to tackle with your partner:

The naive version exhibits problems both with overflow and underflow.
- For what inputs does naive_hypot(x,y) cause overflow? What result is returned if the calculation overflows?
- What inputs cause underflow? What result is returned on underflow?
- Edit the program in code.c and try it!
Although it may not look like it at first glance, the better version is simply a rearranged version of the original. Work out on paper why the rearrangement is mathematically valid.
What is the expected range of r in the better version? Is it possible for r*r to overflow ? to underflow? Why or why not?
Is the better version capable of correctly handling the inputs you earlier identified as problematic for the naive version?
Are there still inputs which cause overflow or underflow for the better version? Explain.

Check off with TA

Before you leave, be sure to submit your checkoff sheet and have lab TA come by and approve it so you will be properly credited for lab. If you don't finish all the exercises in the lab period, we encourage you to work through any remainder on your own. Double-check your progress with self check.