Lab 5: Floating Point

Lab sessions Tue May 07 to Thu May 09

Lab written by Julie Zelenski, with modifications by Nick Troccoli

Learning Goals

During this lab, you will:

  1. experiment with float/double types and their operational behavior
  2. observe the limitations of floating point representation/arithmetic
  3. explore code that manipulates floats

Find an open computer and somebody new to sit with. Introduce yourself and offer each other encouragement on how to crush Friday's midterm!


One of this week's labs recorded the first portion of lab where they reviewed the week's important concepts. You can view this video below.

Get Started

Clone the repo by using the command below to create a lab5 directory containing the project files.

    git clone /afs/ir/class/cs107/repos/lab5/shared lab5

Open the lab checkoff form.


1) Float Behavior (15 minutes)

Open up the floats.c program and look through the operations in the float_behavior function. Put your head together with your partner and make predictions about the expected results from executing this code. Compile and run the program to see how floats operate.

  • How do the min/max values of the float/double range compare to the min/max of int/long?
    • Note the lack of symmetry in the definition names: INT_MAX/FLT_MAX are largest magnitude positive int/float, but INT_MIN is largest magnitude negative int whereas FLT_MIN is smallest magnitude positive float (normalized). The largest magnitude negative float is -FLT_MAX,
  • When an integer operation overflows, the result is wrapped back around the number circle. What is the result of a float operation that overflows?
  • An integer divide by zero halts the program. What about floating point divide by zero?
  • The float representation includes the exceptional float values infinity and nan. What kind of calculations produce an exceptional result? What happens when you use an exceptional value in further calculations?
  • Do the observed behaviors seem like reasonable choices to you?

2) Code Study: fmax (20 minutes)

The standard library fmax function returns the larger of its two arguments. Review the musl_fmax implementation:

   double musl_fmax(double x, double y) {
 1     if (isnan(x)) {
 2        return y;
 3     }
 4     if (isnan(y)) {
 5        return x;
 6     }
 8     /* handle signed zeros, see C99 Annex F.9.9.2 */
 9     if (signbit(x) != signbit(y)) {
10         return signbit(x) ? y : x;
11     }
12     return x < y ? y : x;

Shouldn't this be just a simple one-liner? What's all this about? Talk it over with your partner:

  • Review the man page for fmax to see what it says about exceptional inputs. How is NaN to be handled by the function? There is no mention of INFINITY but how do you expect it to be treated based on what you observed in part 1?
  • Lines 1-6 make a special case for NaN. If those lines were removed from the function, which inputs would now get different results? Try your test case with the fmax.c program. This program can take two command line arguments and prints out the max of them using the above function. (Note that NaN values have a signbit, oddly enough, meaning both negative and positive NaNs are possible. If you're curious as to why there are multiple possible bit representations for NaN, you can read more about the design here!).
  • Lines 9-11 seem redundant with line 12. The comment hints at the motivation for making a special case. Read the man page for signbit. How does accessing the sign bit differ from testing x < 0.0? If those lines were removed, which inputs to fmax would now get different results?

The exceptional values are an essential part of the float family, but they do throw a wrench into things. Any float operation that aspires to be robust to all inputs will likely need special consideration of exceptional values. Ignoring or mishandling those values has a hand in many a bug. Once you've understood why these cases are necessary, exercise your understanding in the next section.

3) Float Bug: Acing Every Test (15 minutes)

This is a true story of code encountered at a well-known online educational platform (name withheld). Specifically, they had a function to check correct answers in their quizzes - we've provided an implementation resembling the original code below, and in test.c. However, their failure to account for exceptional float values created an embarrassing vulnerability in the implementation. The function worked by scoring quiz items with a numeric response using a "tolerance" feature that accepted all responses within a tolerance of the true answer as correct. For example, if the true answer was 200.7 and the tolerance was 0.01, responses within 1% of 200.7 would be scored as correct.

The code to score the student's quiz response looked something like this:

bool is_correct(const char *response, float target) {
    /* Is student response within acceptable tolerance of target?

        - response : student's answer (string)
        - target  : instructor result (number)
    const float TOLERANCE = 0.01;

    float val = strtof(response, NULL);
    float at_most = TOLERANCE * fmax(fabs(val), fabs(target));
    return fabs(val - target) <= at_most;

It turns out that there is a response that enabled users to ace every test. What is the magic answer? You can test your hypothesis by running the test.c program. It takes in one command-line argument and uses that as your "answer" to various hardcoded questions. Try different answers to see what you find!

4) Nearest Representable (35 minutes)

As a finite representation trying to model values from an infinite set, floating point is inherently a compromise. No matter whether we devote all the bits to range (exponent) or all the bits to precision (fraction/"significand"), there are only so many values that can be exactly represented; everything else must be approximated.

Let's examine some of the ways approximation/rounding is introduced.

Rounding during assignment from a constant. A float's value is stored as its binary fraction. A value is exactly representable only if it has a terminating binary fraction that fits into the available precision.

    float f = 12.5;  // has exact representation
    float g = 0.1;   // must be rounded to nearest representable
    float h = 1e12;  // also rounded
  • Run our dissect program and enter the above values as inputs to see how each is stored as a float.
  • What is the nearest representable value to 0.1? to 0.2? to 0.5? How far off is each from what was wanted?
  • Note that this rounding applies across the entire range of magnitudes, not just to small values. The relative error is capped, but the absolute difference can be a bit surprising. How far off is the nearest representable value to 1e12?
  • Rounding on assignment can be hard to get your head around. You have assigned an ordinary looking decimal constant to a float and it's easy to assume it will be recorded exactly. But hidden in there is a change in base! A value expressed in decimal has a high likelihood of not converting to an exact binary float. Unless you are sure otherwise, assume assigning a constant to a float to be an approximation.

Rounding during addition/subtraction. Even if we start with exactly representable values, floating point operations may introduce rounding into the result. To add/subtract, the binary points must first be aligned. The significand bits of the smaller magnitude value will be right-shifted so as to match the exponent of the larger. Given a large difference in exponents, the low bits can be shifted away and lost entirely.

    float fa = 1e8, fb = 513.25;   // both exactly representable 
    float sum = fa + fb;           // sum not exact, rounded
  • Use the dissect program to verify that 1e8 and 513.25 are each representable, then enter the expression 1e8 + 513.25 to see how the sum is rounded. How far off is it from what was wanted?
  • Try entering other expressions to observe when the sum is exact and when rounded. Try some subtraction, too, to confirm it is similarly affected.
  • Rounding during addition is responsible for the oddity of floating point values which appear to be their own successor. There are positive float values x for which x+1 == x, that's wacky! Play with dissect to find such a value that is large enough to behave this way. Challenge problem: Can you find the smallest such positive float x for which x+1 == x? (Hint: it will be an exact power of 2) What is the result of the expression x-1 for that x?

Rounding during multiplication/division. Floating point multiplication works by separately multiplying significands and adding exponents, and normalizing the result. If the result of multiplying two N-bit significands has more than N bits, rounding will be needed.

    float fc = 8388607, fd = 5;     // both exact representable
    float product = fc * fd;        // product not exact, rounded
  • Use the dissect program to verify that the starting values are each representable, then enter the multiplication expression to see how the sum is rounded. How far off is it from what was wanted?
    • Can you explain why the integer multiply for the same values is capable of produces the correct answer with no rounding?
  • If you start with a value that is approximated, the rounding of a subsequent operation can sometimes "cancel it out" and land on the exact result (try the expression .4 * 2.5). More often once an approximation has entered the calculation it continues to propagate (.4 * .25). Use the dissect program to explore various expressions to see this.

Conversions. Assigning from int to float or float to int performs a type conversion. The bits from one numeric representation are rearranged to convert the value into a different representation. The transformation between numeric types is implicit on assignment, no explicit typecast is required. Given a 4-byte type (such as int or float), it can hold one of about 4 billion (232) distinct bit patterns. This makes for 4-billion distinct representable ints and 4-billion representable floats, but these are not going to be the same set of 4 billion values. The set of representable integers includes every whole number from INT_MIN to INT_MAX. The set of representable floats spans a much larger magnitude from -FLT_MAX to +FLT_MAX with ever-widening gaps between neighbors. When converting from integer to float or vice versa, only those values that are representable in both camps are value-preserving during conversion. As an example, the value 2 can be exactly represented both as an int and as a float. Thus converting 2.0 => 2 or 2 => 2.0 is value-preserving (although the int and float bit patterns for 2/2.0 are not the same). Now let's look into the cases that are not value-preserving during conversion.

  • First consider converting from float => int. Any float with a fractional component is truncated to its integer value, discarding the fraction. You are familiar with this conversion and it is unsurprising that it is not value-preserving for fractional numbers. However, there are whole-valued float values with no fractional component (float values F.0 for some values of F) that do not have a corresponding integer representation and thus do not retain their value across conversion. What are those float values? What happens when you try to assign one of those floats to an int? Test things out writing some code to play around with float variables in the dissect.c program's main function.
  • Now consider converting int => float. At first blush, it seems any integer x has an obvious float equivalent x.0 that is value-preserving. But given we just showed some of the 4 billion floats don't have a match in the 4 billion ints, some ints are guaranteed not have a match in the floats. What are those ints that don't have a corresponding float representation? What happens when you try to assign one of those ints to a float? Test things out writing some code to play around with float variables in the dissect.c program's main function.
  • Consider converting int=>double. A double is basically a bigger version of float that apportions 11 bits for the exponent and 52 for the significand. Every integer value is value-preserving when converted to a double. Why is this guaranteed? (hint: think about the size of an int vs. the size of the double's significand) Test things out writing some code to play around with float variables in the dissect.c program's main function.

5) Code Study: hypotenuse (25 minutes)

The C math library function hypot(x, y) computes the length of the hypotenuse of a right triangle with sides of length x and y. It seems like a simple application of Pythagoras would suffice, but the challenge comes in meeting the requirement that the "calculation is performed without undue overflow or underflow during the intermediate steps." Consider the two implementations below, also provided in hypotenuse.c:

// NOTE: both ignore exceptional inputs as simplification

double naive_hypot(double x, double y) {
    return sqrt(x * x + y * y);

double better_hypot(double x, double y) {
    x = fabs(x);
    y = fabs(y);
    double max = fmax(x, y);
    double min = fmin(x, y);
    double r = min / max;
    return max * sqrt(1 + r * r);

You can use the provided hypotenuse program to test by specifying two command line arguments, which are the lengths of the two sides you wish to use to calculate. You can also modify the program as you wish to experiment. Here are some issues to tackle with your partner:

  • The naive version exhibits problems both with overflow and underflow.
    • For what inputs does naive_hypot(x,y) cause overflow? What result is returned if the calculation overflows?
    • What inputs cause underflow? What result is returned on underflow?
  • Although it may not look like it at first glance, the better version is simply a rearranged version of the original. Work out on paper why the rearrangement is mathematically valid.
  • What is the expected range of r in the better version? Is it possible for r * r to overflow ? to underflow? Why or why not?
  • Is the better version capable of correctly handling the inputs you earlier identified as problematic for the naive version?
  • Are there still inputs which cause overflow or underflow for the better version? Explain.

Check off with TA

At the end of the lab period, submit the checkoff form and ask your lab TA to approve your submission so you are properly credited for your work. It's okay if you don't completely finish all of the exercises during lab; your sincere participation for the full lab period is sufficient for credit. However, we highly encourage you to finish on your own whatever is need to solidify your knowledge. Also take a chance to reflect on what you got what from this lab and whether you feel ready for what comes next! The takeaways from lab5 should be an introduction to floating point and understanding the representation as a form of "binary scientific notation". You should have a feel for the inherent compromises in this system: which values are representable, which are not, the gaps in the floating point number line, and a general sense of the rounding/approximation to expect from a floating point calculation. Here are some questions to verify your understanding and get you thinking further about these concepts:

  • Explain the reasons why a floating point constant may have to be rounded on assignment to a float variable. Are any integer constants rounded on assignment to a float? Which?
  • Roughly how large is the gap between 1.0 and the next larger float value? between 1e6 (1 million) and the next larger float value? between FLT_MAX and the next smaller float value?
  • If the float bits were re-apportioned to designate 7 bits for the exponent and 24 for the significand, how would this change what float values could be represented (e.g. dynamic range, precision, size of gaps)?

Further Reading

Some fun and interesting articles on floats to check out!

xkcd float comic