Assignment 5: Floating Point and Assembly

Due: Fri Feb 22 11:59 pm
Ontime bonus 5%. Grace period for late submissions until Sun Feb 24 11:59 pm

Assignment by Julie Zelenski, with modifications by Nick Troccoli

Learning Goals

This assignment focuses on floats and assembly. You will be building your skills with:

  • understanding floating point representation and its features/limitations
  • reading assembly code and practicing reverse-engineering

You will do some float and assembly experiments and write up your results to submit as your deliverable. No code to write this time!

Overview

In this assignment, you will explore real-world tools for evidence of floating point limitations like those we have discussed in class and lab, compare the tradeoffs between different implementations of a floating point average function, and use assembly language to explore details of the bbsort program's implementation without having access to its source code!

To get started on the assignment, clone the starter project using the command

git clone /afs/ir/class/cs107/repos/assign5/$USER assign5

The starter project contains the following:

  • readme.txt: the file where you will write your answers to the different written questions for this assignment.
  • average.c and Makefile: the file containing different implementations of code to average an array of floats, and its Makefile for compiling
  • samples: a symbolic link to the shared directory for this assignment. It contains:
    • bbsort: an executable implementing a version of the Unix sort command that you will examine to better understand its implementation and behavior
    • colors and numbers: sample text files for testing the bbsort program
  • tools: contains a symbolic link to the submit program for submitting your work

1. Exploring The Real World

In lab, you investigated how floating point values are necessarily approximated subject to the limits of the IEEE representation. Although you used C code to make these observations, this behavior transcends C, as it is driven by the underlying hardware reality. If the IEEE 754 standard is dominant in modern processors, does it follow that all software that layers on it will also follow suit? Let's find out!

Choose two different tools/systems you have access to, such as:

  • another programming language (Python, JavaScript, Swift, ... (don't explore C++, since that's based on C!))
  • a computation tool (Matlab, Mathematic, Maple, MathHandbook, ...)
  • a spreadsheet (Excel, Numbers, ...)
  • a Unix tool (dc, bc, xcalc, gdb (gdb works on doubles by default) ...)
  • a website (https://www.desmos.com/scientific, https://web2.0calc.com, https://google.com, ...)
    • (note: you can enter an equation directly into Google search!)
  • your trusty hand-held calculator
  • ... ?

You must pick from two different categories in your exploration.

Experiment with your chosen tools to find evidence to confirm or deny that it uses the same IEEE binary floating point for representing real-valued numbers. Here are some possible questions to guide your inquiry (this is neither intended to be an exhaustive list of everything you might investigate and report on, nor to be the minimum requirements--just some ideas to guide your inquiry):

  • Are values being rounded on assignment and during calculations?
  • How large a value can be represented? How close to zero? What happens if you go beyond those limits?
  • How many decimal digits of precision are maintained?
  • What happens if you attempt to calculate a value outside the available range/precision?
  • Can you determine if it's using 32-bit float or 64-bit double (or something else?)
  • Can you observe the presence of exceptional values (overflow, infinity, nan)?

The deliverable for this exercise is simply to summarize your conclusions. Edit your readme.txt file to share what you were able to discern about how floating point values are handled by the two tools you examined, one paragraph per tool. We expect you won't find much that is not IEEE-based, but if you do, it will definitely make for more interesting explorations (but you don't need to worry about e.g. reverse engineering it to figure out what it is instead). We're curious to learn what's out there!

2. Floating Average

This is a code study exercise that considers various approaches for computing the average of a sequence of float values. The naive approach of summing the values and dividing by the count should do the job in theory, but in practice will not compute a correct answer for all inputs. Floating point math brings the possibility of overflow (magnitude too large for range), underflow (magnitude too small for range), loss of precision due to rounding errors in addition, and loss of significant digits in subtraction, all of which can throw a wrench into even simple calculations.

In pure mathematics, a sequence of numbers has one true average. On a computer, the best we can hope for is the nearest representable float to that true average, so that's where we set the standard. If the average function produces that nearest representable float, we will call that a "correct" result. The true average of {1,1,0,0,0} is 0.4, but 0.4 is a non-terminating binary decimal. Its nearest representable float is 0.400000006, so this is considered the correct result. Assuming the function is given a sequence of exactly represented floats, you might think it a straightforward task to return the nearest representable float to the average, but even seemingly valid approaches won't compute the correct float for some inputs.

An average function operates on a sequence of floats. We give you four pre-written variants of average in average.c: one "naive" version and three "improved" variants A, B, and C. Each of A, B, and C makes an intentional rearrangement of operations to avoid a specific floating point pitfall present in the naive version. The reformulations are equivalent according to the rules of pure math, but in a floating point world, these simple changes can and do affect the outcome for certain inputs. Each improved version succeeds somewhat on its good intentions by fixing some previously incorrect outcomes, yet each still has inputs for which it is incorrect.

Read over the given naive average and compare its approach to averages A, B, and C. Each has changed the computation; you should work out why those modifications were made -- what weakness in the naive version is it trying to address? For which inputs do you expect these changes to produce an improved result?

The code in average.c sets up some different test sequences. Each sequence is passed to each of the four average functions. Work out what is the "correct" result for each sequence and run the code and observe which (if any) of the average functions return that result. Where the result differs from expected, dig in to find out what is happening. The sequences we provide are intentionally composed to show certain effects. You might find it helpful to experiment with additional test sequences of your own to explore further until you have a clear idea of how each function operations.

Edit the readme.txt file to summarize your findings in a short report, one paragraph for each of the averageA/B/C functions. You should identify the strengths of this version relative to the others, as well as what vulnerabilities remain. Aim to be precise in your analysis and relate the observed behaviors to the underlying mechanics of floating point operations. The grading for this question will value the thoughtfulness and thoroughness of your explanation above all -- show us that you're starting to make some sense of how these quirky floats work!

An important learning goal is to build an understanding of the floating point limitations. You should be able to reason about errors that arise in floating point calculations and recognize the inherent tradeoffs that come in trying to work around them. An infinite domain in a finite number of bits is inherently a compromise, but knowing some of the techniques you can use to mitigate the worst of its effects and what approximations must be tolerated is valuable. Who knew it could be so challenging to write robust floating point code!

3. Reverse-Engineering Busybox Sort

The sorting program bbsort from the busybox project is a cousin of the mysort program you wrote in assign4 — each implements a simplified version of the standard Unix sort command. bbsort's interface and external behavior is quite similar to mysort; it will be interesting to dissect it to see if its internal implementation is also like yours!

Check out the samples directory for the executable and different test files. Use ./bbsort --help to see the program's usage. Invoke ./bbsort on various inputs and command-line flags to see how it behaves. (Note: bbsort will ignore invalid command-line flags, but may not print anything out complaining about them - this is expected, and a known bug in the program!)

The program is given to you in compiled form only, but even with no access to the C source, you can still discern quite a lot about how it operates. You can run the program on various inputs, use objdump to review its assembly, and poke around in gdb to examine the program state while executing. Working backwards from the compiled assembly to construct a picture of the original C source is a process known as reverse-engineering.

Let's try reverse-engineering bbsort! Follow along with the tasks listed here and answer each of the questions posed below. Write up your answers in your readme.txt file.

Tip: some of the steps below require using GDB to examine assembly. GDB has additional commands that let you print out register values, disassemble functions, and more. Check out the last section in the CS107 Guide to x86-64 for details!

  1. Start gdb on bbsort. You can run any program under the debugger, not just those that you wrote yourself. You won't be able to do things that require the program source files (e.g. cannot list main for example), but you can run/step, set breakpoints, examine registers and memory, disassemble and more. Note that because you don't have access to the source, you will see "No such file or directory" messages when stopped at a breakpoint or during single stepping - this is expected, and you can ignore this. Set a breakpoint on qsort, which this program calls, and run the program on one of the sample input files. When the breakpoint is hit, stop here and look at arguments being passed to qsort. In particular, look at the comparison callback function. Re-run sort a few times, each time with different flags, and you'll discover that the same comparison function is used for every sort, regardless of sort options. Answer the following questions in readme.txt for part a):
    • What is the name of this all-purpose comparison function?
    • Disassemble it and count the number of assembly instructions it contains.
  2. Set a breakpoint on this comparison function. When that breakpoint is hit, figure out how to print the two elements being compared. The arguments to the function are not accessible by name, but you can access their contents by other means (hint: think assembly-level! See the CS107 x86-64 guide for information that may provide hints). The type of the arguments to the comparison function in the C source would have been const void *, but you should deduce the actual type of the arguments. Answer the following question in readme.txt for part b):
    • Give the gdb expressions that can be used to print the contents of the two lines being compared when stopped at the comparison callback. Hint: you will need a typecast and, as always, be careful with the level of indirection when interpreting a void*.
  3. The all-purpose comparison function used by bbsort uses the command-line flags to determine which kind of comparison to apply to the elements. The function accesses this information via a global variable (boo!). The main function calls the getopt32 function to parse the command-line options, which records the settings in a global variable that is later read by the comparison function. Read through the assembly for the comparison function to identify where it reads from this global variable. (Hint: the instruction that accesses the global variable has a comment with the variable's address and name). Print the value of that global variable in gdb. Run the program with various command-line flags and see how the value changes based on the chosen options. (Printing the value in binary or hex makes it more clear what is happening). Answer the following questions in readme.txt for part c):
    • What is the name of this global variable? What is its address?
    • What is its value when bbsort is invoked with the flags -n -r? What if invoked with just -r? With all three flags -n -r -u?
    • What can you conclude about how the global variable is represented/used?
  4. Set a breakpoint on qsort and when hit, print the elements from the array being sorted (review gdb array-printing tips from from lab4). Then, use the gdb finish command to execute the rest of the qsort call and return to the caller, and then print the array again to see how it was sorted (note that after finish, you'll no longer be stepping inside of qsort anymore). From these observations, determine how bbsort handles its flags. Answer the following questions in readme.txt for part d):
    • For -r, is the array actually sorted in reverse or just printed in reverse?
    • For -u, are duplicates removed from the array before or after sorting?
    • For -n, are lines converted first and sorted as numbers or is each line repeatedly converted to a number to compare?

Great job! Your sleuthing skills are well on their way to being ready for the famous binary bomb coming up as your next assignment.

Submitting

Once you are finished working and have saved all your changes, check out the guide to working on assignments for how to submit your work. We recommend you do a trial submit in advance of the deadline to allow time to work through any snags. You may submit as many times as you would like; we will grade the latest submission. Submitting a stable but unpolished/unfinished version is like an insurance policy. If the unexpected happens and you miss the deadline to submit your final version, this previous submit will earn points. Without a submission, we cannot grade your work.

Grading

We estimate around 35 points allocated to the readme answers. You will be graded on the completeness of your answers, thoughtfulness of your explanations, and your demonstrated understanding of the concepts.

On-time Bonus (+5%) The on-time bonus for this assignment is 5%. Submissions received by the due date earn the on-time bonus. If you miss the due date, late work may be submitted during the grace period without penalty. No submissions will be accepted after the grace period ends, please plan accordingly! The bonus is calculated as a percentage of the point score earned by the submission. See the General Information handout for more information about the late policy.

Post-Assignment Check-in

How did the assignment go for you? We encourage you to take a moment to reflect on how far you've come and what new knowledge and skills you have to take forward. Once you finish this assignment, you will have experience understanding floating point representation and the ramifications of how it is represented. You will also have experience using your debugging and assembly skills to reverse-engineer a program without having access to its source code. Way to go! At this point, you should feel more comfortable reading assembly and building up your skills with the tools to disassemble, trace, and examine programs at an assembly level. You are ready for binary bomb!

To help you gauge your progress, for each assignment/lab, we identify some of its takeaways and offer a few thought questions you can use as a self-check on your post-task understanding. If you find the responses don't come easily, it may be a sign a little extra review is warranted. These questions are not to be handed in or graded. You're encouraged to freely discuss these with your peers and course staff to solidify any gaps in you understanding before moving on from a task. They could also be useful as review before the exams.

  • Is floating point addition guaranteed to be commutative? associative? Explain why or why not.
  • If your bank stored your balance as a 32-bit float, roughly how large would your balance need to be such that trying to add a single penny would not register?
  • How can you spot a read/write of a global variable in a sequence of assembly instructions?
  • How does the caller pass arguments to a function?

We would also appreciate if you filled out this homework survey to tell us what you think. We appreciate your feedback!