Lab 7: Optimizing, Profiling, and Ethics

Lab sessions Tue Nov 16 to Thu Nov 18

Lab written by Julie Zelenski, with modifications by Nick Troccoli and Kathleen Creel

Learning Goals

This lab is designed to give you a chance to explore how to analyze the efficiency of C programs, useful for making your heap allocator as fast and efficient as possible, and also to pull together various ethical concepts from the quarter in discussions with your labmates. Specifically, you will:

experiment with different mechanisms for execution timing
run a profiler to get dynamic instruction counts
examine an ethical case study and discuss responsibility and accountability

This is the last CS107 lab of the quarter!

Get Started

Clone the repo by using the command below to create a lab7 directory containing the project files.

git clone /afs/ir/class/cs107/repos/lab7/shared lab7

Next, pull up the online lab checkoff and have it open in a browser so you can jot things down as you go. Only one checkoff needs to submitted for both you and your partner.

1) Profiling with Callgrind (35min + 10min discussion)

Learning Goals: Become familiar with using callgrind to measure instruction counts and observe the efficiencies of different implementations of various functions. Callgrind will be a useful tool on the next assignment to optimize your heap allocators!

Starter code files: trials.c and copy.c

Callgrind, a part of the Valgrind suite of tools, is a profiler, a developer tool that observes an executing program to gather detailed statistics about resource consumption such as time spent, instructions executed, memory used, and so on. Callgrind specifically counts instructions and memory accesses. It is one way to measure program efficiency, alongside others, such as using the Unix time command to measure execution time ("runtime"), as we have done on past assignments.

Open Callgrind Guide

Powers of 2 (20min)

Let's use Callgrind to profile the trials.c program, which has functions that implement various ways of computing 2ⁿ. Let's find out how these different approaches compare in terms of instruction counts.

Versions A and B use the math library routines pow and exp2. Library routines are fast, right? Well, the math library routines work on float/double type -- scroll through the 300+ lines of the musl implementation of pow to get a glimpse of what is involved. Integers don't get a free ride here, they must be converted to float and operated on using floating point instructions, which may be expensive!
Version C is a loop that iterates n times, multiplying by 2.
Version D capitalizes on the relationship between bit patterns and binary numbers.

Q1: How do you think the performance will compare between these different versions?

Let's find the static instruction count. Use objdump or gdb and the disassemble command to get the assembly for each function and count instructions.

Q2: Which function has the most instructions? The least? Does a shorter instruction count necessarily mean a function will be faster?

(Side note: what are cvtsi2sd and cvttsd2si? These are floating point assembly instructions! Floating point is a whole different side of assembly - feel free to search for them online for more info, but we're not going to worry about it for our purposes).

Let's use Callgrind to get the dynamic instruction counts. You run Callgrind just like Valgrind, but with --tool=callgrind to specify this new tool. The toggle-collect option optionally tells callgrind to only count instructions within functions whose name includes a certain expression (here we want just the power functions), so as to focus on the functions of interest:

valgrind --tool=callgrind --toggle-collect='*power*' ./trials

In the output, the number labeled I refs is the total count of instructions executed. Callgrind also writes detailed information to a file named callgrind.out.[pid] (where [pid] is the process number shown in left column of the callgrind summary, e.g. ==29240==). You can run something called the annotator on the output file to parse this information and get a breakdown of instruction count per function or line. When invoked with the --auto=yes option, the annotator displays the original C source file, and annotates each line with the number of assembly instructions executed on its behalf. Try it now to compare the counts for the various two_to_power functions:

callgrind_annotate --auto=yes callgrind.out.[pid]

In the output, for function calls you may see an => pointing to a function name. This separates the count of instructions for setting up and calling a function with the count of instructions for executing that function itself. For example, in the output below, callgrind is telling us there are 7 instructions to set up and call pow. The actual execution of the function (which happens once here) takes 126 instructions. Functions you write will be annotated separately in the callgrind output as well.

  1  long two_to_power_A(unsigned int exp) {
  7       return pow(2, exp);
126  => /build/glibc-eX1tMB/glibc-2.31/math/./w_pow_template.c:pow@@GLIBC_2.29 (1x)
  2  }

Q3: The bitwise version is untouchably fast (just a few instructions!). How much more work is involved in the other options? Did this surprise you?

Note that callgrind's counts are for that specific run of the program - they could change depending on how the program executes (e.g. different inputs). Callgrind can be useful for identifying performance bottlenecks; you can scan the annotated result to find those lines with large counts. A larger count means the line was executed more times and/or it corresponds to many assembly instructions. These are the program's "hot spots". A lot of instructions are being executed in these sections and these are the passages to streamline to improve performance.

Copying Data (15min)

Now let's use callgrind to explore the cost of different means of copying a large array in copy.c. In each method, the total amount of data copied is constant, but changing how much work is done per iteration and number of iterations can have a profound impact on performance. The program experiments with a variety of methods: chunk-by-chunk in a loop, a memcpy in a loop, or a single call to memcpy or memmove for the entire block.

First, review the C code in copy.c, talk it over with your group and make some predictions about the relative performance of the different copying functions.

Q4: How do you think the performance will compare between these different versions?

Run under callgrind and annotate the output:

myth$ valgrind --tool=callgrind --toggle-collect='copy*' ./copy
myth$ callgrind_annotate --auto=yes callgrind.out.[pid]

Q5: In the instruction counts for the char, int and long loops, about how many fewer instructions is copying by ints instead of chars? By longs instead of ints?

Copying a larger chunk demonstrates the value of "loop unrolling" to amortize the loop overhead by doing more work in each iteration and iterating fewer times.

Next, compare the instruction counts in the body of the long loop to the body of the memcpy loop (memcpy 8 bytes each iteration, same number of iterations as the long loop) and you'll see the cost of using memcpy when a simple assignment would do -- ouch!

Q6: About how many fewer instructions is copying by longs directly instead of memcpying 8 bytes?

Lastly, check out the instruction counts for the single block memcpy/memmove. Wowzers -- somebody sure worked hard at optimizing the library routines! There's a speedup of an order of magnitude from copying the entire block in one go rather than copying each element individually.

Note: memcpy can have a performance advantage over memmove, since memcpy can ignore overlapping regions while memmove has to deal with them. Read the man page for memmove to see what it says about handling overlap. The wording seems to say that the data is copied twice (one from src to tmp and again from tmp to dst), which would be a noticeable jump in instruction count, but the very slight rise in the instruction counts for memmove versus memcpy suggests otherwise. If you're interested, check out the extra problem in lab4 to see how one implementation from musl chooses to handle overlap. What does it do?

Wrapping Up

In summary, re-implementing functionality already present in the C library is generally not a good idea. The library routines are already written, tested, debugged, and aggressively optimized -- what's not to like? However, you do need to take care to choose the right tool for the job. Knowing the cost/benefit of the library functions and being knowledgeable of the compiler's optimizing capabilities will allow you better focus your attention on passages that require human intervention and not bother where the library functions and gcc can work their magic on your behalf. And you should never use a floating point math function to do what could have been a simple integer or bitwise operation!

Try your hand at this quiz on what gcc can/will optimize -- fun and informative!

2) Ethics Case Study: Therac-25 (25min + 10min discussion)

Learning Goals: Learn about ensuring the safety of systems on which people rely, and determine responsibility and accountability for complex systems created by "many hands".

So far in CS107, you have diagnosed problems in the launch sequence of a multi-million dollar rocket and secured a bank vault from threats both internal and external. As a working computer scientist, software engineer, or programmer, you could be called on to take responsibility for a system equally consequential -- or more so. In lab today, we will consider a case study of one of the worst software bugs in history: a lethal race condition in the software of a medical radiation device, the Therac-25, between 1985-1987.

First, take 5-10 minutes to read this description of the case.

Open Therac-25 Article

(If you would like more context, or to follow up later, an optional fuller description is here).

For this problem, form larger groups of 4-5 to discuss your ideas for the questions below. Discuss your ideas among your group-mates for 5-10 minutes each - write your thoughts and answers in the shared doc for your lab!

Open Your Lab's Shared Doc

Q1: Who should be held responsible for incidents such as these ones? (some ideas: the creator company, its shareholders, software engineers, quality testers, the hospitals using it, etc.?) In what way should they be held responsible?

Q2: The FDA now regulates medical devices, including contemporary radiation machines, more thoroughly than they did in the 1980s. However, most uses of software are not regulated. In addition to medical software, what other kinds of software (if any) ought to be regulated in order to preserve human life, health, and safety?

[Optional] Extra Problem

Finished with lab and itching to further explore optimization? Check out our extra problem!

Recap

Nice work on the lab! It's okay if you don't completely finish all of the exercises during lab; your sincere participation for the full lab period is sufficient for credit. However, we highly encourage you to finish on your own whatever is need to solidify your knowledge. Also take a chance to reflect on what you got what from this lab and whether you feel ready for what comes next! The takeaways from this lab should be learning how to use Valgrind callgrind to analyze program efficiency and bottlenecks, and examining the nuances of system safety and accountability. Your awesome assembly skills are just what is needed for examining the generated assembly and identifying gcc's cool transformations to make code run faster. Here are some questions to verify your understanding and get you thinking further about these concepts:

Loop unrolling is the process of reworking a loop to accomplish more per iteration and iterate fewer times. In what situations might this provide a performance win?
Avoiding the expense of function call/return is often cited for the motivation for inlining a function (though this is not as necessary in current versions of C). What are the estimated savings in terms of number of instructions? Inlining also enables additional optimization opportunities, given that gcc applies most transformations within a function, but not inter-procedurally. Do you expect these gains to be more or less significant than the call overhead? Explain.
Error messages are a form of documentation. What should have been different about the Therac-25 error messages? What would have made them more informative?
AECL, the manufacturers of Therac-25, were sued by a patient who was injured in 1985. AECL claimed that it was "impossible" that the software could have malfunctioned. Why might they have been this overly confident in their own software? What systems could be put in place to help calibrate their own confidence correctly?