Lab 8: Optimizing and Profiling

Lab sessions Tue May 28 to Thu May 30

Lab written by Julie Zelenski, with modifications by Nick Troccoli

Learning Goals

This lab is designed to give you a chance to explore how to analyze the efficiency of C programs. This is useful for making your heap allocator as fast and efficient as possible. Specifically, you will:

  1. experiment with different mechanisms for execution timing
  2. run a profiler to get dynamic instruction counts

This is the last CS107 lab! Share a computer with somebody new to meet a new classmate as we reach the end of the quarter.

Get Started

Clone the repo by using the command below to create a lab8 directory containing the project files.

git clone /afs/ir/class/cs107/repos/lab8/shared lab8

Open the lab checkoff form.

Exercises

1) Timing Tools

One common measurement for program efficiency is execution time, also known as run time. A simple means to measure runtime is using the system process timer via the Unix time command. This measurement is fairly coarse and is limited to timing the entire program but it's convenient in that you can time any program, whether you have the source code for it or not, and do not need to make any code changes. Let's try it now!

  • The samples folder in the lab folder contains mysort_soln from assign4 and bbsort from assign5, along with input files of random numbers. Compare the time for these two programs with the standard unix sort command on the same input using the commands below. Make sure to execute these commands from within your main lab8 directory. (Note: the >/dev/null at the end of the command will suppress output so you can more easily see the timing information). The time labeled "user" is the one of interest.

    time samples/mysort_soln -n samples/million >/dev/null
    time samples/bbsort -n samples/million >/dev/null
    time sort -n samples/million >/dev/null
    

    How do the runtimes of the three sort programs rank? The results should show that the runtimes for bbsort and mysort are comparable, but standard sort is much faster. Let's investigate further!

  • A profiler is a developer tool that observes an executing program to gather detailed statistics about resource consumption such as time spent, instructions executed, memory used, and so on. Although our primary use of valgrind has been for help finding memory errors and leaks, it can also operate as a profiler by using the callgrind tool to count instructions and memory accesses. (More info in the CS107 guide to callgrind)

    valgrind --tool=callgrind samples/mysort_soln -n samples/thousand >/dev/null
    valgrind --tool=callgrind samples/bbsort -n samples/thousand >/dev/null
    valgrind --tool=callgrind sort -n samples/thousand >/dev/null
    

    Run each of the three sort programs under callgrind and find the number labeled I refs in the output. This is the total count of instructions executed. What do you find? As expected from the earlier timing data, the instruction counts for bbsort and mysort are comparable, but the standard sort is doing a lot less work.

  • In addition to summarizing the total instruction count, callgrind also writes detailed information to a file named callgrind.out.pid (where pid is the process number shown in left column of the callgrind summary, e.g. ==29240==). You can run something called the annotator on the output file to get a breakdown of instruction count per function or line. Try it now (replacing pid below with your process id):

    callgrind_annotate callgrind.out.pid
    

    If you run the annotator on the callgrind output for mysort and bbsort, you will see that one single function accounts for the vast majority of the instructions executed. What function is it? Does that function suggest an explanation for what the built-in sort is likely doing that makes it so much faster?

2) Integer, Float and Bitwise Math

The trials.c program has code to implement some numeric operations in a variety of ways. Let's find out how these different approaches compare in terms of instruction counts.

  • The two_to_power functions are various ways of computing 2^n.
    • Versions A and B use the math library routines pow and exp2. Library routines are fast, right? Well, the math library routines work on float/double type -- scroll through the 300+ lines of the musl implementation of pow to get a glimpse of what is involved. Integers don't get a free ride here, they must be converted to float and operated on using floating point instructions, which may be expensive!
    • Version C is a loop that iterates n times, multiplying by 2. Look at the assembly for the function. Is there an actual mul instruction in there or did the clever compiler come up with something better?
    • Version D capitalizes on the relationship between bit patterns and binary numbers. That seems very likely to be our winner. How much better do we expect this to be than the others?
  • The various is_power functions show a variety of approaches for testing whether an integer is an exact power of 2. How do each of these work?
  • After reviewing the C code, talk it over with your partner and make some predictions about the relative performance of the two_to_power and is_power functions.
  • Now let's do some counting. First use objdump or gdb disassemble to get the assembly for each function and count instructions. This is the static count. In what situations is it expected that the static count will match the dynamic count?
  • Use callgrind to get the dynamic counts. The toggle-collect option in the command below tells callgrind to only count instructions within functions whose name includes power, so as to focus on the functions of interest:

    valgrind --tool=callgrind --toggle-collect='*power*' ./trials
    
  • Use callgrind_annotate to break down instruction counts per function/line.

    callgrind_annotate --auto=yes callgrind.out.pid
    

    When invoked with the --auto=yes option, the annotator displays the original C source file, and annotates each line with the number of assembly instructions executed on its behalf. You can identify performance bottlenecks by scanning the annotated result to find those lines with large counts. A larger count means the line was executed more times and/or it corresponds to many assembly instructions. These are the program's "hot spots". A lot of instructions are being executed in these sections and these are the passages to streamline to improve performance.

  • Compare the counts for the various two_to_power functions. The bitwise version is untouchably fast (just a few instructions!). How much more work is involved in the other options? Did this surprise you?

  • Compare the counts for the is_power functions. The bitwise approach is out in front, but the switch and popcount versions are not far behind. Which version is bringing up the rear?

Note: you may see counts of instructions executed to set up and call a function, and counts of instructions for when the function is actually executing. For example, in the example output below, callgrind is telling us there are 9000 instructions executed to set up and call find_min. This doesn't include the actual execution of the function - just the setup and calling of it.

4,005      for (int i = 0; i < n; i++) {
9,000          int min = find_min(arr, i, n - 1);

Separately, callgrind annotates the actual function's code to tell us the number of instructions executed when that function is running:

            . int find_min(int arr[], int start, int stop)
        3,000  {
        2,000      int min = start;
    2,005,000      for(int i = start + 1; i <= stop; i++)
    4,995,000          if (arr[i] < arr[min])
        6,178              min = i;
        1,000      return min;
        2,000  }

This output shows us the number of instructions spent executing the find_min function (but not the number of instructions spent setting up and calling it from other functions).

In summary, re-implementing functionality already present in the C library is generally not a good idea. The library routines are already written, tested, debugged, and aggressively optimized -- what's not to like? However, you do need to take care to choose the right tool for the job. Knowing the cost/benefit of the library functions and being knowledgeable of the compiler's optimizing capabilities will allow you better focus your attention on passages that require human intervention and not bother where the library functions and gcc can work their magic on your behalf. You should never use a floating point math function to do what could have been a simple integer or bitwise operation!

Try your hand at this quiz on what gcc can/will optimize -- fun and informative!

3) Copying Data

The copy program explores the relative cost of different means of copying a large array. The total amount of data copied is constant, but changing how much work is done per iteration and number of iterations can have a profound impact on performance. The program experiments with a variety of methods: chunk-by-chunk in a loop, a memcpy in a loop, or a single call to memcpy or memmove for the entire block.

  • Review the C code in copy.c, talk it over with your partner and make some predictions about the relative performance of the different copying functions.

  • Run under callgrind and annotate the output:

    valgrind --tool=callgrind --toggle-collect='copy*' ./copy 
    callgrind_annotate --auto=yes callgrind.out.pid
    
  • Look at the instruction counts for the char, int, and long loops. About how many fewer instructions is copying by ints instead of chars? By longs instead of ints? Copying a larger chunk demonstrates the value of "loop unrolling" to amortize the loop overhead by doing more work in each iteration and iterating fewer times.

  • Compare the instruction counts in the body of the long loop to the body of the memcpy loop (memcpy 8 bytes each iteration, same number of iterations as the long loop) and you'll see the cost of using memcpy when a simple assignment would do -- ouch!
  • Check out the instruction counts for the single block memcpy/memmove. Wowzers -- somebody sure worked hard at optimizing the library routines! There's a speedup of an order of magnitude from copying the entire block in one go rather than copying each element individually.
  • memcpy can have a performance advantage over memmove, since memcpy can ignore overlapping regions while memmove has to deal with them. Read the man page for memmove to see what it says about handling overlap. The wording seems to say that the data is copied twice (one from src to tmp and again from tmp to dst), which would be a noticeable jump in instruction count, but the very slight rise in the instruction counts for memmove versus memcpy suggests otherwise. Do you remember what the musl_memmove (review lab4) did instead to handle overlap?

4) Stack vs. Heap Allocation

Stack allocation is preferable to heap allocation for a multitude of reasons (type-safety, automatic allocation/deallocation, etc.), while also offering significant performance advantages. Let's use callgrind to profile a sorting program to explore the relative runtime cost of using the stack versus the heap.

The isort program implements insertion sort. When sorting the array, it will need temporary space when swapping array elements and the program explores the performance impact of allocating that temporary space using a fixed-size stack buffer, a variable-length stack buffer, or dynamically allocating from the heap.

  • Review the C code in isort.c, talk it over with your partner and make some predictions about the relative performance of the different functions.

  • Run under callgrind and annotate the output:

    valgrind --tool=callgrind --toggle-collect='swap_*' ./isort
    callgrind_annotate --auto=yes callgrind.out.pid
    
  • All three swap functions have the same instruction count in the three memcpy calls; the issue to examine is the instructions spent in setting up the tmp memory and cleaning it up.

  • Find the instruction counts for the swap_fixedstack and swap_varstack functions. The lines corresponding to the function opening/closing brace and variable declaration are annotated with counts for function entry/exit. What is the difference in the number of instructions executed on entry and exit to swap_fixedstack vs swap_varstack?
  • Compare the disassembly for these two functions. How many instructions are required to set up and clean up a constant-length stack array? How many additional instructions are needed for a variable-length stack array? Does the count of additional instructions in the disassembly match the count of executed instructions reported by callgrind?
  • Now look at the instruction counts for swap_heap. The annotation will indicate a small-ish number for the instructions that set up for the call to malloc/free and a much larger number of instructions executed within the malloc and free library functions. About how many instructions does each call to malloc incur? Each call to free? How does that compare to the cost of allocating and deallocating space on the stack?

Check off with TA

At the end of the lab period, submit the checkoff form and ask your lab TA to approve your submission so you are properly credited for your work. It's okay if you don't completely finish all of the exercises during lab; your sincere participation for the full lab period is sufficient for credit. However, we highly encourage you to finish on your own whatever is need to solidify your knowledge. Also take a chance to reflect on what you got what from this lab and whether you feel ready for what comes next! The takeaways from lab8 should be learning how to use Valgrind callgrind to analyze program efficiency and bottlenecks. Your awesome assembly skills are just what is needed for examining the generated assembly and identifying gcc's cool transformations to make code run faster. Here are some questions to verify your understanding and get you thinking further about these concepts:

  • Loop unrolling is the process of reworking a loop to accomplish more per iteration and iterate fewer times. In what situations might this provide a performance win?
  • Avoiding the expense of function call/return is often cited for the motivation for inlining a function (though this is not as necessary in current versions of C). What are the estimated savings in terms of number of instructions? Inlining also enables additional optimization opportunities, given that gcc applies most transformations within a function, but not inter-procedurally. Do you expect these gains to be more or less significant than the call overhead? Explain.