Lab 7: Optimizing and Profiling

Lab sessions Tue Nov 10 to Thu Nov 12

Lab written by Julie Zelenski, with modifications by Nick Troccoli

There is no prelab for lab7.

Learning Goals

This lab is designed to give you a chance to explore how to analyze the efficiency of C programs. This is useful for making your heap allocator as fast and efficient as possible. Specifically, you will:

experiment with different mechanisms for execution timing
run a profiler to get dynamic instruction counts

This is the last CS107 lab of the quarter!

Get Started

Clone the repo by using the command below to create a lab7 directory containing the project files.

git clone /afs/ir/class/cs107/repos/lab7/shared lab7

Exercises

1) Timing Tools (15 minutes)

One common measurement for program efficiency is execution time, also known as run time. A simple means to measure runtime is using the system process timer via the Unix time command. This measurement is fairly coarse and is limited to timing the entire program but it's convenient in that you can time any program, whether you have the source code for it or not, and do not need to make any code changes. Let's try it now!

The samples folder in the lab folder contains mysort_soln from assign4 and the bbsort executable. bbsort is a sorting program from the busybox project, and is a cousin of the mysort program you wrote in assign4 — each implements a simplified version of the standard Unix sort command. bbsort's interface and external behavior is quite similar to mysort. The samples folder also contains input files of random numbers. Compare the time for these two programs with the standard unix sort command on the same input using the commands below. Make sure to execute these commands from within your main lab directory. (Note: the >/dev/null at the end of the command will suppress output so you can more easily see the timing information). The time labeled "user" is the one of interest.
```
time samples/mysort_soln -n samples/million >/dev/null
time samples/bbsort -n samples/million >/dev/null
time sort -n samples/million >/dev/null
```
How do the runtimes of the three sort programs rank? The results should show that the runtimes for bbsort and mysort are comparable, but standard sort is much faster. Let's investigate further!
A profiler is a developer tool that observes an executing program to gather detailed statistics about resource consumption such as time spent, instructions executed, memory used, and so on. Although our primary use of valgrind has been for help finding memory errors and leaks, it can also operate as a profiler by using the callgrind tool to count instructions and memory accesses. (More info in the CS107 guide to callgrind)
```
valgrind --tool=callgrind samples/mysort_soln -n samples/thousand >/dev/null
valgrind --tool=callgrind samples/bbsort -n samples/thousand >/dev/null
valgrind --tool=callgrind sort -n samples/thousand >/dev/null
```
Run each of the three sort programs under callgrind and find the number labeled I refs in the output. This is the total count of instructions executed. What do you find? As expected from the earlier timing data, the instruction counts for bbsort and mysort are comparable, but the standard sort is doing a lot less work.
In addition to summarizing the total instruction count, callgrind also writes detailed information to a file named callgrind.out.pid (where pid is the process number shown in left column of the callgrind summary, e.g. ==29240==). You can run something called the annotator on the output file to parse this information and get a breakdown of instruction count per function or line. Try it now (replacing pid below with your process id):
```
callgrind_annotate callgrind.out.pid
```
If you run the annotator on the callgrind output for mysort and bbsort, you will see that one single function accounts for the vast majority of the instructions executed. What function is it? Does that function suggest an explanation for what the built-in sort is likely doing that makes it so much faster?

2) Integer, Float and Bitwise Math (20 minutes)

The trials.c program has code to implement some numeric operations in a variety of ways. Let's find out how these different approaches compare in terms of instruction counts.

The two_to_power functions are various ways of computing 2^n.
- Versions A and B use the math library routines pow and exp2. Library routines are fast, right? Well, the math library routines work on float/double type -- scroll through the 300+ lines of the musl implementation of pow to get a glimpse of what is involved. Integers don't get a free ride here, they must be converted to float and operated on using floating point instructions, which may be expensive!
- Version C is a loop that iterates n times, multiplying by 2. Look at the assembly for the function. Is there an actual mul instruction in there or did the clever compiler come up with something better?
- Version D capitalizes on the relationship between bit patterns and binary numbers. That seems very likely to be our winner. How much better do we expect this to be than the others?
The various is_power functions show a variety of approaches for testing whether an integer is an exact power of 2. How do each of these work?
After reviewing the C code, talk it over with your group and make some predictions about the relative performance of the two_to_power and is_power functions.
Now let's do some counting. First use objdump or gdb disassemble to get the assembly for each function and count instructions. This is the static count. In what situations is it expected that the static count will match the dynamic count?
Use callgrind to get the dynamic counts. The toggle-collect option in the command below tells callgrind to only count instructions within functions whose name includes power, so as to focus on the functions of interest:
```
valgrind --tool=callgrind --toggle-collect='*power*' ./trials
```
Use callgrind_annotate to break down instruction counts per function/line.
```
callgrind_annotate --auto=yes callgrind.out.pid
```
When invoked with the --auto=yes option, the annotator displays the original C source file, and annotates each line with the number of assembly instructions executed on its behalf. You can identify performance bottlenecks by scanning the annotated result to find those lines with large counts. A larger count means the line was executed more times and/or it corresponds to many assembly instructions. These are the program's "hot spots". A lot of instructions are being executed in these sections and these are the passages to streamline to improve performance.
Compare the counts for the various two_to_power functions. The bitwise version is untouchably fast (just a few instructions!). How much more work is involved in the other options? Did this surprise you?
Compare the counts for the is_power functions. The bitwise approach is out in front, but the switch and popcount versions are not far behind. Which version is bringing up the rear?

Note: you may see counts of instructions executed to set up and call a function, and counts of instructions for when the function is actually executing. For example, in the example output below, callgrind is telling us there are 9000 instructions executed to set up and call find_min. This doesn't include the actual execution of the function - just the setup and calling of it.

4,005      for (int i = 0; i < n; i++) {
9,000          int min = find_min(arr, i, n - 1);

Separately, callgrind annotates the actual function's code to tell us the number of instructions executed when that function is running:

            . int find_min(int arr[], int start, int stop)
        3,000  {
        2,000      int min = start;
    2,005,000      for(int i = start + 1; i <= stop; i++)
    4,995,000          if (arr[i] < arr[min])
        6,178              min = i;
        1,000      return min;
        2,000  }

This output shows us the number of instructions spent executing the find_min function (but not the number of instructions spent setting up and calling it from other functions).

In summary, re-implementing functionality already present in the C library is generally not a good idea. The library routines are already written, tested, debugged, and aggressively optimized -- what's not to like? However, you do need to take care to choose the right tool for the job. Knowing the cost/benefit of the library functions and being knowledgeable of the compiler's optimizing capabilities will allow you better focus your attention on passages that require human intervention and not bother where the library functions and gcc can work their magic on your behalf. You should never use a floating point math function to do what could have been a simple integer or bitwise operation!

Try your hand at this quiz on what gcc can/will optimize -- fun and informative!

3) Copying Data (15 minutes)

The copy program explores the relative cost of different means of copying a large array. The total amount of data copied is constant, but changing how much work is done per iteration and number of iterations can have a profound impact on performance. The program experiments with a variety of methods: chunk-by-chunk in a loop, a memcpy in a loop, or a single call to memcpy or memmove for the entire block.

Review the C code in copy.c, talk it over with your group and make some predictions about the relative performance of the different copying functions.

Run under callgrind and annotate the output:

valgrind --tool=callgrind --toggle-collect='copy*' ./copy 
callgrind_annotate --auto=yes callgrind.out.pid

Look at the instruction counts for the char, int, and long loops. About how many fewer instructions is copying by ints instead of chars? By longs instead of ints? Copying a larger chunk demonstrates the value of "loop unrolling" to amortize the loop overhead by doing more work in each iteration and iterating fewer times.
Compare the instruction counts in the body of the long loop to the body of the memcpy loop (memcpy 8 bytes each iteration, same number of iterations as the long loop) and you'll see the cost of using memcpy when a simple assignment would do -- ouch!
Check out the instruction counts for the single block memcpy/memmove. Wowzers -- somebody sure worked hard at optimizing the library routines! There's a speedup of an order of magnitude from copying the entire block in one go rather than copying each element individually.
memcpy can have a performance advantage over memmove, since memcpy can ignore overlapping regions while memmove has to deal with them. Read the man page for memmove to see what it says about handling overlap. The wording seems to say that the data is copied twice (one from src to tmp and again from tmp to dst), which would be a noticeable jump in instruction count, but the very slight rise in the instruction counts for memmove versus memcpy suggests otherwise. If you're interested, check out the extra problem in lab4 to see how one implementation from musl chooses to handle overlap. What does it do?

[Optional] Extra Problem

Finished with lab and itching to further explore optimization? Check out our extra problem!

Check off with TA

Nice work on the lab! It's okay if you don't completely finish all of the exercises during lab; your sincere participation for the full lab period is sufficient for credit. However, we highly encourage you to finish on your own whatever is need to solidify your knowledge. Also take a chance to reflect on what you got what from this lab and whether you feel ready for what comes next! The takeaways from this lab should be learning how to use Valgrind callgrind to analyze program efficiency and bottlenecks. Your awesome assembly skills are just what is needed for examining the generated assembly and identifying gcc's cool transformations to make code run faster. Here are some questions to verify your understanding and get you thinking further about these concepts:

Loop unrolling is the process of reworking a loop to accomplish more per iteration and iterate fewer times. In what situations might this provide a performance win?
Avoiding the expense of function call/return is often cited for the motivation for inlining a function (though this is not as necessary in current versions of C). What are the estimated savings in terms of number of instructions? Inlining also enables additional optimization opportunities, given that gcc applies most transformations within a function, but not inter-procedurally. Do you expect these gains to be more or less significant than the call overhead? Explain.