Learning goals

After this lab, you will have

experimented with different mechanisms for execution timing
run a profiler to get dynamic instruction counts

Share a computer with somebody new. Brag to each other about your plan to crush heap allocator.

Get started

Clone the lab8 repo by using the command below.

git clone /afs/ir/class/cs107/repos/lab8/shared lab8

Open the lab checkoff form.

Lab exercises

1) Timing tools

A simple means to measure runtime is using the system process timer via the Unix time command. This measurement is fairly coarse and is limited to timing the entire program but it's convenient in that you can time any program, whether you have source or not, and without making code changes. Let's try it now!

The samples subdir for this lab contains mysort_soln from assign4 and bbsort from assign5, along with input files of random numbers. Compare the time for these two programs with the standard unix sort command on the same input. (Note: the >/dev/null at the end of the command will suppress output so you can more easily see the timing information).
```
time samples/mysort_soln -n samples/million >/dev/null
time samples/bbsort -n samples/million >/dev/null
time sort -n samples/million >/dev/null
```
The results should show that the runtimes for bbsort and mysort are comparable, but standard sort is much faster. Let's investigate further!
A profiler is a developer tool that observes an executing program to gather detailed statistics about resource consumption such as time spent, instructions executed, memory used, and so on. Although our primary use of valgrind has been for help finding memory errors and leaks, it can also operate as a profiler by using the callgrind tool to count instructions and memory accesses. (More info in the CS107 guide to callgrind)
```
valgrind --tool=callgrind samples/mysort_soln -n samples/thousand >/dev/null
valgrind --tool=callgrind samples/bbsort -n samples/thousand >/dev/null
valgrind --tool=callgrind sort -n samples/thousand >/dev/null
```
Run each of the three sort programs under callgrind and find the number labeled I refs in the output, this is the total count of instructions executed. As expected from the earlier timing data, the instruction counts for bbsort and mysort are comparable, but the standard sort is doing a lot less work.
In addition to summarizing the total instruction count, callgrind also writes detailed information to a file named callgrind.out.pid (where pid is the process number shown in left column of the callgrind summary, e.g. ==29240==). You run the annotator on the output file to get a breakdown of instruction count per function or line. Try it now (replacing pid below your process id):
```
callgrind_annotate callgrind.out.pid
```
If you run the annotator on the callgrind output for mysort and bbsort, you will see that one single function accounts for the vast majority of the instructions executed. What function is it? Does that function suggest an explanation for what the built-in sort is likely doing that makes it so much faster?

2) Math: integer, float, and bitwise

The trials.c program has code to implement some numeric operations in a variety of ways (some clever, some less so, and some rather lame). Let's find out how these different approaches compare in terms of instruction counts.

The two_to_power functions are various ways of computing 2^n.
- Versions A and B use the math library routines pow and exp2. Library routines are fast, right? Well, the math library routines work on float/double type -- scroll through the 300+ lines of the musl implementation of pow to get a glimpse of what is involved. Integers don't get a free ride here, they must be converted to float and operated on using floating point instructions. Do you wonder how expensive that will be?
- Version C is a loop that iterates n times, multiplying by 2. Look at the assembly for the function. Is there an actual mul instruction in there or did the clever compiler come up with something better?
- Version D capitalizes on the relationship between bit patterns and binary numbers. That seems very likely to be out winner. How much better to expect this to be than the others?
The various is_power functions show a variety of approaches for testing whether an integer is an exact power of 2. How do each of these work?
After reviewing the C code, talk it over with your partner and make some predictions about the relative performance of the two_to_power and is_power functions.
Now let's do some counting. First use objdump or gdb disassemble to get the assembly for each function and count instructions. This is the static count. In what situations is it expected that the static count will match the dynamic count?
Use callgrind to get the dynamic counts. The toggle-collect option in the command below tells callgrind to only count instructions within functions whose name includes power, so as to focus on the functions of interest:
```
valgrind --tool=callgrind --toggle-collect='*power*' ./trials
```
Use callgrind_annotate to break down instruction counts per function/line.
```
callgrind_annotate --auto=yes callgrind.out.pid
```
When invoked with the --auto=yes option, th annotator displays the original C source file, and annotates each line with the number of assembly instructions executed on its behalf. You can identify performance bottlenecks by scanning the annotated result to find those lines with large counts. A larger count means the line was executed more times and/or it corresponds to many assembly instructions. These are the program's "hot spots". A lot of instructions are being executed in these sections and these are the passages to streamline to improve performance.
Compare the counts for the various two_to_power functions. The bitwise version is untouchably fast (just a few instructions!). How much more work is involved in the other options? Did this surprise you?
Compare the counts for the is_power functions. The bitwise approach is out in front, but the switch and popcount versions are not far behind. Which version is bringing up the rear?

Re-implementing functionality already present in the C library is generally a big lose. The library routines are already written, tested, debugged, and aggressively optimized -- what's not to like? However, you do need to take care to choose the right tool for the job. Knowing the cost/benefit of the library functions and being knowledgeable of the compiler's optimizing capabilities will allow you better focus your attention on passages that require human intervention and not bother where the library functions and gcc can work their magic on your behalf. You must promise to never again use a floating point math function to do what could have been a simple integer or bitwise operation!

Try your hand at this quiz on what gcc can/will optimize -- fun and informative!

3) Copying data

The copy program explores the relative cost of different means of copying a large array. The total amount of data copied is constant, but changing how much work is done per iteration and number of iterations can have a profound impact on performance. The program experiments with a variety of ways: chunk-by-chunk in a loop, a memcpy in a loop, or a single call to memcpy or memmove the entire block.

Run under callgrind and annotate the output:

valgrind --tool=callgrind --toggle-collect='copy*' ./copy 
callgrind_annotate --auto=yes callgrind.out.pid

Look at the instruction counts for the char, int, and long loops. About how many fewer instructions is copying by ints instead of chars? By longs instead of ints? Copying a larger chunk demonstrates the value of "loop unrolling" to amortize the loop overhead by doing more work in each iteration and iterating fewer times.
Compare the instruction counts in the body of long loop to the body of the memcpy loop (memcpy 8 bytes each iteration, same number of iterations as long loop) and learn the cost of using memcpy when a simple assignment would do-- ouch!
Check out the instruction counts for the single block memcpy/memmove. Wowzers -- somebody sure worked hard at optimizing the library routines! A speedup of an order of magnitude is achieved from copying the entire block in one go rather than copying each element individually.
memcpy can have a performance advantage over memmove, since memcpy can ignore overlapping regions while memmove has to deal with them. Read the man page for memmove to see what it says about handling overlap. The wording seems to say that the data is copied twice (one from src to tmp and again from tmp to dst), which would be a noticeable jump in instruction count, but very slight rise in the instruction counts for memmove versus memcpy suggests otherwise. Do you remember what the musl_memmove (review lab4) did instead to handle overlap?

4) Stack vs heap allocation

Stack allocation is preferable to heap allocation for a multitude of reasons (type-safety, automatic allocation/deallocation, cache-friendly), while also offering significant performance advantages. Let's use callgrind to profile a sorting program to explore the relative runtime cost of using the stack versus the heap.

The isort program implements a simple insertion sort. When sorting the array, it will need temporary space when swapping array elements and the program explores the performance impact of allocating that temporary space using a fixed-size stack buffer, a variable-length stack buffer, or dynamically allocating from the heap.

Run under callgrind and annotate the output:

valgrind --tool=callgrind --toggle-collect='swap_*' ./isort
callgrind_annotate --auto=yes callgrind.out.pid

All three swap functions have the same instruction count in the three memcpy calls, the issue to examine is the instructions spent in setting up the tmp memory and cleaning it up.
Find the instruction counts for the swap_fixedstack and swap_varstack functions. The lines corresponding to the function opening/closing brace and variable declaration are annotated with counts for function entry/exit. What is the difference in the number of instructions executed on entry and exit to swap_fixedstack vs swap_varstack?
Compare the disassembly for these two functions. How many instructions are required to setup and cleanup a constant-length stack array? How many additional instructions are needed for a variable-length stack array? Does the count of additional instructions in the disassembly jive with the count of executed instructions reported by callgrind?
Now look at the instruction counts for swap_heap. The annotation will indicate a small-ish number for the instructions that set up for the call to malloc/free and a much larger number of instructions executed within the malloc and free library functions. About how many instructions does each call to malloc incur? Each call to free? How does that compare to the cost of allocating and deallocating space on the stack?

Check off with TA

Before you leave, be sure to submit your checkoff sheet (in the browser) and have lab TA come by and confirm so you will be properly credited for lab. If you don't finish everything before lab is over, we strongly encourage you to finish the remainder on your own. Double-check your progress with self check.