CS107 Lab 8: Memory and optimization

Lab sessions Mon Mar 06 to Thu Mar 09

Lab written by Julie Zelenski

Learning goals

After this lab, you will have

observed results of memory misuse detection by libc and valgrind
experimented with different mechanisms for execution timing
disassembled and compared code compiled with and without optimization
run a profiler to analyze a program's runtime behavior

Share a computer with somebody new. Brag to each other about your plan to crush heap allocator.

Lab exercises

1. Get started.

Clone the lab starter project using the command

hg clone /afs/ir/class/cs107/repos/lab8/shared lab8

This creates the lab8 directory which contains source files and a Makefile.

Pull up the online lab checkoff and have it open in a browser so you'll be able to jot down things as you go. At the end of the lab period, you will submit that sheet and have the TA check off your work.

2. Understanding malloc

A conforming malloc implementation has requirements such as returning NULL on failure and freeing the old pointer when realloc moves the block. However some behaviors are unspecified, such as what happens when freeing a pointer twice or realloc'ing a non-heap pointer. In those situations, implementations can, and do, behave differently. Given the heavy use of the allocator, it is critical that it run fast. Including code to deflect errant requests might help a careless programmer, but it would come at the expense of slowing down every call. The preference for efficiency over safety means that allocators typically blunder through bad calls. The malloctest.c program demonstrates some incorrect use of the allocator functions. The argument to the program is a number from 1 to 6, the number identifies which error to make. Execute the program for each of the 6 arguments to see what happens on each of the errors. In some cases, you might see no visible effects. At best, this means it detected that your request was invalid and ignored it. At worst, it means you have subtly corrupted the heap and only later will the program happen upon the damage.

Run the program again for each of the 6 arguments, but this time run under valgrind and see what help it can provide. Valgrind's virtual environment instruments the code with extra memory checks and uses its own heap allocator with different behaviors such as inserting padding around heap blocks, never reusing memory, and storing the heap housekeeping information outside the heap so it is less likely to be stomped on. It also tracks when each address is written to detect use of uninitialized data and watches for leaks. These extra checks mean your program runs differently and more slowly under valgrind, but it can provide invaluable help in identifying and fixing memory errors and leaks. Which of the errors are caught and reported by Valgrind?

Optional malloc reverse-engineering challenge. The heap allocator on myth uses a pre-node header to track node housekeeping, like that discussed in lecture/textbook. By poking around, casting pointers, examining the memory, reading the disassembly, and so on, you can use your now-killer reverse-engineering skills to figure out the layout of the pre-node header. Given a pointer returned by malloc, how can you dig out the number of bytes allocated to that address? The true size can be larger than requested because of rounding/alignment. Run malloctest 7 to do some spying on the libc heap. While you're digging around, read <malloc.h> or the man page for mallinfo for some interesting additional allocator functions. The function malloc_stats prints a brief summary of the heap use. The function malloc_usable_size(void *) will return the size allocated to a malloc block. What happens if you pass a non-heap pointer to this function? Note these functions are not standard C and you should not rely on them in production code, but they can be handy for debugging.

3. Execution timing

The matrix, swap, and copy programs allow you to experiment with various means of timing a program's execution.

a) The matrix program sums the elements of a 2-dimensional array. The argument given to the program controls the order it accesses the elements and which timer is used to measure performance.

The simplest way to get a rough runtime estimate is using the system process timer via the Unix time command. This measurement is not that precise but it's convenient in that you can time any program without making code changes. Try time ./matrix rows and time ./matrix cols a few times and read the seconds reported. How consistent are the times from run to run? Even at this coarse granularity, you should observe a difference in summing by rows versus columns.
The matrix program also includes code to time itself if you execute it with one of its two timer arguments (either gtd or rtc). The code for the gtd timer uses gettimeofday to access the OS clock and reports the time used in seconds. This is a simple and portable way to measure time, but its accuracy can vary depending on the system implementation. Execute a few runs of ./matrix gtd to get results from this timer. How much variability do you observe? How does time reported compare to the results you got earlier from time?
The rtc timer counts cycles by reading the chip's real-time cycle counter and does repeated trials until results converge (K-best cycle timer from ch 5 of B&O). This is much more precise, but technique is hardware-specific. Execute a few runs of ./matrix rtc to get results from this timer. How much variability do you observe?
Use the timing data you've gathered to answer the question: does C arrange the elements of a 2-d array in row-major or column-major order?

b) The swap program explores the relative cost of stack versus heap allocation for the temporary space needed to swap values. The program sorts a large array using three different swap functions: one that uses a fixed-size stack buffer, one a variable-length stack buffer, and a third that uses a dynamically-allocated heap buffer. The program has code for all three versions, times each via RTC, and prints the timing results.

Run the swap program several times and review the printed results. How do the strategies rank in terms of time required?
Disassemble and compare the code for the fixed-stack versus variable-stack. Allocating a constant-length stack array is one instruction (subtracting a compile-time constant from %rsp), how many additional instructions are inserted to set up a variable-length stack array?

c) The copy program explores the relative cost of different means of copying a large array. The total amount of data copied is constant, but changing how much work is done per iteration and number of iterations can have a profound impact on performance. The program experiments with a variety of ways: char-by-char in a loop, int-by-int loop, a memcpy in a loop, or to a single call to memcpy or memmove the entire block. The program uses the RTC counter to time the operation and prints the timing results.

Run the copy program several times and look at the cycle counts for the char and int loops. About how much faster is copying by ints instead of chars? The int version demonstrates the value of "loop unrolling" to amortize the loop overhead by doing more work in each iteration and iterating fewer times.
Compare the char/int loops to the memcpy loop (it memcpy's 4 bytes each iteration, same number of iterations as int loop) and learn the cost of using memcpy when a simple assignment would do-- ouch!
Check out the cycle counts for the make memcpy/memmove. Wowzers -- somebody sure worked hard at optimizing the library routines! More than the order of magnitude speedup is achieved from memmove the entire block rather than memcpy each element individually. memcpy can have a performance advantage over memmove (since memcpy can ignore overlapping regions and memmove has to deal with them) but it is generally slight-- how much difference do you observe between the two?

4. Optimization

Higher levels of gcc optimization can do impressive things to streamline the generated assembly. By becoming knowledgeable of the compiler's capabilities, you'll be better able to focus your attention on passages that require human intervention and not bother where gcc can work its magic. The trials.c program has code to implement a few simple numeric operations in a variety of ways (some clever, some less so, and some laughably lame) and runs time trials on each. The Makefile builds two versions of the program: trials_unopt using -O0 and trials_opt using -O2.

Before you run the program, first look at the code for the different implementations (version A, B, etc.) of the operations and make predictions about how you think the variants will stack up in performance and how large the expected difference. Run trials_unopt (the unoptimized version) and use the timings reported to find out if your intuition holds up. Having seen the timings, you must promise to never again invoke pow/exp2 to do what could have been a simple bit shift ...
Now run trials_opt (the optimized version) and compare to the unoptimized times. Which operations gain large improvements from the optimizer? Which don't? Can you pick out some of the features that make code amenable to being optimized by the compiler?
Pick a function or two that shows a large improvement when optimized. Examine the unoptimized and optimized disassembly to identify what transformations have been made by the compiler.

Just for fun: Here's a cute quiz on what gcc can/will optimize -- fun to check out!

Check off with TA

Before you leave, be sure to submit your checkoff sheet (in the browser) and have lab TA come by and confirm so you will be properly credited for labIf you don't finish everything before lab is over, we strongly encourage you to finish the remainder on your own. Double-check your progress with self check.

Lab 8: Memory and optimization