Assignment 6: Leak detector

Due date: Sat May 21 11:59 pm - Hard deadline: Mon May 23 11:59 pm

Assignment by Julie Zelenski

[Quick Links: Advice page, Grading]

Learning goals

After completing this assignment, you can proudly say that you have learned how to:

  1. dissect an ELF object file
  2. interpose on functions in the standard library
  3. gather information from the runtime stack
  4. build a useful developer tool!

The problem

Thank goodness for Valgrind, the C world without it would be a sad, lawless place. This week's assignment introduces you to the challenges and glory of implementing a tool like Valgrind. The true heroes of the software world are those programmers who apply their deep knowledge of systems to provide the debuggers, profilers, and simulators that enable all developers to be more productive!

The assignment consists of two tasks:

  1. Implement symbols.c to harvest function symbol information from an ELF file. We provide the namelist program that builds on your symbols module to provide a simplified version of the nm utility. This intermediate milestone verifies your handling of the ELF symbol table.
  2. Implement lmallc.c to add leak-checking to the standard heap. It intercepts calls to malloc/realloc/free, tracks in-use blocks, and reports any unfreed blocks at the time of program exit. You'll use your symbols module to attach symbolic names to the backtrace of function calls that led to a leak.

The ELF format

An object file is the result of compiling, assembling, and possibly linking, C source code. It contains global variables, string constants, function names, binary-encoded instructions, etc., written as a lump of binary data. Object files on the myth machines are in Executable and Linking Format (ELF). For this assignment, you must dig into the ELF file to extract the function symbols. The relevant ELF data to this task are:

The diagram below shows the parts of the ELF file you will access (uint is used as an abbreviation for unsigned int):

ELF diagram

It takes a few hops within the ELF file to read the contents of a particular section: start at the file header, jump from there to the list of section headers, find the desired section header and then follow it to its associated section data. One neat feature of ELF is that it is designed to be directly accessed from its untranslated on-disk representation. For example, the list of section headers is stored as an array of section header structs; the symtab section is an array of symbol structs. You must use manual pointer arithmetic to calculate the location in the file where the array starts, but once there, you can assign the location to a pointer of the appropriate type, and treat it like an array, accessing entries using ordinary array notation.

Here are the steps to read the ELF symbols (data identified by color to match the diagram above):

  1. ELF File header (yellow)
    • Read data at offset 0 in file and compare to the expected 64-bit ELF header. If header is valid, read the entire ELF file into memory. The starter project includes code for this task.
    • Use the offset and nsectionheaders fields from the file header to identify the location and length of the list of section headers. The offset is the number of bytes from the start of the ELF file to the start of the list (offset labeled 'c' in the diagram above).
  2. Search list of section headers (blue) for symtab section header
    • Treat location of the list as base address of array and loop over the array to find the section header whose type field is equal to SHT_SYMTAB. This is the header for the symbol table (symtab) section. An ELF file will have at most one symtab section.
    • Use the offset and size fields from the symtab section header to identify the location and size of the symtab section data. Each section header contains an offset which is the number of bytes from the start of the ELF file to the start of its section data. (offset labeled 'a' in the diagram)
  3. Read symbols from symtab section (red)
    • Treat location of the symtab section data as base address of array and loop over the array to access each symbol.
    • The strings for the symbol names are not stored within the symtab section. Instead, there is a companion string table (strtab) section that contains the strings for all symbol names. The symtab section header has a strtab_index field, which is the index into the list of section headers for the companion strtab section header. There can be more than one strtab section in an ELF file, so be sure to use strtab_index to access the correct section header.
    • Access the strtab section header from the list of section headers and use the offset field from the strtab section header to identify the location of the strtab section data. (offset labeled 'b' in the diagram)
  4. Read symbol names from strtab section (green)
    • Apply a typecast to location of the strtab section data to access the strings. The strtab section data is a sequence of null-terminated strings laid out contiguously. A symbol's name field identifies where to find the symbol's name. The name offset is expressed as the number of bytes from the start of the strtab section data to the start of the symbol's name string. As an example, the name 'second' is at offset 12 for the strtab in the diagram.

The ELF file header (yellow) will always be at offset 0; the red, green, and blue pieces can be in a different relative order than shown in the diagram. Additional info on ELF can be found in the elf man page, the elf.h include file (on myth at /usr/include/elf.h), sections 7.3-7.5 of Bryant and O'Hallaron, ELF 101 stolen from HackerNews, and the full 100-page ELF specification (for all you insomniacs).

The symbols module and the namelist program

The shared symbols module that is compiled into both namelist and leak detector is responsible for extracting the symbol data from an ELF file. The symbols.h interface is comprised of three functions: one to harvest symbols, another to search for a symbol by address, and one to dispose of the symbols. You are to implement these functions to match their specification as given in the .h file. This very tiny interface is intentionally limited to exactly and only those operations needed by its two known clients. As a client, you will appreciate how ultra-simple the interface is and how it abstracts away the low-level grunginess of ELF, but as the implementor, you are responsible for making good on the contract. It will be your job to dig through through the raw ELF to find the appropriate data and then clean it up to repackage in the tidy format promised by the interface.

A few notes on the symbols implementation:

Once you have implemented your symbols module, test it out via the namelist client program. This program is a simplified version of the standard nm utility. We give you the namelist program pre-written so once your symbols module is correctly implemented, you're already done with namelist! Here's how you can try it out:

Compare the output from your namelist with the output from our solution version to verify your symbols module is working correctly. Pretty cool, you have just written your own version of nm!

Leak detector

The lmalloc.c module is responsible for providing leak detection for heap-allocated memory. The leak detector operates by spying on every allocation/deallocation request and keeping tabs on the in-use blocks. At program exit, it produces a leak report for those blocks that were allocated but never freed. The leak detector uses the symbols module to attach function names to the leak backtrace.

Let's explain a little more about how this is going to work. The leak detector maintains a list of in-use blocks. The list starts empty. When a new block is allocated, it is added to the list. When a block is freed, it is removed from the list. Any block remaining in the list at program exit is considered a leak. The in-use list is stored in a global array. Each array entry contains the information for a single in-use block: address, size, and callstack. The callstack is the sequence of function calls that led up to this allocation request. The GNU backtrace function is used to take a snapshot of the current callstack by reading the return addresses from the runtime stack.

At program exit, the leaks are summarized into a report. All leaks that originate from the same callstack are consolidated into a single "loss record". Two callstacks are considered equal if the have the same return addresses, i.e. both originate from the exact same code path. For example, if a loop executes 10 times and each iteration allocates a length 6 string, each of those allocations will have the same callstack. If any of those blocks are not freed, they be consolidated into one combined loss record. For each loss record, the report shows the total number of bytes, the number of blocks, and the callstack. When printing the callstack, each return address is matched to its function name using the symbols module. For two leaks to have the "same callstack", it means the exact same sequence of return addresses.

A sample leak report looks like this:

==sample_ld==   6 bytes in 1 block(s) are lost, allocated
==sample_ld==     by 0x4017dd: __wrap_malloc
==sample_ld==     by 0x41c48a: __strdup
==sample_ld==     by 0x4011d6: main
==sample_ld==  
==sample_ld==   12 bytes in 2 block(s) are lost, allocated
==sample_ld==     by 0x401810: __wrap_realloc
==sample_ld==     by 0x40115f: concatenate
==sample_ld==     by 0x4011bc: main
==sample_ld==  
==sample_ld==   100 bytes in 1 block(s) are lost, allocated
==sample_ld==     by 0x4017dd: __wrap_malloc
==sample_ld==     by 0x603094: <unknown>
==sample_ld==     by 0x401172: create_array
==sample_ld==     by 0x4011f0: main
==sample_ld==  
==sample_ld==  LEAK SUMMARY:
==sample_ld==     In use at exit: 118 bytes in 4 blocks

This report indicates the program exited with four leaked blocks consolidated into three loss records. The first was a 6-byte block allocated by malloc which was called by strdup which was called by main. The second loss record indicates two blocks totaling 12 bytes were leaked along the code path realloc which was called by concatenate which was called by main. The third loss record was one block of 100 bytes, allocated by malloc, which was called by an unidentified function, called by create_array called by main.

Interposing
The leak detector sets up wrapper functions around malloc, calloc, realloc, and free. These functions are named __wrap_malloc, __wrap_calloc, and so on. Each wrapper function calls the real function to do the work and then tacks on the processing to update the in-use list. For example, __wrap_malloc invokes the real malloc to allocate memory and then jots down the block info (address, size, callstack) and adds to in-use list. The __wrap_free calls the real free to deallocate the memory and then removes the block from the in-use list.

In order for this scheme to work, the client program has to call the wrapper function in place of the real function. Rather than require the client to manually make this change, we explore two automated ways to interpose on an existing function:

A downside of the cpp approach is that the preprocessor will only wrap calls that are explicit in the client code. For example, if the client program uses strdup to copy a string, although that eventually calls malloc, if the call to malloc is not explicit in the client program, it will not be wrapped. However, the cpp approach works in any standard C environment and it gives the client control of exactly which calls are wrapped by careful application of the #define. The ld approach ensures that every call throughout the entire program is wrapped, not just the explicit ones. That is generally what is wanted, but allows thoroughly hosing things if you install a buggy wrapper for a critical function. It also relies on the linker to provide wrap support (non-standard). Section 7.13 of B&O gives more background information on library interpositioning including cpp (compile-time) and ld (link-time) approaches along with a third runtime strategy (dynamic loader).

Our makefile is configured for both approaches. From a client program binky.c, it will build three executables:

  1. binky (compiled/linked in the ordinary way, no leak detection)
  2. binky_cpp has leak detection, preprocessor #defines inserted into binky.c and re-compiled
  3. binky_ld has leak detection, object files re-linked using the linker wrap option

If you run the regular binky program, it will execute normally, with no leak detection. The binky_cpp will execute with leak detection as installed by the cpp approach and binky_ld will execute with leak detection as installed by ld. The leak reports from binky_cpp and binky_ld may differ depending on whether all allocation calls are explicit in the client program.

A few notes on the leak detector implementation:

Required format for the leak report:
In order that your output can be correctly evaluated by the autotester, adhere to the format details listed below and take care to compare your output to the sample to ensure it matches.

Summary of project requirements

Don't miss out on the companion advice page! Go to advice/FAQ page

Grading

Background information on how we grade assignments. The tentative rubric for this assignment is:

Functionality (100 points)

Code quality (buckets weighted to contribute ~25 points)

Here we will read and evaluate your code in areas such as:

Submissions received by the assignment deadline will receive an on-time bonus equal to 5% of the functionality points earned.

Getting started

Check out a copy of the starter project from your cs107 repository using the command

hg clone /afs/ir/class/cs107/repos/assign6/$USER assign6

The assign6 samples directory is linked in your repo as slink and includes a sample namelist program and sample leak report.

Finishing

There is a sanity check that verifies output conformance of namelist and a sample leak report. Be sure that all debugging print statements are removed/disabled as extraneous output can interfere with the autotester. When finished, don't forget to submit your work for grading. If needed, familiarize yourself with our late policy.

How did this assignment go for you? Review the post-task self-check.