Assignment 6: Leak detector

Due date: Sat May 21 11:59 pm - Hard deadline: Mon May 23 11:59 pm

Assignment by Julie Zelenski

Learning goals

After completing this assignment, you can proudly say that you have learned how to:

dissect an ELF object file
interpose on functions in the standard library
gather information from the runtime stack
build a useful developer tool!

The problem

Thank goodness for Valgrind, the C world without it would be a sad, lawless place. This week's assignment introduces you to the challenges and glory of implementing a tool like Valgrind. The true heroes of the software world are those programmers who apply their deep knowledge of systems to provide the debuggers, profilers, and simulators that enable all developers to be more productive!

The assignment consists of two tasks:

Implement symbols.c to harvest function symbol information from an ELF file. We provide the namelist program that builds on your symbols module to provide a simplified version of the nm utility. This intermediate milestone verifies your handling of the ELF symbol table.
Implement lmallc.c to add leak-checking to the standard heap. It intercepts calls to malloc/realloc/free, tracks in-use blocks, and reports any unfreed blocks at the time of program exit. You'll use your symbols module to attach symbolic names to the backtrace of function calls that led to a leak.

The ELF format

An object file is the result of compiling, assembling, and possibly linking, C source code. It contains global variables, string constants, function names, binary-encoded instructions, etc., written as a lump of binary data. Object files on the myth machines are in Executable and Linking Format (ELF). For this assignment, you must dig into the ELF file to extract the function symbols. The relevant ELF data to this task are:

File header. Every ELF file begins with a file header. The file header describes the object file characteristics such as endianness and architecture and provides a roadmap to the file contents. readelf -h filename prints the ELF file header.
Sections. The data within an ELF file is divided into sections. An object file commonly includes a text section (asm instructions), data section (global variables), symtab section (symbols), and strtab section (strings). The object file may also include sections for debugging, relocation, and other information.
List of section headers. This is the "table of contents" for the sections. The list contains a section header for every section in the file. A section header indicates the section type, an offset to the section data within the ELF file, and the section data size in bytes. readelf -S filename prints the list of section headers.
Symbol table section. The symtab section is a list of symbols. Each symbol is a function or global variable and has a type, address, and size. The symbol name is also included, but in a roundabout way, as a numeric offset into the symtab's companion string table section. readelf -s filename prints the symbol table section.
String table section. A strtab section is a sequence of strings laid out end-to-end with terminating null characters. readelf -x index filename prints the contents of the section data for the section header at the specified index in the list of section headers.

The diagram below shows the parts of the ELF file you will access (uint is used as an abbreviation for unsigned int):

ELF diagram

It takes a few hops within the ELF file to read the contents of a particular section: start at the file header, jump from there to the list of section headers, find the desired section header and then follow it to its associated section data. One neat feature of ELF is that it is designed to be directly accessed from its untranslated on-disk representation. For example, the list of section headers is stored as an array of section header structs; the symtab section is an array of symbol structs. You must use manual pointer arithmetic to calculate the location in the file where the array starts, but once there, you can assign the location to a pointer of the appropriate type, and treat it like an array, accessing entries using ordinary array notation.

Here are the steps to read the ELF symbols (data identified by color to match the diagram above):

ELF File header (yellow)
- Read data at offset 0 in file and compare to the expected 64-bit ELF header. If header is valid, read the entire ELF file into memory. The starter project includes code for this task.
- Use the offset and nsectionheaders fields from the file header to identify the location and length of the list of section headers. The offset is the number of bytes from the start of the ELF file to the start of the list (offset labeled 'c' in the diagram above).
Search list of section headers (blue) for symtab section header
- Treat location of the list as base address of array and loop over the array to find the section header whose type field is equal to SHT_SYMTAB. This is the header for the symbol table (symtab) section. An ELF file will have at most one symtab section.
- Use the offset and size fields from the symtab section header to identify the location and size of the symtab section data. Each section header contains an offset which is the number of bytes from the start of the ELF file to the start of its section data. (offset labeled 'a' in the diagram)
Read symbols from symtab section (red)
- Treat location of the symtab section data as base address of array and loop over the array to access each symbol.
- The strings for the symbol names are not stored within the symtab section. Instead, there is a companion string table (strtab) section that contains the strings for all symbol names. The symtab section header has a strtab_index field, which is the index into the list of section headers for the companion strtab section header. There can be more than one strtab section in an ELF file, so be sure to use strtab_index to access the correct section header.
- Access the strtab section header from the list of section headers and use the offset field from the strtab section header to identify the location of the strtab section data. (offset labeled 'b' in the diagram)
Read symbol names from strtab section (green)
- Apply a typecast to location of the strtab section data to access the strings. The strtab section data is a sequence of null-terminated strings laid out contiguously. A symbol's name field identifies where to find the symbol's name. The name offset is expressed as the number of bytes from the start of the strtab section data to the start of the symbol's name string. As an example, the name 'second' is at offset 12 for the strtab in the diagram.

The ELF file header (yellow) will always be at offset 0; the red, green, and blue pieces can be in a different relative order than shown in the diagram. Additional info on ELF can be found in the elf man page, the elf.h include file (on myth at /usr/include/elf.h), sections 7.3-7.5 of Bryant and O'Hallaron, ELF 101 stolen from HackerNews, and the full 100-page ELF specification (for all you insomniacs).

The symbols module and the namelist program

The shared symbols module that is compiled into both namelist and leak detector is responsible for extracting the symbol data from an ELF file. The symbols.h interface is comprised of three functions: one to harvest symbols, another to search for a symbol by address, and one to dispose of the symbols. You are to implement these functions to match their specification as given in the .h file. This very tiny interface is intentionally limited to exactly and only those operations needed by its two known clients. As a client, you will appreciate how ultra-simple the interface is and how it abstracts away the low-level grunginess of ELF, but as the implementor, you are responsible for making good on the contract. It will be your job to dig through through the raw ELF to find the appropriate data and then clean it up to repackage in the tidy format promised by the interface.

A few notes on the symbols implementation:

When harvesting symbols, return only defined function symbols with static or extern binding. A symbol must have type STT_FUNC, binding STB_GLOBAL or STB_LOCAL, and section_tag not SHN_UNDEF (not undefined); otherwise is not included in the harvest.
The returned array of harvested symbols is to be sorted in order of address range (i.e. a symbol spanning address range 0x10-0x14 precedes one at the higher range 0x132-0x154). This is to support the requirements of namelist as well to enable efficient search later.
The find operation searches for a symbol matching a target address. A target address matches a symbol if it is within the symbol's address range. If symbol binky starts at address 0x58 and has size 0x1a = 26 bytes, all addresses from 0x58 to (0x58 + 0x1a - 1 = 0x71) are within its range.
Use bsearch to efficiently search the sorted array. The same comparator should be used for both qsort and bsearch to ensure they operate consistently. When searching, the target address can be considered a symbol with a singleton range that starts at address and has size one. Your comparator should consider two symbols are equal if one symbol range encloses the other. A properly written comparator should be symmetric--- i.e. both arguments are of the same type and cmp(a, b) == -cmp(b, a).
For somewhat obscure reasons, there can be few cases of symbol "synonyms"--- two or more symbols with identical or overlapping ranges. Do not make a special case out of this, just allow the default behavior to proceed. Synonym symbols may compare as equal because of the overlap in their address range. They can be sorted in either order. If searching for an address that matches to two symbol ranges, it is allowable to match it to either.
One final oddity is the occasional zero-size symbol. Rather than make a special case out, it's fine to assume these don't exist. We won't test lookup on such symbols, so whether you find or don't find a zero-size symbol if searching for its start address doesn't matter.

Once you have implemented your symbols module, test it out via the namelist client program. This program is a simplified version of the standard nm utility. We give you the namelist program pre-written so once your symbols module is correctly implemented, you're already done with namelist! Here's how you can try it out:

The arguments to namelist are a required filename and an optional address. The filename is the name of an ELF object file. The optional additional argument is expected to be a well-formed hexadecimal address.
If invoked with just a filename, namelist prints every defined function symbol from that file. The symbols are printed one per line and listed in order of increasing address. The four columns are the address, size, binding (T for extern or t for static), and name.
If invoked with filename and hex target address, namelist prints the function symbol from that file whose address range contains the target address or indicates that no such symbol was found.

Compare the output from your namelist with the output from our solution version to verify your symbols module is working correctly. Pretty cool, you have just written your own version of nm!

Leak detector

The lmalloc.c module is responsible for providing leak detection for heap-allocated memory. The leak detector operates by spying on every allocation/deallocation request and keeping tabs on the in-use blocks. At program exit, it produces a leak report for those blocks that were allocated but never freed. The leak detector uses the symbols module to attach function names to the leak backtrace.

Let's explain a little more about how this is going to work. The leak detector maintains a list of in-use blocks. The list starts empty. When a new block is allocated, it is added to the list. When a block is freed, it is removed from the list. Any block remaining in the list at program exit is considered a leak. The in-use list is stored in a global array. Each array entry contains the information for a single in-use block: address, size, and callstack. The callstack is the sequence of function calls that led up to this allocation request. The GNU backtrace function is used to take a snapshot of the current callstack by reading the return addresses from the runtime stack.

At program exit, the leaks are summarized into a report. All leaks that originate from the same callstack are consolidated into a single "loss record". Two callstacks are considered equal if the have the same return addresses, i.e. both originate from the exact same code path. For example, if a loop executes 10 times and each iteration allocates a length 6 string, each of those allocations will have the same callstack. If any of those blocks are not freed, they be consolidated into one combined loss record. For each loss record, the report shows the total number of bytes, the number of blocks, and the callstack. When printing the callstack, each return address is matched to its function name using the symbols module. For two leaks to have the "same callstack", it means the exact same sequence of return addresses.

A sample leak report looks like this:

==sample_ld==   6 bytes in 1 block(s) are lost, allocated
==sample_ld==     by 0x4017dd: __wrap_malloc
==sample_ld==     by 0x41c48a: __strdup
==sample_ld==     by 0x4011d6: main
==sample_ld==  
==sample_ld==   12 bytes in 2 block(s) are lost, allocated
==sample_ld==     by 0x401810: __wrap_realloc
==sample_ld==     by 0x40115f: concatenate
==sample_ld==     by 0x4011bc: main
==sample_ld==  
==sample_ld==   100 bytes in 1 block(s) are lost, allocated
==sample_ld==     by 0x4017dd: __wrap_malloc
==sample_ld==     by 0x603094: <unknown>
==sample_ld==     by 0x401172: create_array
==sample_ld==     by 0x4011f0: main
==sample_ld==  
==sample_ld==  LEAK SUMMARY:
==sample_ld==     In use at exit: 118 bytes in 4 blocks

This report indicates the program exited with four leaked blocks consolidated into three loss records. The first was a 6-byte block allocated by malloc which was called by strdup which was called by main. The second loss record indicates two blocks totaling 12 bytes were leaked along the code path realloc which was called by concatenate which was called by main. The third loss record was one block of 100 bytes, allocated by malloc, which was called by an unidentified function, called by create_array called by main.

Interposing
The leak detector sets up wrapper functions around malloc, calloc, realloc, and free. These functions are named __wrap_malloc, __wrap_calloc, and so on. Each wrapper function calls the real function to do the work and then tacks on the processing to update the in-use list. For example, __wrap_malloc invokes the real malloc to allocate memory and then jots down the block info (address, size, callstack) and adds to in-use list. The __wrap_free calls the real free to deallocate the memory and then removes the block from the in-use list.

In order for this scheme to work, the client program has to call the wrapper function in place of the real function. Rather than require the client to manually make this change, we explore two automated ways to interpose on an existing function:

The cpp approach leverages a very simple (some might say even say "hacky") use of the C preprocessor. A #define malloc __wrap_malloc is introduced into the client program, causing every mention of malloc to be transformed into _wrap_malloc before the compiler even sees it. The client code is then re-compiled.
The ld approach works through the linker. Since the linker is responsible for resolving references, it is in the perfect position to perform the substitution we want. The gnu linker supports a wrap flag (-Wl,--wrap,malloc) that wires up every call to malloc to instead call _wrap_malloc. The client program does not need to be re-compiled, just re-linked.

A downside of the cpp approach is that the preprocessor will only wrap calls that are explicit in the client code. For example, if the client program uses strdup to copy a string, although that eventually calls malloc, if the call to malloc is not explicit in the client program, it will not be wrapped. However, the cpp approach works in any standard C environment and it gives the client control of exactly which calls are wrapped by careful application of the #define. The ld approach ensures that every call throughout the entire program is wrapped, not just the explicit ones. That is generally what is wanted, but allows thoroughly hosing things if you install a buggy wrapper for a critical function. It also relies on the linker to provide wrap support (non-standard). Section 7.13 of B&O gives more background information on library interpositioning including cpp (compile-time) and ld (link-time) approaches along with a third runtime strategy (dynamic loader).

Our makefile is configured for both approaches. From a client program binky.c, it will build three executables:

binky (compiled/linked in the ordinary way, no leak detection)
binky_cpp has leak detection, preprocessor #defines inserted into binky.c and re-compiled
binky_ld has leak detection, object files re-linked using the linker wrap option

If you run the regular binky program, it will execute normally, with no leak detection. The binky_cpp will execute with leak detection as installed by the cpp approach and binky_ld will execute with leak detection as installed by ld. The leak reports from binky_cpp and binky_ld may differ depending on whether all allocation calls are explicit in the client program.

A few notes on the leak detector implementation:

Before starting, read over the given lmalloc.c to see the "skeleton" of the module. The starter code defines the four wrapper functions and the module's init and finish functions. These functions don't yet do much, but they demonstrate the game plan for how the module will operate. You will edit/extend these functions and add new functions of your own.
You will design the data structures and operations to track the in-use blocks and prepare the leak report. Be sure to leverage stdlib where appropriate-- lfind and qsort will come in handy. When designing your data structures, remember that the functions in lmalloc.c will be injected into the code path for dynamic allocation. In this tricky position, it cannot also itself use dynamic memory, so there should be no calls to malloc/realloc/free on behalf of the leak detector itself. The leak detector should use stack memory wherever possible and global static data when necessary.
The backtrace function is a nifty GNU extension that trolls the current runtime stack to take a snapshot of the calling sequence. It uses hints provided by the compiler to determine the offsets between stack frames and reads the return addresses directly from stack memory at those offsets. Use backtrace to obtain the return addresses and then match each return address to its function name to produce a symbolic backtrace -- cool! Read backtrace man page for how to use the function. (Ignore the man page's companion backtrace_symbols functions, although it seems possibly relevant, they won't work in our context and the symbols module already provides the desired functionality).
When you need to harvest the symbols of the executing program, there is a special path you can open that refers to the currently running executable: FILE *fp = fopen("/proc/self/exe", "r")
Attempting to run your leak detector under Valgrind has interference due to both your code and Valgrind trying to interpose on the heap, so don't do that. But Valgrind can still be helpful here. Run the unadulterated programs (with no leak detection) under Valgrind to get a leak report. Run your leak detector-enabled version (outside Valgrind) to get your report. Now compare the two to verify you are finding the same leaks as Valgrind. Did you ever think you'd be capable of going head-to-head with such industrial-strength tool? You rock!

Required format for the leak report:
In order that your output can be correctly evaluated by the autotester, adhere to the format details listed below and take care to compare your output to the sample to ensure it matches.

All lines in the report are prefixed with the special marker ==program_name==. We provide the lm_printf function to automate this.
Leaks are consolidated into loss records. All leaks with the same callstack are consolidated into one loss record.
The loss records should be printed in order of increasing size. The loss record with fewest lost bytes is printed first, the one with the most is last. Break ties arbitrarily between loss records of the same size.
Each loss record includes the total number of bytes and blocks lost and the callstack.
Print the callstack one function per line starting from the inner call and stopping at main. For each call, print the return address using format %#lx and function name. If the return address cannot be matched to symbol, print the name as <unknown>. Stop the backtrace after printing main. Although there can be other frames behind main in the callstack, they are not of interest.
Note that the absolute addresses can vary depending on executable layout, but the function names will be stable.
The final part of the leak report is a summary of the total number of bytes and blocks lost.
Whitespace/empty lines are insignificant but take care to exactly match the wording and use of punctuation.

Summary of project requirements

The starter project include these files:
- Makefile The Makefile builds namelist and the client programs for testing leak detector. You can edit the Makefile to build additional client programs.
- symbols.c This module harvest symbols from an ELF file and provides functions to manipulate that symbol data. Read the interface in symbols.h and implement it in the symbols.c file
- lmalloc.c The module has a starting skeleton for the leak detector that you are to implement.
- leaky.c sieve.c wacky.c These sample programs use heap memory but are sloppy about deallocation. You can use/change/cannibalize these programs to test your leak detector.
There are also namelist.c and some header files. These files are supplied to you pre-written and should not be edited.
Output format. Please pay careful attention to the required output for the leak report, especially the wording, punctuation, and numeric formats.
Memory. We expect the symbols module and namelist to run cleanly under valgrind with no errors nor leaks reported. We will not test leak detection under valgrind.
Efficiency. We expect reasonable use of time/space for symbols/namelist. Efficiency is not emphasized for leak detector due to its large static data structures and linear-time algorithms to search/update.
Design/style/readability. This code has potential to become rather ugly if you aren't keeping your eye on your goal to write publication-quality code. Keep it clean!

Don't miss out on the companion advice page! Go to advice/FAQ page

Grading

Background information on how we grade assignments. The tentative rubric for this assignment is:

Functionality (100 points)

Sanity cases (45 points) The standard sanity check tests are used for sanity grading.
Comprehensive cases (35 points) symbols/namelist: symbols correctly harvested/found and correct information printed for varied ELF files. Leak detector: correct leak report for an assortment of programs including leaks whose backtraces come from diverse callstacks.
Robustness cases (10 points) Correct handling of required unusual/edge conditions.
Clean compile (4 points) We expect your code to compile cleanly without warnings.
Clean run under valgrind (4 points) Namelist is expected to have a clean run under valgrind. The leak detector will not be tested under valgrind.
Reasonable efficiency (4 points) We expect your code to be reasonably efficient in its use of space/time.

Code quality (buckets weighted to contribute ~25 points)

Here we will read and evaluate your code in areas such as:

Style and readability. We expect your code to be clean and readable. We will look for descriptive names, helpful comments, and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
Clean use of pointers and memory. We expect you to show proficiency in handling pointers/memory, as demonstrated by appropriate use of stack versus heap allocation, no unnecessary levels of indirection, pointee types and typecasts used correctly, and so on. Show you are wise in the ways of pointers!
Program design. We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. Common code should be factored out and unified, not duplicated. Leverage existing C library functions rather than re-implement that functionality.

Submissions received by the assignment deadline will receive an on-time bonus equal to 5% of the functionality points earned.

Getting started

Check out a copy of the starter project from your cs107 repository using the command

hg clone /afs/ir/class/cs107/repos/assign6/$USER assign6

The assign6 samples directory is linked in your repo as slink and includes a sample namelist program and sample leak report.

Finishing

There is a sanity check that verifies output conformance of namelist and a sample leak report. Be sure that all debugging print statements are removed/disabled as extraneous output can interfere with the autotester. When finished, don't forget to submit your work for grading. If needed, familiarize yourself with our late policy.

How did this assignment go for you? Review the post-task self-check.