Due date: Sat May 21 11:59 pm - Hard deadline: Mon May 23 11:59 pm
Assignment by Julie Zelenski
[Quick Links: Advice page, Grading]
After completing this assignment, you can proudly say that you have learned how to:
Thank goodness for Valgrind, the C world without it would be a sad, lawless place. This week's assignment introduces you to the challenges and glory of implementing a tool like Valgrind. The true heroes of the software world are those programmers who apply their deep knowledge of systems to provide the debuggers, profilers, and simulators that enable all developers to be more productive!
The assignment consists of two tasks:
symbols.c
to harvest function symbol information from an ELF file. We provide the namelist
program that builds on your symbols
module to provide a simplified version of the nm
utility. This intermediate milestone verifies your handling of the ELF symbol table.lmallc.c
to add leak-checking to the standard heap. It intercepts calls to malloc/realloc/free, tracks in-use blocks, and reports any unfreed blocks at the time of program exit. You'll use your symbols
module to attach symbolic names to the backtrace of function calls that led to a leak.An object file is the result of compiling, assembling, and possibly linking, C source code. It contains global variables, string constants, function names, binary-encoded instructions, etc., written as a lump of binary data. Object files on the myth machines are in Executable and Linking Format (ELF). For this assignment, you must dig into the ELF file to extract the function symbols. The relevant ELF data to this task are:
readelf -h filename
prints the ELF file header.readelf -S filename
prints the list of section headers.symtab
section is a list of symbols. Each symbol is a function or global variable and has a type, address, and size. The symbol name is also included, but in a roundabout way, as a numeric offset into the symtab's companion string table section. readelf -s filename
prints the symbol table section.strtab
section is a sequence of strings laid out end-to-end with terminating null characters. readelf -x index filename
prints the contents of the section data for the section header at the specified index in the list of section headers.The diagram below shows the parts of the ELF file you will access (uint
is used as an abbreviation for unsigned int):
It takes a few hops within the ELF file to read the contents of a particular section: start at the file header, jump from there to the list of section headers, find the desired section header and then follow it to its associated section data. One neat feature of ELF is that it is designed to be directly accessed from its untranslated on-disk representation. For example, the list of section headers is stored as an array of section header structs; the symtab section is an array of symbol structs. You must use manual pointer arithmetic to calculate the location in the file where the array starts, but once there, you can assign the location to a pointer of the appropriate type, and treat it like an array, accessing entries using ordinary array notation.
Here are the steps to read the ELF symbols (data identified by color to match the diagram above):
offset
and nsectionheaders
fields from the file header to identify the location and length of the list of section headers. The offset is the number of bytes from the start of the ELF file to the start of the list (offset labeled 'c' in the diagram above).type
field is equal to SHT_SYMTAB
. This is the header for the symbol table (symtab) section. An ELF file will have at most one symtab section.offset
and size
fields from the symtab section header to identify the location and size of the symtab section data. Each section header contains an offset which is the number of bytes from the start of the ELF file to the start of its section data. (offset labeled 'a' in the diagram)strtab_index
field, which is the index into the list of section headers for the companion strtab section header. There can be more than one strtab section in an ELF file, so be sure to use strtab_index
to access the correct section header.offset
field from the strtab section header to identify the location of the strtab section data. (offset labeled 'b' in the diagram)name
field identifies where to find the symbol's name. The name
offset is expressed as the number of bytes from the start of the strtab section data to the start of the symbol's name string. As an example, the name 'second' is at offset 12 for the strtab in the diagram.The ELF file header (yellow) will always be at offset 0; the red, green, and blue pieces can be in a different relative order than shown in the diagram. Additional info on ELF can be found in the elf
man page, the elf.h include file (on myth at /usr/include/elf.h), sections 7.3-7.5 of Bryant and O'Hallaron, ELF 101 stolen from HackerNews, and the full 100-page ELF specification (for all you insomniacs).
The shared symbols
module that is compiled into both namelist and leak detector is responsible for extracting the symbol data from an ELF file. The symbols.h
interface is comprised of three functions: one to harvest symbols, another to search for a symbol by address, and one to dispose of the symbols. You are to implement these functions to match their specification as given in the .h file. This very tiny interface is intentionally limited to exactly and only those operations needed by its two known clients. As a client, you will appreciate how ultra-simple the interface is and how it abstracts away the low-level grunginess of ELF, but as the implementor, you are responsible for making good on the contract. It will be your job to dig through through the raw ELF to find the appropriate data and then clean it up to repackage in the tidy format promised by the interface.
A few notes on the symbols implementation:
STT_FUNC
, binding STB_GLOBAL
or STB_LOCAL
, and section_tag not SHN_UNDEF
(not undefined); otherwise is not included in the harvest.binky
starts at address 0x58 and has size 0x1a = 26 bytes, all addresses from 0x58 to (0x58 + 0x1a - 1 = 0x71) are within its range.bsearch
to efficiently search the sorted array. The same comparator should be used for both qsort and bsearch to ensure they operate consistently. When searching, the target address can be considered a symbol with a singleton range that starts at address and has size one. Your comparator should consider two symbols are equal if one symbol range encloses the other. A properly written comparator should be symmetric--- i.e. both arguments are of the same type and cmp(a, b) == -cmp(b, a)
. Once you have implemented your symbols module, test it out via the namelist
client program. This program is a simplified version of the standard nm
utility. We give you the namelist
program pre-written so once your symbols module is correctly implemented, you're already done with namelist! Here's how you can try it out:
Compare the output from your namelist with the output from our solution version to verify your symbols module is working correctly. Pretty cool, you have just written your own version of nm
!
The lmalloc.c
module is responsible for providing leak detection for heap-allocated memory. The leak detector operates by spying on every allocation/deallocation request and keeping tabs on the in-use blocks. At program exit, it produces a leak report for those blocks that were allocated but never freed. The leak detector uses the symbols module to attach function names to the leak backtrace.
Let's explain a little more about how this is going to work. The leak detector maintains a list of in-use blocks. The list starts empty. When a new block is allocated, it is added to the list. When a block is freed, it is removed from the list. Any block remaining in the list at program exit is considered a leak. The in-use list is stored in a global array. Each array entry contains the information for a single in-use block: address, size, and callstack. The callstack is the sequence of function calls that led up to this allocation request. The GNU backtrace
function is used to take a snapshot of the current callstack by reading the return addresses from the runtime stack.
At program exit, the leaks are summarized into a report. All leaks that originate from the same callstack are consolidated into a single "loss record". Two callstacks are considered equal if the have the same return addresses, i.e. both originate from the exact same code path. For example, if a loop executes 10 times and each iteration allocates a length 6 string, each of those allocations will have the same callstack. If any of those blocks are not freed, they be consolidated into one combined loss record. For each loss record, the report shows the total number of bytes, the number of blocks, and the callstack. When printing the callstack, each return address is matched to its function name using the symbols module. For two leaks to have the "same callstack", it means the exact same sequence of return addresses.
A sample leak report looks like this:
==sample_ld== 6 bytes in 1 block(s) are lost, allocated
==sample_ld== by 0x4017dd: __wrap_malloc
==sample_ld== by 0x41c48a: __strdup
==sample_ld== by 0x4011d6: main
==sample_ld==
==sample_ld== 12 bytes in 2 block(s) are lost, allocated
==sample_ld== by 0x401810: __wrap_realloc
==sample_ld== by 0x40115f: concatenate
==sample_ld== by 0x4011bc: main
==sample_ld==
==sample_ld== 100 bytes in 1 block(s) are lost, allocated
==sample_ld== by 0x4017dd: __wrap_malloc
==sample_ld== by 0x603094: <unknown>
==sample_ld== by 0x401172: create_array
==sample_ld== by 0x4011f0: main
==sample_ld==
==sample_ld== LEAK SUMMARY:
==sample_ld== In use at exit: 118 bytes in 4 blocks
This report indicates the program exited with four leaked blocks consolidated into three loss records. The first was a 6-byte block allocated by malloc
which was called by strdup
which was called by main
. The second loss record indicates two blocks totaling 12 bytes were leaked along the code path realloc
which was called by concatenate
which was called by main
. The third loss record was one block of 100 bytes, allocated by malloc
, which was called by an unidentified function, called by create_array
called by main
.
Interposing
The leak detector sets up wrapper functions around malloc
, calloc
, realloc
, and free
. These functions are named __wrap_malloc
, __wrap_calloc
, and so on. Each wrapper function calls the real function to do the work and then tacks on the processing to update the in-use list. For example, __wrap_malloc
invokes the real malloc
to allocate memory and then jots down the block info (address, size, callstack) and adds to in-use list. The __wrap_free
calls the real free
to deallocate the memory and then removes the block from the in-use list.
In order for this scheme to work, the client program has to call the wrapper function in place of the real function. Rather than require the client to manually make this change, we explore two automated ways to interpose on an existing function:
The cpp
approach leverages a very simple (some might say even say "hacky") use of the C preprocessor. A #define malloc __wrap_malloc
is introduced into the client program, causing every mention of malloc to be transformed into _wrap_malloc before the compiler even sees it. The client code is then re-compiled.
The ld
approach works through the linker. Since the linker is responsible for resolving references, it is in the perfect position to perform the substitution we want. The gnu linker supports a wrap flag (-Wl,--wrap,malloc
) that wires up every call to malloc to instead call _wrap_malloc. The client program does not need to be re-compiled, just re-linked.
A downside of the cpp
approach is that the preprocessor will only wrap calls that are explicit in the client code. For example, if the client program uses strdup
to copy a string, although that eventually calls malloc
, if the call to malloc
is not explicit in the client program, it will not be wrapped. However, the cpp
approach works in any standard C environment and it gives the client control of exactly which calls are wrapped by careful application of the #define. The ld
approach ensures that every call throughout the entire program is wrapped, not just the explicit ones. That is generally what is wanted, but allows thoroughly hosing things if you install a buggy wrapper for a critical function. It also relies on the linker to provide wrap support (non-standard). Section 7.13 of B&O gives more background information on library interpositioning including cpp (compile-time) and ld (link-time) approaches along with a third runtime strategy (dynamic loader).
Our makefile is configured for both approaches. From a client program binky.c
, it will build three executables:
binky
(compiled/linked in the ordinary way, no leak detection)binky_cpp
has leak detection, preprocessor #defines inserted into binky.c and re-compiled binky_ld
has leak detection, object files re-linked using the linker wrap option If you run the regular binky
program, it will execute normally, with no leak detection. The binky_cpp
will execute with leak detection as installed by the cpp
approach and binky_ld
will execute with leak detection as installed by ld
. The leak reports from binky_cpp
and binky_ld
may differ depending on whether all allocation calls are explicit in the client program.
A few notes on the leak detector implementation:
lmalloc.c
to see the "skeleton" of the module. The starter code defines the four wrapper functions and the module's init and finish functions. These functions don't yet do much, but they demonstrate the game plan for how the module will operate. You will edit/extend these functions and add new functions of your own.lfind
and qsort
will come in handy. When designing your data structures, remember that the functions in lmalloc.c
will be injected into the code path for dynamic allocation. In this tricky position, it cannot also itself use dynamic memory, so there should be no calls to malloc/realloc/free on behalf of the leak detector itself. The leak detector should use stack memory wherever possible and global static data when necessary.backtrace
function is a nifty GNU extension that trolls the current runtime stack to take a snapshot of the calling sequence. It uses hints provided by the compiler to determine the offsets between stack frames and reads the return addresses directly from stack memory at those offsets. Use backtrace
to obtain the return addresses and then match each return address to its function name to produce a symbolic backtrace -- cool! Read backtrace
man page for how to use the function. (Ignore the man page's companion backtrace_symbols
functions, although it seems possibly relevant, they won't work in our context and the symbols module already provides the desired functionality).FILE *fp = fopen("/proc/self/exe", "r")
Required format for the leak report:
In order that your output can be correctly evaluated by the autotester, adhere to the format details listed below and take care to compare your output to the sample to ensure it matches.
==program_name==
. We provide the lm_printf
function to automate this.main
. For each call, print the return address using format %#lx
and function name. If the return address cannot be matched to symbol, print the name as <unknown>
. Stop the backtrace after printing main
. Although there can be other frames behind main in the callstack, they are not of interest.The starter project include these files:
Makefile
The Makefile builds namelist
and the client programs for testing leak detector. You can edit the Makefile to build additional client programs.symbols.c
This module harvest symbols from an ELF file and provides functions to manipulate that symbol data. Read the interface in symbols.h and implement it in the symbols.c filelmalloc.c
The module has a starting skeleton for the leak detector that you are to implement.leaky.c
sieve.c
wacky.c
These sample programs use heap memory but are sloppy about deallocation. You can use/change/cannibalize these programs to test your leak detector.There are also namelist.c
and some header files. These files are supplied to you pre-written and should not be edited.
Output format. Please pay careful attention to the required output for the leak report, especially the wording, punctuation, and numeric formats.
Don't miss out on the companion advice page! Go to advice/FAQ page
Background information on how we grade assignments. The tentative rubric for this assignment is:
Functionality (100 points)
Code quality (buckets weighted to contribute ~25 points)
Here we will read and evaluate your code in areas such as:
Submissions received by the assignment deadline will receive an on-time bonus equal to 5% of the functionality points earned.
Check out a copy of the starter project from your cs107 repository using the command
hg clone /afs/ir/class/cs107/repos/assign6/$USER assign6
The assign6 samples directory is linked in your repo as slink
and includes a sample namelist program and sample leak report.
There is a sanity check that verifies output conformance of namelist and a sample leak report. Be sure that all debugging print statements are removed/disabled as extraneous output can interfere with the autotester. When finished, don't forget to submit your work for grading. If needed, familiarize yourself with our late policy.
How did this assignment go for you? Review the post-task self-check.