Assignment 6 advice

Written by Julie Zelenski

Getting oriented

This is your first foray into the nitty-gritty of systems program. The good news is there is not a particularly large amount of code to write and it doesn't involve any tricky algorithms. Past students have told us the most daunting issues were getting started in the first place and keeping track of the big picture: which data is where, how to access it, and how the parts relate. So take a deep breath, spend time up front to get oriented, ask questions, draw pictures, and poke around in gdb to be sure your understanding of the big picture is solid before tackling the code. Jump to the FAQ and read the question about the distinction in the two kinds of addresses being manipulated as an important bit of advice.

Here is a recommended plan of attack when ready to start coding:

Symbols
1. B&O 7.3-7.5 is good background reading on ELF files. There is a lot of complexity to the format, but take comfort in the fact that you only need to extract a little slice of it.
2. Use the step-by-step instructions from the assignment writeup as a guide for writing harvest_symbols.
3. Implement find_symbol to match symbol by address. With a correctly written comparator, qsort and bsearch work together like a dream.
Namelist

No coding for this step (we provide namelist pre-written), but stop here and use namelist to thoroughly vet your symbols functions, especially lookup by address. It will be much easier to find and fix bugs here than having to hunt them down while working on leak detector.
Leak detector
1. Review skeleton code in lmalloc.c to understand the lay of the land
2. Add tracking for in-use blocks (no backtrace snapshot yet) and produce an unconsolidated leak report
3. Use backtrace to snapshot callstack on allocation and add to leak report (no function names yet)
4. Consolidate leaks into loss reports by matching callstacks
5. Add function names to callstack printed in leak report

Keeping your addresses straight

Any address just refers a location in memory, but knowing what that address represents and how it relates to other addresses can be critical. In this program, there are two distinct kinds of addresses being manipulated: symbol addresses and heap block addresses. First consider the symbol addresses. A symbol, e.g. main, occupies a range in memory where its compiled instructions are placed. Use objdump -d leaky to disassemble the executable. Find main in the disassembly listing and look at the addresses along the left side. Each address corresponds to an instruction of the main function. All of those instructions together form the range for the main symbol. The symbol table in an ELF file stores information about each symbol, including its name and address range. Use nm leaky to dump the symbol table and look for the entry for main and verify it has same address you saw in the objdump listing. The namelist program searches the symbol table to find the entry whose range encloses a target address. Now consider how these symbol addresses relate to an executing program. A symbol address is the address of an instruction, located within the text (code) segment of the executing program. The return address stored on the stack by a call instruction is also the address of the instruction. If you look up return address to find its match in the symbol table, you get the name of the function it corresponds to. The address range for a function symbol and the return addresses being read from the stack are the same kind of thing: address of an instruction.

The leak detector is also tracking a different kind of address: addresses that correspond to heap blocks. These are addressees returned by a a call to malloc and the leak detector tracks these addresses to confirm that they are later freed. These are address of data, located with the heap segment of the executing program. The address of the heap block is not a symbol address and it not expected to be found within a symbol address range nor will it match any return addresses found on the stack.

Nothing good comes from conflating the two kinds of addresses, so take the time to understand the difference and save yourself much headache and heartache.

Pro tip: gdb is wise to symbol addresses. Printing a pointer usually just prints the address, but if that address is within the range of a known symbol, gdb will also include a annotation with the symbol name and offset. The gdb run shown below prints two different addresses. gdb finds a match for the first address within the symbol table, the second one is not a match and gets no annotation.

(gdb) p (void *)0x400730
$3 = (void *) 0x400730 <singleton+11>
(gdb) p (void *)0x80329c
$4 = (void *)0x80329c

Asking gdb to print an address can be a quick way to validate whether that address lies within the range of a symbol. Pretty neat!

Code quality

Do not throw typecasts around indiscriminately. Whenever possible, you should work within the type system, not against it. Choose the appropriate type for a value, declare it properly, and stick with that type. In the rare situations where you must overrule the declared type, the typecast can be your friend, but it is your worst enemy when used inappropriately. Make it a rule to stop and think twice before inserting a cast and justify to yourself why each cast is necessary and correct.
Accessing the data within the ELF file requires a little pointer arithmetic with an offset and one typecast to the content type, but once oriented, you can (and should) use ordinary C array/struct access. In our solution, there is a single typecast per each layer: a typecast on the location to access the file header, another typecast on the location of the list containing section header structs, another typecast on the location of the section data containing symbol structs, and a typecast on the location of the section data containing name strings.
Watch out for careless co-mingling of int, long, and numeric pointer types, or haphazard coercing/casting between them. In the ELF file, symbol addresses are stored as the type uintptr_t. This type is declared in <stdint.h> as an unsigned long guaranteed to have sufficient range to store any address. This type is intended for addresses that you intend to manipulate numerically, it is it is compatible with other numeric types, but it is not a pointer type and cannot be dereferenced. The symbols module, namelist program, and leak detector manipulate symbol addresses solely as numbers -- the addresses are never dereferenced or otherwise treated as pointers so uintptr_t is an appropriate choice to standardize on. Starting with a thoughtful choice of whether to declare an address as numeric type versus pointer type will lead to fewer conversions later and cleaner code.

Testing

One complicating factor for testing leak detector is that are injecting your code into program-critical code paths. The consequence of adding buggy code to every malloc call is sure to be a lot of trouble! Do your initial testing using the cpp approach as it only wraps the explicit calls from the client program which limits the potential for havoc.
Be sure to develop and test on myth to avoid any surprises from platform-dependent discrepancies.

Program facts

Code size. Our solution has 200 total lines (100 symbol.c, 100 in lmalloc.c) about third of these lines are given in starter.
Time spent. The leak detector is new this quarter, so we don't have any historical data on yet. It does share a lot with the assignment used in previous quarters, for which the students self-reported this amount of time spent:
- < 5 hours 2%
- 5-10 hours 14%
- 10-15 hours 42%
- 15-20 hours 29%
- 20-25 hours 12%
- more than 25 hours 3%

Frequently asked questions about assign6

How can I verify that I'm reading the right data from the ELF file?

Use the readelf tool. For example, to verify that you are correctly accessing the section headers, use readelf -S to dump the list of section headers and compare its entries to those your code is finding or use readelf -s to list the contents of the symbol table. By using readelf, you will learn there are some surprising entries (such as the first section header or symbol containing all zero fields), that might have otherwise had you wondering if your code was malfunctioning.

How can the symbols module properly size the array of function symbols?

When dynamically allocating memory, it is desirable that the memory be "right-sized", but sometimes you won't know until after you've finished processing what the right size is. To get around this, consider adopting an idiom used in assignments 1 & 2: use the stack for cheap temporary storage of maximum-size, fill it, and copy the used part to dynamically-allocated memory of the correct size when the size is known. There is no pre-determined max on the number of symbols for any ELF file, but once you have found the symtab section header of a particular file, you will know the number of symbols it contains, of which only some will fit the required criteria.

Can we add new functions to the symbols module beyond the three public ones?

Yes! You are strongly encouraged to decompose the longer tasks into smaller helper functions that are used internally. However, you should not change the public interface of the symbols module.

How do I strip an executable?

The strip utility (see man page) will destructively modify an object file to discard symbols. It does this by removing the entire symbol table section, both the section data and section header. There is no way to reverse/unstrip, but you can compile again to regenerate the object file with its symbol table intact. First make clean to discard any existing build products and make anew.

Can we use global variables?

Yes, in lmalloc.c only. Our prohibition against global variables is relaxed for this module as it needs to maintain persistent state across function calls for which neither stack nor heap is appropriate. It is allowable to use global state to track in-use blocks.

If the client program uses a lot of dynamic memory, there can be a lot of blocks to track. How is the leak detector supposed to cope? Can we set a maximum on the storage?

There is a cap on the number of simultaneously in-use blocks that can be tracked. The leak detector sets aside a fixed amount of space for the in-use list and if that space is exhausted, the excess blocks go untracked. Programs can allocate and deallocate many more blocks during the program lifetime without running into the cap just as long as there is never more than maximum number of blocks simultaneously in-use. The constant of 100K in-use given in starter will cover most ordinary programs. The total number of loss records depends on how much consolidation can be done on the leaks when summarizing. If each leak is unique, the number of loss records would be same as leak count, but in practice, they will be consolidated into a tiny fraction. A client program that creates 100K leaks, easy, but 100K leaks each coming from a distinct code path of its own, not same as any others? That's some crazy code. We promise to not test on a program with more than 50K loss records. In fact, setting your cap to a number much, much lower would have no visible effect on any reasonable program.

A program that does have more than 100K blocks in-use at one time or more than 50K loss records may have an erroneous leak report, i.e. miss reporting a leak due to block going untracked or miss a loss record from the report. However, leak detector (both ld and cpp) should run without error/assert/crash on any correct client program, even one that makes a ridiculous number of allocation requests.

A client program that does a lot of allocation/deallocation runs very slowly under my leak detector. Should I be worried?

No. A linear search to find an entry in the in-use array definitely bogs down the leak detector as N gets large. This is expected. Switching to bsearch may seem appealing, but given that the in-use array is constantly being updated, the work to keep it in sorted order negates much of the benefit of the efficient search. Keep it simple and unsorted and use good old lfind and you'll be just fine.

Trying to call `qsort` from my wrapper functions seems to blow out my stack. From gdb, I can see my code is 4000 calls deep into an endless qsort->malloc->qsort->malloc cycle. What gives?

In certain situations (particularly when sorting large array with large elem size), qsort can make its own call to malloc/free internally -- surprise! Consider the consequence this can have: if you call qsort from your malloc wrapper and that qsort tries to malloc memory, it re-enters into your malloc wrapper, which tries to call qsort, which tries to malloc memory.... and you can see where this is going. The easiest fix is to heed our advice above and don't sort. Our recommendation is not specifically about avoiding this problem, it more that the sort is not worth it. Keeping the in-use array sorted requires an expensive update on every change (e.g. re-sort or a sorted insert/remove). Each of the operations that requires a search also does an update, so there is little benefit in providing a fast search if is always coupled with a slow update anyway. If you insist, you could turn off collecting before qsort and turn back on after to avoid the cycle. This will allow you to confirm for yourself that the performance of keeping sorted is disappointing rather than take our word for it. :-)

When consolidating into loss records, should callstacks be compared by return address or by function name?

By return address. Matching by return address guarantees that allocations being consolidated are on the same exact code path, not just coming from somewhere within the same function. If there are two distinct calls to malloc within the binky function, both return addresses will be within the address range for the binky symbol, but the return point is different and these leaks should not be consolidated. These two allocations represent two distinct loss records.

What functions do we expect to see in the callstack printed for a loss record? What about if I am using helper functions within my leak detector?

The callstack should start at the allocation function (malloc/calloc/realloc/free wrapper) and stop after main or after DEPTH total frames (DEPTH is set to 10 in the starter code), whichever comes first. Similarly, when comparing two callstacks for possible consolidation into one loss record, you only need to consider at most DEPTH frames starting from the allocation function. Depending on where you call backtrace() you may have to adjust the callstack a bit to elide evidence of your helper functions. Sanity/grading require your callstack to cover the sequence from wrapper function to main (or stop after DEPTH), no more, no less. If the executable file has been stripped, you won't be able to identify main, so in that case just print the entire callstack (or at most DEPTH frames of it).

What should leak detector do if the client program corrupts the stack/heap or makes incorrect calls such as freeing a block twice or attempting to free the wrong pointer?

A buggy client program will likely crash the leak detector. You are not expected to devise handling to detect or prevent this, nor will we test on such buggy programs.

(Follow up on last question) But given how the cpp approach wraps calls, a correct client program may have unmatched calls to the wrapped versions of alloc/free. How should leak detector handle these?

Your wrapper functions should track/report on every allocation event that they observe. For the cpp version you need to anticipate the possibility of mismatched events, as you may hear about allocation without free or vice versa. The leak report from the cpp version may be a bit erroneous in those situations (i.e. could miss reporting a leak, or report a block as leaked that was freed). However, leak detector (both ld and cpp) should run without error/assert/crash on any correct client program.

How should leak detector handle an address in the backtrace that doesn't have a matching symbol? What if the executable is stripped?

Some addresses may not have a matching symbol in the symbol table, such as symbols in dynamically-linked libraries or selectively stripped symbols. This is not an error, nor cause for alarm. Print those addresses normally and instead of identifying with symbol name, simply label as <unknown>. If the entire executable is stripped (i.e. has no symbol table at all), all addresses in the backtrace will be labelled <unknown>.

Should the report from my leak detector match what is reported by running `valgrind --leak-check=full`?

Yes. Your leak report from the ld-wrapped executable and Valgrind report from regular executable should match in terms of byte/block counts per loss record and total counts. Note that Valgrind divides leaks into categories of directly/indirectly/reachable, but we make no such distinctions-- any block that was allocated and not freed is considered "lost". To get a Valgrind leak report, remember to run Valgrind on the unadulterated programs (with no leak detection). Attempting to run your leak detector under Valgrind has interference due to both your code and Valgrind trying to interpose on the heap, so don't do that.

What is the expected difference in the report from the `cpp`-wrapped executable versus the `ld`-wrapped?

The ld version wrap all calls to malloc/free, including allocation that is hidden with library functions such as strdup or fopen whereas the cpp version is only able to wrap calls that are explicit in the client program. If a block's allocation was wrapped and deallocation not wrapped or vice versa, the cpp leak report can have have erroneous/missing entry for that block.

How can I add new client test programs to the project?

To add a client program:

Create a myprogram.c file with the desired program.
Use hg add myprogram.c to add the file to your repo.
Edit the Makefile to add myprogram to the names listed for LM_PROGS.

Now myprogram is a part of your project and make will build it. Try adding your reassemble.c for a fun stress test!