CS107 Assignment 7 advice

Written by Julie Zelenski

Choosing your design

You are to implement a correct allocator that runs efficiently, but the what/how of that implementation is completely open-ended. Unlike our previous assignments which have dictated nearly every detail, this project is yours to spec. There is no task list of what to implement next, no mandate about how to structure the code, no prescribed design to guide you, just lots of open questions and tradeoffs to weigh. The experience can be liberating, but also a bit disorienting. Here are just a few prompts to consider:

How to store and access housekeeping data: block headers and/or footers, separate table/list of block information?
Determining how to manage the free list: implicit? explicit? array/list/tree/hashtable? sorted or segregated by size?
Deciding whether/when/how to split and coalesce blocks
Finding available space using first-fit, next-fit, best-fit or something else entirely
Are the blocks themselves segregated by size in the storage? Do they have buddies?
Will you use the same strategy for all sizes of request, or have different handling by size?
Will you implement a standalone realloc or just delegate to malloc/memcpy/free?

Section 9.9 of Bryant and O'Hallaron contains helpful reading on heap implementation strategies and presents partial code for an allocator case study. This allocator uses boundary tags and an implicit free list and demonstrates some worthwhile motifs such as abstracting the pointer arithmetic/typecasts (although we strongly recommend using inline functions instead of macros) and strategically installing a few extra blocks to avoid making special cases. This design is a moderate performer, the code is missing some key pieces, and I know of at least one bug, so I would recommend it as code you can learn a lot from, but be wary of adopting from it without careful scrutiny of what you're getting.

A much simpler starting point is to consider the code given in allocator.c. This is a naive but largely functional implementation of the allocator functions. There is such a small amount of code and trivially simple approach that you can easily read and thoroughly understand it line-by-line. It's a very long way from being a efficient implementation, but you could apply an incremental approach to evolve it forward, applying improvements one step at a time. Analyze what about its approach is most problematic, figure out a way to make it better (how about re-using freed memory or not requiring a whole page for a 2-byte request?), implement the improvement, measure the gain, and repeat.

The design process can be fun and creative but often requires some iteration until your design solidifies. Be sure to get an early start so that you have sufficient time to brainstorm alternatives, weigh trade-offs, make decisions, experiment, react and rework as necessary to land a successful final result. Also be sure to keep up with regular Mercurial commits to allow for easy trial-and-error exploration. If you commit before trying an experiment, you will have a known good snapshot to revert to should it not work out.

Design choices and performance opportunities

The sample scripts reflect a range of scenarios for how the allocator might be used. Reviewing the relative performance on different scripts provides guidance about in which situations your allocator operates efficiently and those it does not, allowing you to identify opportunities for improvement. For example, if you have bottlenecks in realloc, you will see it reflected in poorer performance on those scripts that make heavy use of realloc as opposed to the other scripts that have little or no use. Sometimes improving the results for one particular script comes at the expense of degrading others, so you'll have to weigh the results on balance. The per-script results tell you where you have strengths/weaknesses, but your allocator will be evaluated on aggregate performance, not any one script in isolation.

Some suggestions to help get you thinking about designing for performance:

How much internal fragmentation does your design carry? This overhead may be in the noise for large blocks, but probably not for small blocks. Can you squeeze some bytes out of the general per-block overhead or make a special case out of small blocks?
How is external fragmentation affecting your utilization? You can trim this waste with more strategic choices in fit and placement: how you choose which block to use, whether/when you split and coalesce, the effects of segregated storage, pairing for buddies, and so on.
Malloc throughput is mostly about search--- examining blocks to find an appropriate one. How might you spend less time with each block and/or examine fewer blocks to find a good one? Can you organize the free list to allow you to more quickly narrow in on what you need or be less picky about which block you take?
Throughput on free is largely constrained by how quickly you can access/update your data structures.
Realloc has its own issues that directly pit throughput against utilization. It's worth noting that realloc is used less frequently than malloc/free, so optimizing realloc at the expense of malloc/free is of questionable merit. Here's a little tip: most blocks are never reallocated, but a block that is reallocated at least once is often reallocated repeatedly (e.g. the storage backing a CVector or merging strings in reassemble), so anticipating future reallocations might pay off.
Coalescing seems like a good idea in theory, but can be disappointing in practice if not thoughtfully implemented. It may help to experiment with easily-tunable parameters to find the sweet spot for when/where/how to use coalescing for best effect.
What choices make for cache-friendly access for the allocator itself and/or the client's blocks? For example, a segregated storage design that places same-size blocks (e.g. a gazillon homogeneously-sized linked list or tree nodes) in a contiguous run would be very cache-friendly.
Some changes improve both utilization and throughput. For example, tightly-packing blocks both increases utilization and ups the cache hit rate. Other improvements sacrifice one for the other. Searching for the very best fit improves utilization but costs cycles. A deliberately-over-allocated block is easy to realloc but wastes space. Fancier free lists (doubly-linked, trees, segregated) are faster to search, but add overhead, may require extra cycles to update, and necessitate a larger investment in complex code and careful debugging. Most design questions don't have a single "best" answer, just trade-offs to be balanced.
Improvements in utilization tend to come from changes in strategy (placement choices, fit algorithms, coalescing approach, and the like). You'll likely find hitting 50% utilization to be relatively straightforward even with rather simple approaches. However, it tends to get trickier after the low-hanging fruit has been cleared, and you'll work harder for gains from there. It's somewhat challenging to get utilization above 80% and achieving evenly tighter packing will often consume a lot of throughput.
Other than the one-time boost you'll get from engaging compiler optimization, the improvements on throughput will tend to be smaller and more continuous across the entire range. Streamline one bottleneck at a time to accumulate small, steady gains. No matter where your throughput is currently at, whether it be 10% to 110%, there are usually a fair number of modest improvements in reach. Efforts can be applied across the spectrum, from changing up a strategy at the high-level down to the micro-detail of streamlining a few assembly instructions that are heavily-trafficked.
Experiment and measure! Unsure about the trade-offs between immediate versus deferred coalescing? Curious about whether or how finely to segregate your free list? Rather than rely on hunches or speculation, try some A/B experimentation. Introduce some global settings that allow you to experiment with different values or enable/disable features and then run performance trials to measure under various scenarios to gauge the effects in order to choose the right parameters to tune your implementation.

Measuring and optimizing

Once you have a solid design that is correctly implemented, you can then work to reduce the constant factors by applying code optimization techniques. The first step in any optimization effort is measuring. The alloctest results give a high-level view of your allocator performance as a whole, but you'll need more fine-grained measurements to make effective optimizations in the code itself. Run alloctest under callgrind to profile the code and obtain instruction counts and cache hit/miss counts. Use the report to identify code passages that are heavily-traveled and/or particularly cache-unfriendly. These critical passages are where to focus your attention for maximum gain. Even just a slight decrease in the instruction count or cache miss rate for a high-traffic area can pay off handsomely in absolute results.

A few notes on optimization:

As we saw in lecture, gcc's optimization can provide a significant boost on many common patterns. Don't preemptively optimize on these fronts (i.e. your efforts to hand-fold constants or convert multiplication by 2 into a shift are redundant), let the optimizer do it for you. You will likely be quite pleased with how much help even just -O2 will be and be sure to try out higher levels and experimental options (gcc optimizations) as well. It can be educational to examine and compare the disassembly to understand how it has been transformed.
Your manual efforts are needed for optimizations the compiler can't or won't do. You have knowledge about the code that is deeper than what the compiler can glean. You know where you can cache results and avoid unnecessary re-computation. You can manually hoist constant work out of a loop that the compiler cannot determine is redundant. If you know that one of the variables to a particular multiply/divide operation is always a power of two, you can convert into the equivalent bit shift. And most importantly, you measure carefully first to identify which passages are critical so you know where to be ultra-aggressive and where not to bother.
Decomposition and code unification into helper functions is still a great idea for managing complexity, but you may be concerned about the cost of the function-call overall. Never fear, you can have your cake and eat it too by using gcc's support for inlining. Inlining is enabled at gcc optimization level -O2. Inlines are a great way to get the speed of preprocessor macros without any of their accompanying headaches.
Gcc provides a number of highly optimized built-in functions (gcc builtins). Many are faster versions of the expensive math functions and not relevant for an allocator, but there are a few bit operations (such as finding the first/last bit set) that might come in handy. Check them out!

Script files for alloctest

The alloctest harness runs on script files. A script file contains a sequence of allocator requests in a simple text-based format. The three request types are a (allocation), r(reallocation), and f (free). Each allocation result is assigned an id-number that can be referred to in a subsequent realloc or free.

a id-number size
r id-number size
f id-number

A script file containing:

is translated by alloctest into these calls to your allocator:

void *ptr0 = mymalloc(24);
void *ptr1 = mymalloc(100);
free(ptr0);
ptr1 = myrealloc(ptr1, 300);
free(ptr1);

You can open any script file in your text editor. A script can begin with a comment that summarizes the scenario presented by this script. You can also peruse the entire sequence to see the pattern of requests in more detail. You can make your own simple scripts using a text editor and generate larger, complex scripts programmatically.

The importance of validate_heap

You are to implement validate_heap to scrutinize your heap data structures and squawk loudly should any problem be found. When checking correctness, the alloctest harness makes a call to validate_heap after every allocator request. If validate_heap is implemented to identify when the heap goes from valid to invalid, the monitoring from alloctest can pinpoint the exact moment the heap got out of whack. Instead of trying to postmortem a corrupted heap with wild guessing about how and when it got that way, you can instead jump right in and trace the critical call in context to see how it went astray. This debugging aid is priceless, but depends on you having usefully implemented validate_heap!

Review the code in alloctest.c to see its external checks of heap correctness (e.g. one check verifies addresses are aligned and within heap, another verifies allocated blocks don't overlap). You don't need to repeat these checks, instead validate_heap will augment them by reviewing the internal consistency of the heap. The implementation of validate_heap should be written alongside your allocation functions. As your heap data structures become more complex, validate_heap becomes more sophisticated to match them.

For example, validate_heap might walk the entire heap and verify such things as:

Does the housekeeping information (location, size, free status) for each block appear reasonable?
Is every block in the free list marked as free? Is every block marked free listed on the free list? Is each listed only once?
If using sorting or segregation, are these properties being properly maintained?
Have adjacent free blocks been coalesced according to your policy?
Are redundancies in the data structures consistent, e.g. does count of free blocks match length of free list, total bytes in-use match sum of in-use block sizes?

Do not have validate_heap spew a dump of the heap data structures for you to wade through in a attempt to manually validate the contents. Instead have validate_heap traverse the structures, directly look for trouble, and report only on failures (e.g an in-use block is found on the freelist). No need for running commentary about things that are working fine. In this way, validate_heap will directly bring your attention to exactly those issues that require your intervention and otherwise keep a silent vigilance.

The more comprehensive and thorough checks done by validate_heap, the more useful it will be to you. A good validate_heap will also be slow -- do not be at all concerned this, this function is a development/debugging aid, there is no desire for it to be efficient and it will not be invoked during any performance measurements.

It can be very difficult to try to debug an allocator without decent validation, so do yourself a favor and invest in yours from the get-go. If you somehow manage to get your allocator working without one and then throw together a lame version on your way to submit to appease us in grading, you will have missed the entire point.

Code quality

Our grading priorities heretofore have put little emphasis on efficiency. For all assignments heretofore, we designated only a few points for meeting the "reasonable efficiency" benchmark and we were quick to ding for hacky algorithms, code duplication, lack of decomposition, unnecessary special cases, and other choices that make for ugly code, even if they offered some efficiency advantage. The priorities change for this assignment where efficiency is the focus. This doesn't mean you should set out to write something ghastly, but it does shift the trade-offs. Obviously the best outcome is streamlined code that is also clean and high-quality, but when the two are in opposition, compromise is inevitable. When making the choice to sacrifice a bit on quality in the name of performance, only do so if the performance gain you're getting is measurably important and there is no elegant way to get the same boost.

There will some unavoidable use of void* and typecasts, but you can tame complexity by writing helpers to clear away the dense pointer arithmetic/typecasting and increase clarity of the surrounding code. As always, unify common code -- if it was tricky to write once, you surely don't want to write it more than once.
Decomposition is essential. A page-long malloc or realloc can be dense code that makes your head swim. Break it down into tasks: finding a block, update the housekeeping, etc. If needed, use inlining to keep down costs of function calls.
Don't slap in quick fixes, adding +1 here and a typecast there, an extra * there, without thinking about the underlying root case and find the right way to fix.
The standard malloc interface takes its arguments as a size_t, which is an unsigned long. Careless co-mingling signed and unsigned types and being sloppy about bitwidth can lead to unexpected troubles, so be thoughtful and consistent.

Program facts

Some facts that might help you gauge what to expect.

Code size. Submissions have ranged from 71 to 482 lines; the average is around 180. These are counts of programming statement lines (not including blank lines, comments, #include).
Performance metrics. Utilization has ranged from single digits to a high of 88 and throughput from single digits to a high of 150 . Median utilization is high 60s, median throughput mid 70s. Over a third of past submissions have exceeded the full-credit mark and earn performance bonus points. A couple of submissions achieved combined utilization > 80 and throughput > 95!

Frequently asked questions about assign7

Is it required to work with a partner?

No, partnering is completely optional. In past quarters, about third to half the class has chosen to work individually and the remainder pair up. Working in a team has advantages (two heads are better than one) and disadvantages (scheduling, coordinating, arguing). The average score has been roughly the same for partnered versus individual submissions, although there was a smaller variance in the partnered group (i.e. solo submissions have more outcomes at the extreme edges). Read our advice for successful partnership if considering the partnered route.

Are we allowed to adopt the allocator code from the textbook? Are we allowed to adopt code from elsewhere?

We plan for you to invest your hard-earned CS107 superpowers in your own creation, but if you do need to borrow from the text to get started, we have a specific carve-out that allows you to incorporate Ch 9.9 code from text with appropriate citation. All other code is entirely off-limits -- you must not review or adopt code for your allocator from any other resource.

Must my allocator <insert-technique-here>? ("maintain an explicit free list", "use a pre-node header", "implement coalescing", ...)

No. You get to make all design choices for yourself!

What does it mean that every address must be aligned to a multiple of 8?

Each payload must be placed such that its address has last hex digit of 0 or 8. Addresses 0xa5b8 and 0xa5b0 are aligned; 0xa5b4 or 0xa5b1 are not. The alignment requirement applies to the base address of the payload; whatever headers/footers you have placed around it are your own business. The textbook indicates the alignment multiple is 16 for 64-bit code, but we are aligning to 8, the size of largest primitive type. You can choose to align to 16 (or any larger multiple of 8) for convenience if you like.

What is the largest payload size the allocator must handle?

The theoretical maximum request for a 64-bit address space is beyond enormous but in practice, the max size will be constrained by other in-use segments. Your design can plan that payload size tops out at INT_MAX bytes. A request larger than max should be returned unserviced (i.e. return NULL).

Is there a minimum payload size?

No. For some allocator designs, it will be convenient to round up small requests to a minimum payload of 8 or 16 bytes, but there is no requirement to do so.

My validate_heap routine is very slow. Should I be concerned?

No. In order to do its job well, validate_heap should be thorough and comprehensive, which requires time to execute. This function is a development/debugging aid, there is no desire for it to be efficient and it will not be invoked during any performance measurements.

I'm getting an ALLOCATOR_ERROR from alloctest. What does this mean?

When testing correctness, alloctest has some simple checks to look for heap trouble. It checks that each malloc/realloc call returns an address within the heap segment that is correctly aligned and the new block doesn't overlap any other currently in-use block. It writes a identifying byte to the payload an allocated block. When asked to realloc/free a block, it verifies the payload contents have remained intact. It also calls your validate_heap after each request. If any of these checks turn up trouble, it reports an ALLOCATOR_ERROR to draw attention to the problem. Review the code in alloctest.c to understand the symptom for the particular error being reported to you.

Some of my heap housekeeping seems to get randomly overwritten with a repeated pattern, such as 0x2a2a2a2a. What gives?

After allocating a block, alloctest copies a pattern over the entire payload, e.g. repeats 0x2a or 0x08 in every payload byte. It does not write before/after the block, only on the payload bytes. If this pattern appears where you think it should not, this means there are bytes that alloctest believes is in-use payload yet your allocator thinks those bytes store heap housekeeping. Perhaps an in-use block was incorrectly left on the free list, a block coalesce/split was bungled, a block is wrongly both in-use and free? Identifying when the corruption came about sounds like a job for validate_heap!

When I valgrind alloctest, it warns about "set address range perms: large range". What gives?

The default Valgrind tool (memtool) is designed to spy on the actions of the real malloc/free and will be paying no attention to what you are doing over in your own heap segment. Using memtool on alloctest is not only unhelpful, its complaints may needlessly worry you. Do not use Valgrind memtool on this assignment. You can run alloctest under Valgrind callgrind (the profiling tool) for help in finding performance bottlenecks.

My allocator gets a segmentation fault dereferencing an invalid address, but when I access that same address within gdb, it seems fine. Why?

gdb doesn't heed the protection bits on memory segments, so it can access read-protected locations without complaint, while accessing these same locations in a running program will cause a segmentation fault. If your allocator is faulting due to trying to access a read-protected location, it could be due to accessing part of the heap segment that was previously mapped but has since been marked unreadable. This symptom is highly suggestive that your myinit doesn't correctly re-set the state of the allocator back to a clean slate.

GDB seems confused about my code; control flow jumps around, variables are missing. What gives?

Are you attempting to debug on optimized code? The debugger will do its best, but the optimizing transformations make the assembly much less of a literal rendition of the C, so stepping may appear to jump through the code erratically, statements may have been elided, variables optimized out of existence so unknown to gdb, and so on. You'll have your best luck doing your debugging on code compiled with little or no optimization. Some bugs only show up when compiled with optimization, in which case you have no choice. In those cases, it may be more helpful to follow the asm than try to work from the C code.

My allocator works on all scripts except for <insert-some-case-here> (e.g. crashes on line 1428 of binky.script). Can someone please look at my code and find what's wrong or tell me how to fix?

Sounds like you have a bug! Don't look for someone else to wave a magic wand, you've got this! Put to work those fabulous debugging chops you have been building up all quarter. Get that program under your gdb microscope, devise experiments to run, gather data, draw pictures, disassemble, poke around in memory, and trace to find how and where your code is going astray. One particularly useful leverage point is the validate_help function (See section above on "The importance of validate_heap").

In office hours/forum/email, our focus will be on supporting you so that you can track down your own bug. Please do ask us how to best use tools, what strategies to consider, and advice about how to improve your debugging process, but asking us to diagnose your bug by description/symptom? Not so much. The staff will not look at your allocator code nor attempt to trace/review/debug it by inspection. We expect you to take ownership of your own code and drive your own debugging. You're ready!

My allocator has abysmal performance on the tiny scripts. Should I be concerned?

No. These scripts are much too small to produce meaningful performance results. They are intended only as a first step in early development and debugging. We will not test allocator performance on such small scripts and neither should you. Figure that a script needs more than 500 requests before it is even worth measuring.

My performance measurements vary from run to run on the same scripts, with no code changes. Is this normal?

Utilization is generally stable on the same inputs, but throughput varies with CPU/cache contention, so it is expected to see small ups and downs from run-to-run and even larger discrepancies when the system is under heavy load (use uptime to view activity/load). During grading, we will evaluate performance on an isolated quiescent host to reduce artifacts. All myth hosts are running identical software, but some have skimpier hardware (e.g. 4M L2 cache versus 6M, slower clock) which can affect results. You can view a host's specs in the file /proc/cpuinfo if you're curious. We will measure performance for all submissions on the same myth under the same conditions to ensure consistency in grading.

Am I guaranteed the same allocator performance on the grading scripts that I get for the sample scripts?

No. The mix of grading scripts will be similar, but not identical, to the published samples. Your allocator should be designed to perform well in a variety of situations, rather than be tailored to fit exactly one set of known inputs. It may help to play with creating some scripts to broaden the mix of workloads on which you have tested. It is typical that there are some scenarios that play into your best-case handling and others that don't, so ups and downs are expected, but far outliers can dominate the aggregate, so keep an eye on your extreme cases. This means both trying to bring your worst-case closer to the median and being wary of depending too much on your best case making up for others. If your performance is very consistent across a variety of scripts, you gain higher confidence that your results will be replicable across a wider variety of scenarios. Look for the opportunities that moderate between extremes. If your allocator has consistent performance across most/all sample scripts, then taking a few out and replacing with others should result in only minor change in overall performance, but if you have wider variability script-to-script, then your allocator performance may experience a more significant swing when we mix things up in grading. You can estimate your own worst-case outcome by calculating how your allocator would fare on a mix constructed from the published samples having removed your top few best performers and replacing with copies of your weakest performers.

What gcc flags are allowed?

Anything goes. Edit ALLOCATOR_EXTRA_CFLAGS in the Makefile to configure your desired options. The initial flags are -Og. While still in active development, this flag generates code that is more debugger-friendly, but once you're working on performance, experiment with more aggressive levels and options. We will compile your allocator using the setting from your submitted Makefile, so be sure to submit with the flags configured to work best for your code.