Assignment 1 advice/faq

Written by Julie Zelenski

Welcome to the assignment advice/FAQ page!

The assignment writeup is the place to go for the formalities (required specifications, due date, grading standards, logistics, and so on). This is the companion advice page where we offer hints and recommendations and answer common question that we expect may come up along the way. You may want to skim this page now to get a sense of what's covered and return here on an as-needed basis for further review when/if these issues arise.

Our first recommendation is that you check out our best practices for completing cs107 assignments for broadly applicable advice for tackling any assignment!

Memory allocation

One of the challenges in this program concerns correct memory management. Systems programmers are expected to be conscious and careful in their use of memory. Here are some general memory guidelines as well as a few recommendations specific to this program:

Stack allocation is as simple as declaring a variable. The stack is convenient (auto-allocation/deallocation), type-safe, and very efficient.
Dynamic allocation is via malloc/realloc/free. The heap gives the programmer control over lifetime (explicit allocation/deallocation) and flexibility (storage can be re-sized/re-purposed), but comes at a cost in safety and efficiency.
Take care to understand proper use of both allocations and when each is appropriate. As a general rule of thumb, unless a situation requires dynamic allocation, stack allocation is preferred. Some programs, such as this one, use both techniques together.
Dynamically-allocated C-strings should be the proper size, e.g. the string "apple" needs 6 characters worth of storage, no more, no less. A useful idiom to learn is allocating a maximum-sized stack buffer to temporarily store a string of indeterminate length, which is later copied to a dynamically-allocated buffer of the correct size once the length is known. This technique is demonstrated in the fragment read given in the starter code.
The starter code declares a maximum-sized array of fragments on the stack. Most inputs will only use a small portion of that array, but given that stack storage is cheap and plentiful, the excess is of no consequence.
Each time you merge a pair of fragments, the array will contain one fewer entry. Do not attempt to re-size the array, allow it to remain oversized and just ignore the unused capacity. It is a much better design to maintain the current set of fragments in the first K consecutive slots of the array, rather than have to pick your way through the entire array to find the K in-use entries scattered among many unused ones. This choice will result in cleaner code and more efficient access to the array.

Algorithms, function design

The algorithmic core of the program consists of two operations we will refer to as align_overlap and merge_overlap. align_overlap operates on a pair of strings and finds the alignment that has the maximum overlap. merge_overlap takes a pair of strings and merges them at the alignment of maximum overlap. Implementing these functions will be excellent practice with manipulating C-strings and memory management. The design of these functions and their interface is also an interesting task in itself. One possible approach is to envision a fancy all-purpose routine that is internally complex, but easy to use. Alternatively, you can simplify the function's implementation at the cost of moving complexity to the calling site. Some questions to ask yourself:

Given two strings A and B, does align_overlap consider alignment in both orders (A-B and B-A)? Would it help or hurt if you required the function to be called in such a way that only one order has to be examined?
Would it help if align_overlap were guaranteed that the first argument was the shorter string?
Does align_overlap need to consider both overlapping and containing matches or can it assume that overlapping matches have already been pruned?
align_overlap has to examine many possible alignments for a given pair. Is there an efficiency or design advantage to trying the possibilities in a particular order?
align_overlap and merge_overlap have some commonality in their operation. How can you share/unify/communicate between the two to avoid repeating code/work?

There is no one "right" answer to these questions; just as there is no one "best" design. But there are better and worse approaches and you should be thoughtful and deliberate about the trade-offs that come with the choices you make.

Efficiency

Each round runs a contest among all remaining fragments to find and merge the pair with largest overlap. N² pairings can be made from N fragments and align_overlap is called for each of those pairs. That makes N² calls to align_overlap per round and N rounds in total which adds up to a whopping total of N³ calls. This means that any inefficiency within align_overlap can dramatically decrease program performance. Making align_overlap reasonably zippy is not about hacking and zealotry, but does require thoughtful choices. Here are some things to consider:

Whenever possible, C technique is to perform operations on substrings "in-place", i.e. starting at a location within the original string and stopping at the desired length. Working in-place is an efficiency win in both space and time compared to the alternative of creating new substring copies.
The standard C library offers highly-efficient functions for common string operations in string.h. Absolutely use these functions rather than re-implement them! Do take care to choose the right function(s) and use appropriately. As an example, the workhorse function strstr searches the entirety of one string for an embedded instance of another string. A call to strstr is perfect for determining if one fragment contains another, but not at all correct nor efficient for comparing the tail of one fragment to the head of another to find overlap. Check out strncmp instead!
Scrutinize the high-traffic inner loop within align_overlap to identify expensive computations that are unnecessarily re-executed every iteration. For example, strlen runs in linear time to count characters. If the string isn't changing inside the loop, calculating its length once outside the loop and storing into a variable avoids that recalculation.
The merge_overlap function should also make appropriate use of string.h functions (e.g. strcpy/strcat and the like) to simplify and streamline its operation. However, as merge_overlap is typically called just once per round, aggressively tuning this function has less overall impact on efficiency. If your program is running too slowly, it's likely the align_overlap function that needs attention.

Note there is significant redundancy in the computations performed each round. Most fragments are unchanged from round to round, and the same alignments are recomputed that were worked out in previous rounds. A round retains no memory of past rounds and requires a complete repeat of align_overlap for all pairs. This is per the specification and you are not to avoid this by caching results from previous rounds. The simple and straightforward approach is what we ask for.

Incremental development and testing

Some suggested coding milestones and testing strategies.

Read fragments from a well-formed input. The starter code provides a good start, but it only handles well-formed inputs read from stdin. A few changes and you'll have it reading from files in no time.
Robustness to malformed inputs. Error-handling on read is a fussy little job, so you might find it more satisfying to jump ahead to the meat of align-merge and come back to this later. When you are ready to work on robustness, make a list of of ill-formed inputs and update read_frag to correctly detect/report each kind of malformation. Be sure to verify that correctly reading valid inputs doesn't regress along the way.
align_overlap. This function tries all alignments for a given pair to identify the alignment with longest overlap. The program depends on this function working perfectly, so be sure to test it thoroughly before moving on. You might manually test by making calls in gdb, add code to make calls from main, or apply align to pairs read from file. Now would also be a good time to verify that Valgrind raises no red flags about your use of memory.
merge_overlap. The function constructs the merged result for a pair. Like align, merge warrants very thorough testing in isolation. Check in with Valgrind, too.
Do a round. Once align and merge are solid, it is straightforward to add code to examine all pairs, choose the pair with max overlap, merge it, and update the collection. Use Valgrind to ensure no memory errors.
Do all rounds. Add a loop around the previous step and now you have a program able to compare its output to the sample solution. First test on default sanity check and. Add your own custom tests and eventually work your way up to the large stress cases.
Finalize. Find and clean up memory leaks using valgrind. Re-read the specification and verify you have all required functionality. Review your code to make final improvements in quality. Reward yourself with an entire pint of ice cream!

Program facts

Some facts that might help you gauge what you're in for with this assignment.

Code size. Our solution has 92 lines, student submissions have ranged from 70 to almost 300 lines. These are counts of programming statement lines (not including blank lines, comments, #include). Our solution is decomposed into 11 functions.
Time spent. Student hours spent (self-reported):
- < 5 hours 7%
- 5-10 hours 26%
- 10-15 hours 26%
- 15-20 hours 23%
- 20-25 hours 9%
- more than 25 9%

Frequently asked questions about assign1

The fscanf call given in the starter code makes use of %[]. What is this?

The square brackets are a scanset, a format specifier for scanning a string of characters to include/exclude. The scanset %[A-Z] matches any sequence of uppercase letters; the negated scanset %[^Zz] accepts any characters other than lower/upper Z. Have your C reference(s) at the ready so you can look up further information on this and other C language and library features on a need-to-know basis.

Why must I use a field width with scanf?

When scanning a string, you must take care that the characters read fit into the buffer used to store them. The format %[A-Z] matches a sequence of uppercase letters. No field width is specified, so this format reads unbounded, stopping only at the first non-letter or EOF. If there are more uppercase letters than the buffer length, it overflows the buffer and cause data corruption or a crash. The proper technique is to use a field width that constrains the maximum count of characters to read. For a buffer of length 16, you must stop after reading at most 15 letters, leaving one slot for the null terminator. The format specifier %15[A-Z] stops at the first non-letter, EOF, or after reading 15 characters, whichever comes first. One sad detail is that the field width you need (1 smaller than your buffer size) cannot easily depend on the variable declaration or a shared constant, so just use a raw number in the format string. There are some slightly goopy ways to tie the two together, but none of them are especially obvious or pleasing.

What is stdin and how does it differ from an opened file?

The <stdio.h> header declares stdin for the already-opened FILE* connected to the terminal. Code that reads fragments from an ordinary FILE* is equally capable of reading from stdin without changes -- you will not need to make a special case for either. The only constraint to be aware of is that you can only move forward through stdin (i.e must start at beginning and read to end once with no ability to jump ahead or rewind), this means your code must read the fragments in a single linear pass over the input.

What is the proper way to handle a fatal error?

Print an informative error message and halt the program by calling exit(1). The non-standard convenience function error() combines print with exit if you'd rather (read its man page for info). We recommend that you report/handle the error directly at the point of detection. Attempting to pass out codes and information to be used further back up the chain is awkward and messier.

Must I clean up memory on fatal errors?

We do not require recovery, memory deallocation, or other cleanup from fatal errors. When an error surfaces mid-execution, it can be convoluted to manually route control through the deallocation appropriate from that context, so we don't ask for it. We will not look for leaks in program runs that exit due to error.

Must my error messages exactly match the wording of the sample?

An error message must be accurate and actionable, giving the user sufficient information to understand the problem and what is needed to resolve it. Our sample executable models appropriate handling and feedback on errors. You are not required to match our wording, but doing so is a simple way to ensure that your messages will be judged as sufficient.

Is there only one correct reassembled result for a given input?

Since you are allowed to break ties arbitrarily, there can be more than one correct input for some inputs. For example, given the fragments ['ax', 'xb', 'xx'] all three pairings have a max overlap of length 1, thus any pair can be chosen to merge in the first round. If you first merge 'xx' with either 'ax' or 'xb', the final reassembly is 'axxb'. If you first merge 'ax' with 'xb', the output can be either 'xxaxb' or 'axbxx'. All of these are considered correct. When comparing your output to that of the sample solution, be aware of these possible (harmless) discrepancies. Our large sample files have been constructed to guarantee only one correct reassembly to aid your testing.

What is a good way to test my program?

Our samples directory contains a working executable and some sample fragment files. The provided samples cover a range of inputs but you will want to supplement with your own, especially consider making small targeted inputs that isolate for testing specific behaviors. Use our sanity check tool to automate comparing your output to that of the sample. Use custom sanity check as a full test suite against all of your inputs. Don't overlook the convenience of entering inputs via the terminal for quick tests, too.

How do I measure the runtime of a program?

Prefix a command with time to execute and report on the time used to complete it. For example, time ./reassemble slink/preamble_frags will measure your program's runtime reassembling the preamble, time slink/reassemble_soln slink/preamble_frags will measure the sample executable's runtime on the same input. The time labeled "user" is the one of interest.

How do I measure the memory use of a program?

Run the program under Valgrind. Look for the line labeled "total heap usage" to see the count of allocations and the number of bytes allocated.

How can I determine if my program meets the runtime and memory efficiency benchmarks?

Measure the sample executable on a given input and compare to the measurement of your program running on the same input. If your performance is in the same ballpark as ours (say within a factor of 2 or 3), you're golden. When considering tradeoffs of space versus time, the sample executable shows you the balance we are looking for. If your program uses noticeably more memory or time, it indicate you have some unnecessary or redundant work that should be eliminated.

My program takes a long time to reassemble a large input and requires an eternity to run under Valgrind. Does this indicate a problem with my code?

The specification requires each round to re-examine all N² pairings to find the maximal overlap alignment. This is known to bog down on larger fragment files. As long your program is operating at par with the sample solution on the same input, there is no reason for concern. Expect that running under Valgrind will slow down a program by a factor of 3-10x.

I have an idea about how to make my program run faster/use less memory. Should I pursue it?

Performance is not high priority for this and other early assignments. Grading reserves only a few points for matching the efficiency of the sample. There is no bonus for outperforming the benchmark and much to lose as we will strongly frown on approaches that violate the specification, add complexity, or show questionable judgment about appropriate tradeoffs. (That said, we know of a few pleasing tweaks, including a one-line addition that single-handedly produces a 5-10x speedup in the sample solution. We would welcome a tiny non-disruptive change that produces a nice win.)