CS107 Assignment 3 advice

Written by Julie Zelenski

Here's our best advice on how to hit a home run with your CVector and CMap implementation!

Good design and wise use of memory/pointers

Use stdlib, don't re-implement. The functions in the standard libraries are already written, debugged, and optimized--- what more could you ask for? Study up on malloc/realloc/memcpy/memmove/qsort/bsearch/lfind and be ready to make good use of them.
Unify, don't duplicate. The common code across operations should be unified. Sharing code requires effort in upfront design, but pays off by giving you less code to write/test/debug and produces a more readable and malleable implementation. Sharing code has two general forms: creating a private function that is used by multiple public functions or calling existing public functions when implementing another public function. Re-using public functions as-is is typically the easier approach, and a great solution when it's appropriate. However, there are some situations where the public function isn't quite what is needed--- it only indirectly accomplishes the task in a roundabout way or its use will unnecessarily downgrade performance. When use of public functions is problematic in these ways, create private helper functions that factor out tasks in common to enable code-sharing without the negative consequences. The goal is to unify common code, whether public-public or public-private is the better option depends on context. Reusing code via copy and paste does not count as sharing. :-)
Write helper functions. CVector and CMap have a few vital expressions that are heavily-repeated. For CVector, you will calculate the address of the nth element again and again; for CMap, you repeatedly dig into a cell to access its fields (key, value, next). Even though these expressions are one-liners, they are complicated, dense, pointer-laden one-liners that you don't want to have to repeatedly stare down to ensure it's correct each time. Move these expressions into tiny shared helpers. Write it once, get it right, and then use the helper everywhere for readability and reliability.
Pointee types should communicate the truth. If you know the type of a pointer, by all means, declare it with its specific type. Use void* only if the pointee type is not known or can vary. Conversely, don't declare what should be a void* pointer with a specific pointee type. Declaring a void* as a char* misleads the reader into thinking this pointer is a C-string and allows it to be mis-used as such with no compiler warning.
Easy on the pointer arithmetic. Having been exposed to pointer arithmetic, some students go a little crazy and start using it for everything. Even though you can access ordinary array elements or struct fields by calculating an offset from the base address, you should use array subscripts and struct field names for readability and safety. Work within the type system whenever possible and only drop down to low-level typecasts and pointer arithmetic when raw memory is all you've got.
Prefer assignment to raw memcpy. Similarly, although memcpy can be used to do any kind of assignment, that doesn't mean it always should be used. Consider the three following ways of assigning an integer:
```
int dst, src;
void *pDst = &dst, *pSrc = &src;

dst = src;                          // option A
*(int *)pDst = *(int *)pSrc;        // option B
memcpy(pDst, pSrc, sizeof(int));    // option C
```
All three options accomplish the same thing (copying a 4-byte integer from src to dst). Option A, the ordinary assignment statement is direct, clean, and safe. This is always your best option as long as you are able to work within the type system. If you are forced to operate through a void*, you must instead consider options B and C. Option B, assignment using a typecast, is your second best choice. This option is readable, efficient, and relatively safe (given the correct typecast). Option C operates at the lowest-level, with no type safety whatsoever. A mismatch in the pointer types or size is silently accepted, yet causes havoc. When trying to read/write contents of a void* , go with Option B if possible. Drop down to the low-level memory copying of Option C only when you have no other choice. As a rule of thumb: if the size argument to memcpy is a compile-time constant (and especially so if it uses sizeof), this indicates you shouldn't be using memcpy/memmove. You only need to invoke raw byte-copying when the type/amount of memory being copied is determined at runtime.
Be wary of typecasts. Typecasts are not to be thrown about indiscriminately. Your typecast defeats the type system and coerces the compiler to go along with it without complaint. With this power comes great responsibility. Before introducing one, consider: why is this cast necessary? what happens if it is not present? is there a safer way to communicate my intention? how might I work within the type system rather than against it? (i.e. correcting a mismatch in types rather than covering up with a cast). You should never cast away const-ness (fix the types instead!) and you should never ever cast a function pointer (fix the function prototype!). You also never need to cast anything to a void*, since it is the universal recipient and is compatible with any pointer type (such as cast is redundant and can be removed). You will need a few typecasts, but add them exactly and only where you must. If your code repeatedly uses an expression that contains a necessarily unavoidable typecast, consider placing that expression into a shared helper to be invoked everywhere rather than repeat code.

Incremental development & testing

Attempting to implement and test all of the operations simultaneously is sheer lunacy. One operation at a time! If your cmap_get isn't finding an entry, is it because cmap_get searches incorrectly, cmap_put didn't store the entry, or something else entirely? If you build one operation at a time and throughly test/debug before moving on, you have a reliable foundation to build on and won't be forced into trying to debug all the code all the time. Here is a suggested order of attack:

CVector (definitely do vector ahead of map)
1. Helper function: pointer math to access nth element
  It's just one little expression but grungy and error-prone enough that you don't want to repeat it. Write/test/debug just once, and from then on, you get to use the helper without a further thought. Trust us, it is worth it!
2. create, insert, nth (these three are enough critical mass to enable external testing via vectest.c)
3. count, append
4. first, next
5. sort, search
6. dispose (and applications of client cleanup callback)
CMap
1. Blob helper functions: get/set next field, get/set key field, get/set value field, create blob
  These interconnected one-liners abstract the goopy 107 memory tricks needed for blob handling. With these finished and debugged, what's left to write is a mostly just an ordinary 106B hashtable. Without the helpers, every line becomes an entangled un-fun mess of both 106B and 107 code.
2. create, put, get (enough critical mass for maptest.c)
3. count
4. first, next
5. dispose (and applications of client cleanup callback)

Testing is HUGE !
(You have read our page on effective software testing, right?) This assignment will require a more significant investment in testing than any previous assignment. The number of variables in play (settings for CVector/CMap when created, which operations are called in which order, etc.) create many diverse code paths to verify. If your testing passes through only a limited subset of those paths, then you can't be confident it will stand up to our rigorous scrutiny in grading. Be vigilant and thorough!

The provided test programs are rather basic. Take stock of their coverage, being especially sure to note the omissions. To build out adequate testing coverage, you will have to write additional client code. You can add that code to the existing client programs or create new client programs of your own.
Construct small, targeted test cases per operation. Consider not only the mainstream code path but also various edge and unusual cases and write focused tests to poke at those cases. If the code makes a special case of something (i.e there is a distinct code path to insert at the end, to overwrite an existing key, to expand capacity, to handle a NULL or non-NULL cleanup function, ...?), then it's likely you'll need a special case test to exercise that path.
When you have confidence that operations work in isolation, mix things up with sequences of intermingled operations to flush out bugs due to more complex interactions.
Run all your tests under Valgrind. Getting a clean Valgrind report on a simple test case confirms the correctness of that particular code path, but only that path. You must run Valgrind on a comprehensive set of tests to obtain broader confirmation of your code's memory hygiene.

Program facts

Code size. Our CVector has 85 lines, our CMap 120. Submissions have ranged from 155 to almost 500 lines total for both. These are counts of programming statement lines (not including blank lines, comments, #include).
Time spent. Student hours spent (self-reported):
- < 5 hours 0%
- 5-10 hours 18%
- 10-15 hours 26%
- 15-20 hours 34%
- 20-25 hours 12%
- more than 25 hours 11%

Frequently asked questions about assign3

What is an incomplete type? How do I complete it?

Both CVector and CMap are declared as incomplete types in their respective header files:

typedef struct CVectorImplementation CVector;     // this is from cvector.h

The above typedef establishes the nickname CVector as a a synonym for the type struct CVectorImplementation. But nowhere in cvector.h is this struct defined! This makes CVector an incomplete type. It mentions a struct by name, but no further details are revealed here. An incomplete type limits the client to uses that don't depend on its internal structure. The client can declare/pass/return variables of CVector *, never CVector itself. The client cannot dereference a CVector*, access CVector fields, nor compute sizeof(CVector).

The issue is resolved in the implementation. At the top of cvector.c, complete the type with your definition of struct CVectorImplementation:

    struct CVectorImplementation {   // this type definition is at top of cvector.c
        int a, b;                    // include your desired fields in the struct
        char c;
    };                               // don't forget to end with semi-colon!

With this type definition, the true nature of CVector is revealed. The code in cvector.c can now use CVector as a regular struct type with no restrictions.

What fields should I declare in my CVector and CMap structs?

There is a mandate for the general layout (see diagrams in writeup), but the specific struct fields are yours to determine. One strong recommendation is that you choose good names. You use these fields repeatedly throughout your implementation and names that clearly communicate their purpose will be a great help. A vague name such as size_avail forces you to repeatedly ponder whether it tracks the total allocated or unused portion and whether size is expressed in number of bytes or number of elements.

How do I store a function pointer as a field in a struct?

Ordinary assignment operation works for function pointers, the only sticky part is the syntax to declare a field of the proper function pointer type. The function pointer typedefs in the .h file make this easier, just use the typedef name :

    struct CVectorImplementation {
        ...
        CleanupElemFn clean;        // field to store function pointer to client callback
    };

What data structure should I use for the CMap buckets?

The buckets are an indexed homogeneous collection. Although you could use a CVector, it provides little advantage since the number of buckets does not change, each bucket has a known type, and you don't need any of the fancy CVector operations.. You should instead use an ordinary C array; its use is simpler and provides more type-safety. Each bucket is a pointer to a linked list cell, that pointer is typed as a void*, and a dynamic array of such pointers would be a void**. Note that a void ** is not the same type as a void*. For example, consider these two declarations:

void *a, **b;
...
a[1] = *a;      // doesn't compile, need cast before you can access pointee
b[1] = *b;      // totally fine, copies pointer to second slot to first

a has an unknown pointee type (void) and requires a typecast to access the pointee. b has a known pointee type (void*) which can be accessed normally.

Can I define a struct for the CMap linked list cells?

No. Although the linked list cell is "struct-like" (i.e. a region in memory with three adjacent fields), two of its fields are not statically-sized, which makes it impossible to define a compile-time C-struct for it. Instead, you must manually create a blob of raw bytes that models the struct. You will allocate the correct-size memory region and manually dissect it using a combination of pointer arithmetic and typecasts. Defining your own helper functions to read/write each field will help immensely in keeping the complexity down.

What are the recommended helper functions?

CVector is all about the pointer arithmetic within an array, address_of_nth is its handy helper. For CMap, dissecting the cell/blob to access the fields is handled by helpers get_next, get_key, and get_value. Although needed less frequently, the setter functions set_next, set_key, set_value would also be welcome. Using the three setters in create_new_blob is the icing on the cake. Once you have perfected these helpers, the rest of your work becomes a whole lot easier.

How do I resolve the compiler warning `pointer of type void * used in arithmetic`?

The pointer arithmetic expression pointer + n implicitly scales n by the size of the pointee before adding it to pointer. If pointer is a void*, its pointee size is unknown, and the compiler will squawk because it cannot apply the proper scaling. To perform pointer arithmetic on a void* you must first cast it to indicate the desired scaling. The typical approach is to cast to char* to get no scaling and then manually multiply yourself.

Should I copy the function comments from cvector.h/cmap.h into the cvector.c/cmap.c?

No. The comments in cmap.h/cvector.h explain to a client how to use the container. Your comments in the .c file should instead explain to another developer how the the internals/code work.

How does iteration work for a CVector? In what order does it iterate?

Elements are iterated in order of increasing index. cvec_first returns the element at index 0 and each call to cvec_next returns the element at the next higher index. The implementation of cvec_next has to orient itself from the previous to identify the next in sequence. At first, it may seem a bit mystical how you can do this, but remember you are the omniscient implementer and can take advantage of your knowledge of how the data is laid out internally to root around in memory to locate where you are in the iteration and access what you need. It may be helpful to know you can subtract two addresses to get the distance between them.

How does iteration work for a CMap? In what order does it iterate?

Keys are iterated in arbitrary order. cmap_first can return any key in the CMap (e.g. no requirement that this key is alphabetically first or was entered first) and each call to cmap_next returns another key in the CMap, in no particular order. The easiest approach is to just iterate over the keys in the order they appear in the buckets. It can take a little thinking to see how you can orient from the previous key and access the key that follows, but you are allowed to take advantage of your privileged knowledge of the CMap internals as necessary.

The sample CMap adapts its capacity to load when created with capacity_hint 0. Must mine do that too?

No. In your version, the number of buckets is permanently configured at creation of the CMap and does not change. Performance will degrade when overloaded. The sample version operates as fixed-size when created with a non-zero capacity hint, so use that as a model for the expected decline in performance. The sample version rehashes when created with zero capacity hint, you are not expected to implement this particular behavior and should not use it as a model for your implementation.

What is the difference between memcpy and memmove?

Both are optimized routines for copying a region of raw bytes. memcpy can be more efficient that memmove but requires that the source and destination regions are distinct, whereas memmove correctly handles if the regions overlap. Either can be used to write an entire sequential block of a memory in one call, which can be much more efficient that repeated calls to work on the memory in smaller pieces.

What is assert? How do I use it?

The assert macro from <assert.h> is a mechanism for fatal errors. You assert a condition that must be true to continue:

    assert(num < 1000);

If the condition fails, assert prints the error and halts the program. Liberal use of asserts is a defensive programming practice used by a programmer to communicate with themselves (or other programmers) about fatal situations that should "never happen" and from which there is no possibility of recovery. For example, you can verify an allocation request was successful by asserting the result was non-NULL. A NULL result can mean that heap is exhausted or the internal malloc data structures have been damaged, in either case, there is no point in continuing. The alternative of blundering on with a NULL pointer will eventually lead to a crash, but far removed from the original error. Better to assert right away!

Can I raise asserts for additional error cases beyond those listed in the interface?

We want you to assert only the listed conditions. You might aspire to further bullet-proof, for example, rejecting a NULL CMap* or complaining if the client passes the address of an element that is the wrong type/size/level of indirection/already freed. Although we salute your intentions, it is impossible to make good on them. Determining whether a pointer is valid is not solvable in general. Testing ptr == NULL will catch exactly that one case, but doesn't detect all other invalid addresses. Any measure to detect bad pointers will be half-hearted at best. As the implementer, you have little choice but to assume the client will supply valid addresses and should write your code accordingly. Assert those conditions specifically listed and assume the rest. (That said, our solution CVector/CMap goes out of its way to include special-case defenses against certain detectable client misuses as a help to novice clients. You may have tripped these when working on assign2 or will observe when testing your CVector/CMap against the solution. These protections are not documented in the public interface and we do not want you to implement them.)

How do I add a private function to the implementation?

Declare the function in the .c file and qualify as static, do not add its prototype to the .h file. A static function is private and will not be visible/accessible outside the module in which it is defined.

How can I add new client test programs to the project?

The starter repo is configured for the given three client programs. To add a new client program of your own:

Create a myprogram.c file with the desired client program.
Use hg add myprogram.c to add the file to your repo.
Edit the Makefile to add myprogram to the names listed for PROGRAMS.

Now myprogram is a part of your project and make will build it. If your assign2 was solid, you could copy your client programs into your assign3 directory and follow the above steps above to use them as additional test programs.

How can I compare my CVector/CMap to the sample implementation?

The Makefile includes a target to build solution versions of all test programs. The binky_soln executable is built from binky.c but substitutes the sample library implementation of CVector/CMap for yours. For example, you can build vectest_soln (which uses our CVector/CMap) and compare its behavior to vectest (which uses your CVector/CMap). You can make the solution version of a single program by name, e.g. make vectest_soln, or use make soln to build solution versions of all programs named in the Makefile.

Custom sanity check can be used to compare the output of a custom program with its solution version. Let's assume you have written your own test program mytest and followed the steps above to add mytest to repo/Makefile. Build the program and its solution using make mytest mytest_soln. You can then use mytest as the executable in a custom sanity test and it will compare the output from mytest to mytest_soln. Pretty cool!

How can I set gdb breakpoints in a different source file that the one containing main?

The command break 114 sets a breakpoint at line 114 in the most recently accessed source file. To choose a different file, prefix the line with the filename, e.g. break cmap.c:114. You can also set breakpoints on functions by name, e.g. b cvec_append which finds the named function in whichever file it is defined.

Our collaboration policy allows you to exchange testing plans, strategies, and ideas for test cases (no exchange of actual code though). Sharing these insights on the forum allows everyone to benefit, too! As always, collaboration with others should be properly cited.

Assignment 3 advice