Assignment 2 advice

Written by Julie Zelenski

Better late than never

You have read our advice on "Best Practices" for 107 assignments, haven't you?? :)

void* client: a cautionary tale

Our provided CVector is a full-featured, memory-managed, re-sizable array and CMap is a super-efficient hash table. Each is a clean package that abstracts away the low-level details, allowing you to put your energy toward solving more interesting problems. Data abstraction is a power tool, and one you have taken advantage of throughout the CS106 courses. What makes it a bit different in C is that these generic collections have to operate using a void* interface that requires the client stay on their toes at all times.

These are two short programs (our solution is only a total 3 pages of code) and the handy CVector and CMap do a lot of the heavy lifting. However, don't mistakenly assume this will be a breeze. The code can be dense and it's frighteningly easy to make mistakes when handling memory using void* pointers. Go slow, develop in careful steps, do lots of testing along the way, and keep gdb and Valgrind in the loop.

Start by studying the interfaces in cvector.h and cmap.h. Using them as a client requires a solid understanding of how void* is used and steadfast vigilance about details. We strongly recommend doing this fill-in-the-blank warmup exercise before you even think about coding.

Memory efficiency

Remember this is big data with an emphasis on big. The google corpus contains 5 million of bigrams. If each bigram is stored twice (once under each of first and second word) and each copy requires 30 bytes of storage (strings, frequency, prevalence), you're easily looking at 300MB of memory use. Measure the sample on the same input to get a benchmark and calm your nerves about the expected size of the memory footprint.
During file-reading, the strings are usually scanned into temporary storage. Think about when you do and don't need to copy a string to persistent storage. Any persistent string storage should be appropriately sized, not arbitrarily sized to some fixed maximum.
Some of the data you store in the CVector and CMaps will be strings. It can be convenient/efficient to share the same strings (i.e. store the same char* pointer in more than one collection), but this necessitates that you coordinate the sharing to avoid surprises such as one collection freeing the strings out from under another. Alternatively, you can make distinct copies to avoid aliasing issues. Either is fine with us, but be sure to have a plan and document it.

Code quality

There are many features provided with the CVector and CMap implementations, so be sure to read the .h header files and leverage those features rather than rolling your own.
You will need some pointer typecasts (most notably to convert the void* retrieved from the collection back into its specific type pointer) but don't become enamored with sprinkling them about. When you use a typecast, you are taking full responsibility for type-correctness and suppressing any help you might have otherwise gotten from the compiler. Use typecasts exactly and only when you must.
When you write a callback function, its prototype must match the callback prototype exactly as given in the CVector/CMap interface file, including matching use of const. The const qualifier on the void* is an indication that the pointee contents is only to be read, not written. Never cast a function pointer to get around type mismatch; instead, change the function prototype to correctly match.
Arguments to the callback function will be typed as void* , but you know they are really int* or struct binky * . You could add a typecast to the argument everywhere you use in the function body, but a better strategy declares a local variable of the correct pointer type, assigns it once from the parameter, and uses the strongly-typed local in the rest of the body.

Program facts

The data on student submissions below is from last quarter. The assignment has been re-designed this quarter, but retains similar elements. Your mileage may vary.

Code size. Our solution has 108 lines. Student submissions range from 80 to over 340 lines. These are counts of programming statement lines (not including blank lines, comments, #include). Our solution is decomposed into 13 functions.
Time spent. Student hours spent (self-reported):
- < 5 hours 4%
- 5-10 hours 6%
- 10-15 hours 20%
- 15-20 hours 29%
- 20-25 hours 18%
- more than 25 23%

Frequently asked questions about assign2

What is a symbolic link? How do I make one? How can I test my searchdir visited detection?

A symbolic link (such as the slink we provide in your repos) is a directory entry that is a synonym for another path. It's like the filesystem version of a pointer. You can create a symbolic link using ln -s srcpath dstname. The directory /afs/ir/class/cs107/samples/assign2/loopy contains a variety of nested and circular links that might be useful for testing.

My searchdir traverses directories in a different order than the sample. Is this okay?

Yes. Your output should always list the same number of files as the sample. However, it might refer to some files by a different full path than the sample when multiple paths lead to that file. The full path chosen is dependent on which path was returned earlier from readdir and readdir makes no guarantee about the order it returns the directory entries. It has stable behavior per-host, so your program and the solution running on the same host should produce the same paths. Running on different hosts could mix things up, but rest assured the permutation in output will not be an issue in grading. A change in order of traversal will never cause a difference in number of files found, so those such mismatches are a sign of programming error.

My searchdir uses a lot of time/memory. Is that expected?

This little program does a surprisingly hefty amount of work, especially when searching a larger filesystem. The searchdir program makes calls to opendir call to traverse the directory entries. Behind the scenes, opendir makes a large allocation (~32K per call, freed by closedir) causing heap usage to balloon when traversing a deeply-recursive directory structure. A directory with 100 subdirectories is a 3MB hit to traverse. The directory /usr/include has over 1400 subdirectories so be prepared for chewing through ~45MB of memory! Don't be alarmed! As always, measure the sample using time and Valgrind to get a benchmark on what is reasonable. It would be unwise to attempt to search an enormous directory such as /afs/ir/class/ or /. We will not test on such cases and don't expect you to either :-)

Can you remind me how struct declarations work in C?

C struct declarations are almost, but not exactly, the same as C++. In C, the following declares a new struct type:

struct coord {
   int x, y;
};  // don't forget to end with semi-colon, without it you'll be mired in compiler errors

The name of the type is struct coord and you cannot drop the struct keyword; declaring a coord will not compile. You can wrap the struct definition in a typedef to set up a nickname:

typedef struct coord {
   int x, y;
} Coord;           // need to end with semi-colon here, too

This second version declares struct coord and gives it the nickname Coord. Either name (struct coord or Coord) can be used for the same type. The . (dot) operator is used to access the fields within a struct. If you have a pointer to a struct, your first attempt to access the fields is likely to run afoul of the fact that dot has higher precedence than * (dereference). You can add parentheses to force the desired precedence, or better, use the -> operator which combines . and * for this common need.

Coord origin;    // allocates struct on stack
Coord *p = malloc(sizeof(Coord));   // allocates struct in heap (note: sizeof works for correctly for structs)

origin.x = 0;    // access field from struct variable

                 // these next 3 lines access field via struct pointer
*p.x = 0;        // this line doesn't compile because default precedence applies . first then *
(*p).x = 0;      // this line works, use parens to override default precedence and apply * first, then .
p->x = 0;        // this line shows the preferred way to access field via struct pointer

origin = *p;     // btw, can copy a struct using ordinary assignment (copies all fields in one go)
*p = origin;

I don't understand the use of function pointer typedefs in the CVector and CMap header files. Please help!

The raw declaration of function types can be a bit goopy so CVector and CMap use typedef nicknames. For example, the cleanup function for CVector takes one void* parameter and returns void. The typedef below establishes the nickname CleanupElemFn as synonymous to the raw declaration of a void (*)(void *) function type.

typedef void (*CleanupElemFn)(void *addr);

This typedef allows parameters and variables to be declared using the typename CleanupElemFn. To write a cleanup function as a client, you merely need to define a function with the correct prototype. The function's name is used to access its function pointer . A function pointer is compatible with any function type matching its prototype.

void byebye(void *ptr)
{
    Binky *b = (Binky *)ptr;
    // do cleanup of b here7f

}

int main()
{
    CVector *all_my_binkys = cvec_create(sizeof(Binky), 10, byebye);

What is the expected prototype/behavior for a comparison callback?

The standard library qsort, bsearch, and lfind/lsearch functions and CVector sort and search all use the same format for the comparison callback. It is a two-argument function that take pointers to elements (const void *) and returns an integer which indicates whether the first element is less than, equal, or greater than the second. Here is a sample callback for comparing double elements:

int cmp_dbl(const void *p1, const void *p2)
{
    double d1 = *(double *)p1, d2 = *(double *)p2;
    if (d1 < d2) return -1;
    if (d1 > d2) return 1;
    return 0;
}

I'm getting compiler warnings about `initialization discards ‘const’ qualifier from pointer target type`. How do I resolve these?

The const qualifier on a declaration is an indication that type is only to be read, not written. A const char * and a char * are not quite the same type. You can read characters from either, but only the non-const version allows the characters to be changed. While it is perfectly fine to supply a non-const pointer where a const is expected (no cast required or recommended), the inverse will raise a warning from the compiler. Throwing down a typecast to const would quiet the compiler, but that's not the appropriate fix. If write-access is required, a read-only pointer is not going to work and a cast only serves to cover up the mistake. You should instead be fixing the logic to supply the correct pointer. If write-access is not needed, then the simple and correct choice is to add the const qualifier to the declaration.

What is the proper response if asked to find synonyms for an invalid target word (not alphabetic, longer than 50 chars, etc.)?

If the target word is not in the corpus (for whatever reason), print an empty list of synonyms and move on. No need to report or handle as an error.

Should I store structs or pointers to structs in my CVector/CMap?

If you want to put "widgets" in your CVector/CMap, the preference is to directly store the actual widget itself. Storing pointers when the value itself could have instead been stored is adding an unnecessary level of indirection, resulting in extra overhead and added complexity. For what reasons might it be required/appropriate to instead store a pointer to widget stored elsewhere?

Widgets are not of a known fixed size, so actual widget cannot be directly stored as elements. Only by referring to widgets by address do they become homogeneous, so you store a pointer to a widget instead. (this is the case with storing strings, also types such as FILE/CVector/CMap whose size is unknown)
Widgets are shared. There is one copy of the actual widget but multiple uses of it. Storing actual widgets creates duplicate copies and add redundancy to your data structure, so you store a pointer to a shared copy instead.

(following up on previous question) But I thought all elements for CVector/CMap must be pointer type! cvec_append takes a `void*`!

You will always refer to the element by its address (a void*) when adding to or retrieving it from the container, but that does not imply that the element type is itself must be pointer. The CVector and CMap are capable of storing elements of any type: int, struct, fixed-size array, pointer, and so on. When you pass/retrieve the element, you refer to it at a level of indirection away from its base type, thus an int element will be retrieved as a int *, a struct * element as a struct **, and so on.

How do I print/examine the contents of a CVector or a CMap from within gdb?

The internal structure of these types is opaque to the client (or at least will be until you complete assign3 :-), and not well-suited for you to manually dissect in the debugger. Instead, we recommend that you write your own debugging functions print_map and print_vec that iterate and print the elements. You can then call these functions from within the debugger as needed:

(gdb) p print_map(mymap)

When reading the corpus, should I try to constrain the scanf format string to only accept well-formed input or check after the fact?

Try to use scanf as fully as you can. Consider the requirement that words be in lowercase. If you scan using the loose %s format to accept all non-white chars, it will require additional processing to check that the chars scanned were lowercase. If the format string applies a scanset of lowercase letters, a non-lowercase input simply fails to scan. It is true that scanf can be a little bit goopy to use, but its powerful features provide a lot of leverage. Try to make the most of them!

To compute a bigram's prevalence, I need the total count for all the word's bigrams. I'm not currently storing that. Should I?

Given that you need the count temporarily as an input when computing prevalence and then never again, it seems best to compute and use in a local context without permanently storing it. At the time you read a bigram, you can store its frequency, but will not yet be able to assign its prevalence, as the final total count for the word's bigram is pending. After having read the entire file, run a one-time post-processing step to go back and fill in the field. Iterate over a word's bigrams to sum frequencies into total count, then iterate again to fill in the prevalence field based on the ratio of frequency to total count. Do this for every word in the corpus. The idea is to do some upfront work now to enable later efficient computation of similarity scores.

When I compare two similarity scores of type float/double using ==, I get a compiler warning that says this is unsafe. How can I fix?

It will take until next week for us to delve into floating point types and learn how naive equality has issues and why the compiler squawks about it. For now, we're good with naive but hate compiler warnings, so will disguise our intentions from the compiler with this little trick: transform if (a == b) <equal> else <notequal> into the equivalent if ((a < b) || (a > b)) <notequal> else <equal>.

My similarity/leader board results seem slightly off relative to the sample. What gives?

During the calculation of similarity, non-terminating binary decimals 1/3, 3/28, etc. have to be rounded. Depending on which data type is used (float versus double) and the order of operations, that rounding is subtly different and can result in a similarity score ever-so-slightly higher/lower when compared to a calculation done in a different manner. The upshot is that the contents of leader board can be jiggered between you and the sample due to floating point discrepancy. Assume such discrepancies will be ignored in grading. To get reliable outcomes for testing yourself, choose words with denominators (total bigram count) that are powers of 2, those ratios are exactly representable and no rounding will interfere. Next week, we'll dig into what's going on with those wacky floating point types!

Words with no bigrams in common with the target have a similarity score of 0. Should such words be considered for the leader board?

Add a candidate to the leader board if there are unfilled slots or if the candidate knocks an existing entry off the board because it has higher priority. Although a 0-similarity candidate might get added to the leader board early in the process, it won't often survive to the final list, but either way, don't make a special case out of adding or not adding it, just treat like all other candidates.

My program takes a very long time to read the corpus. When I break in gdb, it always seems to be buried in cmap_put or cmap_get? Isn't the CMap supposed to be fast?

The CMap is very efficient as long as optimized for the proper number of entries (set when calling cmap_create), but performance can critically degrade if configured for the the wrong capacity. If you aren't sure of the eventual number of entries for a CMap, use 0 as the capacity hint when creating it. A 0 capacity hint enables a special adaptive sizing behavior that will maintain efficient performance across a wide range of load.

I am using my CVector/CMap to store elements of type struct. When I retrieve elements, only the first field(s) of my struct is correct, the rest is garbage. What gives?

This symptom could indicate that you have specified the wrong size to cvec_create/cmap_create. To create a container to store elements of structname, use sizeof(structname). If you use a size that is too small, the container will only store the first bytes of the struct, not the entire contents.