Assignment 4: Into the void*

Due: Mon May 7 11:59 pm
Ontime bonus 5%. Grace period for late submissions until Wed May 9 11:59 pm

Assignment by Julie Zelenski and Michael Chang

He who fights monsters might take care lest he thereby becomes a monster. And when you gaze long into a void*, the void* also gazes into you.
—Friedrich Nietzsche, Beyond Good and Evil

Learning goals

Completing this assignment will level up your understanding and skills with:

  • the purpose and use of function pointers in C
  • using generic void* interfaces as a client
  • implementing a generic void* function using raw memory operations
  • debugging memory errors like a C warrior-ninja!

You will be writing simplified version of the unix programs ls and sort.

Get started

Check out the starter project from your cs107 repo using the command

git clone /afs/ir/class/cs107/repos/assign4/$USER assign4

The starter project contains various C files and supporting Makefile and custom_tests files. In the samples subdirectory, you'll find our sample solution programs.

1. Code study: scandir

The library function scandir provides an example use of client callbacks in C. This function is used to create a directory listing for a given path. It uses two function pointers: a filter function which controls which entries to include in the list and a comparison function which controls the order of entries in the list.

You will be a client of scandir for this assignment. Before using it, let's look inside and see how the function is implemented.

A scandir implementation (from musl) is shown below. This function bares the inner soul of C (function pointers, dynamic allocation, triple star, oh my!) but your last three weeks of immersion into deep C waters has made you ready for this. Go slow, draw pictures, and ask questions about anything you don't understand.

 1  int musl_scandir(const char *path, struct dirent ***res,
 2        int (*sel)(const struct dirent *),
 3        int (*cmp)(const struct dirent **, const struct dirent **))
 4  {
 5      DIR *d = opendir(path);
 6      struct dirent *de, **names = NULL, **tmp;
 7      size_t cnt = 0, len = 0;
 8
 9      if (!d) return -1;
10
11      while ((de = readdir(d))) {
12          if (sel && !sel(de)) continue;
13          if (cnt >= len) {
14              len = 2*len+1;
15              if (len > SIZE_MAX/sizeof(*names)) break;
16              tmp = realloc(names, len * sizeof(*names));
17              if (!tmp) break;
18              names = tmp;
19          }
20          names[cnt] = malloc(de->d_reclen);
21          if (!names[cnt]) break;
22          memcpy(names[cnt++], de, de->d_reclen);
23      }
24
25      closedir(d);
26
27      if (errno) {
28          if (names) while (cnt-- > 0) free(names[cnt]);
29          free(names);
30          return -1;
31      }
32      if (cmp) qsort(names, cnt, sizeof *names, 
32                   (int (*)(const void *, const void *))cmp);
33      *res = names;
34      return cnt;
35  }

Review the code above and consider the issues we highlight below. You do not need to submit answers to these questions, just work through them for your own understanding. You want a solid grasp on how scandir operates so you can successfully use it in writing the myls program.

  • Review man readdir to see the definition of struct dirent. An oddity to note is that the last struct field is an array of characters that is described as of "unspecified size". The d_name portion will be custom-sized to fit a particular entry's filename including a terminating null.
  • Decipher the types of each of the function's arguments. (Here is it pasted into decl.org). Draw a picture of the memory layout of the stack frame for scandir including all parameters and local variables. Use this diagram when tracing the code to see what it happening with the memory.
  • That triple-star res parameter is pretty scary. What is this parameter used for? Why does it need a triple-star?
  • How does a client indicate that they want no filtering applied to the directory listing? No sorting?
  • I find the choice of the names len and cnt rather unhelpful in indicating their purpose. The names variable is also misleading -- is this an array of string names? How might you rename these variables for better readability?
  • Line 12 uses continue, a C construct you may not have seen before. What is its purpose here? This construct is not used that often, but good to know what it does when you see it.
  • Does a non-zero result from the filter callback function sel cause an entry to be kept or discarded from the list?
  • Line 15 is a near-repeat of a piece of code we looked at in calloc (lab3). What is the purpose of this test and what would be the consequence of removing it?
  • The number of directory entries is not known in advance, so scandir allocates the array using classic "resize-as-you-go" allocation strategy. This code should look familiar, but there are a few details to examine closely:
    • It calls realloc on line 16 without having made an initial call to malloc. Why is this valid? (Hint: review the realloc man page to see how this particular case is handled).
    • It uses the expression len * sizeof(*names) to compute the number of total bytes. If names is NULL, this expression seems to be dereferencing a NULL pointer but remember that sizeof doesn't evaluate the expression, just determines the expression's type and reports the size of that type.
    • The size expression could have equivalently been written as len * sizeof (struct dirent *). What is the benefit to using sizeof(*names) instead of explicitly naming the type?
  • On line 16, it assigns the return value from realloc to tmp and two lines later copies from the pointer from tmp to names. Why does it not just assign directly to names?
  • Lines 20-22 make a heap copy of a struct dirent. Because of the variable-sized d_name field (see first issue above), it allocates and copies a custom size.
  • Line 27 refers to a mysterious errno which comes out of left field. Read man 3 errno to learn more about its purpose and function. If an allocation failure occurred, what will be the value of errno? (Hint: read NOTES section of the malloc man page)
  • Line 32 is a little hard to parse, but it is applying a typecast to the function pointer being passed to qsort. Try compiling the code both with and without the cast. What warning/error do you get without the cast? (I argue that casting a function pointer is a sketchy thing to do, but can appreciate why they chose to do so here. Do you agree?)
  • The scandir filter function receives its argument as a const struct dirent *; the comparison function receives its arguments as const struct dirent **. Why the inconsistency?
  • Trace the various malloc/free calls to understand what memory is allocated and deallocated within the function. What deallocation remains to be done after the function exits? What are the proper steps for the caller to properly free that memory?

Forging your way through this is a significant accomplishment. There is a lot going on in this code and you should feel crazy-proud of being able to dissect such a dense passage! Armed with that understanding, you're ready to move onto implementing myls. This is a tasty little piece of cake that rewards you for learning how to use scandir.

2. Implement myls

You have used ls many a time to list a directory's contents, now it will be your turn to implement a simplified version of this workhorse utility.

The myls program operates similarly to standard ls but with many simplifications and some differences. While it may be helpful to think of myls as the same as standard ls in spirit, please don't mistakenly attempt to match the more complex features of standard ls. The list below enumerates the required features for myls. If in doubt about the expected behavior for myls, observe the behavior of the sample solution rather than compare to standard ls.

  • myls takes zero or more directory paths as arguments. It lists the directory entries from each path. If invoked with no arguments, myls prints the entries from the current directory (.)
  • myls only supports directory paths as arguments. If an argument refers to a file or non-existent path, an error message is printed and that argument is skipped.
  • myls prints the entries one per-line.
  • myls ignores entries whose names start with . unless invoked with the -a flag
  • myls adds a trailing slash when printing the name of an entry that is itself a directory.
  • myls prints the entries in a directory sorted in order by name. Names are compared lexicographically (i.e. strcmp). If invoked with the -z flag, the sort order is instead by type (directories ahead of non-directories). Entries of the same type are sorted by name.
  • myls supports no command-line flags other than -a and -z

myls should call the standard scandir function and not re-implement its functionality. Given that scandir is most of an ls program already, you just need to write the filter/comparison functions, make a single call to scandir, and print the resulting list.

You may want to read ahead to the section on "Starter code" for guidance on reviewing the initial myls.c program.

This little program is a nice warmup exercise on callback functions!

3. Code study: bsearch

Writing a fully correct binary search is notoriously difficult. (A good read: Are you one of the 10% of programmers who can write a binary search?) Implementing as a generic void* function only adds to the challenge. Below is Apple's code for bsearch, which I find to be a fairly tight and reasonably readable implementation.

void *apple_bsearch(const void *key, const void *base, size_t nmemb,  
               size_t width, int (*compar)(const void *, const void *))
{
    for (size_t nremain = nmemb; nremain != 0; nremain >>= 1) {
        void *p = (char *)base + (nremain >> 1) * width;
        int sign = compar(key, p);
        if (sign == 0)
            return p;
        if (sign > 0) {  /* key > p: move right */
            base = (char *)p + width;
            nremain--;
        }       /* else move left */
    }
    return NULL;
}

You are going to implement a binsert function that is based on the above code, so it's very important that you know how it works. Please review carefully! Consider the following questions to verify your understanding.

  • For problem 2 of assign1, we read Joshua Bloch's article Nearly All Binary Searches and Mergesorts are Broken. Does the apple_bsearch exhibit the bug called out in the article? Justify why or why not.
  • The comment /* else move left */ appears to apply to an empty statement. How does the search move left when sign is negative?
  • Pointer arithmetic on a void* is illegal, thus the casts to char*. In exercise 1 of lab4, you reviewed musl_memmove, which copied its void* arguments into local variables of char* at the start of the function and thereby avoided the need for further casting. In contrast, apple_bsearch maintains its arguments as void* and explicitly casts before each pointer arithmetic. My preference is for this latter strategy. The void* type communicates the special nature of this pointer, rather than misleadingly labeling it as an ordinary c-string. Keeping it a void* also means the compiler will squawk should you accidentally dereference or apply pointer arithmetic. Each manipulation of the void* requires an explicit cast. That typecast confirms the programmer's deliberate intent and serves as documentation.
  • Assume the variable void *arr is the base address of the array, void *found the address of the matching element and size_t width the size of each element in bytes. What is the C expression that converts found into its corresponding array index?

If you did not get to fully work through the lab4 exercises on void*, now is a good time to shore up that knowledge before moving forward.

4. Write binsert

lfind and lsearch provide two variants of a generic linear search. These functions are not standard C, but are commonly available, such as on our myth systems. Read man lsearch for an introduction. The feature that distinguishes lsearch from lfind is that lsearch will add the search key to the array if not found. This is a handy convenience!

You are to write the generic function binsert which is a bsearch variant with this same functionality. A call to binsert will perform a binary search to search for the key and if no matching element is found, it will insert the key into the proper position in the sorted array. Consider this function prototype for binsert:

void *binsert(const void *key, void *base, size_t *nel, 
               size_t width, int (*cmp)(const void *, const void *));

Specific details of the function's operation:

  • The binsert arguments are patterned after the arguments to bsearch (review man bsearch). The one difference is that the number of elements is passed by reference. This is necessary as binsert will update the count when inserting a new element into the array.
  • If binsert does not find a matching element, the key is inserted into the array at the proper position and the number of elements is incremented. It is the client's responsibility to ensure the array has sufficient space to accommodate a new element. binsert does no allocation, reallocation, or deallocation of the client's array memory.
  • The function returns a pointer to a matching array member or to the newly added member if no existing match was found.
  • You can copy/paste the code from apple_bsearch above into binsert to get your starting point. It would be ideal for binsert to simply call bsearch and not repeat its code, but the standard bsearch doesn't expose the necessary information from an unsuccessful search that would allow you to properly position the new element. If you called bsearch and got a negative result, you would have to make a second search yourself to find the position, which means redundant code and duplicate execution, yuck! By subsuming the bsearch code into your binsert, the function will search once using one code path.

Write your implementation in the util.c file. You will write and test this function in isolation, and then use the function later when writing the mysort program. You can test your binsert function using our provided test_binsert.c program. The test_binsert program is integrated with sanitycheck.

5. Implement mysort

What does the sort command do?

The unix sort program is another filter like uniq and tail.

sort reads input line-by-line and then prints out the lines in sorted order. It supports various command-line flags that control the sort behavior. Consider the following sample uses of sort:

myth> cat samples/colors
red
green
green
red
blue
blue
blue
red

myth> sort samples/colors 
blue
blue
blue
green
green
red
red
red

myth> sort -u -r samples/colors 
red
green
blue

How does mysort operate?

The mysort program operates similarly to standard sort but with many simplifications and some differences. While it may be helpful to think of mysort as the same as standard sort in spirit, please don't mistakenly attempt to match the more complex features of standard sort. If in doubt about the expected behavior for mysort, observe the behavior of the sample solution rather than compare to standard sort.

Here are the required features for mysort:

  • mysort reads one file; either the named file (if specified as argument) or standard input (if not).
  • The default sort order for mysort is lexicographic order.
  • If invoked with -l flag, the sort order is by line length.
  • If invoked with the -n flag, the sort order is by string numerical value (applies atoi to each line and compares numbers).
  • There is no tie-breaker. Duplicate lines, i.e. lines that compare as equal according to sort order, can be output in any order.
  • If invoked with the -r flag, the sort order for mysort is reversed.
  • If invoked with the -u flag, mysort discards duplicate lines. The sorted output contains one instance of each unique line from the input. The sort order determines which lines are duplicates (e.g. if sorting by length, two lines of same length are considered duplicates).
  • The flags to mysort may be used alone or in combination. If both -l and -n used, whichever flag is last on the command line "wins" as sort order.
  • mysort supports no command-line flags other than -l -n -r -u.

Requirements that apply to the internal implementation of mysort:

  • mysort reads a line into a stack array using fgets. This stack array should be sized to a large maximum size (see MAX_LINE_LEN constant). We will not test on inputs containing any lines longer than this maximum. You may furthermore assume that all inputs will end with a final newline. This avoids having to make any special case out of the last line in the input.
  • After reading a line you intend to store, it should be copied to dynamically-allocated storage of the appropriate size. The function strdup will be handy here.
  • mysort should be able to handle an input containing any number of lines. Such a large array of lines could much too big for the stack, so this array must be heap-allocated. Because the number of lines cannot be determined in advance, mysort should allocate the array to a minimum initial number (see MIN_NLINES constant) and then each time it fills up, double the size.
  • The -u option of mysort must call your binsert function and only store lines that are unique. It should not store duplicate lines for later removal.
  • If invoked without the -u flag, mysort should make exactly one call to qsort and no calls to binsert. If invoked with the -u flag, it should make no calls to qsort and repeated calls to binsert.

One of the key goals for mysort is to cleanly handle the mix of sorting options without repeating code or overly complicating things. This requires thoughtful design and you may have to iterate a little until you arrive at a pleasing end result.

Review and comment starter code

Both myls.c and mysort.c are given to you with a small amount of code to handle the command-line arguments. Before starting on either program, first read and understand the given code, work out how to incorporate it into your plans, and finally add comments to document your strategy. The starter code for this assignment is intended to help you get going, but you are free to remove/change this code as you prefer.

Some topics to explore as self-test: (do not submit answers)

  • Read down into the man page for readdir for info on struct dirent and file types.
  • Processing command-line arguments is crufty work. The GNU extension getopt helps to clean it up somewhat. Use man 3 getopt to learn more about how it works. How is an invalid option detected/reported when using getopt?
  • Note how a typedef is used in mysort.c to give a nickname to a function pointer type to avoid repeating the raw syntax.
  • Do you see anything unexpected or erroneous? We intend for our code to be bug-free; if you find otherwise, please let us know!

As per usual, the code we provide has been stripped of its comments and it will be your job to provide the missing documentation.

Advice/FAQ

Don't miss out on the good stuff in our companion document!
Go to advice/FAQ page

Grading

Here is the tentative point breakdown:

Functionality (90 points)

  • Sanity cases (25 points) Correct results on the default sanity check tests.
  • Comprehensive/stress cases (30 points) Correct results for additional test cases with broad, comprehensive coverage and larger, more complex inputs.
  • Clean compile (2 points) Compiles cleanly with no warnings.
  • Clean runs under valgrind (15 points) Clean memory reports for all programs when run under valgrind. Memory errors (invalid read/write, use of freed memory, etc) are significant deductions. Memory leaks are a minor deduction. Every normal execution path is expected to run cleanly with no memory errors nor leaks reported. We will not test exceptional/error cases under Valgrind.
  • Reasonable efficiency (15 points) We expect programs to be reasonably efficient in use of time and space. Full points are awarded for being on par (2-3x) with the sample program; deductions will apply for immoderate use of memory/time. There is no bonus for outperforming the benchmark (and efforts to do so could detract from your code quality...).

Code review (buckets together weighted to contribute ~20 points)

  • Use of pointers and memory. We expect you to show proficiency in handling pointers/memory, as demonstrated by appropriate use of stack versus heap allocation, no unnecessary levels of indirection, correct use of pointee types and typecasts, const correctness, and so on.
  • Program design. We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. When you need the same code in more than one place, you should unify, not copy and paste. If the C library provides functionality needed for a task, you should leverage these library functions rather than re-implement that functionality.
  • Style and readability. We expect your code to be clean and readable. We will look for descriptive names, defined constants (not magic numbers!), and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
  • Documentation. You are to document both the code you wrote and what we provided. We expect program overview and per-function comments that explain the overall design along with sparing use of inline comments to draw attention to noteworthy details or shed light on a dense or obscure passage. The audience for the comments is your C-savvy peer.

On-time bonus (+5%)

The on-time bonus for this assignment is 5%. Submissions received by the due date earn the on-time bonus. The bonus is calculated as a percentage of the point score earned by the submission.

Finish and submit

Review the How to Submit page for instructions. Submissions received by the due date receive the on-time bonus. If you miss the due date, late work may be submitted during the grace period without penalty. No submissions will be accepted after the grace period ends, please plan accordingly!

How did it go for you? Review the post-task self-check.

Further reading

I highly recommend you check out: Engineering a sort function
This article is a great read on the thought and engineering that went into the qsort library function. Jon Bentley and Doug McIlroy are two Hall of Famers from the former Bell Labs and anything they write is worth reading!