Assignment 4: Into the void*

Due: Wed Nov 3 11:59 pm
Late submissions accepted until Fri Nov 5 11:59 pm

Assignment by Julie Zelenski and Michael Chang, with modifications by Nick Troccoli, Katie Creel and Brynne Hurst

Learning Goals

This assignment focuses on generics with void *, function pointers, generic memory handling and disclosure and partiality. You will be building your skills with:

the purpose and use of function pointers in C
using generic void* interfaces as a client
implementing a generic void* function using raw memory operations
understanding disclosure processes and the role of partiality in security

Overview

In this assignment, you will implement simplified versions of the unix commands ls and sort. Your ls command will print out the contents of directories in various ways. Your sort command will print out the lines in files in sorted orders.

To get started on the assignment, clone the starter project using the command

git clone /afs/ir/class/cs107/repos/assign4/$USER assign4

The starter project contains the following:

readme.txt: a text file where you will answer questions for the assignment
util.c mysort.c, myls.c and Makefile: three files that you modify, and their Makefile for compiling
custom_tests: the file where you should add custom tests for your programs
test_binsert.c: a program that calls your binsert function in util.c for testing purposes. You do not need to comment or modify this file.
samples: a symbolic link to the shared directory for this assignment. It contains:
- SANITY.ini and prototypes.h: files to configure and run Sanity Check. You can ignore these.
- myls_soln, mysort_soln and test_binsert_soln: executable solutions for the code you will implement.
- colors, numbers and names: sample text files for testing your unix commands
tools: contains symbolic links to the sanitycheck, submit, and codecheck programs for checking, testing and submitting your work.

Codecheck: Part of your style / code review score will be dependent on having no code issues when run through the codecheck tool. Make sure to run it to ensure your code adheres to all necessary guidelines!

Testing and Debugging

As you work through the problems below, make sure to develop incrementally and use good debugging practices. Incremental development means implementing small pieces in isolation, thoroughly testing them, and then building on them only once you know they are correct. This ensures a working implementation at each step, and if you encounter a later bug, you can be confident it is a result of your most recent changes.

Testing

As part of this assignment you should also add at least 10 additional tests of your own in the custom_tests file that show thoughtful effort to develop comprehensive test coverage.

3 - 5 tests for each of the parts: myls, binsert, and mysort
When you add a test, also document your work by including comments in the custom_tests file that explain why you included each test and how the tests relate to one another.
The tests supplied with the default SanityCheck are a start that you should build on, with the goal of finding and fixing any bugs before submitting, just like how a professional developer is responsible for vetting any code through comprehensive testing before adding it to a team repository.
We recommend you run SanityCheck early and often (and remember, it even makes a snapshot of your code to guard against editing mishaps!).
Quick links from the Assignments dropdown: Information on SanityCheck on the working on assignments page, testing strategies.

IMPORTANT NOTE ON CUSTOM FILES: by default, only files in the starter project are submitted when running the submit tool. If you create any additional files for your custom tests that you need to include in your submission, you must execute the following command once per file to add it to the list of files that will be submitted: git add MY_FILE_NAME. For instance, if you create a custom text file for testing called myfile.txt that you need to submit, the command would be git add myfile.txt. Then, when you run the submit tool, this file will be submitted as well. You can confirm files were properly included in your submission by submitting and then re-cloning your starter code project into a different location, and you should see your submitted files there.

Debugging

Good testing can help you uncover bugs that you can then track down and fix. When debugging, you should follow the debugging checklist listed in the debugging guide.

Check out the Memory/Pointers Advice and the FAQ at the bottom of this page as you're designing and writing your code!

1. Code Study: `scandir`

In a later part of this assignment, you'll be writing your own version of the ls command. To do this, you'll need to be a client of the scandir library function, which is used to create a directory listing (an array representing the contents of a directory) for a given path. It also uses two function pointers: a filter function which controls which entries to include in the list, and a comparison function which controls the order of entries in the list.

Your understanding of scandir will be critical for implementing myls. Start by reviewing the manual page for scandir to understand its general behavior.

Though usually reading the manual page is enough to understand client usage of a function, for our purposes we will dig deeper and use our growing C skills to understand some of its implementation. This can help us even better understand its implementation and see many core 107 concepts (pointers, heap, function pointers, and more) applied in a library function.

A scandir implementation based on code (from musl) is shown below, with added comments to help navigate. This function bares the inner soul of C (function pointers, dynamic allocation, triple star, oh my!) but your last three weeks of immersion into deep C waters has made you ready for this. We have picked out questions below that are key to your understanding of the function as a whole and how to use it. Go slow, draw pictures, and ask questions about anything you don't understand. A link to a plaintext version of this code is here in case you want to experiment with the implementation in your own program.

Click for text file


 1  int musl_scandir(const char *dirp, struct dirent ***namelist,
 2     int (*filter)(const struct dirent *),
 3     int (*compar)(const struct dirent **, const struct dirent **)) {
 4
 5      DIR *directory = opendir(dirp);
 6      if (directory == NULL) {
 7          return -1;
 8      }
 9
10      // Initialize an array of dirent pointers to hold the entries
11      struct dirent **dir_entries = NULL;
12      size_t dir_entries_used = 0;
13      size_t dir_entries_capacity = 0;
14
15      struct dirent *curr_entry_ptr;
16      while ((curr_entry_ptr = readdir(directory))) {
17
18          // If we should filter out this entry, skip it
19          if (filter && !filter(curr_entry_ptr)) {
20              continue;
21          }
22
23          // If we have run out of space, resize our array
24          if (dir_entries_used >= dir_entries_capacity) {
25              dir_entries_capacity = 2 * dir_entries_capacity + 1;
26
27              // If the new bytes needed > SIZE_MAX, it's invalid
28              if (dir_entries_capacity > SIZE_MAX / sizeof(*dir_entries)) {
29                  break;
30              }
31
32              struct dirent **tmp = realloc(dir_entries,
33                  dir_entries_capacity * sizeof(*dir_entries));
34              if (!tmp) {
35                  break;
36              }
37              dir_entries = tmp;
38          }
39
40          // Make heap space for this entry, and add it to the array
41          dir_entries[dir_entries_used] = malloc(curr_entry_ptr->d_reclen);
42          if (!dir_entries[dir_entries_used]) {
43              break;
44          }
45          memcpy(dir_entries[dir_entries_used++], curr_entry_ptr,
46              curr_entry_ptr->d_reclen);
47      }
48
49      closedir(directory);
50
51      // If an error occurred (errno is a global), free memory
52      if (errno) {
53          if (dir_entries) {
54              while (dir_entries_used-- > 0) {
55                  free(dir_entries[dir_entries_used]);
56              }
57          }
58          free(dir_entries);
59          return -1;
60      }
61
62      // If specified, sort the entries based on `compar`
63      if (compar) {
64          qsort(dir_entries, dir_entries_used, sizeof(*dir_entries),
65              (int (*)(const void *, const void *))compar);
66      }
67
68      *namelist = dir_entries;
69      return dir_entries_used;
70  }

Review the code above and consider the issues we highlight below. You do not need to submit answers to these questions, just work through them for your own understanding.

Review man readdir to see the definition of struct dirent. An oddity to note is that the last struct field is an array of characters that is described as of "unspecified size". The d_name portion will be custom-sized to fit a particular entry's filename including a null terminator.
Review the scandir man page to read about each of the function's arguments and its return value.
Q: How does a client indicate that they want no filtering applied to the directory listing? No sorting?
Q: Does a non-zero result from the filter callback function filter cause an entry to be kept or discarded from the list?
Q: What does it return on success, and on error?
Q: What heap memory must be freed by the client? What are the proper steps for the client to properly free that memory?
That triple-star namelist parameter is pretty scary. Q: What is this parameter used for? Why does it need a triple-star? (Hint: take a look at line 68, where that parameter is used. What is the code doing there? Also think back to the pattern we discussed in class of passing a pointer to something if the caller wants to allow it to be modified in a function).
The number of directory entries is not known in advance, so scandir allocates the array using the classic "resize-as-you-go" allocation strategy. It varies slightly from what we have seen, but you can see similarities in the code from lines 24 to 38.
Lines 41-46 make a heap copy of a struct dirent. Because of the variable-sized d_name field (see first issue above), it allocates and copies a custom size.
Line 64/65 is a little hard to parse, but it is applying a typecast to the function pointer being passed to qsort. Casting a function pointer is a sketchy thing to do, and it should generally be avoided, but it's perhaps understandable why they chose to do so here.
The scandir filter function receives its argument as a const struct dirent *; the comparison function receives its arguments as const struct dirent **. Q: Why the inconsistency?

Forging your way through this is a significant accomplishment. There is a lot going on in this code and you should feel crazy-proud of being able to dissect such a dense passage! Armed with that understanding, you're ready to move onto implementing myls. Your understanding of scandir will be critical for this next step.

2. Implement `myls`

What does `myls` do?

You have used ls many a time to list a directory's contents. Now it's your turn to implement a simplified version of this workhorse utility, which is also a nice exercise with function pointers.

The myls program operates similarly to standard ls but with many simplifications and some differences. While it may be helpful to think of myls as the same as standard ls in spirit, please don't mistakenly attempt to match the more complex features of standard ls. The list below enumerates the required features for myls. If in doubt about the expected behavior for myls, observe the behavior of the sample solution in the samples folder rather than compare to standard ls.

How does `myls` operate?

These are the required features for myls:

myls takes zero or more directory paths as arguments. It lists the directory entries from each path. If invoked with no arguments, myls prints the entries from the current directory (.)
myls only supports directory paths as arguments. If an argument refers to a file or non-existent path, it should print an error message and skip that argument. You should match the error message exactly to the solution using the error function with the text "cannot access _____", filling in the name of the invalid argument, along with an error code of 0 and a status code of 0 (so the program still continues running).
myls prints the entries one per-line.
myls ignores entries whose names start with . unless invoked with the -a flag
myls adds a trailing slash when printing the name of an entry that is itself a directory. (Note: the provided code for discerning files and directories categorizes symbolic links, such as the samples directory in the assignment folder, as non-directories, and it's fine for your program to treat symbolic links as non-directories - this is what the sample solution does)
myls prints the entries in a directory sorted in order by name. Names are compared lexicographically (i.e. strcmp). If invoked with the -z flag, the sort order is instead by type (directories ahead of non-directories). Entries of the same type are sorted by name.
myls supports no command-line flags other than -a and -z
myls should call the standard scandir function and not re-implement its functionality. Given that scandir is most of an ls program already, you just need to write the filter/comparison functions, make a single call to scandir, and print the resulting list. As an additional note, the man page for scandir contains a short example program. Minor heads up: that example program happens to print the entries in reverse order which may not be what you want/expect. Be wary of dropping in code from the documentation without reviewing it first to know what you are getting and whether you will need to adapt it.

Note: an important restriction on code you write for this problem (and for this assignment in general) is you are not allowed to use the versionsort or alphasort functions that are part of the C library as part of this assignment. You should implement your own comparison functions. A solution that violates this restriction will receive zero credit, so please verify your approach is in compliance!

`myls` Tips

myls.c is given to you with a small amount of code to handle the command-line arguments. Before starting, first read and understand the given code, work out how to incorporate it into your plans, and finally add comments to document your strategy and document the provided code.

TIP: you can use a typedef to give a nickname to a function pointer type to avoid repeating the raw syntax. It also makes it easier to make a variable of that function pointer type. Check out the FAQ at the bottom of the page for how to do this - it can help make your code cleaner. Also check out the starter code for mysort.c for an example.

Some topics to explore as a self-test: (do not submit answers)

Processing command-line arguments is crufty work. The GNU extension getopt helps to clean it up somewhat. Use man 3 getopt to learn more about how it works. How is an invalid option detected/reported when using getopt?
Do you see anything unexpected or erroneous? We intend for our code to be bug-free; if you find otherwise, please let us know!

3. Disclosure and Partiality Case Study

In lecture, we learned about the importance of safe, fair and just disclosure processes (lecture 7), and the role that partiality plays in computer security (lecture 9). Please read the following obituary of security researcher Dan Kaminsky (Stanford login required) and answer the questions below in your readme.txt file about his security work and your own beliefs about disclosure and partiality.

View Article

[1-3 sentences] What disclosure process did Dan Kaminsky follow in response to finding the DNS vulnerability? Given the context, what kind of partiality did he demonstrate in doing so?
[1 paragraph] In lecture 9 we discussed the EternalBlue incident and the Vulnerabilities Equities Process. In this case, the US showed preference towards using vulnerabilities in the interests of its own intelligence gathering capabilities or its own citizens. What degree of partiality do you think is appropriate for a government in deciding whether to disclose or stockpile vulnerabilities? (We will not be grading you based on which degree you chose. Your answer should rely on critical reflection of the four degrees and include clear reasoning of your position.)
[1 paragraph] Which of the four degrees of partiality most guides your conduct and why? (We will not be grading you based on which degree you chose. Your answer should rely on critical reflection of the four degrees and include clear reasoning of your position.)

4. Code Study: `bsearch`

Next you'll be studying the following code in order to write a generic binary insert function to efficiently insert unique elements into a sorted array using binary search. Writing a fully correct binary search is notoriously difficult, and implementing a generic void* function only adds to the challenge. (Need a refresher on the binary search algorithm? Check out this link). Below, we've provided Apple's code for generic binary search, bsearch. You'll be modifying this implementation's code as part of your own binsert function later, so it's important that you know how it works. Review carefully! If you did not get to fully work through the lab4 exercises on void*, now is a good time to shore up that knowledge before moving forward.


 1    void *apple_bsearch(const void *key, const void *base,
 2                        size_t nmemb, size_t width,
 3                        int (*compar)(const void *, const void *)) {
 4        for (size_t nremain = nmemb; nremain != 0; nremain >>= 1) {
 5            void *p = (char *)base + (nremain >> 1) * width;
 6            int sign = compar(key, p);
 7            if (sign == 0) {
 8                return p;
 9            }
10            if (sign > 0) {  /* key > p: move right */
11                base = (char *)p + width;
12                nremain--;
13            }       /* else move left */
14        }
15        return NULL;
16    }

Consider the following questions to verify your understanding (do not submit answers):

The comment /* else move left */ appears to apply to an empty statement. How does the search move left when sign is negative?
Pointer arithmetic on a void* is illegal, thus the casts to char*. Some generics code in real-world systems copies void* arguments into local variables of type char* at the start of the function and thereby avoids the need for further casting. In contrast, Apple's bsearch maintains its arguments as void* and explicitly casts before each pointer arithmetic. The argument behind this approach is the void* type communicates the special nature of this pointer, rather than misleadingly labeling it as an ordinary c-string. Keeping it a void* also means the compiler will complain should you accidentally dereference or apply pointer arithmetic. Each manipulation of the void* requires an explicit cast. That typecast confirms the programmer's deliberate intent and serves as documentation.
(Hypothetical question not related to this code) - assume the variable void *arr is the base address of the array, void *found is the address of the matching element and size_t width is the size of each element in bytes. What is the C expression that converts found into its corresponding array index?

Forging your way through this is a big accomplishment. There is a lot going on in this code and you should feel proud that you can understand library code like this to implement a generic binary search! Armed with that understanding, you're ready to move onto implementing binsert, a variation of Apple's bsearch.

5. Write `binsert`

What does `binsert` do?

lfind and lsearch provide two variants of a generic linear search. These functions are not standard C, but are commonly available, such as on our myth systems. Read man lsearch for an introduction. The feature that distinguishes lsearch from lfind is that lsearch will add the search key to the array if not found. This is a handy convenience!

Your next task is to write the generic function binsert which is a bsearch variant with this same functionality; a call to binsert will perform a binary search for the key and if no matching element is found, it will insert the key into the proper position in the sorted array. This will come in handy for the final part of the assignment, where you implement your own version of the sort command. To write binsert, you'll use the bsearch implementation code above as a starting point and modify it. Consider this function prototype for binsert:

void *binsert(const void *key, void *base, size_t *p_nelem,
   size_t width, int (*compar)(const void *, const void *));

How does `binsert` operate?

Here are the details of the function's operation:

The binsert arguments are patterned after the arguments to bsearch (review man bsearch). The one difference is that the number of elements is passed as a pointer. This is necessary as binsert will update the count when inserting a new element into the array.
If binsert does not find a matching element, it inserts the key into the array at the proper position and increments the number of elements. It is the client's responsibility to ensure the array has sufficient space to accommodate a new element. binsert does no allocation, reallocation, or deallocation of the client's array memory.
The function returns a pointer to the matching array member or to the newly added member if no existing match was found.
You can copy/paste the code from Apple's bsearch above into binsert as a starting point. It would be ideal for binsert to simply call bsearch and not repeat its code, but the standard bsearch doesn't expose the necessary information from an unsuccessful search that would allow you to properly position the new element. If you called bsearch and got a negative result, you would have to make a second search yourself to find the position, which means redundant code and duplicate execution! By subsuming the bsearch code into your binsert, the function will search once using one code path. Your solution must use the bsearch code as a starting point to write binsert, modified to fit the requirements for this problem. As a hint, you may not need to make significant modifications to the provided code. Our solution adds/modifies less than 10 lines.
If you need to move elements in the array to make room, instead of a loop to move array items one by one you should use memmove, which can scoot the whole array over in one call!

`binsert` Tips

Write your implementation in the util.c file. You will write and test this function in isolation, and then use the function later when writing the mysort program. You can test your binsert function using our provided test_binsert.c program. The test_binsert program is integrated with sanitycheck. NOTE: besides the -i and -s flags, test_binsert does not support inputs with negative numbers or strings beginning with "-". See the test_binsert.c file for more information on how to use it.

Before continuing: have you thoroughly tested your implementation of binsert? An essential part of the development process is incremental development, where you implement small pieces in isolation, thoroughly test them, and then build on them once you know they are correct. This way, if you encounter a later bug, you can be confident it is a result of your most recent changes. mysort builds on this code, so make sure you are confident in its correctness before moving on.

6. Implement `mysort`

What does `mysort` do?

For the final part of the assignment, you'll implement your own version of the Unix sort command, called mysort, that uses your binsert function as a client. sort is another filter command like uniq and tail, which you implemented versions of in the last assignment. The standard Unix sort reads input line-by-line and then prints out the lines in sorted order. It supports various command-line flags that control the sort behavior. Consider the following sample uses of sort:

myth$ cat samples/colors
red
green
green
red
blue
blue
blue
red

myth$ sort samples/colors
blue
blue
blue
green
green
red
red
red

myth$ sort -u -r samples/colors
red
green
blue

The mysort program operates similarly to standard sort but with many simplifications and some differences. While it may be helpful to think of mysort as the same as standard sort in spirit, please don't mistakenly attempt to match the more complex features of standard sort. If in doubt about the expected behavior for mysort, observe the behavior of the sample solution rather than compare to standard sort.

How does `mysort` operate?

Here are the required features for mysort:

mysort reads one file; either the named file (if specified as an argument) or standard input (if not).
The default sort order for mysort is lexicographic (alphabetical) order.
If invoked with the -l flag, the sort order is by increasing line length.
If invoked with the -n flag, the sort order is by string numerical value (applies atoi to each line and compares numbers).
There is no tie-breaker. Duplicate lines, i.e. lines that compare as equal according to sort order, can be output in any order.
If invoked with the -r flag, the sort order for mysort is reversed. (Hint: sort the input as you would normally, and just change the order you print the results)
If invoked with the -u flag, mysort discards duplicate lines. The sorted output contains one instance of each unique line from the input. The sort order determines which lines are duplicates (e.g. if sorting by length, two lines of same length are considered duplicates).
The flags to mysort may be used alone or in combination. If both -l and -n used, whichever flag is last on the command line "wins" as sort order.
mysort supports no command-line flags other than -l -n -r -u.

Internally, mysort should have the following behavior:

mysort reads a line into a stack array using fgets. This stack array should be sized to a large maximum size (see the MAX_LINE_LEN constant). We will not test on inputs containing any lines longer than this maximum. You may furthermore assume that all inputs will end with a final newline. This avoids having to make any special case out of the last line in the input.
After reading a line you intend to store, it should be copied to dynamically-allocated storage of the appropriate size. The function strdup will be handy here.
mysort should be able to handle an input containing any number of lines. Such a large array of lines could much too big for the stack, so this array must be heap-allocated. Because the number of lines cannot be determined in advance, mysort should allocate the array to a minimum initial number (see MIN_NLINES constant) and then each time it fills up, double the size.
The -u option of mysort must make repeated calls to your binsert function and only store lines that are unique. It should not store duplicate lines for later removal, and should make no calls to qsort.
If invoked without the -u flag, mysort should make exactly one call to qsort and no calls to binsert.

Important Note: in this implementation, lines should only be heap allocated if you intend to store them. In other words, you should not allocate a line on the heap if it will not ultimately need to be saved and printed out later. If you have significantly higher memory usage than the solution, issues with this are likely the reason why.

`mysort` Tips

Starting out: mysort.c is given to you with a small amount of code to handle the command-line arguments. Before starting on the program, first read and understand the given code, work out how to incorporate it into your plans, and finally add comments to document your strategy and the code.

Some topics to explore as self-test: (do not submit answers)

Processing command-line arguments is crufty work. The GNU extension getopt helps to clean it up somewhat. Use man 3 getopt to learn more about how it works. How is an invalid option detected/reported when using getopt?
Note how a typedef is used in mysort.c to give a nickname to a function pointer type to avoid repeating the raw syntax. It also makes it easier to make a variable of that function pointer type (function pointers can be variables just like any other type!)
Do you see anything unexpected or erroneous? We intend for our code to be bug-free; if you find otherwise, please let us know!

Comparing with assign3: This may seem similar to the previous assignment, and you may initially be inclined to copy-paste code over, but realize there are some significant differences in this implementation.

Unlike the last assignment, this assignment requires using a different (and also valid) strategy for reading text of unknown size - allocating it on the stack, and then only using heap allocation once you know the full size, can allocate exactly what you need, and know you will need to store it.
For the last assignment, you potentially allocated slightly more than you needed on the heap, but you didn't use a ton of space on the stack first.
Conversely, in this case you use significant stack space to store the initial string, but as a result only allocate exactly the heap memory that you need.
We want you to get exposure to both valid patterns!

Handling all your options: One of the key goals for mysort is to cleanly handle the mix of sorting options without repeating code or overly complicating things. In particular, the 4 command-line flags can each be turned on independently, making a total of 2⁴ combinations and if you aren't careful about it, you can end up making a distinct case out of each combination and debugging 16x more code than you needed to. Instead, take care to unify all the tasks that are in common, making only the tiniest of detours where needed to add code to handle their differences. You may have iterate a little to work out the details, but you can arrive at a very pleasing end result if you put your mind to it!

Memory/Pointers Advice

Stack versus heap. Take care to understand proper use of stack versus heap allocation and when each is appropriate. As a general rule of thumb, unless a situation requires dynamic allocation, stack allocation is preferred. Often both techniques are used together in a program. For example, it is desirable that a dynamically-allocated C-strings be of the proper size, e.g. the string "apple" needs 6 characters worth of storage, no more, no less. A useful idiom to learn is allocating a maximum-sized stack buffer to temporarily store a string of indeterminate length, which is later copied to a dynamically-allocated buffer of the correct size once the length is known. This is how the line-reading code for mysort should operate.
Dynamic resizing. The mysort program doesn't know the count of lines in the input in advance, so it allocates based on an estimate and grows as needed. Doubling-in-size is a classic technique you'll see again and again such as in the buffer-handling for last assignment's read_line and the entries gathered by musl_scandir.
Assert allocation requests. It's not likely you will exhaust the entire heap on a modern system, but NULL can also be returned when the heap is corrupted. Either way,there is nothing good that comes after a failed allocation, so best to just call a halt to things before it goes from bad to worse.
Pointee types should communicate the truth. If you know the type of a pointer, by all means, declare it with its specific type. Use void* only if the pointee type is not known or can vary. Conversely, don't declare what should be a void* pointer with a specific pointee type. Declaring a void* as a char* misleads the reader into thinking this pointer is a C-string and allows it to be mis-used as such with no compiler warning.
Use pointer arithmetic sparingly. Even though you can access ordinary array elements or struct fields by calculating an offset from the base address, you should use array subscripts and struct field names for readability and safety. Work within the type system whenever possible and only drop down to low-level typecasts and pointer arithmetic when raw memory is all you've got.
Prefer assignment to raw memcpy. Similarly, although memcpy can be used to do any kind of assignment, that doesn't mean it always should be used. Consider the three following ways of assigning an integer:

int dst, src;
void *pDst = &dst;
void *pSrc = &src;

dst = src;                          // option A
*(int *)pDst = *(int *)pSrc;        // option B
memcpy(pDst, pSrc, sizeof(int));    // option C

All three options accomplish the same thing (copying a 4-byte integer from src to dst). Option A, the ordinary assignment statement is direct, clean, and safe. This is always your best option as long as you are able to work within the type system. If you are forced to operate through a void*, you must instead consider options B and C. Option B, assignment using a typecast, is your second best choice. This option is readable, efficient, and relatively safe (given the correct typecast). Option C operates at the lowest-level, with no type safety whatsoever. A mismatch in the pointer types or size is silently accepted, yet causes havoc. When trying to read/write contents of a void* , go with Option B if possible. Drop down to the low-level memory copying of Option C only when you have no other choice. As a rule of thumb: if the size argument to memcpy is a compile-time constant (and especially so if it uses sizeof), this indicates you shouldn't be using memcpy/memmove. You only need to invoke raw byte-copying when the type/amount of memory being copied is determined at runtime.

Be wary of typecasts. Use typecasting sparingly. Your typecast defeats the type system and coerces the compiler to go along with it without complaint. With this power comes great responsibility. Before introducing one, consider: why is this cast necessary? what happens if it is not present? is there a safer way to communicate my intention? how might I work within the type system rather than against it? (i.e. correcting a mismatch in types rather than covering up with a cast). You should never cast away const-ness (fix the types instead!) and casting a function pointer is an extremely sketchy move (fix the function prototype instead!). You also never need to cast anything to a void*, since it is the universal recipient and is compatible with any pointer type (such a cast is redundant and can be removed). You will need a few typecasts, but add them exactly and only where you must. If your code repeatedly uses an expression that contains a necessarily unavoidable typecast, consider placing that expression into a shared helper to be invoked everywhere rather than repeat code.

Submitting

Once you are finished working and have saved all your changes, check out the guide to working on assignments for how to submit your work. We recommend you do a trial submit in advance of the deadline to allow time to work through any snags. You may submit as many times as you would like; we will grade the latest submission. Submitting a stable but unpolished/unfinished is like an insurance policy. If the unexpected happens and you miss the deadline to submit your final version, this previous submit will earn points. Without a submission, we cannot grade your work.

We would also appreciate if you filled out this homework survey to tell us what you think once you submit. We appreciate your feedback!

Grading

Here is the tentative point breakdown:

Readme (10 points)

Functionality (95 points)

Sanity cases (22 points) Correct results on the default sanity check tests.
Comprehensive/stress cases (30 points) Correct results for additional test cases with broad, comprehensive coverage and larger, more complex inputs.
Clean compile (2 points) Compiles cleanly with no warnings.
Clean runs under valgrind (15 points) Clean memory reports for all programs when run under valgrind. Memory errors (invalid read/write, use of freed memory, etc) are significant deductions. Memory leaks are a minor deduction. Every normal execution path is expected to run cleanly with no memory errors nor leaks reported. We will not test exceptional/error cases under Valgrind.
Reasonable efficiency (15 points) We expect your programs and code to be reasonably efficient in use of time and space. Full points are awarded for being on par (2-3x) with the sample program; deductions will apply for immoderate use of memory/time. There is no bonus for outperforming the benchmark (and efforts to do so could detract from your code quality...). Use valgrind and time to measure your programs' efficiency vs. the solutions'.
custom_tests (5 points) Your custom_tests file should include tests of your own that show thoughtful effort to develop comprehensive testing coverage. Please include comments that explain your choices. We will run your custom tests against your submission as well as review the cases to assess the strength of your testing efforts.

Code review (buckets together weighted to contribute ~20 points)

The grader's code review is scored into a bucket per assignment part to emphasize the qualitative features of the review over the quantitative. The style guide is a great overall resource for good program style. Here are some highlights for this assignment:

Use of pointers and memory. We expect you to show proficiency in handling pointers/memory, as demonstrated by appropriate use of stack versus heap allocation, no unnecessary levels of indirection, correct use of pointee types and typecasts, const correctness, and so on.
Program design. We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. When you need the same code in more than one place, you should unify, not copy and paste. If the C library provides functionality needed for a task, you should leverage these library functions rather than re-implement that functionality.
Style and readability. We expect your code to be clean and readable. We will look for descriptive names, defined constants (not magic numbers!), and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
Documentation. You should document both the code you wrote and what we provided. We expect program overview and per-function comments that explain the overall design along with sparing use of inline comments to draw attention to noteworthy details or shed light on a dense or obscure passage. The audience for the comments is your C-savvy peer.
Codecheck compliance. We will be running your code through codecheck and verifying that there are no issues present.

For more information about style guidelines to use for assignments, take a look at the CS107 Style Guide, linked to from the Assignments dropdown.

Post-Assignment Check-in

How did the assignment go for you? We encourage you to take a moment to reflect on how far you've come and what new knowledge and skills you have to take forward. Once you finish this assignment, you will have written your own implementation of two workhorse Unix commands using raw memory, function pointers, and generics, and reflected more on partiality and proper disclosure. Way to go! You should now have great command of memory management and the client use of a void* interface and the generic implementation. The critical mastery is correctly passing/receiving the void*s flying around: what level of indirection to use, what typecasts are needed, using * and & in exactly and only the necessary places. You should know without question how to determine the necessary level of indirection and correctly apply it. It's also valuable to understand and be able to reason through the consequences of getting it wrong.

To help you gauge your progress, for each assignment/lab, we identify some of its takeaways and offer a few thought questions you can use as a self-check on your post-task understanding. If you find the responses don't come easily, it may be a sign a little extra review is warranted. These questions are not to be handed in or graded. You're encouraged to freely discuss these with your peers and course staff to solidify any gaps in you understanding before moving on from a task. They could also be useful as review before exams.

To read the input, the mysort program reads each line into a maximum-sized piece of stack memory used temporarily and the copies to right-sized dynamic memory for persistent storage. This stack-heap dance is a common technique in situations like this. Explain why this approach is necessary/appropriate given the characteristics of the stack and heap.
The strcmp function almost matches the prototype for the comparison function argument of bsearch and qsort. A sketchy typecast (such as the one exhibited by scandir) could be used to hush the compiler's complaint about the mismatch, but this would be a huge mistake. strcmp is pretty much never the right function to use as a callback comparison function. Explain why not.
Read the man page for lfind and describe how to use lfind with an appropriate callback to re-implement the functionality of strlen.

Frequently Asked Questions

The last sanity check test's output is really long. Is there anything I can do to more easily read it or other test output?

The last test uses a long file of numbers for sorting. If you pass the test, the entire output will not be printed, but if your output does not match then it will print the entire diff, which can be annoying. One tip to better navigate potential long sanity check outputs is either to just scroll in your terminal if you're trying to view previous test output, or save the sanity check output to file using > and access it that way. One note is that saving it to file may display strange characters due to the formatting and coloring of the output. To get around this, you can open the file with proper formatting by using the following command: less -R mysanityfile.txt, where mysanityfile.txt is the name of the file where you saved the sanity check output. Then, you can use the mouse or arrow keys to scroll through the output more easily. To exit this view, press q. To summarize these steps:

myth$ tools/sanitycheck > outputfile.txt    # run sanitycheck and save its output to outputfile.txt
myth$ less -R outputfile.txt                # view outputfile.txt in its own scrolling view

How can I print arrays in GDB?

If you print a stack array from within the function in which it is declared, gdb will show the array and its contents. However, it cannot automatically do the same in other contexts, such as for a heap array or for an array/pointer passed into a function. Instead, you must use different GDB syntax - p ELEM@COUNT, where ELEM is the first element to print and COUNT is the count of elements to print. ELEM and COUNT are C expressions and can refer to any variables in the current scope. E.g. p argv[0]@argc prints the entire argv array, and p argv[1]@2 prints the 2-element portion starting at index 1. You can also add in a typecast if needed. For example, if you are given a parameter ptr of type void* that you know is the base address of an array of nelems elements of type char*, you could print the entire array as p *(char **)ptr@nelems.

Since a function pointer is just a variable, how do I declare a parameter/variable of function pointer type?

The syntax for this is a bit ugly. The name of the variable is buried in the middle of the declaration and you have to read from inside out. The declaration below says the variable c is a pointer to a function that takes two const void* parameters and returns an int.

int (*c)(const void *, const void *) = my_compare;

Making a typedef for the function pointer type is a nice way to clean this up. The typedef below establishes cmp_fn_t as a nickname for int (*)(const void *, const void*). This typedef means you can now use cmp_fn_t as the type for a variable, parameter, or return type, in place of the longhand form.

typedef int (*cmp_fn_t)(const void *, const void *);

cmp_fn_t c = my_compare;

See your C reference or web search for more info on typedef. A neat online resource for unraveling complex C declarations is cdecl.org.

How do I resolve the compiler warning `pointer of type void * used in arithmetic`?

The pointer arithmetic expression pointer + n implicitly scales n by the size of the pointee before adding it to pointer. If pointer is a void*, its pointee size is unknown, and the compiler will complain because it cannot apply the proper scaling. To perform pointer arithmetic on a void* you must first cast it to indicate the desired scaling. The typical approach is to cast to char* to get no scaling and then manually multiply yourself.

My custom sanitycheck tests for mysort report mismatch due to re-ordering of equal lines. Is this a problem?

There is no required tie-breaker for equal lines - the spec allows mysort to print them in any order. But sanity check is looking for an exact match so you should construct your custom tests so as to not trip on spurious discrepancies. In grading, our tests will accept any valid reordering of equal lines.

Learning Goals

Overview

Testing and Debugging

Testing

Debugging

1. Code Study: scandir

2. Implement myls

What does myls do?

How does myls operate?

myls Tips

3. Disclosure and Partiality Case Study

4. Code Study: bsearch

5. Write binsert

What does binsert do?

How does binsert operate?

binsert Tips

6. Implement mysort

What does mysort do?

How does mysort operate?

mysort Tips