Assignment 2: Void* client fun

Due date: Mon Apr 18 11:59 pm - Hard deadline: Wed Apr 20 11:59 pm

Assignment by Julie Zelenski

[Quick Links: Implementation details, Advice page, Grading]

Learning goals

Completing this assignment should sharpen your skills in:

client use of generic void* interfaces in C
writing void* client callback functions
debugging memory errors (because we are sure you will encounter at least one or two :-)

The problem

You are to implement two programs (one small guided exercise and a larger follow-on program) that feature client use of void* generic interfaces. We provide the efficient and full-featured CVector and CMap, the C versions of what you previously have known as ArrayList and HashMap. As C has no sophisticated language support for type-safe generics, these containers have no choice but to work via untyped pointers and manipulation of raw memory. The role as client is considerably more challenging given the lack of clear pointer types. By the time you're done, you'll appreciate that C is a versatile language that puts few restrictions in your way, but you'll also wish that it didn't always play it so fast and loose -- there is danger here!

Directory search

The searchdir program recursively searches a directory to find and print those files whose name contains the user's search string. The starter code provides a draft version of program that you are to modify/extend to full functionality. Here are the specific steps to take:

The starter code uses a few library functions you may not have seen. Review the man pages for opendir/readdir/closedir, stat, and sprintf and skim our code to see how we are using them to access the filesystem and manipulate file paths.
Study how the starter code stores the filenames into the CVector. Each path is assembled into a stack-allocated buffer. That buffer is sized to length PATH_MAX (a whopping 4K) of which only a fraction will be used. The CVector of matches is created to hold elements of size PATH_MAX and when adding an entry, the full-sized buffer is copied and stored. At 4K per element, a CVector of 300 paths accumulates a memory footprint of over 1MB-- ouch! Change the element type of the CVector to instead store each path as a pointer to a right-sized dynamically-allocated string instead of that wasteful huge buffer. This will necessitate modifying how the CVector is created and disposed and how its elements are added and accessed. Storing 300 right-sized paths into the CVector now requires several orders of magnitude less memory than previous.
The starter code as-is prints all files, but it should only include those with matching names. Add a test to check if a file's name (not the full path, just the filename itself) contains the search string before adding it to the CVector of matches.
The starter code has no protection against deep recursion when a directory being searched contains a link back to itself/parent. Fix this by tracking which directories have already been visited and skip over any previously explored directory. Add code to manage an additional CVector visited and use it to store the inode numbers of each directory that has been visited. The rationale for tracking by inode instead of name is that the inode is unique and will detect a visited directory even if linked under a different name. This task gives you a chance to experiment with storing a different kind of element type in the CVector and use of cvec_search and writing a comparison callback.
Before printing the CVector of matching files, sort them by path length, breaking ties lexicographically for paths of the same length. cvec_sort is just what you need and a little more practice writing callbacks for you.
Once all the functionality is perfect, fire up Valgrind again to look for leaks. The coordination of memory management between client and CVector has some extra challenges, specifically in when and how to use a cleanup callback.
If there is any part of using CVector/CMap that does not fully make sense to you, resolve it before moving on. Bring questions to office hours, post on the forum, send us an email, get the help you need now!

When finished, the behavior of your searchdir should be as follows:

It is invoked with arguments of a search string and optional directory. If no directory is specified, it searches the current directory. If no search string is given, a helpful error message is reported.
Iterates over the directory entries, including recursively examining subdirectories but not re-exploring loops, and identifies every file whose name contains the search string. The matching is case-sensitive.
Prints the list of matching paths one per line. The paths are printed in order of increasing length of the path string. Use case-sensitive lexicographic order to break ties between paths of the same length.
It should run cleanly under Valgrind with no errors nor leaks.

Here are a few sample runs:

myth> searchdir .h 
./cmap.h
./cvector.h

myth> searchdir ea /afs/ir/class/cs107/samples/assign0
/afs/ir/class/cs107/samples/assign0/sea_frags
/afs/ir/class/cs107/samples/assign0/dream_frags
/afs/ir/class/cs107/samples/assign0/preamble_frags
/afs/ir/class/cs107/samples/assign0/reassemble_soln

Data mining for synonyms

Compiling a thesaurus seems a task best suited for the nuance and intelligence of a human etymologist, but a hard-working yet otherwise dull computer supplied with an immense stockpile of data from which to draw statistical correlations can also be surprisingly effective at the job. My colleague Greg Valiant demonstrated a sophisticated version of this technique at the recent CS department retreat which gave me the inspiration to adopt the idea for use in this assignment.

The idea is to infer probable synonyms by dredging an enormous corpus of representative English text to find words that are used similarly to the target word. For any word, you can extract from the corpus a list of all the bigrams (phrases of length 2) in which it appears. The number of bigrams in common between the lists for the target and other word tracks how similarly the words are used. Rather than simply count of the number in common, we'll actually use a weighted contribution based on how prevalent the bigram is, but the basic idea stays the same.

Let's make this more concrete with an example. "Green Eggs and Ham" by Dr. Seuss is being used as the corpus. Below are a few bigrams with their frequency:

try them 4
will try 1
eat green 1
eat them 25
not eat 11
you eat 2
would eat 2
will eat 9

The bigrams excerpted above are those that include the word "try" or "eat". Let's trace the calculation of the similarity of these two words. "try" appears in just two bigrams. Sum the bigram frequencies to obtain total occurrence count of 5 for "try". The prevalence of a bigram is defined as the ratio of bigram frequency to the word's total occurrence count. For the word "try", the prevalence of "try them" is 4/5 or 0.8, the prevalence of "will try" is 1/5 or 0.2. For "eat", the prevalence of "eat them" is 25/50 or 0.5 and "will eat" is 9/50 or 0.18.

We define the similarity score as the sum of the prevalences of the bigrams in common, divided by two. The words "eat" and "try" have two bigrams in common: "~ them" and "will ~". The prevalences for "~ them" are (0.8 + 0.5) and for "will ~". are (0.2 + 0.18). The sum is 1.68, divide by 2 for a final similarity of 0.84. Similarity increases if more bigrams in common or higher prevalences. Similarity range between 0 (no bigrams in common) to a maximum of 1.0 (all bigrams in common).

To find synonyms for a target word, the program will compute the similarity score for the target against every other word in the corpus and pull out those words with the highest similarity. These candidates are offered as the most probable synonyms.

Data structure leverage

The Google corpus comprises over 5 million bigrams which is a significant amount of data to store and process. But never fear, you are building on CMap and CVector which provide flexible storage and efficient access when appropriately used. The recommended arrangement for storing the corpus information is a CMap associating each word with a CVector of its bigrams. Each CVector element is a struct packaging a bigram phrase with its frequency/prevalence.

It is important that you be thoughtful in how you store/arrange/access the data, as uncareful choices can have a profound impact on the program's runtime and memory requirements. For example, were you to store each bigram phrase in a buffer sized to hold the full 100-char maximum, 10 million stored bigrams will chew through 1GB of storage, and that's just for the phrases! You absolutely should right-size strings as opposed to wasting the full max-sized buffer and furthermore, do you even need to store both words in a bigram phrase? Every bigram associated with "enthusiastic" duplicates the word ("enthusiastic applause", "is enthusiastic"). If you shorten each to "~ applause" and "is ~", you have eliminated many redundant characters. Do you see how this choice also makes it simpler to determine that two words share the same bigram?

To compute the similarity score, you must intersect two bigram vectors to find the elements in common. Given you compute the intersection for thousands of word pairs, efficiency is key! The naive approach of iterating over one vector and repeatedly calling the unsorted vector search on the second will work correctly but runs much too slowly to be practical. But the simple fix of switching to the sorted vector search makes this algorithm much more viable. A sorted search require that the vector be sorted beforehand. Is it more efficient to build up the vector in sorted order as you construct it or do a one-time sort after it has been populated? Must both vectors must be sorted or just one? Is there an advantage to which of the two vectors you choose to iterate and which you search? Try out options for yourself to find out the answers!

Another factor that can impact efficiency is appropriate capacity hints when creating a CMap. The CMap is optimized for the hinted capacity on creation and does not change thereafter. The wrong choice of hinted capacity can cause the table to be grossly over or under-sized and result in memory or time inefficiency. If you aren't sure of the eventual number of entries for a CMap, use 0 as the capacity hint when creating it. A 0 capacity hint enables a special adaptive sizing behavior that will maintain efficient performance across a wide range. The hinted capacity is less critical for CVector, which adapts as needed.

Implementation details

The synonyms program finds the most probable synonyms within the corpus for one or more target words. Here are its detailed requirements:

Usage. The first argument is the name of the corpus file. The second (and any further arguments) are the target words for which to find synonyms. If the arguments are missing or otherwise incorrect, exit with an appropriate error message.
Read corpus frequencies. The corpus data is provided already processed and number of occurrences of each tallied (thank you, GoogleN-gram team!). Each line in the file is expected to list a bigram and its frequency in the format <first> <second> <integer> as shown below. The words consist of only lowercase letters and the maximum allowable word length is 50 characters. The line can contain arbitrary whitespace before/between/after the tokens up to a line length of 512 total characters at most.
```
common ground 1644
common interests 355
```
Each line will be no longer than 512 chars, the corpus will contain no duplicated bigrams, and each frequency is greater than zero, you do not need to verify these assumptions nor handle inputs which violate them. All other malformed entries (non-lower letters, too long words, punctuation, blank lines, etc.) should be reported with a clear and helpful message and handled as a fatal error. You should read the corpus file in a single linear pass over the input, no jumping around or rewind.
Case-insensitivity. Distinctions of case should be ignored. "SHOUT" and "shout" are considered the same word. Although well-formed corpus input must already be in lowercase, the target arguments supplied by the user may require conversion.
Find synonyms. For each target word, examine every word in the corpus and consider it as a possible synonym for target. Use a CVector as a "leader board" which tracks the top N candidates most likely so far. Update the leader board whenever you find a better candidate. Compare candidates using the following metrics:
1. similarity. The word with the larger similarity score is considered better.
2. alphabetic. If two words have same similarity score, the one that is alphabetically first is considered better.
The size of the leader board is constrained by the constant NUM_RESULTS. This constant is set to 5 by default, but can be changed by editing the constant definition in synonyms.c. The number of entries offered on the leader board should respect the value of this constant. For efficiency purposes, maintain just the top N entries in your leader board vector. The alternative of loading an enormous vector with every word from the corpus and waiting until the end to prune to the top N will result in significantly increased runtime and memory overhead.
Output format. The output is a list of each target word and its proposed synonyms, one per line. The synonyms are printed in order of decreasing likelihood. The target and synonym words are printed in all lowercase. The format of each line is as shown below. If a target word does not appear in the corpus, no candidates are printed for it.
```
gray: grey dark blue brown white 
said: asked decided answered wanted told 
notarealword:
rapidly: quickly slowly gradually steadily soon
...
```
Memory. We expect your program to run cleanly under Valgrind with no errors reported in any situation and no leaks if exiting cleanly. (We do not require cleanup of memory leaks from fatal errors).
Efficiency. You can compare time and memory use with our sample executable as a benchmark for what is reasonable. Being with a factor of 2-3x will be considered on par.
Design/style/readability. You know the drill-- aim for publication-quality code!
Don't miss out on the companion advice page: Go to advice/FAQ page

Grading

Have you read how assignments are graded? For this assignment, the anticipated point breakdown will be in the neighborhood of:

Functionality (95 points)

Sanity cases. (38 points) The published sanity check tests are used for sanity grading.
Comprehensive cases. (20 points) Correct results for small inputs targeted to test specific input variations and broad coverage.
Robustness cases. (10 points) Correct behavior on unusual cases, edge conditions, and graceful handling of required error conditions.
Stress cases. (5 points) Large inputs touching on all required features. (synonyms only, not searchdir)
Clean compile. (2 points) We expect your code to compile cleanly without warnings.
Clean run under valgrind. (10 points) We expect your program to run cleanly under Valgrind during all normal operation. Memory errors are significant deductions, leaks are more minor.
Reasonable efficiency. (10 points) We expect programs to be reasonably efficient in use of time and space. Full points are awarded for being on par (2-3x) with the sample program; deductions will apply for immoderate use of memory/time. There is no bonus for outperforming the benchmark (and efforts to do so could detract from your code quality...). Programs with extreme runtime inefficiencies (requiring more than 10x the time of the sample solution) may lose additional functionality points from failed functionality tests that time out under the autotester.

Code review (buckets together weighted to contribute ~20 points) Here we will read and evaluate your code in areas such as:

Style and readability We expect your code to be clean and readable. We will look for descriptive names, helpful comments, and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
Use of pointers and memory We expect you to show proficiency in handling pointers/memory, as demonstrated by appropriate use of stack versus heap allocation, no unnecessary levels of indirection, correct use of pointee types and typecasts, and so on.
Client use of CVector/CMap We expect your program to use CVector and CMap to manage and manipulate the corpus data and leader board. You should leverage the features provided and not re-implement the functionality.
Program design We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. When you need the same code in more than one place, you should unify, not copy and paste. When the C library provides needed functionality, you should leverage the existing library functions rather than re-implement that functionality.

Submissions received by the assignment deadline will receive an on-time bonus equal to 5% of the functionality points earned by the submission.

Getting started

The starter project contains the source files searchdir.c and synonyms.c, our CVector/CMap implementation provided as a compiled library, and a Makefile. The project is distributed as a Mercurial repository. Check out a copy of your cs107 repository using the command
```
hg clone /afs/ir/class/cs107/repos/assign2/$USER assign2
```
The shared directory /afs/ir/class/cs107/samples/assign2 contains a sample executable and input files. Your repo contains slink which is a shortcut way to refer to this directory if you find typing the full path tedious. The enormous corpus in the samples is courtesy of the Google NGram Project (check out their fun online viewer). This large corpus is perfect for stress-testing at the end but you'll want to use smaller files early in development.

Finishing

When your code is finished, don't forget to finish with the all-important submit step to send your code to us for grading. If needed, familiarize yourself with our late policy.

How did this assignment go for you? Review the post-task self-check.