Assignment 2: Void* client fun

Due date: Mon Apr 18 11:59 pm - Hard deadline: Wed Apr 20 11:59 pm

Assignment by Julie Zelenski

[Quick Links: Implementation details, Advice page, Grading]

Learning goals

Completing this assignment should sharpen your skills in:

  1. client use of generic void* interfaces in C
  2. writing void* client callback functions
  3. debugging memory errors (because we are sure you will encounter at least one or two :-)

The problem

You are to implement two programs (one small guided exercise and a larger follow-on program) that feature client use of void* generic interfaces. We provide the efficient and full-featured CVector and CMap, the C versions of what you previously have known as ArrayList and HashMap. As C has no sophisticated language support for type-safe generics, these containers have no choice but to work via untyped pointers and manipulation of raw memory. The role as client is considerably more challenging given the lack of clear pointer types. By the time you're done, you'll appreciate that C is a versatile language that puts few restrictions in your way, but you'll also wish that it didn't always play it so fast and loose -- there is danger here!

The searchdir program recursively searches a directory to find and print those files whose name contains the user's search string. The starter code provides a draft version of program that you are to modify/extend to full functionality. Here are the specific steps to take:

  1. The starter code uses a few library functions you may not have seen. Review the man pages for opendir/readdir/closedir, stat, and sprintf and skim our code to see how we are using them to access the filesystem and manipulate file paths.
  2. Study how the starter code stores the filenames into the CVector. Each path is assembled into a stack-allocated buffer. That buffer is sized to length PATH_MAX (a whopping 4K) of which only a fraction will be used. The CVector of matches is created to hold elements of size PATH_MAX and when adding an entry, the full-sized buffer is copied and stored. At 4K per element, a CVector of 300 paths accumulates a memory footprint of over 1MB-- ouch! Change the element type of the CVector to instead store each path as a pointer to a right-sized dynamically-allocated string instead of that wasteful huge buffer. This will necessitate modifying how the CVector is created and disposed and how its elements are added and accessed. Storing 300 right-sized paths into the CVector now requires several orders of magnitude less memory than previous.
  3. The starter code as-is prints all files, but it should only include those with matching names. Add a test to check if a file's name (not the full path, just the filename itself) contains the search string before adding it to the CVector of matches.
  4. The starter code has no protection against deep recursion when a directory being searched contains a link back to itself/parent. Fix this by tracking which directories have already been visited and skip over any previously explored directory. Add code to manage an additional CVector visited and use it to store the inode numbers of each directory that has been visited. The rationale for tracking by inode instead of name is that the inode is unique and will detect a visited directory even if linked under a different name. This task gives you a chance to experiment with storing a different kind of element type in the CVector and use of cvec_search and writing a comparison callback.
  5. Before printing the CVector of matching files, sort them by path length, breaking ties lexicographically for paths of the same length. cvec_sort is just what you need and a little more practice writing callbacks for you.
  6. Once all the functionality is perfect, fire up Valgrind again to look for leaks. The coordination of memory management between client and CVector has some extra challenges, specifically in when and how to use a cleanup callback.
  7. If there is any part of using CVector/CMap that does not fully make sense to you, resolve it before moving on. Bring questions to office hours, post on the forum, send us an email, get the help you need now!

When finished, the behavior of your searchdir should be as follows:

Here are a few sample runs:

myth> searchdir .h 
./cmap.h
./cvector.h

myth> searchdir ea /afs/ir/class/cs107/samples/assign0
/afs/ir/class/cs107/samples/assign0/sea_frags
/afs/ir/class/cs107/samples/assign0/dream_frags
/afs/ir/class/cs107/samples/assign0/preamble_frags
/afs/ir/class/cs107/samples/assign0/reassemble_soln

Data mining for synonyms

Compiling a thesaurus seems a task best suited for the nuance and intelligence of a human etymologist, but a hard-working yet otherwise dull computer supplied with an immense stockpile of data from which to draw statistical correlations can also be surprisingly effective at the job. My colleague Greg Valiant demonstrated a sophisticated version of this technique at the recent CS department retreat which gave me the inspiration to adopt the idea for use in this assignment.

The idea is to infer probable synonyms by dredging an enormous corpus of representative English text to find words that are used similarly to the target word. For any word, you can extract from the corpus a list of all the bigrams (phrases of length 2) in which it appears. The number of bigrams in common between the lists for the target and other word tracks how similarly the words are used. Rather than simply count of the number in common, we'll actually use a weighted contribution based on how prevalent the bigram is, but the basic idea stays the same.

Let's make this more concrete with an example. "Green Eggs and Ham" by Dr. Seuss is being used as the corpus. Below are a few bigrams with their frequency:

try them 4
will try 1
eat green 1
eat them 25
not eat 11
you eat 2
would eat 2
will eat 9

The bigrams excerpted above are those that include the word "try" or "eat". Let's trace the calculation of the similarity of these two words. "try" appears in just two bigrams. Sum the bigram frequencies to obtain total occurrence count of 5 for "try". The prevalence of a bigram is defined as the ratio of bigram frequency to the word's total occurrence count. For the word "try", the prevalence of "try them" is 4/5 or 0.8, the prevalence of "will try" is 1/5 or 0.2. For "eat", the prevalence of "eat them" is 25/50 or 0.5 and "will eat" is 9/50 or 0.18.

We define the similarity score as the sum of the prevalences of the bigrams in common, divided by two. The words "eat" and "try" have two bigrams in common: "~ them" and "will ~". The prevalences for "~ them" are (0.8 + 0.5) and for "will ~". are (0.2 + 0.18). The sum is 1.68, divide by 2 for a final similarity of 0.84. Similarity increases if more bigrams in common or higher prevalences. Similarity range between 0 (no bigrams in common) to a maximum of 1.0 (all bigrams in common).

To find synonyms for a target word, the program will compute the similarity score for the target against every other word in the corpus and pull out those words with the highest similarity. These candidates are offered as the most probable synonyms.

Data structure leverage

The Google corpus comprises over 5 million bigrams which is a significant amount of data to store and process. But never fear, you are building on CMap and CVector which provide flexible storage and efficient access when appropriately used. The recommended arrangement for storing the corpus information is a CMap associating each word with a CVector of its bigrams. Each CVector element is a struct packaging a bigram phrase with its frequency/prevalence.

It is important that you be thoughtful in how you store/arrange/access the data, as uncareful choices can have a profound impact on the program's runtime and memory requirements. For example, were you to store each bigram phrase in a buffer sized to hold the full 100-char maximum, 10 million stored bigrams will chew through 1GB of storage, and that's just for the phrases! You absolutely should right-size strings as opposed to wasting the full max-sized buffer and furthermore, do you even need to store both words in a bigram phrase? Every bigram associated with "enthusiastic" duplicates the word ("enthusiastic applause", "is enthusiastic"). If you shorten each to "~ applause" and "is ~", you have eliminated many redundant characters. Do you see how this choice also makes it simpler to determine that two words share the same bigram?

To compute the similarity score, you must intersect two bigram vectors to find the elements in common. Given you compute the intersection for thousands of word pairs, efficiency is key! The naive approach of iterating over one vector and repeatedly calling the unsorted vector search on the second will work correctly but runs much too slowly to be practical. But the simple fix of switching to the sorted vector search makes this algorithm much more viable. A sorted search require that the vector be sorted beforehand. Is it more efficient to build up the vector in sorted order as you construct it or do a one-time sort after it has been populated? Must both vectors must be sorted or just one? Is there an advantage to which of the two vectors you choose to iterate and which you search? Try out options for yourself to find out the answers!

Another factor that can impact efficiency is appropriate capacity hints when creating a CMap. The CMap is optimized for the hinted capacity on creation and does not change thereafter. The wrong choice of hinted capacity can cause the table to be grossly over or under-sized and result in memory or time inefficiency. If you aren't sure of the eventual number of entries for a CMap, use 0 as the capacity hint when creating it. A 0 capacity hint enables a special adaptive sizing behavior that will maintain efficient performance across a wide range. The hinted capacity is less critical for CVector, which adapts as needed.

Implementation details

The synonyms program finds the most probable synonyms within the corpus for one or more target words. Here are its detailed requirements:

Grading

Have you read how assignments are graded? For this assignment, the anticipated point breakdown will be in the neighborhood of:

Functionality (95 points)

Code review (buckets together weighted to contribute ~20 points) Here we will read and evaluate your code in areas such as:

Submissions received by the assignment deadline will receive an on-time bonus equal to 5% of the functionality points earned by the submission.

Getting started

Finishing

When your code is finished, don't forget to finish with the all-important submit step to send your code to us for grading. If needed, familiarize yourself with our late policy.

How did this assignment go for you? Review the post-task self-check.