Due date: Mon Apr 18 11:59 pm - Hard deadline: Wed Apr 20 11:59 pm
Assignment by Julie Zelenski
[Quick Links: Implementation details, Advice page, Grading]
Completing this assignment should sharpen your skills in:
void* interfaces in Cvoid* client callback functionsYou are to implement two programs (one small guided exercise and a larger follow-on program) that feature client use of void* generic interfaces. We provide the efficient and full-featured CVector and CMap, the C versions of what you previously have known as ArrayList and HashMap. As C has no sophisticated language support for type-safe generics, these containers have no choice but to work via untyped pointers and manipulation of raw memory. The role as client is considerably more challenging given the lack of clear pointer types. By the time you're done, you'll appreciate that C is a versatile language that puts few restrictions in your way, but you'll also wish that it didn't always play it so fast and loose -- there is danger here!
The searchdir program recursively searches a directory to find and print those files whose name contains the user's search string. The starter code provides a draft version of program that you are to modify/extend to full functionality. Here are the specific steps to take:
opendir/readdir/closedir, stat, and sprintf and skim our code to see how we are using them to access the filesystem and manipulate file paths.PATH_MAX (a whopping 4K) of which only a fraction will be used. The CVector of matches is created to hold elements of size PATH_MAX and when adding an entry, the full-sized buffer is copied and stored. At 4K per element, a CVector of 300 paths accumulates a memory footprint of over 1MB-- ouch! Change the element type of the CVector to instead store each path as a pointer to a right-sized dynamically-allocated string instead of that wasteful huge buffer. This will necessitate modifying how the CVector is created and disposed and how its elements are added and accessed. Storing 300 right-sized paths into the CVector now requires several orders of magnitude less memory than previous.visited and use it to store the inode numbers of each directory that has been visited. The rationale for tracking by inode instead of name is that the inode is unique and will detect a visited directory even if linked under a different name. This task gives you a chance to experiment with storing a different kind of element type in the CVector and use of cvec_search and writing a comparison callback.cvec_sort is just what you need and a little more practice writing callbacks for you.When finished, the behavior of your searchdir should be as follows:
Here are a few sample runs:
myth> searchdir .h
./cmap.h
./cvector.h
myth> searchdir ea /afs/ir/class/cs107/samples/assign0
/afs/ir/class/cs107/samples/assign0/sea_frags
/afs/ir/class/cs107/samples/assign0/dream_frags
/afs/ir/class/cs107/samples/assign0/preamble_frags
/afs/ir/class/cs107/samples/assign0/reassemble_soln
Compiling a thesaurus seems a task best suited for the nuance and intelligence of a human etymologist, but a hard-working yet otherwise dull computer supplied with an immense stockpile of data from which to draw statistical correlations can also be surprisingly effective at the job. My colleague Greg Valiant demonstrated a sophisticated version of this technique at the recent CS department retreat which gave me the inspiration to adopt the idea for use in this assignment.
The idea is to infer probable synonyms by dredging an enormous corpus of representative English text to find words that are used similarly to the target word. For any word, you can extract from the corpus a list of all the bigrams (phrases of length 2) in which it appears. The number of bigrams in common between the lists for the target and other word tracks how similarly the words are used. Rather than simply count of the number in common, we'll actually use a weighted contribution based on how prevalent the bigram is, but the basic idea stays the same.
Let's make this more concrete with an example. "Green Eggs and Ham" by Dr. Seuss is being used as the corpus. Below are a few bigrams with their frequency:
try them 4
will try 1
eat green 1
eat them 25
not eat 11
you eat 2
would eat 2
will eat 9
The bigrams excerpted above are those that include the word "try" or "eat". Let's trace the calculation of the similarity of these two words. "try" appears in just two bigrams. Sum the bigram frequencies to obtain total occurrence count of 5 for "try". The prevalence of a bigram is defined as the ratio of bigram frequency to the word's total occurrence count. For the word "try", the prevalence of "try them" is 4/5 or 0.8, the prevalence of "will try" is 1/5 or 0.2. For "eat", the prevalence of "eat them" is 25/50 or 0.5 and "will eat" is 9/50 or 0.18.
We define the similarity score as the sum of the prevalences of the bigrams in common, divided by two. The words "eat" and "try" have two bigrams in common: "~ them" and "will ~". The prevalences for "~ them" are (0.8 + 0.5) and for "will ~". are (0.2 + 0.18). The sum is 1.68, divide by 2 for a final similarity of 0.84. Similarity increases if more bigrams in common or higher prevalences. Similarity range between 0 (no bigrams in common) to a maximum of 1.0 (all bigrams in common).
To find synonyms for a target word, the program will compute the similarity score for the target against every other word in the corpus and pull out those words with the highest similarity. These candidates are offered as the most probable synonyms.
The Google corpus comprises over 5 million bigrams which is a significant amount of data to store and process. But never fear, you are building on CMap and CVector which provide flexible storage and efficient access when appropriately used. The recommended arrangement for storing the corpus information is a CMap associating each word with a CVector of its bigrams. Each CVector element is a struct packaging a bigram phrase with its frequency/prevalence.
It is important that you be thoughtful in how you store/arrange/access the data, as uncareful choices can have a profound impact on the program's runtime and memory requirements. For example, were you to store each bigram phrase in a buffer sized to hold the full 100-char maximum, 10 million stored bigrams will chew through 1GB of storage, and that's just for the phrases! You absolutely should right-size strings as opposed to wasting the full max-sized buffer and furthermore, do you even need to store both words in a bigram phrase? Every bigram associated with "enthusiastic" duplicates the word ("enthusiastic applause", "is enthusiastic"). If you shorten each to "~ applause" and "is ~", you have eliminated many redundant characters. Do you see how this choice also makes it simpler to determine that two words share the same bigram?
To compute the similarity score, you must intersect two bigram vectors to find the elements in common. Given you compute the intersection for thousands of word pairs, efficiency is key! The naive approach of iterating over one vector and repeatedly calling the unsorted vector search on the second will work correctly but runs much too slowly to be practical. But the simple fix of switching to the sorted vector search makes this algorithm much more viable. A sorted search require that the vector be sorted beforehand. Is it more efficient to build up the vector in sorted order as you construct it or do a one-time sort after it has been populated? Must both vectors must be sorted or just one? Is there an advantage to which of the two vectors you choose to iterate and which you search? Try out options for yourself to find out the answers!
Another factor that can impact efficiency is appropriate capacity hints when creating a CMap. The CMap is optimized for the hinted capacity on creation and does not change thereafter. The wrong choice of hinted capacity can cause the table to be grossly over or under-sized and result in memory or time inefficiency. If you aren't sure of the eventual number of entries for a CMap, use 0 as the capacity hint when creating it. A 0 capacity hint enables a special adaptive sizing behavior that will maintain efficient performance across a wide range. The hinted capacity is less critical for CVector, which adapts as needed.
The synonyms program finds the most probable synonyms within the corpus for one or more target words. Here are its detailed requirements:
Read corpus frequencies. The corpus data is provided already processed and number of occurrences of each tallied (thank you, GoogleN-gram team!). Each line in the file is expected to list a bigram and its frequency in the format <first> <second> <integer> as shown below. The words consist of only lowercase letters and the maximum allowable word length is 50 characters. The line can contain arbitrary whitespace before/between/after the tokens up to a line length of 512 total characters at most.
common ground 1644
common interests 355
Each line will be no longer than 512 chars, the corpus will contain no duplicated bigrams, and each frequency is greater than zero, you do not need to verify these assumptions nor handle inputs which violate them. All other malformed entries (non-lower letters, too long words, punctuation, blank lines, etc.) should be reported with a clear and helpful message and handled as a fatal error. You should read the corpus file in a single linear pass over the input, no jumping around or rewind.
Case-insensitivity. Distinctions of case should be ignored. "SHOUT" and "shout" are considered the same word. Although well-formed corpus input must already be in lowercase, the target arguments supplied by the user may require conversion.
Find synonyms. For each target word, examine every word in the corpus and consider it as a possible synonym for target. Use a CVector as a "leader board" which tracks the top N candidates most likely so far. Update the leader board whenever you find a better candidate. Compare candidates using the following metrics:
The size of the leader board is constrained by the constant NUM_RESULTS. This constant is set to 5 by default, but can be changed by editing the constant definition in synonyms.c. The number of entries offered on the leader board should respect the value of this constant. For efficiency purposes, maintain just the top N entries in your leader board vector. The alternative of loading an enormous vector with every word from the corpus and waiting until the end to prune to the top N will result in significantly increased runtime and memory overhead.
Output format. The output is a list of each target word and its proposed synonyms, one per line. The synonyms are printed in order of decreasing likelihood. The target and synonym words are printed in all lowercase. The format of each line is as shown below. If a target word does not appear in the corpus, no candidates are printed for it.
gray: grey dark blue brown white
said: asked decided answered wanted told
notarealword:
rapidly: quickly slowly gradually steadily soon
...
Memory. We expect your program to run cleanly under Valgrind with no errors reported in any situation and no leaks if exiting cleanly. (We do not require cleanup of memory leaks from fatal errors).
Have you read how assignments are graded? For this assignment, the anticipated point breakdown will be in the neighborhood of:
Functionality (95 points)
Code review (buckets together weighted to contribute ~20 points) Here we will read and evaluate your code in areas such as:
Submissions received by the assignment deadline will receive an on-time bonus equal to 5% of the functionality points earned by the submission.
The starter project contains the source files searchdir.c and synonyms.c, our CVector/CMap implementation provided as a compiled library, and a Makefile. The project is distributed as a Mercurial repository. Check out a copy of your cs107 repository using the command
hg clone /afs/ir/class/cs107/repos/assign2/$USER assign2
The shared directory /afs/ir/class/cs107/samples/assign2 contains a sample executable and input files. Your repo contains slink which is a shortcut way to refer to this directory if you find typing the full path tedious. The enormous corpus in the samples is courtesy of the Google NGram Project (check out their fun online viewer). This large corpus is perfect for stress-testing at the end but you'll want to use smaller files early in development.
When your code is finished, don't forget to finish with the all-important submit step to send your code to us for grading. If needed, familiarize yourself with our late policy.
How did this assignment go for you? Review the post-task self-check.