Due: Wed Feb 1 11:59 pm
Ontime bonus 5%. Grace period for late submissions until Fri Feb 3 11:59 pm
Assignment by Julie Zelenski
Learning goals
Completing this assignment should sharpen your skills in:
- client use of generic
void*interfaces in C - writing
void*client callback functions - debugging memory errors (because we are sure you will encounter at least one or two :-)
The problem
You are to implement a small guided exercise that features client use of void* generic interfaces. We provide the efficient and full-featured CVector and CMap, the C versions of what you previously have known as ArrayList/Vector and HashMap (in Java/C++). As C has no sophisticated language support for type-safe generics, these containers have no choice but to work via untyped pointers and manipulation of raw memory. The role as client is considerably more challenging given the lack of clear pointer types. By the time you're done, you'll appreciate that C is a versatile language that puts few restrictions in your way, but you'll also wish that it didn't always play it so fast and loose -- there is danger here!
searchdir usage
The searchdir program recursively searches a directory to find and print various information, depending on flag used. The starter code provides a draft version of program that you are to modify/extend to full functionality. Here is what the program must do in the end:
- Usage is:
./searchdir [one of: -i or -d or searchstr] [optional: directory name] - The directory name is optional, and, if used, is always the 2nd argument. If a directory name is not given, the default of
"."(current directory) should be used. - If an inadequate number of arguments is given, or the specified directory does not exist, a helpful error message is printed.
- If the first argument is neither "-i" nor "-d", then the first argument is assumed to be a
searchstr, that is, a string to search file and directory names for. The search string is a simple fixed string (no metacharacters, no quotes around it, nor whitespace in it). If a search string is given, the program should recursively search the directory to find and print the names (including path) of alldirectories andfiles whose names match the search string. - If the first argument is "-i", the program should recursively search the directory to find and save the inode values (more on this below, it is a unique file/directory ID number) of all directories and files. It should then repeatedly ask the user to enter an inode, and then print for the user the name (including path) of the directory or file with that inode, until the user chooses to quit. If the user types a non-existent inode, it prints nothing.
- If the first argument is "-d", the program should recursively search the directory to find and save the modified dates of all directories and files. It should then repeatedly ask the user to enter a date, and then print for the user the names (including paths) of all directories and files with that modified date, until the user chooses to quit. If there are no matches for a given date, nothing is printed.
Below is more detail on each of the three modes.
Name search (searchstr)
We recommend you begin with this feature. The other features will follow a similar template, but have much less starter code to guide you.
- The starter code uses a few library functions you may not have seen. Review the man pages for
opendir/readdir/closedir,stat, andsprintfand skim our code to see how we are using them to access the filesystem and manipulate file paths. - Study how the starter code stores the filenames into the CVector. Each path is assembled into a stack-allocated buffer. That buffer is sized to length
PATH_MAX(a whopping 4K) of which only a fraction will be used. The CVector of matches is created to hold elements of sizePATH_MAXand when adding an entry, the full-sized buffer is copied and stored. At 4K per element, a CVector of 300 paths accumulates a memory footprint of over 1MB-- ouch! Change the element type of the CVector to instead store each path as a pointer to a right-sized dynamically-allocated string instead of that wasteful huge buffer. This will necessitate modifying how the CVector is created and disposed and how its elements are added and accessed. Storing 300 right-sized paths into the CVector now requires several orders of magnitude less memory than previous. Win! - The starter code as-is prints all files, but it should only include those with matching names. Add a test to check if a file's name (not the full path, just the filename itself) contains the search string before adding it to the CVector of matches.
- The starter code has no protection against deep recursion when a directory being searched contains a link back to itself/parent (this can happen with symbolic links, like the
slinksymbolically-linked directory you had in your assign1 and again in assign2). Fix this by tracking which directories have already been visited and skip over any previously explored directory. Add code to manage an additional CVector you should namevisited, and use it to store the inode numbers of each directory that has been visited. The rationale for tracking by inode instead of name is that the inode is unique and will detect a visited directory even if linked under a different name. This task gives you a chance to experiment with storing a different kind of element type in the CVector and use ofcvec_searchand writing a comparison callback. (Although it is possible to use CMap for this, we ask that you use CVector.) - Before printing the CVector of matching files, sort them by path length, breaking ties lexicographically for paths of the same length.
cvec_sortis just what you need and a little more practice writing callbacks for you. - Once all the functionality is perfect, fire up Valgrind again to look for leaks. The coordination of memory management between client and CVector has some extra challenges, specifically in when and how to use a cleanup callback.
- If there is any part of using CVector/CMap that does not fully make sense to you, resolve it before moving on. Bring questions to office hours, post on the forum, send us an email.
When finished, the behavior of your searchdir with the default searchstr option should be as follows:
- It is invoked with arguments of a search string and optional directory. If no directory is specified, it searches the current directory. If no search string is given, a helpful error message is reported.
- Iterates over the directory entries, including recursively examining subdirectories but not re-exploring loops, and identifies every file whose name contains the search string. The matching is case-sensitive.
- Prints the list of matching paths one per line. The paths are printed in order of increasing length of the path string. Use case-sensitive lexicographic order to break ties between paths of the same length.
- It should run cleanly under Valgrind with no errors nor leaks.
- UPDATE: this note is added to clear up a previous point of confusion: the searchstr should ONLY be applied to files NOT to directory names. Refer to the examples below and run the sample solution.
Here are a few sample runs:
myth> searchdir .h
./cmap.h
./cvector.h
myth> searchdir ea /afs/ir/class/fakeclass/
/afs/ir/class/fakeclass/sea_frags.txt
/afs/ir/class/fakeclass/dream_frags.txt
/afs/ir/class/fakeclass/preamble_frags.txt
/afs/ir/class/fakeclass/reassemble_soln.txt
Inode search (-i)
After gaining guided practice with two different uses of the CVector, it's time to get to know CMap by implementing the Inode search. Recall from implementing the Name search (with searchstr) that C has a function stat() that will give you the inode of a file or directory, given its name. With this feature, you'll provide the reverse lookup: given an inode, report the file's name. Repeated key-value pair lookup is a good use case for CMap! (Though, admittedly, you likely wouldn't notice a difference between O(logN) and O(N).) At the highest level, CMap is a hash map whose keys are always strings, and whose values are a generic type, but you should read the cmap.h file in detail. Here's how you will use CMap to implement Inode search:
- Using the same type of recursive directory search you used in
gather_filesfor the name search, recurse the directory tree. For each file and directory, save its inode in a CMap that uses the inode as a key (keys must be strings, so a string version) and the file or directory name (with path) as the value. All strings in the CMap must be right-sized (see "tips" below). - Once you've exhausted the recursive directory search, enter a loop where you repeatedly ask the user to enter an inode number, retrieve the associated file or directory name from the CMap, and print it. When the user types "q" instead of an inode, that is the sentinal value indicating you should exit the program (but not before cleaning up memory, of course!).
- If you encountered a file or directory more than once during your recursive directory search, giving non-unique path value for a given inode key, any valid path may be printed. However, as a reminder, you should be keeping track of visited so as not to get stuck in a loop, and so printing a name that traces a loop in its path would not be acceptable. (you may use the CVector visited again, or the same CMap that you build up for the queries--though recall you must use CVector for visited for the Name search option).
- If the user types anything other than q or a valid inode, print nothing and continue asking for another inode search.
- NOTE: Unlike with the Name search, the -i search should collect and include the CMap both directory and file inode numbers.
Tips:
- You'll want to use something like
sprintf(inodestr, "%lu", inode)to convert the inode (type unsigned long) to a string (where inodestr is a char array). - You may be tempted to read the user's input inode from scanf as its usual unsigned long type, and then convert it to a string using sprintf as you did when adding it to the CMap. But remember that input from the console can always be seen as a string, even if it is digits, so it will be easier to just read it as a string using
%s(you also need to take the input as a string to capture the possible "q" sentinel value). - You may assume that the inode values never need a char array of more than MAX_INODE_LEN characters (including null terminating character), and that the file and directory names never need more than PATH_MAX. However, you should only use arrays of these sizes for temporary storage of strings. All strings in the CMap need to be right-sized. The CMap will take care of that for you for the keys, but you'll need to dynamically allocate (and clean up) the values.
- For debugging, you'll need to know what some files' inode values are, in order to enter them as input. The advice page has tips for doing this.
Here are a few examples:
myth> ./searchdir -i /afs/ir/class/cs107/samples/unixhunt
Enter inode (or q to quit): 1800012521
/afs/ir/class/cs107/samples/unixhunt/nums
Enter inode (or q to quit): 1800012458
/afs/ir/class/cs107/samples/unixhunt/clue1
Enter inode (or q to quit): 1800012898
/afs/ir/class/cs107/samples/unixhunt/air/rockhopper/gatekeeper.py
Enter inode (or q to quit): blah
Enter inode (or q to quit): 1234
Enter inode (or q to quit): q
myth>
myth> ./searchdir -i
Enter inode (or q to quit): 1800025021
./Makefile
Enter inode (or q to quit): 1800020940
./searchdir.c
Enter inode (or q to quit): q
myth>
Date search (-d)
Now that you love CVector and CMap, we're going to put a CVector in a CMap. This feature allows users to look up files by modified date. Here's how you will use CMap and CVector together to implement Date search:
- Using the same type of recursive directory search you used twice before, recurse the directory tree. For each file and directory, save its modified date as a string "MM/DD" in a CMap that uses the date string as a key and a CVector of file and directory names (with path) that share that date as the value. All strings in the CVectors in the CMap must be right-sized.
- Once you've exhausted the recursive directory search, enter a loop where you repeatedly ask the user to enter a date as MM/DD, retrieve the associated CVector of files and directories names from the CMap, and then enumerate over the vector printing each entry. When the user types "q" instead of a date, that is the sentinal value indicating you should exit the program (but not before cleaning up memory, of course!).
- If you encountered a file or directory more than once during your recursive directory search, it is ok (but not required) to have duplicate names for it in the results that are printed. However, as a reminder, you should be keeping track of visited so as not to get stuck in a loop, and so printing a name that traces a loop in its path would not be acceptable.
- If the user types anything other than q or a MM/DD date that exists in your map, print nothing and continue asking for another date search.
- NOTE: Unlike with the Name search (but like the -i Inode search), the -d search should collect and include the CMap both directory and file modify dates.
Tips:
- You don't need to parse out the MM and DD from the user's input, becasue your CMap keys will already be in the form "MM/DD". They strings will naturally compare correctly.
- Note that this program will group files that were modified on the same date but in different years. For simplicity, we won't worry about this. You could imagine this could be useful for some purpose--perhaps gathering images from a certain date across many years to show to you on your Facebook "on this day..." feature.
- Dealing with dates in Unix/C (or any computer language, to be honest) can be tricky. The following code shows how to get a date from the same stat object we extracted the inode from, and then how to print it into a string MM/DD format (this code requires
#include <sys/time.h>and#include <time.h>, and you should add a#defineto your code for DATE_MAX, with value 6):struct stat ss; if (stat(fullpath, &ss) != 0) continue; int inode = ss.st_ino; struct tm *timeobj = localtime(&(ss.st_mtime)); char datebuf[DATE_MAX]; strftime(datebuf, DATE_MAX, "%m/%d", timeobj);
Here are a few examples:
myth> ./searchdir -d
Enter date MM/DD (or q to quit): 10/11
./lect3/Makefile
./lect3/ptr.c
./demo.mp4
Enter date MM/DD (or q to quit): 10/12
./lect5/sizes.c
Enter date MM/DD (or q to quit): blah
Enter date MM/DD (or q to quit): 10/13
Enter date MM/DD (or q to quit): 55/99
Enter date MM/DD (or q to quit): q
myth>
myth> ./searchdir -d /afs/ir/user/c/b/cbl/foo
Enter date MM/DD (or q to quit): 03/05
/afs/ir/user/c/b/cbl/foo/bar.txt
/afs/ir/user/c/b/cbl/foo/doc.txt
Enter date MM/DD (or q to quit): q
myth>
Guidelines
- Memory. We expect your program to run cleanly under Valgrind with no errors reported in any situation and no leaks if exiting cleanly. (We do not require cleanup of memory leaks from fatal errors).
- Efficiency. You can compare time and memory use with our sample executable as a benchmark for what is reasonable. Being with a factor of 2-3x will be considered on par.
- Design/style/readability. You know the drill-- aim for publication-quality code!
- Don't miss out on the companion advice page: Go to advice/FAQ page
Grading
Have you read how assignments are graded? For this assignment, the anticipated point breakdown will be in the neighborhood of:
Functionality (90 points)
- Sanity cases. (45 points) The published sanity check tests are used for sanity grading.
- Comprehensive cases. (18 points) Correct results for small inputs targeted to test specific input variations and broad coverage.
- Robustness cases. (10 points) Correct behavior on unusual cases, edge conditions, and graceful handling of required error conditions.
- Stress cases. (3 points) Large inputs touching on all required features.
- Clean compile. (2 points) We expect your code to compile cleanly without warnings.
- Clean run under valgrind. (7 points) We expect your program to run cleanly under Valgrind during all normal operation. Memory errors are significant deductions, leaks are more minor.
- Reasonable efficiency. (5 points) We expect programs to be reasonably efficient in use of time and space. Full points are awarded for being on par (2-3x) with the sample program; deductions will apply for immoderate use of memory/time. There is no bonus for outperforming the benchmark (and efforts to do so could detract from your code quality...). Programs with extreme runtime inefficiencies (requiring more than 10x the time of the sample solution) may lose additional functionality points from failed functionality tests that time out under the autotester.
Code review (buckets together weighted the usual style percent of score) Here we will read and evaluate your code in areas such as:
- Style and readability We expect your code to be clean and readable. We will look for descriptive names, helpful comments, and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
- Use of pointers and memory We expect you to show proficiency in handling pointers/memory, as demonstrated by appropriate use of stack versus heap allocation, no unnecessary levels of indirection, correct use of pointee types and typecasts, and so on.
- Client use of CVector/CMap We expect your program to use CVector and CMap to manage and manipulate the corpus data and leader board. You should leverage the features provided and not re-implement the functionality.
- Program design We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. When you need the same code in more than one place, you should unify, not copy and paste. When the C library provides needed functionality, you should leverage the existing library functions rather than re-implement that functionality.
Submissions received by the assignment deadline will receive an on-time bonus equal to 5% of the functionality points earned by the submission.
Getting started
-
The starter project contains the source files
searchdir.c, our CVector/CMap implementation provided as a compiled library, and a Makefile. The project is distributed as a Mercurial repository. Check out a copy of your cs107 repository using the commandhg clone /afs/ir/class/cs107/repos/assign2/$USER assign2 -
The shared directory
/afs/ir/class/cs107/samples/assign2contains a sample executable. Your repo containsslinkwhich is a shortcut way to refer to this directory if you find typing the full path tedious.
Finishing
When your code is finished, don't forget to finish with the all-important submit step to send your code to us for grading. If needed, familiarize yourself with our late policy.
How did this assignment go for you? Review the post-task self-check.