Due: Tue Jan 24 11:59 pm
Ontime bonus 5%. Grace period for late submissions until Fri Jan 27 11:59 am
Assignment by Julie Zelenski
Learning goals
This assignment is designed to give you
- first experiences editing, compiling, testing, and debugging C programs under Unix
- an opportunity to view Unix utility programs from an internal perspective, as implementer not just client
- practice with C arrays, pointers, and strings (both raw manipulation and string library functions)
- exposure to the role of Unix shell environment variables
Overview
Your task for this week's assignment is to implement simplified versions of the Unix utilities which and grep. This is an especially apropos way to be introduced to C and Unix, not only does it continues a thread you began in assign0, but implementing the Unix operating system and its command-line tools were what motivated the creation of the C language in the first place! Implementing these utility programs is a very natural use of C, and you'll see how comfortably it fits in this role.
Deliverable 1: mywhich
The which command and the PATH environment variable
The which command searches for a command by name and reports where its matching executable file was found. Pull up its man page and try out the command, e.g. which ls or which make or which vim. The response from which is the full path to the matching executable file or no output if not found.
It may not be obvious at first, but this search is intimately related to how commands are executed by the shell. When you run a command such as ls or vim, the shell searches for an executable program that matches that command name and then runs that program. If the shell cannot find a matching executable, it responds "command not found" (this is different from which that has no output when the command is not found).
How does the search for executables work? You might imagine that it looks for an executable file named in vim in every directory on the entire filesystem, but such an exhaustive search would be both incredibly inefficient and dangerously insecure. Instead, it searches only those directories that have been explicitly listed in the user's search path. The default search path includes directories such as /usr/local/bin/ and /usr/bin/ which house the executable files for the standard unix commands. (The name bin is a nod to the fact that executable files are encoded in binary).
The user can configure their search path by changing the value of their PATH environment variable. The user's environment is a collection of NAME=value pairs. Here is an excerpt from printenv, a command that prints all variables in your environment:
myth> printenv
HOST=myth9
SHELL=/bin/bash
USER=zelenski
PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/sbin:/sbin:/usr/games
PWD=/afs/ir/users/z/e/zelenski
...
Environment variables track information like username (USER=zelenski), the host where logged in (HOST=myth9), and the current working directory (PWD=/afs/ir/users/z/e/zelenski). The environment variable of particular interest is PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/sbin:/sbin:/usr/games. The value for PATH is a sequence of directories separated by colons; these are the directories searched when looking for an executable. The directories are searched in the order they are listed, stopping at the first matching executable.
Implementing mywhich
Your first assignment task is to write your own implementation of mywhich. It will operate similarly to the standard which. mywhich searches the directories in the user's PATH for a command name and prints the full path if a matching executable is found. Your version of mywhich supports a command-line flag -p for use as a testing aid, this flag allows the user to dictate the search path to use in place of the PATH environment variable.
The sample output below to illustrates its operation. In the first use, the user asks mywhich to find four executables. Three of them were found in various bin directories, but no executable named submit was found in any directory in the user's PATH and thus nothing was printed for it.
myth> ./mywhich ls xemacs submit cp
/bin/ls
/usr/bin/xemacs
/bin/cp
In the second use, the -p flag is used to change the search path to the two directories /tmp and /afs/ir/class/cs107/tools. The latter directory is where submit is now found.
myth> ./mywhich -p /tmp:/afs/ir/class/cs107/tools submit
/afs/ir/class/cs107/tools/submit
The basic approach we recommend you take for mywhich is to:
-
get search path from environment (if not superseded by -p argument)
- iterate over environment variables, compare each to "PATH", if match, use its associated value
-
for each command name, look for matching executable in search path
- extract directories one by one from search path, use colon to demarcate divisions
- construct string representing the full path
"directory/command_name" - test if full path is valid executable file and if so, print it and stop looking
Read up on the string functions offered in the standard C library as they can be useful tools to help out with in searching the envp (strncmp), deconstructing the searchpath (strchr), and assembling the full path (sprintf).
The entire program will take about less than a page of C code, but represents a significant milestone: you've just written your first complete C program in unix -- congratulations!
Requirements for mywhich
-
Usage. The
mywhichprogram is invoked with one or more command names to search for in the user'sPATH. It accepts the optional flag argument-p <searchdirs>where<searchdirs>is a colon-delimited sequence of directories to search instead of the user'sPATH. -
Assumptions. You may assume correct usage in all cases. The
-pflag, if used, must be the first argument and the second argument is a colon-delimited sequence of directories; the third and subsequent arguments are command names. If run without the-pflag, the arguments are one or more command names. Furthermore, you may assume that the user'sPATHand any-pargument will be a well-formed sequence of one or more directories separated by colons. You do not need to detect or cope with situations where these assumptions does not hold. We will not test on usage such as no command names, out of order-pargument, malformed sequence of directories inPATHor given as the-pargument, and so on. -
Operation. For each command name, it searches for a matching executable. The directories to be searched is given by the value of user's
PATH(or the value of the-pargument if used). The directories are searched in the order they are listed. A search stops at the first directory containing an executable file matching the command name. -
Expected output. For each command name, it prints the full path to its executable or nothing if no matching executable was found. The matched executables are listed in the order that the command names were specified on the command-line.
-
Restrictions. You are to manually search for
PATHin the environment using theenvpargument tomainand directly test for the presence of the executable file in each directory in the search path. The intended purpose of the task is for you to write code to implement these features; your code is prohibited from using facilities such asgetenv,env, andwhich.
Deliverable 2: mygrep
What is mygrep?
Your final assignment task will be to implement mygrep, a simplified version of standard grep. Assign0 and lab1 have already acquainted you with the handy use of grep pattern files to search the contents of the files for a pattern. In its simplest usage, the pattern is a fixed string (no special characters), but where grep especially shines is in matching complex patterns expressed as regular expressions. The regex support in standard grep is quite extensive, but mygrep will allow fixed strings and only these four metacharacters:
- The metacharacter
.matches any character:mygrep st.pmatches bothstepandstop - The metacharacters
^and$match the beginning and end of the line, respectively:mygrep '^happy$'matches a line that consists solely of the wordhappy(no whitespace or other characters before or afterhappy) - The metacharacter
*matches zero or more occurrences of preceding char:mygrep 'argh*'matchesarg,argh,arghhhhhh, etc. The match is defined to be greedy, which is to say it grabs as many copies of the repeat character as possible.
(This week's lab1 includes some practice forming regular expressions using these metacharacters -- check it out if you haven't yet gone to lab!)
You won't have to start from a blank slate on mygrep as the starter code we give you is a working implementation of a basic fixed-string match. Our provided code processes the command-line arguments, determines which file(s) to search, finds a match within a line, and prints the line with a highlight around the matching portion. Your job will be to extend the program with these two major upgrades:
- report multiple matches within a single line (e.g. pattern
reshould be displayed as matching both the start and end ofrevere) - replace fixed-string match with regex match and support the four metacharacters:
. ^ $ *
Extending mygrep from the given starter code
Here is the order of attack we recommend for you to turn the starter mygrep into a fully functional version:
-
Play with the existing program. Before even thinking about coding, take stock of what you have. Use
maketo build the program and try it out. Do a few manual tests of your own, then run sanitycheck to see how it fares thus far. Play with the sample solution some, too, to have your eye on the goal. -
Study the existing code. Since you are going to be extending on top of this base, it is essential that you understand everything this code does (and doesn't) do. Do not just give a passive skim, dig in and scrutinize each line and be sure that you know how it operates and how it relates to the rest of the code. Ask questions about anything you can't work out for yourself.
-
Document the existing code. We require that you document all of the existing code. Each function should have a header comment that documents the function's purpose and behavior. Inline comments should be added where appropriate to explain unusual or dense passages. Be thorough!
-
Multiple matches per line. Now you're ready to code! The least revolutionary of the changes is matching the pattern more than once on the same line. You already have a working single match that you can reuse/repurpose, and so the approach is to figure how to work it into a loop that handles multiple matches per line. Test, test and test some more. Run sanitycheck, too!
-
Support metacharacter
.Upgrading from fixed-string to regex match is necessarily going to shake things up a bit. The givensearchfunction does a sliding match callingstrncmpto look for the pattern in the input. You don't want the sliding match to go away, but do need to replacestrncmpwith a fancier comparison capable of also matching regex patterns as well as fixed strings. Start by writing aregex_matchthat is basically re-implementsstrncmpand changesearchto use it instead. Feel free to change the interface to take the parameters you want; you don't need it to match the prototype ofstrncmpexactly. Do some focused regression testing now to make sure you haven't broken anything that previously worked. Now, consider how to tweak the notion of "equal characters" used withinregex_matchto allow.in the pattern to match any char in the input. You could writeregex_matchiteratively or recursively, but we recommend approaching recursively as it sets you nicely for the later metacharacters. -
Support metacharacter
^Matching to pattern that starts with^is mostly just a matter of adding a special case to block the sliding part of the match so a pattern starting with^is only matched at the start of the string. -
Support metacharacter
$Matching to a pattern that ends with$is a little bit trickier. It requires some careful thinking to get this one. In particular, pay attention how you detect/handle the base case(s) inregex_matchto ensure that$at end of the pattern matches if and only if the end of line has been reached. Extra testing is warranted here to ensure you have the logic correct in all cases. -
Support metacharacter
*You're in the homestretch now, but we did save the toughest for last. Matching the*metacharacter is a test of your mettle with backtracking recursion. When matching patterna*, if the first char of input is not ana, then the only choice is to matcha*to empty, advance past it in pattern, and recur. So that case can be directly handled. But if first char isa, you have the choice to either matcha*to thataor matcha*to empty. Because*is defined to prefer the greedy match, you should first extend the sequence of a's being matched, and only backtrack if that fails. This is where having writtenregex_matchrecursively saves the day as you can optimistically recur on this greedy path, and if doesn't work out, backtrack to the alternative and try again.
Requirements for mygrep
-
Usage. The
mygrepprogram is invoked with a pattern and zero or or more filenames to search. If no filenames are given as arguments, it reads fromstdin. It is a usage error to run the program without a pattern argument. An error is also raised if a filename cannot be read.mygrepdoes not support any of the flags from standardgrepnor any metacharacters beyond. ^ $ *. The user should single-quote the pattern argument to avoid the shell giving unwanted special treatment to any metacharacters. -
Assumptions. You may assume that all patterns are valid and well-formed. Some metacharacters only make sense if used in a certain way--for example,
*needs to be preceded by the character to repeat and^is only legal at the beginning of the pattern. You do not need to detect or cope with invalid patterns. The specific list of invalid patterns is on the advice page. We will not test on invalid patterns. -
Operation. For each file, it reads the contents line by line and searches for a match to the pattern within each line. It outputs the matching lines.
-
Expected output. For each matching line, the entire line is displayed and the characters of each match are highlighted in inverse video. If there is more than one more match within the line, each match is highlighted. Line without a match are not printed. If more than one file is being searched, each matching line is prefixed with the name of the file.
-
Restrictions. You are to implement regex matching yourself; your code is prohibited from using
grepor other pre-written regex match.
Requirements that apply to both mywhich and mygrep
-
Clean compile. All code you submit to us for grading is expected to compile without warnings or errors.
-
Documented. All code you submit should be appropriately documented. That usually means a program overview comment, individual function comments, and inline comments where needed. You are to document both code that you inherited in the starter and code that you added or modified.
-
Memory clean. We test submissions for memory errors/leaks by running various scenarios under Valgrind. Every normal execution path is expected to run cleanly with no memory errors nor leaks reported. We will not test exceptional/error cases under Valgrind.
-
Reasonable memory and runtime efficiency. We have modest requirements for efficiency. Use our sample solution as the benchmark of what constitutes reasonable use of time and memory. The sample is implemented in a straightforward manner and employs no trickery. If your performance is roughly comparable to the sample, you're golden.
-
Excellent design/style/readability. From CS106, you should be well-versed in good coding practice and we expect your solutions to show thoughtful design and attention to readability. This will continue to matter in CS107. You know the drill: decompose into manageable functions, factor common code to avoid duplication, no global variables, strive for readability in all ways, and so on. See Nick Parlante's stolen prose on landmarks in coding quality.
Advice/FAQ
The assignment writeup (this page) is the place where we spell out the formalities (required specifications, due date, grading standards, logistics, and so on). We also maintain a companion advice page for the assignment that offers informal recommendations and hints, along with answers to common student questions. Given this is a new assignment this quarter, we will be sure to update the advice page to address any late-breaking issues that surface. Please check it out! Go to advice/FAQ page
Grading
Below is the tentative grading rubric. Your grade will be based on a combination of autograder tests for functionality and a style review by our staff. The autograder functionality tests are scored in points; the human grader's code review is scored into buckets to emphasize the qualitative features of the review over the quantitative. A detailed explanation of the different components is available on the Guide to Assignment Grading page.
Functionality (100 points)
- Sanity cases (50 points) Correct results on the published sanity check tests.
- Comprehensive/stress cases (36 points) Correct results for additional test cases with broad, comprehensive coverage and larger, more complex inputs.
- Clean compile (2 points) Compiles cleanly with no warnings.
- Clean run under valgrind (8 points) Clean memory report(s) when run under valgrind. Memory errors (invalid read/write, use of freed memory, etc) are significant deductions. Memory leaks are a minor deduction.
- Reasonable efficiency (4 points) Meets runtime/memory benchmark as set by the sample. Full points are awarded if on par (2-3x) with the sample; deductions will apply for immoderate use of memory/time. There is no bonus for outperforming the benchmark (and efforts to do so could detract from your code quality).
Code review (buckets together weighted to contribute ~20 points)
-
Documentation We expect overview, per-function, and line-level comments as appropriate to explain the design and workings of the program, and especially to illuminate any particularly dense or tricky parts of the code. The audience for the comments is your C-savvy peer.
-
Style and readability We expect your code to be clean and readable. We will look for descriptive names, helpful comments, and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
-
Use of pointers and memory We expect you to show proficiency in handling pointers/memory, as demonstrated by appropriate use of stack versus heap allocation (for assign1, you do not need and should not use dynamic memory), no unnecessary levels of indirection, correct use of pointee types and typecasts, and so on.
-
Program design We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. When you need the same code in more than one place, you should unify, not copy and paste.
Get started
To begin working, you need to clone your repo for assign1. Recall the clone command (for more detail, check the Mercurial Guide:
hg clone /afs/ir/class/cs107/repos/assign1/$USER assign1
In the cloned assign1 directory, you will find several files including a Makefile, mywhich.c, and mygrep.c. The directory /afs/ir/class/cs107/samples/assign1 contains sample solution programs and some text files for grep. You can access these via their full path (e.g. /afs/ir/class/cs107/samples/assign1/mywhich_soln), but it is a bit unwieldy. Your repo has slink, which is a symbolic link to this shared samples directory. Use slink to more easily refer to those files (e.g. slink/mygrep_soln) and avoid retyping the full path. Note that the samples directory is shared and read-only, so you will not be able to add or edit files within that directory.
Finishing
As you work, you should be making regular commits, but keep in mind that commit is not the same as submit! The former just saves your progress for your own checkpoint, and the latter submits the assignment to us for grading. Review the How to Submit page for instructions. We strongly recommend you do a trial submit well in advance of the deadline to familiarize yourself with the process and allow time to work through any snags. You can replace that submission with a subsequent one if desired. It's also fine to submit a stable but perhaps unpolished/unfinished as an insurance policy. If some disaster happens (e.g., computer melts down) and you miss the deadline to submit your final version, you at least have something submitted that will earn some points. Without a submit, we cannot grade your work.
How CS107 handles deadlines and late work may differ from other classes, be sure you have read our complete late policy. Submissions received before the due date earn the on-time bonus. If you miss the due date, late work may be submitted during the grace period without penalty. No submissions will be accepted after the grace period ends, please plan accordingly!
How did it go for you? Review the post-task self-check.