Assignment 1: Unix utilities

Due: Tue Jan 24 11:59 pm
Ontime bonus 5%. Grace period for late submissions until Fri Jan 27 11:59 am

Assignment by Julie Zelenski

Learning goals

This assignment is designed to give you

  1. first experiences editing, compiling, testing, and debugging C programs under Unix
  2. an opportunity to view Unix utility programs from an internal perspective, as implementer not just client
  3. practice with C arrays, pointers, and strings (both raw manipulation and string library functions)
  4. exposure to the role of Unix shell environment variables

Overview

Your task for this week's assignment is to implement simplified versions of the Unix utilities which and grep. This is an especially apropos way to be introduced to C and Unix, not only does it continues a thread you began in assign0, but implementing the Unix operating system and its command-line tools were what motivated the creation of the C language in the first place! Implementing these utility programs is a very natural use of C, and you'll see how comfortably it fits in this role.

Deliverable 1: mywhich

The which command and the PATH environment variable

The which command searches for a command by name and reports where its matching executable file was found. Pull up its man page and try out the command, e.g. which ls or which make or which vim. The response from which is the full path to the matching executable file or no output if not found.

It may not be obvious at first, but this search is intimately related to how commands are executed by the shell. When you run a command such as ls or vim, the shell searches for an executable program that matches that command name and then runs that program. If the shell cannot find a matching executable, it responds "command not found" (this is different from which that has no output when the command is not found).

How does the search for executables work? You might imagine that it looks for an executable file named in vim in every directory on the entire filesystem, but such an exhaustive search would be both incredibly inefficient and dangerously insecure. Instead, it searches only those directories that have been explicitly listed in the user's search path. The default search path includes directories such as /usr/local/bin/ and /usr/bin/ which house the executable files for the standard unix commands. (The name bin is a nod to the fact that executable files are encoded in binary).

The user can configure their search path by changing the value of their PATH environment variable. The user's environment is a collection of NAME=value pairs. Here is an excerpt from printenv, a command that prints all variables in your environment:

myth> printenv
HOST=myth9
SHELL=/bin/bash
USER=zelenski
PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/sbin:/sbin:/usr/games
PWD=/afs/ir/users/z/e/zelenski
...

Environment variables track information like username (USER=zelenski), the host where logged in (HOST=myth9), and the current working directory (PWD=/afs/ir/users/z/e/zelenski). The environment variable of particular interest is PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/sbin:/sbin:/usr/games. The value for PATH is a sequence of directories separated by colons; these are the directories searched when looking for an executable. The directories are searched in the order they are listed, stopping at the first matching executable.

Implementing mywhich

Your first assignment task is to write your own implementation of mywhich. It will operate similarly to the standard which. mywhich searches the directories in the user's PATH for a command name and prints the full path if a matching executable is found. Your version of mywhich supports a command-line flag -p for use as a testing aid, this flag allows the user to dictate the search path to use in place of the PATH environment variable.

The sample output below to illustrates its operation. In the first use, the user asks mywhich to find four executables. Three of them were found in various bin directories, but no executable named submit was found in any directory in the user's PATH and thus nothing was printed for it.

myth> ./mywhich ls xemacs submit cp
/bin/ls
/usr/bin/xemacs
/bin/cp

In the second use, the -p flag is used to change the search path to the two directories /tmp and /afs/ir/class/cs107/tools. The latter directory is where submit is now found.

myth> ./mywhich -p /tmp:/afs/ir/class/cs107/tools submit
/afs/ir/class/cs107/tools/submit

The basic approach we recommend you take for mywhich is to:

  1. get search path from environment (if not superseded by -p argument)

    • iterate over environment variables, compare each to "PATH", if match, use its associated value
  2. for each command name, look for matching executable in search path

    • extract directories one by one from search path, use colon to demarcate divisions
    • construct string representing the full path "directory/command_name"
    • test if full path is valid executable file and if so, print it and stop looking

Read up on the string functions offered in the standard C library as they can be useful tools to help out with in searching the envp (strncmp), deconstructing the searchpath (strchr), and assembling the full path (sprintf).

The entire program will take about less than a page of C code, but represents a significant milestone: you've just written your first complete C program in unix -- congratulations!

Requirements for mywhich

Deliverable 2: mygrep

What is mygrep?

Your final assignment task will be to implement mygrep, a simplified version of standard grep. Assign0 and lab1 have already acquainted you with the handy use of grep pattern files to search the contents of the files for a pattern. In its simplest usage, the pattern is a fixed string (no special characters), but where grep especially shines is in matching complex patterns expressed as regular expressions. The regex support in standard grep is quite extensive, but mygrep will allow fixed strings and only these four metacharacters:

(This week's lab1 includes some practice forming regular expressions using these metacharacters -- check it out if you haven't yet gone to lab!)

You won't have to start from a blank slate on mygrep as the starter code we give you is a working implementation of a basic fixed-string match. Our provided code processes the command-line arguments, determines which file(s) to search, finds a match within a line, and prints the line with a highlight around the matching portion. Your job will be to extend the program with these two major upgrades:

  1. report multiple matches within a single line (e.g. pattern re should be displayed as matching both the start and end of revere)
  2. replace fixed-string match with regex match and support the four metacharacters: . ^ $ *

Extending mygrep from the given starter code

Here is the order of attack we recommend for you to turn the starter mygrep into a fully functional version:

  1. Play with the existing program. Before even thinking about coding, take stock of what you have. Use make to build the program and try it out. Do a few manual tests of your own, then run sanitycheck to see how it fares thus far. Play with the sample solution some, too, to have your eye on the goal.

  2. Study the existing code. Since you are going to be extending on top of this base, it is essential that you understand everything this code does (and doesn't) do. Do not just give a passive skim, dig in and scrutinize each line and be sure that you know how it operates and how it relates to the rest of the code. Ask questions about anything you can't work out for yourself.

  3. Document the existing code. We require that you document all of the existing code. Each function should have a header comment that documents the function's purpose and behavior. Inline comments should be added where appropriate to explain unusual or dense passages. Be thorough!

  4. Multiple matches per line. Now you're ready to code! The least revolutionary of the changes is matching the pattern more than once on the same line. You already have a working single match that you can reuse/repurpose, and so the approach is to figure how to work it into a loop that handles multiple matches per line. Test, test and test some more. Run sanitycheck, too!

  5. Support metacharacter . Upgrading from fixed-string to regex match is necessarily going to shake things up a bit. The given search function does a sliding match calling strncmp to look for the pattern in the input. You don't want the sliding match to go away, but do need to replace strncmp with a fancier comparison capable of also matching regex patterns as well as fixed strings. Start by writing a regex_match that is basically re-implements strncmp and change search to use it instead. Feel free to change the interface to take the parameters you want; you don't need it to match the prototype of strncmp exactly. Do some focused regression testing now to make sure you haven't broken anything that previously worked. Now, consider how to tweak the notion of "equal characters" used within regex_match to allow . in the pattern to match any char in the input. You could write regex_match iteratively or recursively, but we recommend approaching recursively as it sets you nicely for the later metacharacters.

  6. Support metacharacter ^ Matching to pattern that starts with ^ is mostly just a matter of adding a special case to block the sliding part of the match so a pattern starting with ^ is only matched at the start of the string.

  7. Support metacharacter $ Matching to a pattern that ends with $ is a little bit trickier. It requires some careful thinking to get this one. In particular, pay attention how you detect/handle the base case(s) in regex_match to ensure that $ at end of the pattern matches if and only if the end of line has been reached. Extra testing is warranted here to ensure you have the logic correct in all cases.

  8. Support metacharacter * You're in the homestretch now, but we did save the toughest for last. Matching the * metacharacter is a test of your mettle with backtracking recursion. When matching pattern a*, if the first char of input is not an a, then the only choice is to match a* to empty, advance past it in pattern, and recur. So that case can be directly handled. But if first char is a, you have the choice to either match a* to that a or match a* to empty. Because * is defined to prefer the greedy match, you should first extend the sequence of a's being matched, and only backtrack if that fails. This is where having written regex_match recursively saves the day as you can optimistically recur on this greedy path, and if doesn't work out, backtrack to the alternative and try again.

Requirements for mygrep

Requirements that apply to both mywhich and mygrep

Advice/FAQ

The assignment writeup (this page) is the place where we spell out the formalities (required specifications, due date, grading standards, logistics, and so on). We also maintain a companion advice page for the assignment that offers informal recommendations and hints, along with answers to common student questions. Given this is a new assignment this quarter, we will be sure to update the advice page to address any late-breaking issues that surface. Please check it out! Go to advice/FAQ page

Grading

Below is the tentative grading rubric. Your grade will be based on a combination of autograder tests for functionality and a style review by our staff. The autograder functionality tests are scored in points; the human grader's code review is scored into buckets to emphasize the qualitative features of the review over the quantitative. A detailed explanation of the different components is available on the Guide to Assignment Grading page.

Functionality (100 points)

Code review (buckets together weighted to contribute ~20 points)

Get started

To begin working, you need to clone your repo for assign1. Recall the clone command (for more detail, check the Mercurial Guide:

hg clone /afs/ir/class/cs107/repos/assign1/$USER assign1

In the cloned assign1 directory, you will find several files including a Makefile, mywhich.c, and mygrep.c. The directory /afs/ir/class/cs107/samples/assign1 contains sample solution programs and some text files for grep. You can access these via their full path (e.g. /afs/ir/class/cs107/samples/assign1/mywhich_soln), but it is a bit unwieldy. Your repo has slink, which is a symbolic link to this shared samples directory. Use slink to more easily refer to those files (e.g. slink/mygrep_soln) and avoid retyping the full path. Note that the samples directory is shared and read-only, so you will not be able to add or edit files within that directory.

Finishing

As you work, you should be making regular commits, but keep in mind that commit is not the same as submit! The former just saves your progress for your own checkpoint, and the latter submits the assignment to us for grading. Review the How to Submit page for instructions. We strongly recommend you do a trial submit well in advance of the deadline to familiarize yourself with the process and allow time to work through any snags. You can replace that submission with a subsequent one if desired. It's also fine to submit a stable but perhaps unpolished/unfinished as an insurance policy. If some disaster happens (e.g., computer melts down) and you miss the deadline to submit your final version, you at least have something submitted that will earn some points. Without a submit, we cannot grade your work.

How CS107 handles deadlines and late work may differ from other classes, be sure you have read our complete late policy. Submissions received before the due date earn the on-time bonus. If you miss the due date, late work may be submitted during the grace period without penalty. No submissions will be accepted after the grace period ends, please plan accordingly!

How did it go for you? Review the post-task self-check.



Contents