Assignment 2: String me along

Due: Mon Apr 23 11:59 pm
Ontime bonus 5%. Grace period for late submissions until Wed Apr 25 11:59 pm

Assignment by Julie Zelenski

Learning goals

This assignment is designed to give you

  • practice with C-strings (both raw manipulation and using string library functions)
  • an opportunity to view unix utility programs from an internal perspective, as implementer not just client
  • exposure to programmatic access of the filesystem and shell environment variables

For this assignment, you will write programs that replicate some of the functionality of the unix commands printenv and which. This is an especially apropos way to be introduced to C and unix; not only does it continue a thread you began in assign0, but implementing the unix operating system and its command-line tools were what motivated the creation of the C language in the first place! Implementing these programs is a very natural use of C, and you'll see how comfortably it fits in this role.

Get started

Check out the starter project from your cs107 repo using the command

git clone /afs/ir/class/cs107/repos/assign2/$USER assign2

The starter project contains various C files and supporting Makefile and custom_tests files. In the samples subdirectory, you'll find our sample solution programs.

Background: unix filesystem, shell environment

In this assignment, you will write code that interacts with the unix filesystem and shell environment variables.

If you need an introduction or refresher on the filesystem, review our unix reference library tutorials on the tree structure, absolute and relative paths, and various commands to interact with files and directories as a user.

When interacting with the filesystem programmatically in C, you will use a C-string to represent a path and can construct and dissect paths by string manipulation and string.h library functions. Working with paths is mostly just practice with C-strings!

The POSIX standards establish a set of C functions for interacting with the OS. Whereas the standard C library functions provide only simple file reading/writing, the POSIX functions add more comprehensive services, including access to filesystem metadata (e.g. modification time, permissions), directory contents, and filesystem operations that are necessary for implementing unix commands like ls and mkdir. There is one POSIX function you will use for this assignment: the function access to check the user's permissions for a file. Take a look at its man page to get a preview.

On a unix system, programs run in the context of the user's "environment". The environment is a list of key-value pairs that provide information about the terminal session and configure the way processes behave. You have already used the USER environment variable when cloning your assignment repo; USER is set to your SUNet ID when you log into myth. Other variables include PATH (where the system looks for programs to run), HOME (path to your home directory), and SHELL (your command line interpreter).

Explore your environment by trying out the printenv and env commands. You will be implementing your own version of the myprintenv program as part of the assignment.

  • printenv will show your environment variables, env allows you to change them. Read the man pages for printenv and env.
  • Run printenv with no arguments to see your entire environment. Then try printenv USER SHELL HOME. What is the output from a bad request printenv BOGUS?
  • Run printenv, then run env BINKY=1 WINKY=2 printenv. What changes between the two? Run printenv again with no arguments to determine whether changes to the environment were transient or permanent.

1. Write get_env_value

Your first task will be to implement two small utility functions. Each function is designed to be written and tested in isolation, and then later integrated into a full program in the final part of the assignment.

The get_env_value function is used to lookup an environment variable by name. This is a warmup to get you practicing with C-strings and the <string.h> library functions.

The required prototype for get_env_value is:

const char *get_env_value(const char *envp[], const char *varname);

Each entry in the envp array is a string of the form "VARNAME=VALUE". A NULL pointer is placed after the last entry to mark the end of the entries in the environment array.

Your function will iterate through the entries in the environment and look for a matching entry. If found, the function returns the associated value; otherwise it returns NULL. For example, if the envp contains an entry "USER=zelenski", the associated value for "USER" is "zelenski". Your function does not need to make a copy of the value string, it just returns a pointer to the portion of the string following the '=' char.

Write your implementation of get_env_value in the util.c file.

You can use our provided myprintenv.c program for testing the function. It is integrated with sanitycheck for your convenience.

Note: get_env_value should not call the getenv/secure_getenv functions. To operate correctly, it must manually search the environment that was passed via the envp argument.

2. Write scan_token

The second utility is the scan_token function. We studied strtok in lab as a code study exercise. Such a function to tokenize a string by delimiter is handy to have, but the standard strtok has design flaws that make it less than ideal. The intention of scan_token is to separate a string into tokens in the manner of strtok but with a cleaner design. The required prototype for scan_token is:

bool scan_token(const char **p_input, const char *delimiters,
                char buf[], size_t buflen);

The function scans the input string to determine the extent of the token using the delimiters as separators and then writes the token characters to buf, making sure to terminate with a null char. The function returns true if a token was written to buf, and false otherwise.

Specific details of the function's operation:

  • Your implementation of scan_token should take the same general approach as strtok, meaning it can (and should!) use the handy <string.h> functions such as strspn and strcspn, but it should not replicate the bad parts of its design, specifically:
    • no static variables
    • does not destroy the input string
    • no weird carryover of state between calls. Each call operates independently.
  • The function separates the input into tokens in the same way that strtok does: it scans the input string to find the first character not contained in delimiters. This is the beginning of the token. It scans from there to find the first character contained in delimiters. This delimiter (or the end of the string if no delimiter was found) marks the end of the token.
  • Note that the parameter p_input is a char **. This is a pointer argument that is being passed by reference. The pointer is tracking the progress through the input. (Note that this pointer parameter replaces what was tracked by the static variable within the strtok implementation that you studied in lab) The client passes a pointer to the pointer to the first char of the input string. The function will update the pointer held by p_input to point to the next character in the input that follows what was just scanned. If the scanned token consumed all of the remaining input, *p_input will point to the null terminator.
  • buf is a fixed-length array to store the token and buflen is the length of the buffer. scan_token should not write past the end of buf. If a token does not fit in buf, the function should write buflen - 1 characters into buf, write a null byte in the last slot, and the pointer held by p_input should be updated to point to the next character following the buflen - 1 characters in the token. In other words, the next token scanned will start at the first character that would have overflowed buf.

Consider this sample tokenization loop:

    const char *input = "super-duper-awesome-magnificent";
    char buf[10];
    const char *remaining = input;

    while (scan_token(&remaining, "-", buf, sizeof(buf))) {
        printf("Next token: %s\n", buf);
    }

Running the above code produces this output:

    Next token: super
    Next token: duper
    Next token: awesome
    Next token: magnifice
    Next token: nt

Write your implementation of scan_token in the util.c file. You can test it using our provided tokenize.c program. The tokenize program is integrated with sanitycheck. Be sure to make use of the custom tests option to extend your testing beyond the default cases.

Review and comment starter code

With your utility functions tested and debugged, you're now ready to build a program out of them.

The file mywhich.c is given to you with an incomplete main function that sketches the expected behavior for the case when mywhich is invoked with no arguments. You are to first read and understand this code, work out how to change/extend it to suit your needs, and finally add comments to document your strategy.

Some issues to consider when looking over code:

  • The main function in mywhich.c receives an extra argument on top of argc/argv. This is an alternate form of main available to unix programs. What is that third argument?
  • What is PATH_MAX? What is it used for?
  • What does sizeof evaluate to when applied to an array?
  • If the user's environment does not contain a value for MYPATH, what does mywhich use instead?
  • How does a client properly use scan_token? (see sample uses in both tokenize.c and mywhich.c)
  • Do you see anything unexpected or erroneous? We intend for our code to be bug-free; if you find otherwise, please let us know!

As per usual, the code we provide has been stripped of its comments and it will be your job to provide the missing documentation.

3. Implement the mywhich program

The final program you are to write is a version of the unix which command, a utility used to locate and identify executable programs to run. Your mywhich program will use the two utility functions (get_env_value and scan_token) you wrote earlier.

What does the which command do?

The which command searches for a command by name and reports the full path to its matching executable file. Read its man page (man which) and try it out, e.g. which ls or which make or which vim. The response from which is the matching executable or no output if not found.

It may not be obvious at first, but this search is intimately related to how commands are executed by the shell. When you run a command such as ls or vim, the shell searches for an executable program that matches that command name and then runs that program.

Where does it search for executables? You might imagine that it looks for an executable file named vim in every directory on the entire filesystem, but such an exhaustive search would be both incredibly inefficient and dangerously insecure. Instead, it looks only those directories that have been explicitly listed in the user's search path. The default search path includes directories such as /usr/local/bin/ and /usr/bin/ which house the executable files for the standard unix commands. (The name bin is a nod to the fact that executable files are encoded in binary).

The user can configure their search path by changing the value of their PATH environment variable. The value for PATH is a sequence of directories separated by colons such as PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/games. When looking for a command, which considers the directories in the order they are listed in PATH and stops at the first one that contains a matching executable. In order to match, the file's name must be an exact match and the file must be readable and executable by the user.

How does mywhich operate?

The mywhich program you are to write is similar in operation to the standard which with these differences:

  • mywhich uses the environment variable MYPATH for the search path. If no such environment variable exists, it falls back to PATH. (Standard which always uses PATH as the search path)
  • mywhich invoked with no arguments prints the list of directories searched. (standard which with no arguments does nothing)
  • mywhich does not support any command-line flags. (standard which -a prints all matches)

When invoked with no arguments, mywhich prints the directories in the search path, one directory per line. This use case is a testing aid to verify that your utility functions work correctly.

myth> ./mywhich 
Directories in search path:
/usr/bin
/usr/local/bin
/usr/pubsw/bin
/bin

When invoked with one or more arguments, mywhich searches for a matching executable for each argument. The sample output below shows invoking mywhich to find three executables. Two of them were found, but no executable named submit was found in any directory in the user's MYPATH and thus nothing was printed for it.

myth> ./mywhich xemacs submit cp
/usr/bin/xemacs
/bin/cp

You will want to test with different configurations of the search path. Rather than muck with your actual PATH (which can create total chaos), we recommend that you change MYPATH, which only affects mywhich and nothing else. Use env to set the value of MYPATHwhen running mywhich like this:

    myth> env MYPATH=/tmp:tools ./mywhich submit
    tools/submit

Requirements for mywhich

  • Usage. The mywhich program is invoked with zero or more arguments. Each argument is a command to be searched for. A command will contain only non-special characters (i.e. no ^ # * / $ % which have special meaning in unix)

  • Assumptions. You may assume correct usage in all cases and that the user's MYPATH and PATH variables are well-formed sequences of one or more paths separated by colons. You do not need to detect or cope with situations where these assumptions do not hold and we will not test on any inputs that violate these assumptions, e.g. no usage of unsupported -a flag, no malformed values for MYPATH, no special characters in command.

  • Operation. The user's MYPATH (or PATH if there is no MYPATH variable in the user's environment) defines the search path. It considers the directories in the order they are listed in the search path. If the directory contains a readable, executable file whose name is an exact match to the command, the search stops here and the full path of that file is printed.

  • Expected output. For each command name, it prints the full path to the first matching executable or nothing if no matching executable was found. The matched executables are listed one per line in the order that the command names were specified on the command-line.

Advice/FAQ

Don't miss out on the good stuff in our companion document!
Go to advice/FAQ page

Grading

Have you read how assignments are graded? For this assignment, the anticipated point breakdown will be in the neighborhood of:

Functionality (80 points)

  • Sanity cases (25 points) Correct results on the default sanity check tests.
  • Comprehensive/stress cases (40 points) Correct results for additional test cases with broad, comprehensive coverage and larger, more complex inputs.
  • Clean compile (2 points) Compiles cleanly with no warnings.
  • Clean run under valgrind (6 points) Clean memory report(s) when run under valgrind. Memory errors (invalid read/write, use of freed memory, etc) are significant deductions. Memory leaks are a minor deduction. Every normal execution path is expected to run cleanly with no memory errors nor leaks reported. We will not test exception/error cases under Valgrind.
  • custom_tests (5 points) Your custom_tests file should include at least 5 tests of your own that show thoughtful effort to develop comprehensive testing coverage. Please include comments that explain your choices. We will run your custom tests against your submission as well as review the cases to assess the strength of your testing efforts.

Code review (buckets together weighted to contribute ~15 points)

  • Use of pointers and memory. We expect you to show proficiency in handling pointers/memory, no unnecessary levels of indirection, correct use of pointee types and typecasts, and so on. For this program, you should not need and should not use dynamic memory (i.e. no malloc/strdup).
  • Program design. We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. When you need the same code in more than one place, you should unify, not copy and paste. If the C library provides functionality needed for a task, you should leverage these library functions rather than re-implement that functionality.
  • Style and readability. We expect your code to be clean and readable. We will look for descriptive names, defined constants (not magic numbers!), and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
  • Documentation. You are to document both the code you wrote and what we provided. We expect program overview and per-function comments that explain the overall design along with sparing use of inline comments to draw attention to noteworthy details or shed light on a dense or obscure passage. The audience for the comments is your C-savvy peer.

On-time bonus (+5%)

The on-time bonus for this assignment is 5%. Submissions received by the due date earn the on-time bonus. The bonus is calculated as a percentage of the point score earned by the submission.

Finish and submit

Review the How to Submit page for instructions. Submissions received by the due date receive the on-time bonus. If you miss the due date, late work may be submitted during the grace period without penalty. No submissions will be accepted after the grace period ends, please plan accordingly!

How did it go for you? Review the post-task self-check.