Assignment 2: C Strings

Due: Mon Jan 28 11:59 pm
Ontime bonus 5%. Grace period for late submissions until Wed Jan 30 11:59 pm

Assignment by Julie Zelenski, with modifications by Nick Troccoli

Learning Goals

This assignment delves into those topics covered in lectures 4 and 5 and the second lab. You will be building your skills with:

  • C-strings (both raw manipulation and using string library functions)
  • viewing Unix utility programs from an internal perspective, as implementer not just client
  • exposure to programmatic access of the filesystem and shell environment variables

Overview

For this assignment, you will write programs that replicate some of the functionality of the Unix commands printenv and which. This is an especially appropriate way to be introduced to C and Unix; not only does it continue a thread you began in assign0, but implementing the Unix operating system and its command-line tools were what motivated the creation of the C language in the first place! Implementing these programs is a very natural use of C, and you'll see how comfortably it fits in this role.

Take a look at the working on assignments page linked to from the assignments page as you are working on this assignment; it outlines everything you need to know about working through a CS107 assignment, from getting the starter code to testing to submitting.

NEW: Also take a look at the CS107 Style Guide page linked to from the assignments page as you are working on this assignment; it outlines style guidelines to follow when working on C programs, and how to create well-designed programs. Additionally, we cannot over-emphasize how important it is to dedicate yourself to habits that allow you to do your best work. All that your CS106 instructors taught you about the value of decomposition, incremental development, readability, and early-and-often testing is even more true in the world of C.

Note: To remind you of the collaboration policy as it applies to the assignments: you are to do your own independent thinking, design, coding, and debugging. The assignments are for you to solidify your own understanding and develop your individual skills. If you are having trouble completing the assignment on your own, please reach out to the course staff; we are here to help!

To get started on the assignment, clone the starter project using the command

git clone /afs/ir/class/cs107/repos/assign2/$USER assign2

The starter project contains the following:

  • util.c mywhich.c and Makefile: two files that you will modify, and their Makefile for compiling
  • custom_tests: the file where you will add custom tests for your programs
  • myprintenv.c: a program that calls your get_env_value function in util.c for testing purposes.
  • tokenize.c: a program that calls your scan_token function in util.c for testing purposes.
  • util_grader: a program to help with output comparison. You can ignore this.
  • samples: a symbolic link to the shared directory for this assignment. It contains:
    • SANITY.ini and prototypes.h: files to configure and run Sanity Check. You can ignore these.
    • myprintenv_soln, mywhich_soln and tokenize_soln: executable solutions for the programs you will write.
  • tools: contains symbolic links to the sanitycheck and submit programs for testing and submitting your work.

Background: Unix Filesystem and the Shell Environment

In this assignment, you will write code that interacts with the Unix filesystem and something called shell environment variables. If you need an introduction or refresher on the filesystem, review our Resources page for tutorials on the tree structure, absolute and relative paths, and various commands to interact with files and directories as a user.

The POSIX standards establish a set of C functions for interacting with the operating system. Whereas the standard C library functions provide only simple file reading/writing, the POSIX functions add more comprehensive services, including access to filesystem metadata (e.g. modification time, who can access files), directory contents, and filesystem operations that are necessary for implementing Unix commands like ls and mkdir, which are themselves just executable programs. There is one POSIX function you will use for this assignment: the function access to check the user's permissions (how they can access a file) for a file. Take a look at its manual page to get a preview. We'll explain access in more detail later on this page.

When interacting with the filesystem programmatically in C, you will use a C-string to represent a path and can construct and dissect paths by string manipulation and string.h library functions. Working with paths is thus practice with C-strings!

We made a video explaining some of the background information about Unix and the terminal that may be helpful for this assignment. We recommend watching it before continuing.

As mentioned in the video above, on a Unix system, programs run in the context of the user's "environment". The environment is a list of key-value pairs that provide information about the terminal session and configure the way processes behave. You have already used the USER environment variable when cloning your assignment repo; USER is set to your SUNet ID when you log into myth. Other variables include PATH (where the system looks for programs to run), HOME (path to your home directory), and SHELL (your command line interpreter).

Explore your environment by trying out the printenv and env commands mentioned in the video, and reading their manual pages. You will be implementing your own version of the functionality of the printenv program as part of the assignment. As a summary:

  • printenv will show your environment variables. Run printenv with no arguments to see your entire environment. Then try printenv USER SHELL HOME. What is the output from a bad request printenv BOGUS?
  • env is a command that allows you to temporarily change environment variables. You can execute something like:
env BINKY=1 OTHERARG=2 ./myprogram

and what will happen is myprogram will be executed in a temporary environment with all of the original environment variables, plus BINKY set to 1 and OTHERARG set to 2. To see this, run printenv, then run env BINKY=1 WINKY=2 printenv. What changes between the two?

1. get_env_value

Your first task will be to implement a function get_env_value that can be written and tested in isolation, and will be used later in the assignment. This function should be implemented in util.c. It is used in the provided myprintenv.c program.

The get_env_value function takes as parameters an array of environment variables, and the name of a specific variable of interest. It should return the value of the variable name of interest in the array of environment variables, or NULL if it was not found in the array.

The required prototype for get_env_value is:

const char *get_env_value(const char *envp[], const char *varname);

Each entry in the envp array is a string of the form "VARNAME=VALUE". A NULL pointer is placed after the last entry to mark the end of the entries in the environment array. Your function should iterate through the entries in the environment variables array and look for a matching entry.

For example, if the envp contains an entry "USER=troccoli", the associated value for "USER" is "troccoli". Your function does not need to make a copy of the value string, it just returns a pointer to the portion of the string following the '=' char.

You can use our provided myprintenv.c program for testing the function. If you execute ./myprintenv, it will print out all the environment variables, just like printenv; this functionality does not rely on your get_env_value function. If you specify one or more command line arguments, it will print out the value of each of those in the environment; this functionality relies on your get_env_value function. You do not need to modify myprintenv.c. It is also integrated with sanitycheck for your convenience. Be sure to make use of the custom tests option to extend your testing beyond the default cases.

Note: get_env_value should not call the getenv/secure_getenv functions, if you are familiar with them. To operate correctly, it must manually search the environment that was passed via the envp argument.

For testing, try using the env program to temporarily change environment variable values and ensure they print out correctly. For instance, if you execute

env USER=otheruser ./myprintenv USER

this will change the USER environment variable just for executing myprintenv this one time to be otheruser instead of your SUNET ID. You can then ensure that myprintenv prints out the correct value, otheruser.

2. scan_token

Your second task will be to implement a function scan_token that can be written and tested in isolation, and will be used later in the assignment. This function should be implemented in util.c. It is used in the provided tokenize.c program.

We studied strtok in lab as a code study exercise. Such a function to tokenize a string by delimiters is handy to have, but the standard strtok has design flaws that make it difficult to use. The intention of scan_token is to separate a string into tokens in the manner of strtok but with a cleaner design. The required prototype for scan_token is:

bool scan_token(const char **p_input, const char *delimiters,
                char buf[], size_t buflen);

The function scans the input string to determine the extent of the next token, using the delimiters as separators, and then writes the token characters to buf, making sure to terminate with a null char. The function returns true if a token was written to buf, and false otherwise.

Here are specific details of the function's operation:

  • Your implementation of scan_token should take the same general approach as strtok, meaning it can (and should!) use the handy <string.h> functions such as strspn and strcspn, but it should not replicate the bad parts of its design. Specifically, it should not:
    • use static variables
    • modify the input string
    • have carryover of state between calls. Each call should operate independently.
  • The function separates the input into tokens in the same way that strtok does: it scans the input string to find the first character not contained in delimiters. This is the beginning of the token. It scans from there to find the first character contained in delimiters. This delimiter (or the end of the string if no delimiter was found) marks the end of the token.
  • buf is a fixed-length array to store the token and buflen is the length of the buffer. scan_token should not write past the end of buf. If a token does not fit in buf, the function should write buflen - 1 characters into buf, write a null terminator in the last slot, and the pointer held by p_input should be updated to point to the next character following the buflen - 1 characters in the token. In other words, the next token scanned will start at the first character that would have overflowed buf.
  • Note that the parameter p_input is a char **. This is a pointer to a string, as introduced in lecture 5. This is necessary because the scan_token function should update this string to point to the next character in the input that follows what was just scanned. Here's an example:
char *str = "hello world I am a string!";
char buffer[6];
bool result = scan_token(&str, " ", buffer, sizeof(buffer));
printf("%s\n", buffer);     // hello
printf("%s\n", str);        // world I am a string!

In this usage, we want the first token in str, separated by spaces. Thus, the first token is "hello", which is stored in buffer. After the function, we also want str to be updated to point to the remainder of the string that was not yet processed. If the function processed all of the remaining input, then str here would point to the null terminator. In order to have scan_token modify the char * itself, we must pass a pointer to it. Otherwise, we just pass a copy. (This parameter replaces what was tracked by the static variable within the strtok implementation that you studied in lab.)

This means, as the implementer of scan_token, you must update the pointer held by p_input to point to the next character in the input that follows what was just scanned. If the scanned token consumed all of the remaining input, *p_input should point to the null terminator.

Consider this sample tokenization loop:

    const char *input = "super-duper-awesome-magnificent";
    char buf[10];
    const char *remaining = input;

    while (scan_token(&remaining, "-", buf, sizeof(buf))) {
        printf("Next token: %s\n", buf);
    }

Running the above code produces this output:

    Next token: super
    Next token: duper
    Next token: awesome
    Next token: magnifice
    Next token: nt

You can use our provided tokenize.c program for testing the function. If you execute ./tokenize, it will use your scan_token function to calculate the number of syllables of various test words, as a test of your function. You can also run it by specifying other text you would like to use to test, in this format:

./tokenize <DELIMITERS> <TEXT>

For example, if I would like to tokenize the text "hello I am a C-string" using the delimiters "-" and " ", you could run:

./tokenize " -", "hello I am a C-string"

The first string contains the characters to use as delimiters, and the second string is the text to tokenize. This command should output something like:

./tokenize " -", "hello I am a C-string"
Tokenized: { "hello" "I" "am" "a" "C" "string" }

You do not need to modify tokenize.c. It is also integrated with sanitycheck for your convenience. Be sure to make use of the custom tests option to extend your testing beyond the default cases.

Assumptions: We will only test the function on valid arguments. More specifically this means:

  • buf is a valid address to a region of memory that has space for buflen characters
  • buflen > 1
  • p_input is a valid pointer to a pointer
  • *p_input is a well-formed C-string. (may be empty string)
  • delimiters is a well-formed C-string containing one or more delimiter chars. (will not be empty string)

You may aspire to bullet-proof your function, for example, rejecting a NULL value for p_input or complaining if the client passes a buf with a mismatched buflen. Although your intentions are good, it is impossible to make good on them. Determining whether a pointer is valid is not solvable in general. Rejecting p_input == NULL will catch exactly that one case, but doesn't detect all other invalid addresses. Any measure to detect bad pointers will be half-hearted at best. As the implementer, you have little choice but to assume the client will supply valid addresses and must write your code accordingly.

Review and Comment Starter Code

With your utility functions tested and debugged, you're now ready to build a larger program that uses them: your own version of the which command mentioned in the introduction video.

The file mywhich.c is given to you with an incomplete main function that sketches the expected behavior for the case when mywhich is invoked with no arguments. You should first read and understand this code, work out how to change/extend it to suit your needs, and finally add comments to document your strategy. You are welcome (and encouraged!) to modify the existing code to make your implementation as clean and well-designed as possible.

Some concepts to think about when looking over the code:

  • What is PATH_MAX? What is it used for?
  • What does sizeof evaluate to when applied to an array?
  • If the user's environment does not contain a value for MYPATH, what does mywhich use instead?
  • How does a client properly use scan_token? (see sample uses in both tokenize.c and mywhich.c)
  • Do you see anything unexpected or erroneous? We intend for our code to be bug-free; if you find otherwise, please let us know!

The code we provide has been stripped of its comments and it's your job to provide the missing documentation.

3. mywhich

The final program you should write is a version of the Unix which command, a utility used to locate and identify executable programs to run. Your mywhich program will use the two utility functions (get_env_value and scan_token) you wrote earlier.

What does the which command do?

As mentioned in the assignment overview video, the which command searches for a command by name and reports the full path to its matching executable file. Read its man page (man which) and try it out, e.g. which ls or which make or which emacs. The response from which is the matching executable, or no output if not found.

This search is intimately related to how commands are executed by the shell. When you run a command such as ls or vim, the shell searches for an executable program that matches that command name and then runs that program.

Where does it search for executables? You might imagine that it looks for an executable file named vim in every directory on the entire filesystem, but such an exhaustive search would be both inefficient and insecure. Instead, it looks only those directories that have been explicitly listed in the user's PATH environment variable. The default search path includes directories such as /usr/local/bin/ and /usr/bin/ which house the executable files for the standard Unix commands. (The name bin is a nod to the fact that executable files are encoded in binary).

The user can configure their search path by changing the value of their PATH environment variable. The value for PATH is a sequence of directories separated by colons such as PATH=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/games. When looking for a command, which considers the directories in the order they are listed in PATH and stops at the first one that contains a matching executable. In order to match, the file's name must be an exact match and the file must be readable and executable by the user.

How does mywhich operate?

The mywhich program you are to write is similar in operation to the standard which with these differences:

  • mywhich uses the environment variable MYPATH for the search path. If no such environment variable exists, it falls back to PATH. (Standard which always uses PATH as the search path)
  • mywhich does not support any command-line flags. (standard which -a prints all matches)
  • mywhich invoked with no arguments prints the directories in the search path, one directory per line. (standard which with no arguments does nothing). This use case is a testing aid to verify that your utility functions work correctly.
myth$ ./mywhich 
Directories in search path:
/usr/bin
/usr/local/bin
/usr/pubsw/bin
/bin
  • mywhich invoked with one or more arguments searches for a matching executable for each argument. The sample output below shows invoking mywhich to find three executables. For each command name, it prints the full path to the first matching executable or nothing if no matching executable was found. The matched executables are listed one per line in the order that the command names were specified on the command-line. In this example, two of them were found, but no executable named submit was found in any directory in the user's MYPATH and thus nothing was printed for it.
myth$ ./mywhich xemacs submit cp
/usr/bin/xemacs
/bin/cp

A command will contain only non-special characters (i.e. no ^ # * / $ % which have special meaning in Unix).

You will want to test with different configurations of the search path. Rather than permanently change your actual PATH (which can prevent normal operations from working), we recommend that you change MYPATH, which only affects mywhich and nothing else. Use the env command mentioned in the earlier section and the assignment video to set the value of MYPATH just when running mywhich, like this:

    myth> env MYPATH=/tmp:tools ./mywhich submit
    tools/submit

We also encourage using sanity check and the sample solution. For example, for any command ./mywhich arguments, try the same invocation on the solution samples/mywhich_soln arguments and validate that your program has the same behavior as the solution.

Assumptions. You may assume correct usage in all cases and that the user's MYPATH and PATH variables are well-formed sequences of one or more paths separated by colons. You do not need to detect or handle situations where these assumptions do not hold and we will not test on any inputs that violate these assumptions, e.g. no usage of the unsupported -a flag, no malformed values for MYPATH, and no special characters in commands.

Operation. The user's MYPATH (or PATH if there is no MYPATH variable in the user's environment) defines the search path. mywhich considers the directories in the order they are listed in the search path. If the directory contains a readable, executable file whose name is an exact match to the command, the search stops here and prints the full path of that file.

How can I test if a file is an executable?

You must use the library function access() (man access). Given a full path and a "mode", the function reports whether you can access that path for that mode. The mode to check for is a combination of R_OK and X_OK. This verifies that you have permission to read the file and the file is executable. Be sure to carefully read the man page so you know how to properly interpret the return value from a call to access.

The "mode" is provided as a bitmask - in particular, if you want both R_OK and X_OK, you must provide a mask with both of those bits on.

One other minor detail is that a path that is a directory may test as readable and executable, and thus appear to be an executable file to mywhich. There is further filesystem information you can use (man stat) to distinguish directories from files, but we don't ask you to do this. mywhich should just use access to filter results to those matching entries that are readable/executable, without concern for if those matches are files or directories. This is the behavior exhibited by the sample solution.

Testing

In addition to the above programs, as part of this assignment you should also add at least 5 additional tests of your own in the custom_tests file that show thoughtful effort to develop comprehensive test coverage. When you add a test, also document your work by including comments in the file that explain why you included each test and how the tests relate to one another. The tests supplied with the default SanityCheck are a start that you should build on, with the goal of finding and fixing any bugs before submitting, just like how a professional developer is responsible for vetting any code through comprehensive testing before adding it to a team repository. As a reminder, you can visit the working on assignments linked to from the Assignments page for information about SanityCheck and how to use it with custom tests. We recommend you run SanityCheck early and often (and remember, it even makes a snapshot of your code to guard against editing mishaps!). You can also find suggested testing strategies on the testing page, linked to from the Assignments page.

Submitting

Once you are finished working and have saved all your changes, check out the guide to working on assignments for how to submit your work. We recommend you do a trial submit in advance of the deadline to allow time to work through any snags. You may submit as many times as you would like; we will grade the latest submission. Submitting a stable but unpolished/unfinished is like an insurance policy. If the unexpected happens and you miss the deadline to submit your final version, this previous submit will earn points. Without a submission, we cannot grade your work.

Grading

Below is the tentative grading rubric. We use a combination of automated tests and human review to evaluate your submission. More details are given in our page linked to from the Assignments page explaining how assignments are graded.

Functionality (80 points)

  • Sanity cases (25 points) Correct results on the default sanity check tests.
  • Comprehensive/stress cases (40 points) Correct results for additional test cases with broad, comprehensive coverage and larger, more complex inputs.
  • Clean compile (2 points) Compiles cleanly with no warnings.
  • Clean run under valgrind (8 points) Clean memory report(s) when run under valgrind. Memory errors (invalid read/write, etc) are significant deductions. Every normal execution path is expected to run cleanly with no memory errors nor leaks reported. We will not test exception/error cases under Valgrind.
  • custom_tests (5 points) Your custom_tests file should include at least 5 tests of your own that show thoughtful effort to develop comprehensive testing coverage. Please include comments that explain your choices. We will run your custom tests against your submission as well as review the cases to assess the strength of your testing efforts.

Code Quality (buckets weighted to contribute ~15 points)

  • Use of pointers and memory. We expect you to show proficiency in handling pointers/memory, no unnecessary levels of indirection, correct use of pointee types and typecasts, and so on. For this program, you should not need and should not use dynamic memory (i.e. no malloc/free/strdup).
  • Program design. We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. When you need the same code in more than one place, you should unify, not copy and paste. If the C library provides functionality needed for a task, you should leverage these library functions rather than re-implement that functionality.
  • Style and readability. We expect your code to be clean and readable. We will look for descriptive names, defined constants (not magic numbers!), and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
  • Documentation. You are to document both the code you wrote and what we provided. We expect program overview and per-function comments that explain the overall design along with sparing use of inline comments to draw attention to noteworthy details or shed light on a dense or obscure passage. The audience for the comments is your C-savvy peer.

For more information about style guidelines to use for assignments, take a look at the CS107 Style Guide, linked to from the Assignments page.

On-time Bonus (+5%) The on-time bonus for this assignment is 5%. Submissions received by the due date earn the on-time bonus. If you miss the due date, late work may be submitted during the grace period without penalty. No submissions will be accepted after the grace period ends, please plan accordingly! The bonus is calculated as a percentage of the point score earned by the submission. See the General Information handout for more information about the late policy.

Post-Assignment Check-in

How did the assignment go for you? We encourage you to take a moment to reflect on how far you've come and what new knowledge and skills you have to take forward. Once you finish this assignment, you will have written your own implementation of a standard Unix utility program and an improved version of a standard library function. That's a pretty darn impressive accomplishment, especially so given only two weeks of learning about Unix and C -- wow!

To help you gauge your progress, for each assignment/lab, we identify some of its takeaways and offer a few thought questions you can use as a self-check on your post-task understanding. If you find the responses don't come easily, it may be a sign a little extra review is warranted. These questions are not to be handed in or graded. You're encouraged to freely discuss these with your peers and course staff to solidify any gaps in you understanding before moving on from a task. They could also be useful as review before the exams.

  • The string library contains several functions to perform a form of string comparison, e.g. strncmp, strstr, strchr, strspn, ... Explain the differences between the functions and identify a situation in which each is appropriate.
  • Write a C expression that converts a hexadecimal digit char to its numerical value, i.e. '1' => 1, 'f' => 15. Be sure to consider string.h functions that can help with the job.
  • The first parameter to the function scan_token is of type const char **. Explain the purpose of the extra level of indirection on that argument.
  • It is controversial (see section 13) whether to add . (the current directory) to your PATH. Why might it be convenient? Why does it introduce a security risk?

Frequently Asked Questions

We will add common questions and answers here as students work on the assignment. Check back for updates!

Can I use strtok, getenv or strsep on this assignment?

No. For this assignment, you are writing scan_token as the better replacement for strtok. Your mywhich program should call scan_token, not strtok or strsep. Similarly, you are writing your own version of getenv.

Should I separate out all the directories from searchpath and store them into an array of strings?

No. Attempting to first tokenize the searchpath into directories and only then process the directories adds the complication of allocating/managing/deallocating an intermediate memory structure for the array and strings, which we don't want you to do. The preferred approach is "tokenize-and-test". Searching for a command should tokenize a directory from the searchpath and test for the presence of a matching executable in that directory. If it's not found, tokenize the next directory and test, and so on until all directories in the searchpath have been examined. In this way, the memory needs are simplified to just what is needed to process a single directory at a time. If you are searching for multiple commands, you will repeat the tokenization of the search path for each command and that's fine.

How can I assemble a full path from a directory and command name?

Declare a stack buffer of a large size into which you will construct the full path. The appropriate value for "large size" is the constant PATH_MAX defined in the included file <limits.h>. PATH_MAX is the system's limit on the maximum length of a full path (including the null terminator). Now you want to fill the buffer with the concatenation of the directory, a forward slash, and command name.

I'm getting compiler warnings about initialization discards 'const' qualifier from pointer target type. How do I resolve these?

The const qualifier on a declaration is an indication that type is only to be read, not written. A const char * and a char * are not quite the same type. You can read characters from either, but only the non-const version allows the characters to be changed. While it is perfectly fine to supply a non-const pointer where a const is expected (no cast required or recommended), the inverse will raise a warning from the compiler. Throwing down a typecast to const would quiet the compiler, but that's not the appropriate fix. If write-access is required, a read-only pointer is not going to work and a cast only serves to cover up the mistake. You should instead be fixing the logic to supply the correct pointer. If write-access is not needed, then the simple and correct choice is to add the const qualifier to the declaration.

Is it allowable to call string.h functions?

Yes! For this assignment, use of getenv and strtok/strsep is prohibited since you are writing your own versions of those functions, but the rest of the standard library is at your disposal and its use is strongly preferred over re-implementing its functionality. The functions in the standard library are already written, tested, debugged, and highly optimized. What's not to like? One important consideration, though, is to choose the appropriate function to use. As one example, there are several different functions that do variants of string compare/search (strstr, strchr, strncmp, strspn and so on). Rather than pick a function at random and add it into what you need, be sure to choose the approach that most directly accomplishes the task at hand.