Assignment 2: C Strings

Due: Fri Jul 11 11:59 pm
Late submissions accepted until Sun Jul 13 11:59 pm

Notes: DO NOT edit the tokenize_util.c file until reading the instructions for the scan_token section of the assignment.

This assignment focuses just on the material related to strings and pointers - you should not need material related to heap allocation for this assignment.

The scan_token problem requires material covered in the lecture following the release of this assignment about double pointers. Make sure you have also understood the strtok function as mentioned in lab2. Understanding that will help significantly as you implement this function!

Through TA helper hours and the discussion forum, our focus will be on supporting you so that you can track down your own bugs. Please ask us how to best use tools (like GDB and the brand-new Valgrind!), what strategies to consider, and advice about how to improve your debugging process or track down your bug. We are happy to help you with these in order to help you drive your own debugging. For this reason, if you have debugging questions during helper hours, please make sure to gather information and explore the issue on your own first, and fill out the queue questions with this information.

Assignment by Julie Zelenski, with modifications by Nick Troccoli, Katie Creel, Brynne Hurst and Jonathan Kula

Learning Goals

This assignment covers topics in recent string lectures and the second lab. You will be building your skills with:

  • C-strings (both raw manipulation and using string library functions)
  • viewing Unix utility programs from an internal perspective - as an implementer, not just a client
  • exposure to programmatic access of the filesystem and shell environment variables
  • thoroughly documenting your code, and learning about the importance of good documentation

Overview

For this assignment, you will write programs that replicate some of the functionality of the Unix commands printenv and which. This is an especially appropriate way to learn more about C and Unix; implementing the Unix operating system and its command-line tools were what motivated the creation of the C language in the first place! Implementing these programs is a very natural use of C, and you'll see how comfortably it fits in this role. Moreover, when we interact with the filesystem programmatically in C, as we will do on part of this assignment, we can use a C string to represent a path (like /a/b/c) and can construct and dissect paths by string manipulation and string.h library functions. Working with paths is thus practice with C strings!

This assignment asks you to complete two functions and one program. Each part gives practice with string manipulation:

  • get_env_value has you extract a specific value from a list of strings
  • scan_token has you implement and document an improved version of strtok from lab2
  • mywhich has you use your get_env_value and scan_token functions to print out the location of an executable on the filesystem

A few reminders:

  • The working on assignments page contains info about the assignment process.
  • The collaboration policy page outlines permitted assignment collaboration, emphasizing that you are to do your own independent thinking, design, coding, and debugging. If you are having trouble completing the assignment on your own, please reach out to the course staff; we are here to help!

Our debugging guide has tips on how to approach diagnosing various issues and has specific steps to resolve many common bugs that may arise. Keep it handy while you work!

View Debugging Guide

To get started on the assignment, clone the starter project using the command

git clone /afs/ir/class/cs107/repos/assign2/$USER assign2

The starter project contains the following:

  • readme.txt: a text file where you will answer questions for the assignment
  • env_util.c, tokenize_util.c, mywhich.c and Makefile: the code files that you will modify, and their Makefile for compiling
  • custom_tests: the file where you will add custom tests for your programs
  • myprintenv.c: a program that calls your get_env_value function in env_util.c for testing purposes. You do not need to modify this file.
  • tokenize.c: a program that calls your scan_token function in tokenize_util.c for testing purposes. You do not need to modify this file.
  • bugreport.c: a file you can optionally fill out for extra credit to document how you applied the debugging checklist with GDB for one bug you encounter on this assignment.
  • samples: a symbolic link to the shared directory for this assignment. It contains:
    • SANITY.ini, sanity.py and prototypes.h: files to configure and run Sanity Check. You can ignore these.
    • myprintenv_soln, mywhich_soln and tokenize_soln: executable solutions for the programs you will write.
  • tools: contains symbolic links to the sanitycheck and submit programs for testing and submitting your work. It also contains the codecheck tool for checking for style and other common code issues.

Codecheck: Starting with this assignment, part of your style / code review score will be dependent on having a clean run (no reported code issues) when run through the codecheck tool. Make sure to run it frequently to ensure your code adheres to all necessary guidelines!

Working With Strings

For this assignment, use of getenv/secure_getenv and strtok/strsep is prohibited since you are writing your own versions of those functions, but the rest of the standard library is at your disposal and its use is strongly preferred over re-implementing its functionality. The functions in the standard library are already written, tested, debugged, and highly optimized. What's not to like? One important consideration, though, is to choose the appropriate function to use. As one example, there are several different functions that do variants of string compare/search (strstr, strchr, strncmp, strspn and so on). While working on this assignment, be sure to choose the approach that most directly accomplishes the task at hand.

Something else that appears on this assignment is the const keyword; a const char * means that the characters pointed to by this pointer cannot be changed. It also means that if you create another pointer to point to these same characters, it must also be const; think of const like part of the variable type itself. You can, however, reassign the const char * to point to something else; it is just that you are not able to change the characters at the location to which it points. In other words, const char * (and const char **, const char ***, and so on) mean the characters at the location ultimately being referred to cannot be modified, but any pointer on the way there can be modified. Also, it's usually okay to use a non-const pointer for a const pointer argument or variable (no cast required or recommended) - e.g., for strlen(const char *), though its parameter type is technically const char *, we can pass in non-const char *s without casting. But the inverse (supplying a const pointer where a non-const is expected) will raise a warning from the compiler and is likely to result in problems. Here are some examples:

// cannot modify this char
const char c = 'h';
// cannot modify chars pointed to by str
const char *str = ...
// cannot modify chars pointed to by *strPtr
const char **strPtr = ...


char buf[6];
strcpy(buf, "Hello");
const char *str = buf;

// not allowed
str[0] = 'M';

// allowed!
str = "Mello";

// not allowed
str[1] = 'a';

// allowed!
buf[0] = 'M';

If you get compiler warnings about initialization discards 'const' qualifier from pointer target type, it means that the "const-ness" does not match; make sure you follow the rules above for your variable declarations and preserve const-ness where needed.

Testing

This assignment heavily emphasizes testing. For each of the 2 functions and for the mywhich program you write below, you should also add at least 11 tests of your own - at least 3 for get_env_value, at least 5 for scan_token, and at least 3 for mywhich - in the custom_tests file that show thoughtful effort to develop comprehensive test coverage. When you add a test, also document your work by including comments in the custom_tests file that explain why you included each test and how the tests relate to one another. The tests supplied with the default SanityCheck are a start that you should build on, with the goal of finding and fixing any bugs before submitting, just like how a professional developer is responsible for vetting any code through comprehensive testing before adding it to a team repository. We recommend you run tests early and often (and remember, running tests even make a snapshot of your code to guard against editing mishaps!). You can also find suggested testing strategies on the testing page, linked to from the Assignments dropdown.

The best way to approach testing on this assignment is:

  1. Understand the expected program behavior
  2. BEFORE writing code, write some tests that cover various cases you can think of
  3. Write your code
  4. Write more tests to cover additional cases

This is because once you start writing code, you may start to think in terms of how your code works rather than how the code should work, meaning if you omit handling a case in your code, you may also omit covering that case in your testing. Thus, a good strategy is to write some tests before implementing anything, and then as you implement, you can add further tests. Use the tests as a way to gauge your progress and uncover bugs! We provide some testing recommendations in each problem section.

Background: Unix Filesystem and the Shell Environment

In this assignment, you will write code that interacts with the Unix filesystem and something called shell environment variables. If you need an introduction or refresher on the filesystem, review our Unix guide for tutorials on the tree structure, absolute and relative paths, and various commands to interact with files and directories as a user.

We made a video explaining some of the background information about Unix and the terminal that's necessary for this assignment - make sure to watch it before continuing!

As mentioned in the video above, on a Unix system, programs run in the context of the user's "environment". The environment is a list of key-value pairs that provide information about the terminal session and configure the way processes behave. You have already used the USER environment variable when cloning your assignment repo; USER is set to your SUNet ID when you log into myth. Other variables include PATH (where the system looks for programs to run), HOME (path to your home directory), and SHELL (your command line interpreter).

Explore your environment by trying out the printenv and env commands mentioned in the video, and reading their manual pages. You will be implementing a core part of the printenv program as part of the assignment. As a summary:

  • printenv will show your environment variables. Run printenv with no arguments to see your entire environment. Then try printenv USER SHELL HOME. What is the output from a bad request like printenv BOGUS?
  • env is a command that allows you to temporarily change environment variables. You can execute something like:
env BINKY=1 OTHERARG=2 ./myprogram

and myprogram will be executed in a temporary environment with all of the original environment variables, plus BINKY set to 1 and OTHERARG set to 2. To see this, run printenv, then run env BINKY=1 WINKY=2 printenv. What changes between the two?

You can also use env with GDB; e.g. if you want to debug a program that is run using env, start gdb prefixed with env, and then run as normal - for instance: env USER=otheruser gdb myprogram

You can use env with Valgrind as well; e.g. if you want to run valgrind with a program that is run using env, run it like this:

env NAME=VAR valgrind [EXECUTABLE] [ARGS]

Before moving on: make sure you have understood what environment variables are and what the printenv program does. Also make sure you're familiar with how to use the env command; this will be essential for thorough testing!

Extra Credit: Bug Report

Practice with GDB and Valgrind will pay off tremendously as the quarter progresses! To incentivize applying GDB specifically and the debugging checklist to help squash bugs while working on this assignment, for an optional 5 points of extra credit, fill out the bug report file bugreport.txt with information about how you applied the debugging checklist with GDB for one bug you encounter on this assignment (though we certainly encourage its use for all bugs! But we require just one bug report for extra credit). See bugreport.txt for more information. We also strongly encourage Valgrind use alongside the debugging checklist as well, but this exta credit exercise focuses on GDB specifically. It's fine if the bug you document is one you ask questions about at helper hours.

1. get_env_value

View Instructions

2. scan_token and Documentation

Note: one important requirement for this part is, before you make any changes to the tokenize_util.c file, to write and submit a portion of your custom tests for this part. See the full instructions for this part for how to do this.

View Instructions

3. mywhich

View Instructions

Submitting

Once you are finished working and have saved all your changes, check out the guide to working on assignments for how to submit your work.

When you submit, you may optionally indicate that you do not plan to make a submission after the on-time deadline. This allows the staff to start grading some submissions as soon as the on-time deadline passes, rather than waiting until after the late period to start grading.

  • When in doubt, it's fine to indicate that you may make a late submission, even if you end up submitting on time
  • If you do indicate you won't submit late, this means once the on-time deadline passes, you cannot submit again. You can resubmit any time before the on-time deadline, however.
  • If you want to change your decision, you can do so any time before the on-time deadline by resubmitting and changing your answer.
  • If you know that you will not make a late submission, we would appreciate you indicating this so that we can grade assignments more quickly!

You only need to modify the following files for this assignment: env_util.c, tokenize_util.c, mywhich.c, custom_tests, readme.txt

We would also appreciate if you filled out this homework survey to tell us what you think once you submit. We appreciate your feedback!

Grading

Below is the tentative grading rubric. We use a combination of automated tests and manual review to evaluate your submission. More details are given in our page linked to from the Assignments dropdown explaining how assignments are graded.

Readme (12 points)

Functionality (83 points)

  • Sanity cases (25 points) Correct results on the default sanity check tests.
  • Comprehensive/stress cases (40 points) Correct results for additional test cases with broad, comprehensive coverage and larger, more complex inputs.
  • Clean compile (2 points) Compiles cleanly with no warnings.
  • Clean run under valgrind (10 points) Clean memory report(s) when run under valgrind. Memory errors (invalid read/write, etc) are significant deductions. Every normal execution path is expected to run cleanly with no memory errors nor leaks reported. We will not test exception/error cases under Valgrind.
  • custom_tests (6 points) Your custom_tests file should include at least 11 tests of your own (minimums of 3 for get_env_value, 5 for scan_token, and 3 for mywhich), that show thoughtful effort to develop comprehensive testing coverage. Part of this score category is submitting at least 3 scan_token custom tests prior to editing tokenize_util.c, and adding at least 2 more in your final submission. Please include comments that explain your choices. We will run your custom tests against your submission as well as review the cases to assess the strength of your testing efforts.

Code Quality (buckets weighted to contribute ~15 points)

The grader's code review is scored into a bucket per assignment part to emphasize the qualitative features of the review over the quantitative. The styleguide is a great overall resource for good program style. Here are some highlights for this assignment:

  • Using library functions where possible. If the C library provides functionality needed for a task, you should leverage these library functions rather than re-implement that functionality.
  • Use of pointers and memory. We expect you to show proficiency in handling pointers/memory, no unnecessary levels of indirection, correct use of pointee types and typecasts, and so on. For this program, you should not need and should not use dynamic memory (i.e. no malloc/free/strdup).
  • Program design. We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. When you need the same code in more than one place, you should unify, not copy and paste.
  • Style and readability. We expect your code to be clean and readable. We will look for descriptive names, defined constants (not magic numbers!), and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
  • Documentation. You are to document both the code you wrote and what we provided (except for tokenize.c and myprintenv.c). We expect program overview and per-function comments that explain the overall design along with sparing use of inline comments to draw attention to noteworthy details or shed light on a dense or obscure passage. The audience for the comments is your C-savvy peer.

Codecheck compliance. We will be running your code through codecheck and verifying that there are no issues present.

Post-Assignment Check-in

How did the assignment go for you? We encourage you to take a moment to reflect on how far you've come and what new knowledge and skills you have to take forward. Once you finish this assignment, you will have written your own implementation of a standard Unix utility program and an improved version of a standard library function, along with comprehensive documentation. That's a pretty darn impressive accomplishment, especially so given only a few weeks of learning about Unix and C -- wow!

To help you gauge your progress, for each assignment/lab, we identify some of its takeaways and offer a few thought questions you can use as a self-check on your post-task understanding. If you find the responses don't come easily, it may be a sign a little extra review is warranted. These questions are not to be handed in or graded. You're encouraged to freely discuss these with your peers and course staff to solidify any gaps in you understanding before moving on from a task. They could also be useful as review before the exams.

  • The string library contains several functions to perform a form of string comparison, e.g. strncmp, strstr, strchr, strspn, ... Explain the differences between the functions and identify a situation in which each is appropriate.
  • Write a C expression that converts a hexadecimal digit char to its numerical value, i.e. '1' => 1, 'f' => 15.
  • The first parameter to the function scan_token is of type const char **. Explain the purpose of the extra level of indirection on that argument.
  • It is controversial (see section 13) whether to add . (the current directory) to your PATH. Why might it be convenient? Why does it introduce a security risk?
  • Why is good function documentation (like manual pages) critical for good software development?