Lab 2: C Strings

Lab sessions Tue Sep 29 to Thu Oct 01

Lab written by Julie Zelenski, with modifications by Nick Troccoli

Pre-Lab Exercises

Before attending your lab, please work through the following exercises. We plan for the exercises to take about 20-30 minutes max. There's nothing to submit, but the exercises will be essential to the further problems you'll work through during your lab session. Feel free to post on the discussion forum if you have questions while working!

  1. Start by reading over the "Learning Goals", "Get Started" and "Standard C Library Functions" sections of the lab writeup below to get familiar with the focus of this week's lab.
  2. Read over the description of the strtok problem (up to but not including the implementation itself) and make sure you understand how strtok functions. If you get stuck, feel free to post on the discussion forum or jot down any thoughts or questions you have to bring with you to lab to discuss with your group and lab TA.

Lab Overview

Lab starts all together in the main lab room for the first 10 minutes, where you'll get a chance to chat with others in your lab and review relevant concepts for this lab. Your TA will then break everyone off into the same groups as the previous lab to start working on the first problem together. After each problem, the TA will have everyone return to the main room for about 10 minutes to discuss and review the problem, what people found, what questions there are, and what was confusing. Then you'll return to your small group rooms to work on the next problem.

Learning Goals

During this lab, you will:

  1. read and analyze C code that operates on chars and C-strings
  2. use gdb and Valgrind to detect and debug memory errors

Group Work Tips

We hope you have a chance to get to know your lab group over the course of the quarter! We encourage you to take some time at the start to chat and see how things are going for everyone. How is the current assignment? What general tips or tricks have you learned while working? Any favorite tunes you listen to while working?

While you're working through each part of the lab, we recommend a few things to help your group get the most out of the exercises:

  • share your video if you can
  • for each problem, choose one person to "drive", meaning they share their terminal screen and type in commands, while everyone else views their terminal screen to work through the problem and ask questions
  • rotate the "driver" each problem so that each person has a chance to share their screen and lead the terminal interactions
  • In the last few minutes of working on each problem, spend some time brainstorming with your group what your overall takeaways were, and what questions you have that you want to bring back to the full-lab discussion. It's ok if you don't have time to get through everything!

Get Started

Clone the lab starter code by using the command below. This command creates a lab2 directory containing the project files.

git clone /afs/ir/class/cs107/repos/lab2/shared lab2

Standard C Library Functions

The term "standard library" refers to the collection of functions that are packaged with a programming language. Our guide to stdlib gives an overview of the functions available in C's standard library.

Many programmers treat the system libraries as a "black box" -- using the functions as-is, with only a vague idea of how they are implemented. Given our goal to truly understand the inner workings of the system, we are going to go deeper.

To that end, some of our code study exercises will dig into library code. There is much to learn from studying industrial-strength code and reflecting on the tradeoffs in design, readability, and efficiency. You will also become a better-informed client, with deep knowledge of the proper use of these functions and the pitfalls and misuses to avoid.

This knowledge may also embolden you to consider writing your own version of isdigit or strncpy into your programs, but the best practice is to prefer using the existing library functions whenever possible. These versions are already written, fully debugged, and highly tuned -- what's not to like? We will study these functions, learn from the code, and critique them, but also be grateful to later just call upon them when needed.

Exercises

1) Valgrind

(20 minutes + 10min all-lab discussion)

Each lab will devote some time to hands-on practice with the tools. Last week, it was sanitycheck and gdb. This week, we introduce Valgrind.

Now that we are ramping up the use of pointers, our programs become more at risk for memory errors. These can be difficult bugs to track down and Valgrind is a necessary tool to learn. Valgrind is a tool that helps detect memory errors. If you haven't already, review our guide to Valgrind. Let's try using it to fix a program.

  • Read over the code in buggy.c to see its two planted memory errors. Consider just error #1 to start.
  • Run ./buggy 1 and you should see Segmentation fault (core dumped). A segfault results from an attempt to read/write an inaccessible/invalid address. Ok, so now you know your program has a memory error but not much else.
  • Now, run the program under gdb. You observe the same segfault, but now try using the gdb backtrace command to learn where execution was at the time of the crash. That's helpful, but we want more!
  • Exit gdb and run the program under Valgrind: valgrind ./buggy 1. The Valgrind report gives more detail including the type of memory error, the execution call stack, and the value of the invalid address. Furthermore, Valgrind also detected an earlier problem that preceded the seg fault -- use of uninitialized value of size 8 (sizeof pointer) right before the invalid read of size 1 (char). This is the clue that can lead you to the source of the error!
  • Repeat above steps for buggy 2. Note that whereas error #1 fails in every context, error #2 is less obvious to the naked eye and are only outed by Valgrind's vigilance.

The terminology and reporting that Valgrind uses for errors can be difficult to understand, and it takes practice. Your goal is to learn how to relate the error as reported by Valgrind back to the location in the code and root cause.

We chose these particular cases to further demonstrate the trickiness of memory errors. Both bugs come from some form of memory error, but the observed consequences differ. buggy 1 crashes with a seg fault every time. buggy 2 produces output that is sometimes correct and other times not, but it does not crash. A memory error may cause no visible harm, but that doesn't mean the program is correct, it just "got lucky" this time. Running Valgrind can detect these errors that are lying in wait, even before they produce an observable effect.

Once you've finished, review the following question a group: What string-handling error is likely to be the root cause of a Valgrind report that "Conditional jump or move depends on uninitialised value(s)"? This is an error that may occur for you on assign2, so it's vital you understand the causes now!

We recommend that you run Valgrind early and often during your development cycle. Whenever Valgrind reports any memory errors, you should stop and focus on resolving them before moving on. Make routine valgrind checks a part of your assignment process, especially when you run into bugs! A quick run through valgrind can potentially help unearth issues. In grading, we run your submission under Valgrind and look for a clean report, so be sure you are running Valgrind yourself to find these errors.

2) Code Study: strtok

(40 minutes + 10min all-lab discussion, with 5min check-in discussion in the middle)

Take about 20 minutes to work on this problem up through the questions about line 14 of the implementation. Then, there will be a brief 5min check-in discussion to see how everyone is doing. After that, you'll have another 15 minutes to continue working on the rest of the problem.

A common string-handling pattern is to split a string into "tokens" separated by one or more delimiting characters, such as splitting a sentence into space-separated words or dividing a file path into components delimited by slashes. The C library provides the strtok function, but its awkward design makes using the function messy and error-prone. We are going to take a look at this function and its issues as a cautionary tale. You will be writing an improved tokenize function as part of assign2 and should understand the pitfalls to avoid!

Let's first consider the client use of strtok. Acquaint yourself with its operation from the man page (man strtok), taking note of the critiques of the function's design in the BUGS section. Unlike tokenization functions you may have used in other languages, a single call to strtok does not return a nicely formatted vector of tokens. Rather, you must call strtok repeatedly, each time receiving a single token, and continue until strtok returns NULL to indicate there are no more tokens. A tokenization loop looks like this:

char *token = strtok(input, delims);
while (token != NULL) {
    printf("%s\n", token);
    token = strtok(NULL, delims);
}

The two main complaints about strtok are that it destructively modifies its input and it is restricted to managing one active tokenization at a time. Let's consider these issues separately.

Rather than create a new substring for each token, strtok overwrites the token's ending delimiter in the input string with a null char and returns a pointer to the token in-place within the input string. It effectively chops up the input string into a sequence of tokens, re-purposing the existing characters where they are, rather than copying the chars elsewhere.

If the input string originally contained "red-green-blue\0", repeated calls to strtok to split on - would change the contents of the input string to "red\0green\0blue\0". Wacky! Not only is it undesirable that a function destroy its input in this way, this design also disallows tokenizing an input that is read-only, such as a string constant. The token.c program has a sample call that passes a string constant as the input to strtok. Uncomment that call and compile and run the program to see the consequence when strtok attempts to change the string constant - ouch! What happens if you call strtok passing a string constant as the input to tokenize?

Another odd feature is that strtok keeps track of the current state of the tokenization process on behalf of future calls to the function. In the first call to strtok, the first argument is the input string to tokenize, but in subsequent calls you pass NULL as the first argument, which tells strtok to continue tokenizing the same input as the previous call, picking up where it left off. Internally, strtok preserves the original input string from the first call using a global variable (yikes!) and updates it across subsequent calls. This behavior is in stark contrast to a typical function call which operates completely independently of any previous or future calls. Not only does this make it difficult to trace the client use, it also dictates that at most one tokenization can be active at a time. The function has one saved state that is shared by all calls. Starting a second tokenization before finishing the first will delete the saved state of the first and replace it with the second.

Are you curious about how this poorly-designed strtok function is implemented? Here is code based on musl's implementation. Note that not all stylistic decisions below are ones you should emulate - we use this code mostly as a case study of how to understand core implementations.

 1    char *musl_strtok(char *s, const char *delimiters) {
 2        static char *p = NULL;
 3
 4        if (s == NULL) {
 5            s = p;
 6            if (s == NULL) {
 7                return NULL;
 8            }
 9        }
10
11        s += strspn(s, delimiters);
12        if (!*s) {
13            p = NULL;
14            return p;
15        }
16
17        p = s + strcspn(s, delimiters);
18        if (*p) {
19            *(p++) = '\0';
20        } else {
21            p = NULL;
22        }
23
24        return s;
25    }

There is lot to unpack here! It may be helpful to draw a memory diagram and trace how s and p move about during a call and/or run the token program under gdb so you can observe the code in action.

  • The wikipedia page on static variables gives some background info and an example of a static local variable. What does static do and why is it being used here? Take care to note that static variables are initialized differently than non-static local variables. Although the function opens with what looks like a declaration and initialization of a new variable p, a static variable is declared and initialized exactly once. The value of p is preserved across calls.
  • When NULL is passed as the first argument, how does s get its value?
  • What do the functions strspn and strcspn do?
  • [Line 11] How is s updated during the execution of this line?
  • [Line 12] What does the expression *s evaluate to? What is another equivalent way to write this expression? (Perhaps this alternate way would be preferred for readability, but alas...) What does it mean when we put a ! before it?
  • [Line 14] In what situation does the function exit through this line?

Take a break here and reconvene with the rest of your lab! Your lab leader will pull everyone back together for a short 5 minute discussion about how things are going so far. Then you can get back with your group to review and complete the last questions for an additional 15 minutes.

  • [Line 17] Where does p point after executing this line?
  • [Line 19] Unravel the assignment here and work out how it changes the string. One quirky note about p++ is that the expression not only modifies p, but it also evaluates to the original value of p (Check out this link for more about this quirky syntax). In other words, p++ will modify p but then evaluate to p rather than the incremented value. After line (19) executes, where does p now point?
  • [Lines 19 + 21] These lines appear to be somewhat similar assignments of zero to p, yet there are critical differences in what they accomplish. What is the difference in the effect of p = NULL versus *p = '\0' versus p = ""?

[Optional] Extra Problem

Finished with lab and itching to further exercise your string skills? Check out our extra problem!

Recap

Nice work on the lab! It's okay if you don't completely finish all of the exercises during lab; your sincere participation for the full lab period is sufficient for credit. However, we highly encourage you to finish on your own whatever is need to solidify your knowledge. Also take a chance to reflect on what you got what from this lab and whether you feel ready for what comes next! The takeaways from lab2 should be further experience with gdb and Valgrind, learning your way around the string.h functions and manipulating C-strings as raw arrays/pointers. Here are some questions to verify your understanding and get you thinking further about these concepts:

  • In which situations does a call to strncpy null-terminate the destination string and in which does it not? What are some of the consequences that result from using a string that is not properly null-terminated?
  • A program appears to run just fine, but when run under Valgrind, it reports an error. Your friend dismisses the Valgrind report as off-base and paranoid. They argue that the program runs ok as proof, and what's the big deal? Explain to them why accessing any memory that doesn't belong to you is a problem and further describe how a memory bug can lurk asymptomatically, despite being in error.
  • Give a C-expression that reports whether a given char * has a matching prefix/suffix N chars long. Be sure to take advantage of the string.h library functions. What would happen if your expression were evaluated with a value for N that was longer the string's length?
  • How would you test whether string s is a suffix of string t? Hint: try your ideas on the following examples: s = "car", t = "care", s = "na", t = "banana"
  • The atoi/strtol functions convert a string of digits to a decimal, but no standard function that converts in reverse. Describe a technique you could use to convert a number to a string equivalent.