Lab 2: C-Strings

Lab sessions Tue Jan 22 to Thu Jan 24

Lab written by Julie Zelenski, with modifications by Nick Troccoli

Learning Goals

During this lab, you will:

  1. read and analyze C code that operates on chars and C-strings
  2. use gdb and Valgrind to detect and debug memory errors

Find an open computer to share with a partner and introduce yourselves. Introduce yourself and tell them about your favorite music to listen to while working!

Get Started

Clone the lab starter code by using the command below. This command creates a lab2 directory containing the project files.

git clone /afs/ir/class/cs107/repos/lab2/shared lab2

Next, pull up the online lab checkoff and have it open in a browser so you can jot things down as you go. Only one checkoff needs to submitted for both you and your partner.

Standard C Library Functions

The term "standard library" refers to the collection of functions that are packaged with a programming language. Our guide to stdlib gives an overview of the functions available in C's standard library.

Many programmers treat the system libraries as a "black box" -- using the functions as-is, with only a vague idea of how they are implemented. Given our goal to truly understand the inner workings of the system, we are going to go deeper.

To that end, some of our code study exercises will dig into library code. There is much to learn from studying industrial-strength code and reflecting on the tradeoffs in design, readability, and efficiency. You will also become a better-informed client, with deep knowledge of the proper use of these functions and the pitfalls and misuses to avoid.

This knowledge may also embolden you to consider writing your own version of isdigit or strncpy into your programs, but the best practice is to prefer using the existing library functions whenever possible. These versions are already written, fully debugged, and highly tuned -- what's not to like? We will study these functions, learn from the code, and critique them, but also be grateful to later just call upon them when needed.

Exercises

Do your best to spend the recommended amount of time on each exercise to ensure you get to each part of this week's lab.

1) Code Study: atoi (20 minutes)

In assign0, you used the atoi function to convert a string argument to an integer. The function name comes from ascii to integer. As you found out in assign0, atoi is convenient to use, but not particularly robust. It has largely been superseded by strtol (used in assign1), which provides more features, but is more complicated.

Below is the implementation of atoi from musl. This code is included in the atoi.c program so you can compile and run it.

int musl_atoi(const char *s)
{
    int n=0, neg=0;
    while (isspace(*s)) s++;
    switch (*s) {
        case '-': neg=1;
        case '+': s++;
    }
    /* Compute n as a negative number to avoid overflow on INT_MIN */
    while (isdigit(*s))
        n = 10*n - (*s++ - '0');
    return neg ? n : -n;
}

First trace through the function with your partner. Here are some questions to get you started:

  • How is a single digit character converted to its numeric value? How is that value combined with the number being built up so far?
  • If the string begins with a leading minus, at what point does it advance s past that char? This is subtle! (The fallthrough behavior of a switch statement can be surprising and often triggered by accident, so a comment noting there is intentionally no break statement would probably have been a good idea, but alas...)
  • The code has two loops that advance along the string, but neither has an explicit test to stop at the null terminator. Why is this not necessary?
  • The loop builds up n as a negative value and later negates it. The comment indicates this choice avoids overflow on INT_MIN. Why does INT_MIN seem to necessitate its own special case?
  • What happens when the input string contains a non-digit char? What if that non-digit char is the very first character?

Compile the atoi program and do some testing. Identify an input of interest and first manually trace how the function will process it, then step through the program in gdb to verify your understanding is correct. Try some inputs that should successfully convert, e.g. ./atoi 107 0 -12, then consider inputs that are problematic or malformed, e.g. ./atoi 3.14 binky @55 --8. What about an input is outside the representable range of integer? If the string is a number larger than INT_MAX, will it overflow, saturate, return 0, raise an error, or do something else entirely?

2) Valgrind (30 minutes)

Each lab will devote some time to hands-on practice with the tools. Last week, it was sanitycheck and gdb. This week, we introduce Valgrind.

Now that we are ramping up the use of pointers, our programs become more at risk for memory errors. These can be difficult bugs to track down and Valgrind is a necessary tool to learn. Valgrind is a tool that helps detect memory errors. If you haven't already, review the our guide to Valgrind. Let's try using it to fix a program.

  • Read over the code in buggy.c to see its three planted memory errors. Consider just error #1 to start.
  • Run ./buggy 1 and you should see Segmentation fault (core dumped). A segfault results from an attempt to read/write an inaccessible/invalid address. Ok, so now you know your program as a memory error but not much else.
  • Now, run the program under gdb. You observe the same segfault, but now try using the gdb backtrace command to learn where execution was at the time of the crash. That's helpful, but we want more!
  • Exit gdb and run the program under Valgrind: valgrind ./buggy 1. The Valgrind report gives more detail including the type of memory error, the execution call stack, and the value of the invalid address. Furthermore, Valgrind also detected an earlier problem that preceded the seg fault -- use of uninitialized value of size 8 (sizeof pointer) right before the invalid read of size 1 (char). This is the clue that can lead you to the source of the error!
  • Repeat above steps for buggy 2 and buggy 3. Note that whereas error #1 fails in every context, error #2 is less obvious to the naked eye and are only outed by Valgrind's vigilance.

The terminology and reporting that Valgrind uses for errors can be difficult to understand, and it takes practice. Your goal is to learn how to relate the error as reported by Valgrind back to the location in the code and root cause.

We chose these particular cases to further demonstrate the trickiness of memory errors. All three bugs come from some form of memory error, but the observed consequences differ. buggy 1 and buggy 3 crash with a seg fault every time. buggy 2 produces output that is sometimes correct and other times not, but it does not crash. A memory error may cause no visible harm, but that doesn't mean the program is correct, it just "got lucky" this time. Running Valgrind can detect these errors that are lying in wait, even before they produce an observable effect.

For this reason, we recommend that you run Valgrind early and often during your development cycle. Whenever Valgrind reports any memory errors, you should stop and focus on resolving them before moving on.

In grading, we run your submission under Valgrind and look for a clean report, so be sure you are running Valgrind yourself to find these errors.

3) Code Study: strtok (50 minutes)

A common string-handling pattern is to split a string into "tokens" separated by one or more delimiting characters, such as splitting a sentence into space-separated words or dividing a file path into components delimited by slashes. The C library provides the strtok function, but its awkward design makes using the function messy and error-prone. We are going to take a look at this function and its issues as a cautionary tale. You will be writing an improved tokenize function as part of your next assignment and should understand the pitfalls to avoid!

Let's first consider the client use of strtok. Acquaint yourself with its operation from the man page (man strtok), taking note of the critiques of the function's design in the BUGS section. Unlike tokenization functions you may have used in other languages, a single call to strtok does not return a nicely formatted vector of tokens. Rather, you must call strtok repeatedly, each time receiving a single token, and continue until strtok returns NULL to indicate there are no more tokens. A tokenization loop looks like this:

char *token = strtok(input, delims);
while (token != NULL) {
    printf("%s\n", token);
    token = strtok(NULL, delims);
}

The two main complaints about strtok are that it destructively modifies its input and it is restricted to managing one active tokenization at a time. Let's consider these issues separately.

Rather than create a new substring for each token, strtok overwrites the token's ending delimiter in the input string with a null char and returns a pointer to the token in-place within the input string. It effectively chops up the input string into a sequence of tokens, re-purposing the existing characters where they are, rather than copying the chars elsewhere.

If the input string originally contained "red-green-blue\0", repeated calls to strtok to split on - would change the contents of the input string to "red\0green\0blue\0". Wacky! Not only is it undesirable that a function destroy its input in this way, this design also disallows tokenizing an input that is read-only, such as a string constant. The token.c program has a sample call that passes a string constant as the input to strtok. Uncomment that call and compile and run the program to see the consequence when strtok attempts to change the string constant - ouch!

Another odd feature is that strtok keeps track of the current state of the tokenization process on behalf of future calls to the function. In the first call to strtok, the first argument is the input string to tokenize, but in subsequent calls you pass NULL as the first argument, which tells strtok to continue tokenizing the same input as the previous call, picking up where it left off. Internally, strtok preserves the original input string from the first call and updates it across subsequent calls. This behavior is in stark contrast to a typical function call which operates completely independently of any previous or future calls. Not only does this make it difficult to trace the client use, it also dictates that at most one tokenization can be active at a time. The function has one saved state that is shared by all calls. Starting a second tokenization before finishing the first will delete the saved state of the first and replace it with the second.

Are you curious about how this poorly-designed strtok function is implemented? Here is musl's implementation:

 1    char *musl_strtok(char *s, const char *sep)
 2    {
 3         static char *p = NULL;
 4
 5         if (s == NULL && ((s = p) == NULL))
 6             return NULL;
 7         s += strspn(s, sep);
 8         if (!*s)
 9             return p = NULL;
10         p = s + strcspn(s, sep);
11         if (*p)
12             *p++ = '\0';
13         else 
14             p = NULL;
15         return s;
16     }

There is lot to unpack here! It may be helpful to draw a memory diagram and trace how s and p move about during a call and/or run the token program under gdb so you can observe the code in action. Note that, while it can be confusing syntax, an assignment expression (e.g. s = p) completes and then evaluates to the value of p and s.

  • The wikipedia page on static variables gives some background info and an example of a static local variable. What does static do and why is it being used here?

    Take care to note that static variables are initialized differently than non-static local variables. Although the function opens with what looks like a declaration and initialization of a new variable p, a static variable is declared and initialized exactly once. The value of p is preserved across calls.

  • When NULL is passed as the first argument, how does s get its value?

  • What do the functions strspn and strcspn do?
  • How is s updated during the execution of line (7)?
  • In what situation does the function exit through line (9)?
  • Where does p point after executing line (10)?
  • Unravel the assignment in line (12) and work out how it changes the input string. After line (12) executes, where does p now point?
  • Lines (12) and (14) appear to be somewhat similar assignments of zero to p, yet there are critical differences in what they accomplish. What is the difference in the effect of p = NULL versus *p = '\0' versus p = ""?

Recap and Check Off With TA

At the end of the lab period, submit the checkoff form and ask your lab TA to approve your submission so you are properly credited for your work. It's okay if you don't completely finish all of the exercises during lab; your sincere participation for the full lab period is sufficient for credit. However, we highly encourage you to finish on your own whatever is need to solidify your knowledge. Also take a chance to reflect on what you got what from this lab and whether you feel ready for what comes next! The takeaways from lab2 should be further experience with gdb and Valgrind, learning your way around the string.h functions and manipulating C-strings as raw arrays/pointers. Here are some questions to verify your understanding and get you thinking further about these concepts:

  • In which situations does a call to strncpy null-terminate the destination string and in which does it not? What are some of the consequences that result from using a string that is not properly null-terminated?
  • A program appears to run just fine, but when run under Valgrind, it reports an error. Your friend dismisses the Valgrind report as off-base and paranoid. They argue that the program runs ok as proof, and what's the big deal? Explain to them why accessing any memory that doesn't belong to you is a problem and further describe how a memory bug can lurk asymptomatically, despite being in error.
  • Give a C-expression that reports whether a given char* has a matching prefix/suffix N chars long. Be sure to take advantage of the string.h library functions. What would happen if your expression were evaluated with a value for N that was longer the string's length?
  • The atoi/strtol functions convert a string of digits to a decimal, but no standard function that converts in reverse. Describe a technique you could use to convert a number to a string equivalent.

Lecture 5 Video

If you finish the lab early and are comfortable with the concepts and exercises above, if you have not done so already, take the remaining time to watch the posted video for Lecture 5 on the course website's syllabus page. This video provides important additional material related to C strings and this week's assignment.