Lab 2 (Solution): C Strings

Lab sessions Tue Jan 26 to Thu Jan 28

Solutions by Nick Troccoli and Lisa Yan

Go back to this week's lab writeup (without solutions) here.

Learning Goals

During this lab, you will:

  1. read and analyze C code that operates on chars and C-strings
  2. use gdb and Valgrind to detect and debug memory errors

Get Started

Clone the lab starter code by using the command below. This command creates a lab2 directory containing the project files.

git clone /afs/ir/class/cs107/repos/lab2/shared lab2

Standard C Library Functions

The term "standard library" refers to the collection of functions that are packaged with a programming language. Our guide to stdlib gives an overview of the functions available in C's standard library.

Many programmers treat the system libraries as a "black box" -- using the functions as-is, with only a vague idea of how they are implemented. Given our goal to truly understand the inner workings of the system, we are going to go deeper.

To that end, some of our code study exercises will dig into library code. There is much to learn from studying industrial-strength code and reflecting on the tradeoffs in design, readability, and efficiency. You will also become a better-informed client, with deep knowledge of the proper use of these functions and the pitfalls and misuses to avoid.

This knowledge may also embolden you to consider writing your own version of isdigit or strncpy into your programs, but the best practice is to prefer using the existing library functions whenever possible. These versions are already written, fully debugged, and highly tuned -- what's not to like? We will study these functions, learn from the code, and critique them, but also be grateful to later just call upon them when needed.

Exercises

1) Valgrind

(20 minutes + 10min all-lab discussion)

Each lab will devote some time to hands-on practice with the tools. Last week, it was sanitycheck and gdb. This week, we introduce Valgrind.

Now that we are ramping up the use of pointers, our programs become more at risk for memory errors. These can be difficult bugs to track down and Valgrind is a necessary tool to learn. Valgrind is a tool that helps detect memory errors. If you haven't already, review our guide to Valgrind. Let's try using it to fix a program.

  • Read over the code in buggy.c to see its two planted memory errors. Consider just error #1 to start.
  • Run ./buggy 1 and you should see Segmentation fault (core dumped). A segfault results from an attempt to read/write an inaccessible/invalid address. Ok, so now you know your program has a memory error but not much else.
  • Now, run the program under gdb. You observe the same segfault, but now try using the gdb backtrace command to learn where execution was at the time of the crash. That's helpful, but we want more!
  • Exit gdb and run the program under Valgrind: valgrind ./buggy 1. The Valgrind report gives more detail including the type of memory error, the execution call stack, and the value of the invalid address. Furthermore, Valgrind also detected an earlier problem that preceded the seg fault -- use of uninitialized value of size 8 (sizeof pointer) right before the invalid read of size 1 (char). This is the clue that can lead you to the source of the error!

    The first error is calling strlen on an uninitialized pointer. The program should always seg fault, as the program is accessing memory at a garbage location. gdb should show what line the crash occurred on, and Valgrind should show more details about the memory issues. "Use of uninitialised value" refers to the uninitialized pointer (of size 8!). "Invalid read of size 1" refers to strlen reading a byte of memory that the program does not own. "Access not within mapped region at address" also refers to this.

  • Repeat above steps for buggy 2. Note that whereas error #1 fails in every context, error #2 is less obvious to the naked eye and are only outed by Valgrind's vigilance.

    The second error is calling strlen on a string without a null-terminating character. This may output various different lengths that may be correct or incorrect, as strlen is reading uninitialized memory in search of a null-terminating character. GDB does not help here, as the program executes normally.

The terminology and reporting that Valgrind uses for errors can be difficult to understand, and it takes practice. Your goal is to learn how to relate the error as reported by Valgrind back to the location in the code and root cause.

We chose these particular cases to further demonstrate the trickiness of memory errors. Both bugs come from some form of memory error, but the observed consequences differ. buggy 1 crashes with a seg fault every time. buggy 2 produces output that is sometimes correct and other times not, but it does not crash. A memory error may cause no visible harm, but that doesn't mean the program is correct, it just "got lucky" this time. Running Valgrind can detect these errors that are lying in wait, even before they produce an observable effect.

Once you've finished, review the following question a group: What string-handling error is likely to be the root cause of a Valgrind report that "Conditional jump or move depends on uninitialised value(s)"? This is an error that may occur for you on assign2, so it's vital you understand the causes now!

We saw this come up when running ./valgrind buggy 2. "Conditional jump or move depends on uninitialised value(s)" here means that the code in strlen executed based on the contents of memory that was not initialized.

We recommend that you run Valgrind early and often during your development cycle. Whenever Valgrind reports any memory errors, you should stop and focus on resolving them before moving on. Make routine valgrind checks a part of your assignment process, especially when you run into bugs! A quick run through valgrind can potentially help unearth issues. In grading, we run your submission under Valgrind and look for a clean report, so be sure you are running Valgrind yourself to find these errors.

2) Code Study: strtok

(40 minutes + 10min all-lab discussion, with 5min check-in discussion in the middle)

Take about 20 minutes to work on this problem up through the questions about line 14 of the implementation. Then, there will be a brief 5min check-in discussion to see how everyone is doing. After that, you'll have another 15 minutes to continue working on the rest of the problem.

During group time, your Lab TA will be rotating around and checking in with all groups. If your group has a pressing question or clarification, you can use the Ask for Help feature in Zoom (Breakout Rooms -> Ask for Help) so that your Lab TA can prioritize helping your group next.

A common string-handling pattern is to split a string into "tokens" separated by one or more delimiting characters, such as splitting a sentence into space-separated words or dividing a file path into components delimited by slashes. The C library provides the strtok function, but its awkward design makes using the function messy and error-prone. We are going to take a look at this function and its issues as a cautionary tale. You will be writing an improved tokenize function as part of assign2 and should understand the pitfalls to avoid!

Let's first consider the client use of strtok. Acquaint yourself with its operation from the man page (man strtok), taking note of the critiques of the function's design in the BUGS section. Unlike tokenization functions you may have used in other languages, a single call to strtok does not return a nicely formatted vector of tokens. Rather, you must call strtok repeatedly, each time receiving a single token, and continue until strtok returns NULL to indicate there are no more tokens. A tokenization loop looks like this:

char *token = strtok(input, delims);
while (token != NULL) {
    printf("%s\n", token);
    token = strtok(NULL, delims);
}

The two main complaints about strtok are that it destructively modifies its input and it is restricted to managing one active tokenization at a time. Let's consider these issues separately.

Rather than create a new substring for each token, strtok overwrites the token's ending delimiter in the input string with a null char and returns a pointer to the token in-place within the input string. It effectively chops up the input string into a sequence of tokens, re-purposing the existing characters where they are, rather than copying the chars elsewhere.

If the input string originally contained "red-green-blue\0", repeated calls to strtok to split on - would change the contents of the input string to "red\0green\0blue\0". Wacky! Not only is it undesirable that a function destroy its input in this way, this design also disallows tokenizing an input that is read-only, such as a string constant. The token.c program has a sample call that passes a string constant as the input to strtok. Uncomment that call and compile and run the program to see the consequence when strtok attempts to change the string constant - ouch! What happens if you call strtok passing a string constant as the input to tokenize?

Another odd feature is that strtok keeps track of the current state of the tokenization process on behalf of future calls to the function. In the first call to strtok, the first argument is the input string to tokenize, but in subsequent calls you pass NULL as the first argument, which tells strtok to continue tokenizing the same input as the previous call, picking up where it left off. Internally, strtok preserves the original input string from the first call using a global variable (yikes!) and updates it across subsequent calls. This behavior is in stark contrast to a typical function call which operates completely independently of any previous or future calls. Not only does this make it difficult to trace the client use, it also dictates that at most one tokenization can be active at a time. The function has one saved state that is shared by all calls. Starting a second tokenization before finishing the first will delete the saved state of the first and replace it with the second.

Are you curious about how this poorly-designed strtok function is implemented? Here is code based on musl's implementation. Note that not all stylistic decisions below are ones you should emulate - we use this code mostly as a case study of how to understand core implementations.

 1    char *musl_strtok(char *s, const char *delimiters) {
 2        static char *p = NULL;
 3
 4        if (s == NULL) {
 5            s = p;
 6            if (s == NULL) {
 7                return NULL;
 8            }
 9        }
10
11        s += strspn(s, delimiters);
12        if (!*s) {
13            p = NULL;
14            return p;
15        }
16
17        p = s + strcspn(s, delimiters);
18        if (*p) {
19            *(p++) = '\0';
20        } else {
21            p = NULL;
22        }
23
24        return s;
25    }

There is lot to unpack here! It may be helpful to draw a memory diagram and trace how s and p move about during a call and/or run the token program under gdb so you can observe the code in action.

  • The wikipedia page on static variables gives some background info and an example of a static local variable. Q:: What does static do and why is it being used here? Take care to note that static variables are initialized differently than non-static local variables. Although the function opens with what looks like a declaration and initialization of a new variable p, a static variable is declared and initialized exactly once. The value of p is preserved across calls.

    A static variable is a variable whose lifetime is the entire run of the program, instead of a single run of the function in which it was declared. strtok needs a static variable because it keeps track of the state of the tokenization process for future calls to the function. It is essentially global state (bad!). However, because the static variable is declared within the function, its lifetime is the entire program execution, but its scope is still just that function. No matter how many times line 3 is executed, the static variable p is really only initialized once.

  • Q:: When NULL is passed as the first argument, how does s get its value?

    If s is NULL, it sets s, which tracks the next position to continue tokenizing, equal to p. Then, it checks if that copied value is NULL. In English, this says, "if I should continue tokenizing from earlier, but there is nothing left, return NULL".

  • Q:: What do the functions strspn and strcspn do?

    strspn returns the number of characters (i.e., bytes) of the initial segment of the first string parameter that consist entirely of characters in the second string parameter. strcspn returns the number of characters of the initial segment of the fisrt string parameter that consist entirely of characters not in the second string parameter.

  • Q: [Line 11]: How is s updated during the execution of this line?

    Line 11 uses the strspn function to advance past any delimiters at the start of the string. strspn returns the number of characters of the initial segment which consist entirely of bytes from accept. Here s += strspn(s, delimiters) adds to s the number of characters that are in delimiters, which skips the starting separators. After this, s points to the first char that is NOT in delimiters.

  • Q: [Line 12]: What does the expression *s evaluate to? What is another equivalent way to write this expression? (Perhaps this alternate way would be preferred for readability, but alas...) What does it mean when we put a ! before it?

    This conditional (albeit cryptic) checks if s is now pointing to the null terminating character. *s dereferences s, which is equivalent to s[0]. Then, !*s checks equality to zero, i.e, if it's the null terminating character (from man ascii, '\0' has ASCII value 0).Another (perhaps clearer) way to write this would have been if (s[0] == '\0') {.

  • Q: [Line 14]: In what situation does the function exit through this line?

    If s now points to the null terminating character, then this means that all characters that were in s were delimiters; the function then sets p = NULL to indicate that this tokenization is done, and then returns NULL.

Take a break here and reconvene with the rest of your lab! Your lab leader will pull everyone back together for a short 5 minute discussion about how things are going so far. Then you can get back with your group to review and complete the last questions for an additional 15 minutes.

  • Q: [Line 17] Where does p point after executing this line?

    Line 17 sets p to point to the next delimiter (or null-terminating character) found in the string, after the token sequence. Here, s + strcspn(s, delimiters); returns s moved forward by the number of characters that are all not in delimiters, which skips over the next token sequence. The expression evaluates to point to the first char that IS in delimiters (delimiter) or is a null-terminating character.

  • Q: [Line 19] Unravel the assignment here and work out how it changes the string. One quirky note about p++ is that the expression not only modifies p, but it also evaluates to the original value of p (Check out this link for more about this quirky syntax). In other words, p++ will modify p but then evaluate to p rather than the incremented value. After line (19) executes, where does p now point?

    Line 18 checks if p points to a delimiter (not null-terminating character). If it does, on line 19 it writes a null-terminating character there, and then advances one place. Note that p++ increments the pointer one space but evaluates to the original p value. This is because ++ is "post-increment", meaning the increment happens after the expression evaluation. In other words, *(p++) = '\0' is equivalent to

    *p = '\0'; // overwrite *p with null terminator
    p++;       // move p one character forward

    Otherwise, p is set to NULL to indicate that this tokenization is done. Finally, it returns s, which now points to a shorter string that is just the first token, since strtok added a new null-terminating character where it found the first delimiter.

  • Q: [Lines 19 + 21] These lines appear to be somewhat similar assignments of zero to p, yet there are critical differences in what they accomplish. What is the difference in the effect of p = NULL versus *p = '\0' versus p = ""?

    p = NULL changes a pointer,*p = '\0' changes a character, and p = "" points to a string constant with a single null terminator char. It helps to draw pictures!

[Optional] Extra Problem

Finished with lab and itching to further exercise your string skills? Check out our extra problem! (Extra problems solution)

Recap

Nice work on the lab! It's okay if you don't completely finish all of the exercises during lab; your sincere participation for the full lab period is sufficient for credit. However, we highly encourage you to finish on your own whatever is need to solidify your knowledge. Also take a chance to reflect on what you got what from this lab and whether you feel ready for what comes next! The takeaways from lab2 should be further experience with gdb and Valgrind, learning your way around the string.h functions and manipulating C-strings as raw arrays/pointers. Here are some questions to verify your understanding and get you thinking further about these concepts:

  • In which situations does a call to strncpy null-terminate the destination string and in which does it not? What are some of the consequences that result from using a string that is not properly null-terminated?

    strncpy only null-terminates if one of the N bytes it is told to copy over is the null-terminator. Without a null terminator, string functions may behave incorrectly if they rely on knowing where the end of the string is (e.g. strlen, strcat, etc.) and may access memory that doesn't belong to that variable!

  • A program appears to run just fine, but when run under Valgrind, it reports an error. Your friend dismisses the Valgrind report as off-base and paranoid. They argue that the program runs ok as proof, and what's the big deal? Explain to them why accessing any memory that doesn't belong to you is a problem and further describe how a memory bug can lurk asymptomatically, despite being in error.

    Accessing memory that doesn't belong to you can create cryptic bugs by changing variable values without explicitly modifying such data in your code. A memory bug may "appear" correct by coincidence, such as when it is looping over uninitialized memory after a non-null-terminated string to count up its length, and coincidentally encounters a null-terminator immediately in that memory that does not belong to us. But this behavior could change across runs of the program, or on different machines.

  • Give a C-expression that reports whether a given char * has a matching prefix/suffix N chars long. Be sure to take advantage of the string.h library functions. What would happen if your expression were evaluated with a value for N that was longer the string's length?

    You could iterate over the possible prefix/suffix lengths (up to N/2) and for each one, calculate where that suffix would start and end as char *s, and use strcmp to check if they are equal.

  • How would you test whether string s is a suffix of string t? Hint: try your ideas on the following examples: s = "car", t = "care", s = "na", t = "banana"

    One possible solution:

    if (strlen(s) > strlen(t)) return false;
    return strcmp(s, t + strlen(t) - strlen(s)) == 0;
    
  • The atoi/strtol functions convert a string of digits to a decimal, but no standard function that converts in reverse. Describe a technique you could use to convert a number to a string equivalent.

    You could either a) use a printf variant that outputs to a string, or loop over each digit and convert them by using / and %.