Lab 2: C Strings

Lab sessions Tue Oct 05 to Thu Oct 07

Lab written by Julie Zelenski, with modifications by Nick Troccoli

Lab Overview

For the first 10-20 minutes, you'll chat and review concepts as a group. You'll then split off into groups to start working on the first problem. After each problem, you'll pull back together to discuss what you found and what questions you have, and then return to your group to work on the next problem.

Learning Goals

During this lab, you will:

read and analyze C code that operates on chars and C-strings
use gdb and Valgrind to detect and debug memory errors

Group Work Tips

We hope you can get to know your labmate(s) over the course of the quarter! We encourage you to take some time at the start to chat and see how things are going. How is the current assignment? What general tips or tricks have you learned while working? Any favorite tunes you listen to while working?

To help your group get the most out of the exercises, we recommend reviewing key takeaways in the last few minutes of group time: before the full-lab discussion, brainstorm as a group about key takeaways and questions you have for the full discussion. It's ok if you don't have time to get through everything!

Get Started

Clone the lab starter code by using the command below. This command creates a lab2 directory containing the project files.

git clone /afs/ir/class/cs107/repos/lab2/shared lab2

Next, pull up the online lab checkoff and have it open in a browser so you can jot things down as you go. Only one checkoff needs to submitted for both you and your partner.

Standard C Library Functions

The term "standard library" refers to the collection of functions that are packaged with a programming language. Our guide to stdlib gives an overview of the functions available in C's standard library.

Many programmers treat the system libraries as a "black box" -- using the functions as-is, with only a vague idea of how they are implemented. Given our goal to truly understand the inner workings of the system, we are going to go deeper.

To that end, some of our code study exercises will dig into library code. There is much to learn from studying industrial-strength code and reflecting on the tradeoffs in design, readability, and efficiency. You will also become a better-informed client, with deep knowledge of the proper use of these functions and the pitfalls and misuses to avoid.

This knowledge may also embolden you to consider writing your own version of isdigit or strncpy into your programs, but the best practice is to prefer using the existing library functions whenever possible. These versions are already written, fully debugged, and highly tuned -- what's not to like? We will study these functions, learn from the code, and critique them, but also be grateful to later just call upon them when needed.

Exercises

1) Valgrind

(20 minutes + 10min all-lab discussion)

Learning Goal: learn how to use Valgrind output to identify the code location and root cause of memory errors in programs. Valgrind is an essential tool on assignments going forward!
Starter code file: buggy.c

Each lab will devote time to hands-on tools practice. Last week, it was sanitycheck and gdb. This week, we introduce Valgrind.

When programming with pointers, our programs become more at risk for memory errors, which can be difficult bugs to track down. Valgrind is a critical tool that can help detect memory errors. The terminology and reporting that Valgrind uses for errors can be difficult to understand, and it takes practice. We have a guide to valgrind in the handouts tab that you can reference as you need. Let's try using it to fix a program.

Read over buggy.c. It contains two memory errors. Let's start with error #1.
Run ./buggy 1 and you should see Segmentation fault (core dumped). A segfault results from an attempt to read/write an inaccessible/invalid address. Ok, so now you know your program has a memory error but not much else.
Run the program under gdb. You observe the same segfault, but now try using the gdb backtrace command to learn where execution was at the time of the crash. That's helpful, but we want more!
Exit gdb and run the program under Valgrind: valgrind ./buggy 1. The Valgrind report gives more detail including the type of memory error, the execution call stack, and the value of the invalid address. Furthermore, Valgrind also detected an earlier problem that preceded the seg fault -- use of uninitialized value of size 8 (sizeof pointer) right before the invalid read of size 1 (char). This is the clue that can lead you to the source of the error!

Repeat the above steps for ./buggy 2. Note that whereas error #1 fails in every context, error #2 is less obvious to the naked eye - but Valgrind can help!

The terminology and reporting that Valgrind uses for errors can be difficult to understand, and it takes practice. Your goal is to learn how to relate the error as reported by Valgrind back to the location in the code and root cause.

Memory errors are tricky and can vary in their observed consequences. buggy 1 crashes with a seg fault every time. buggy 2 produces output that is sometimes correct and other times not, but it does not crash. A memory error may cause no visible harm, but that doesn't mean the program is correct, it just "got lucky" this time. Running Valgrind can detect these errors that are lying in wait, even before they produce an observable effect.

Once you are done examining these bugs, answer the following question as a group: What string-handling error is likely to be the root cause of a Valgrind report that "Conditional jump or move depends on uninitialised value(s)"? This is an error that may occur for you on assign2, so it's vital you understand the causes now!

You should run Valgrind early and often during your development cycle. Whenever Valgrind reports any memory errors, you should stop and focus on resolving them before moving on. Make routine valgrind checks a part of your assignment process, especially when you run into bugs! A quick run through valgrind can potentially help unearth issues. In grading, we run your submission under Valgrind and look for a clean report, so be sure you are running Valgrind yourself to find these errors.

2) Code Study: `strtok`

(30 minutes + 10min all-lab discussion)

Learning Goal: understand the implementation of the strtok library function so that you can implement an improved version of it on assign2.
Starter code file: token.c

A common string-handling pattern is to split a string into "tokens" separated by one or more delimiting characters, such as splitting a sentence into space-separated words or dividing a file path into components delimited by slashes. The C library provides the strtok function, but its awkward design makes using the function messy and error-prone. We are going to examine this function and its issues as a cautionary tale. You will be writing an improved tokenize function as part of assign2 and should understand the pitfalls to avoid! We will first understand how to use the function; then we will look at how it is implemented.

Using `strtok` (10 minutes)

Unfortunately, strtok does not just return a nicely formatted list of tokens. Rather, you must call strtok repeatedly, each time receiving a single token, and continue until strtok returns NULL to indicate there are no more tokens. Click the link below to open a CPlayground where you can edit and run the sample code right in the window.

Open CPlayground

Design Issue #1: strtok modifies the passed-in input. Rather than create a new substring for each token, strtok overwrites the token's ending delimiter in the input string with a null terminating character and returns a pointer to the token in-place within the input string. It effectively chops up the input string into a sequence of tokens, re-purposing the existing characters. In the CPlayground example, all of the -s in input are replaced with null terminators! This is not only undesirable, but it means we cannot pass in read-only strings (like string constants) to tokenize. Try uncommenting line 26 and see what happens - ouch!

Design Issue #2: strtok can only manage one active tokenization at a time. The first time you call strtok, you specify the input to tokenize as the first argument, but in subsequent calls you pass NULL as the first argument, which tells strtok to continue tokenizing the same input as the previous call. Internally, strtok preserves the original input string from the first call using a global variable (yikes!) that is shared by all calls, and updates it across subsequent calls. This is unlike most funtions, where each call is independent. This means only 1 tokenization can be active at a time; starting a second tokenization before finishing the first will delete the saved state of the first and replace it with the second.

`strtok` Implementation (20 minutes)

Are you curious about how this poorly-designed strtok function is implemented? Here is code based on musl's implementation. Note that not all stylistic decisions below are ones you should emulate - we use this code mostly as a case study of how to understand core implementations.

 1    char *musl_strtok(char *s, const char *delimiters) {
 2        static char *p = NULL;
 3
 4        if (s == NULL) {
 5            s = p;
 6            if (s == NULL) {
 7                return NULL;
 8            }
 9        }
10
11        s += strspn(s, delimiters);
12        if (*s == '\0') {
13            p = NULL;
14            return p;
15        }
16
17        p = s + strcspn(s, delimiters);
18        if (p[0] != '\0') {
19            p[0] = '\0';
20            p++;
21        } else {
22            p = NULL;
23        }
24
25        return s;
26    }

There is lot to unpack here! It may be helpful to draw a memory diagram and trace how s and p move about during a call and/or run the token program under gdb so you can observe the code in action. Let's work through it.

Line 2 declares a static local variable. It does not go away when the function finishes, and is accessible only by this function. Line 2 executes only once, no matter how many times musl_strtok is called, but p's value is saved across calls.
Q: [Lines 4-9] When NULL is passed as the first argument, how does s get its value?
Q: [Line 11] How is s updated during the execution of this line?
Q: [Line 12] What does the expression *s evaluate to? What is another equivalent way to write this expression? (Perhaps this alternate way would be preferred for readability, but alas...)
Q: [Line 14] In what situation does the function exit through this line?
Q: [Line 17] Where does p point after executing this line?
Q: [Lines 19 + 22] These lines appear to be somewhat similar assignments of zero to p, yet there are critical differences in what they accomplish. What is the difference in the effect of p = NULL versus *p = '\0' versus p = ""?

[Optional] Extra Problem

Finished with lab and itching to further exercise your string skills? Check out our extra problem!

Recap

Nice work on the lab! It's okay if you don't completely finish all of the exercises during lab; your sincere participation for the full lab period is sufficient for credit. However, we highly encourage you to finish on your own whatever is need to solidify your knowledge. Also take a chance to reflect on what you got what from this lab and whether you feel ready for what comes next! The takeaways from lab2 should be further experience with gdb and Valgrind, learning your way around the string.h functions and manipulating C-strings as raw arrays/pointers. Here are some questions to verify your understanding and get you thinking further about these concepts:

In which situations does a call to strncpy null-terminate the destination string and in which does it not? What are some of the consequences that result from using a string that is not properly null-terminated?
A program appears to run just fine, but when run under Valgrind, it reports an error. Your friend dismisses the Valgrind report as off-base and paranoid. They argue that the program runs ok as proof, and what's the big deal? Explain to them why accessing any memory that doesn't belong to you is a problem and further describe how a memory bug can lurk asymptomatically, despite being in error.
Give a C-expression that reports whether a given char * has a matching prefix/suffix N chars long. Be sure to take advantage of the string.h library functions. What would happen if your expression were evaluated with a value for N that was longer the string's length?
How would you test whether string s is a suffix of string t? Hint: try your ideas on the following examples: s = "car", t = "care", s = "na", t = "banana"
The atoi/strtol functions convert a string of digits to a decimal, but no standard function that converts in reverse. Describe a technique you could use to convert a number to a string equivalent.