Lab 2 Solutions

Written by John Louie, with modifications by Nick Troccoli and Lisa Yan

Lecture Review

Here are key concepts relevant to this week's lab:

  • C strings are arrays of characters terminated with a null-terminating character
  • For common operations, such as comparing strings, copying strings, concatenating strings, searching strings, etc. we use string library functions in string.h.
  • C strings can be represented as pointers to characters, and we can adjust those pointers to advance through the string (e.g. make a substring that skips the first few characters)
  • it's vital that a string be properly null-terminated when creating, manipulating or copying it!
  • when passing strings as parameters, we always pass them as a char * (a pointer to the first character). For this reason, if we change characters in a string passed to a function, these changes will persist outside of the function.
  • (NOTE: covered after some labs) characters of strings can live either on the stack (modifiable) or in the data segment (read-only), depending on how they are declared.
  • Valgrind is a tool that may be able to help us track down memory errors
  • Strings are ultimately represented using pointers, which allows us to pass around their locations, use pointer arithmetic to advance through strings, and have changes to strings persist across functions.

Solutions

1) Valgrind

Learning how to use Valgrind will pay dividends throughout the quarter to diagnose memory errors! A memory error may be asymptomatic but we must make sure to not have any memory errors in our programs.

Error 1

The first error is calling strlen on an uninitialized pointer. The program should always seg fault, as the program is accessing memory at a garbage location. gdb should show what line the crash occurred on, and Valgrind should show more details about the memory issues. Here's what the Valgrind report should look like:

==10750== Use of uninitialised value of size 8
==10750==    at 0x4C30F62: strlen (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10750==    by 0x4006D3: makeError1 (buggy.c:17)
==10750==    by 0x4007F6: main (buggy.c:48)
==10750== 
==10750== Invalid read of size 1
==10750==    at 0x4C30F62: strlen (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10750==    by 0x4006D3: makeError1 (buggy.c:17)
==10750==    by 0x4007F6: main (buggy.c:40)
==10750==  Address 0x100000000 is not stack'd, malloc'd or (recently) free'd
==10750== 
==10750== 
==10750== Process terminating with default action of signal 11 (SIGSEGV)
==10750==  Access not within mapped region at address 0x100000000
==10750==    at 0x4C30F62: strlen (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10750==    by 0x4006D3: makeError1 (buggy.c:17)
==10750==    by 0x4007F6: main (buggy.c:40)
==10750==  If you believe this happened as a result of a stack
==10750==  overflow in your program's main thread (unlikely but
==10750==  possible), you can try to increase the size of the
==10750==  main thread stack using the --main-stacksize= flag.
==10750==  The main thread stack size used in this run was 8388608.

"Use of uninitialised value" refers to the uninitialized pointer (of size 8!). "Invalid read of size 1" refers to strlen reading a byte of memory that the program does not own. "Access not within mapped region at address" also refers to this.

Error 2

Q1: Repeat the above steps for ./buggy 2. Note that whereas error #1 fails in every context, error #2 is less obvious to the naked eye - but Valgrind can help! What does it tell us about the error that occurs?

A1: The second error is calling strlen on a string without a null-terminating character. This may output various different lengths that may be correct or incorrect, as strlen is reading uninitialized memory in search of a null-terminating character. GDB does not help here, as the program executes normally.

Here's what the Valgrind report should look like:

==11338== Conditional jump or move depends on uninitialised value(s)
==11338==    at 0x4C30F78: strlen (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==11338==    by 0x400722: makeError2 (buggy.c:25)
==11338==    by 0x400808: main (buggy.c:42)

"Conditional jump or move depends on uninitialised value(s)" here means that the code in strlen executed based on the contents of memory that was not initialized.

Q2: Once you are done examining these bugs, answer the following question as a group: What string-handling error is likely to be the root cause of a Valgrind report that "Conditional jump or move depends on uninitialised value(s)"? This is an error that may occur for you on assign2, so it's vital you understand the causes now!

A2: This means that you are reading into uninitialized/garbage memory and using those values to affect execution. E.g. calling strlen on an uninitialized char array, or on a string missing a null terminator.

2) Code Study: strtok

Q3: Try uncommenting line 26 - what happens? Ouch!

A3: Crash! strtok tries to modify a read-only string, which it's not allowed to do (we cannot modify characters in the data segment), so we get a segmentation fault.

This function keeps a static variable - a static variable is a variable whose lifetime is the entire run of the program, instead of a single run of the function in which it was declared. strtok needs a static variable because it keeps track of the state of the tokenization process for future calls to the function. It is essentially global state (bad!). However, because the static variable is declared within the function, its lifetime is the entire program execution, but its scope is still just that function. No matter how many times line 3 is executed, the static variable p is really only initialized once. (Additional details: if there is no explicit initialization, global/static variables default to zero-ed. Also, the initial value must be a constant expression because the initialization is processed at compile-time.)

Q4: [Lines 4-9] When NULL is passed as the first argument, how does s get its value?

A4: Lines 4-9 check if there is an existing tokenization to continue. If s is not NULL, it always tokenizes that string. If s is NULL, it sets s, which tracks the next position to continue tokenizing, equal to p. Then, it checks if that copied value is NULL. In English, this says, "if I should continue tokenizing from earlier, but there is nothing left, return NULL".

Q5: [Line 11] How is s updated during the execution of this line?

A5: Line 11 uses the strspn function to advance past any delimiters at the start of the string. strspn returns the number of characters of the initial segment which consist entirely of bytes from accept. Here s += strspn(s, delimiters) adds to s the number of characters that are in delimiters, which skips the starting separators. After this, s points to the first char that is NOT in delimiters.

Q6: [Line 12] What does the expression *s evaluate to? What is another equivalent way to write this expression? (Perhaps this alternate way would be preferred for readability, but alas...)

A6: *s gets the first character of s; an alternate way to write this is s[0].

Q7: [Line 14] In what situation does the function exit through this line?

A7: Lines 12-15 check if s now points to the null terminating character, meaning that all characters that were in the input were delimiters. If this is true, it sets p = NULL to indicate that this tokenization is done, and then returns NULL.

Q8: [Line 17] Where does p point after executing this line?

A8: Line 17 sets p to point to the next delimiter (or null-terminating character) found in the string, after the token sequence. strcspn returns the number of characters of the initial segment which consist entirely of characters NOT from reject. Here s + strcspn(s, delimiters); returns s moved forward by the number of characters that are all not in delimiters, which skips over the next token sequence. The expression evaluates to point to the first char that IS in delimiters (delimiter) or is a null-terminating character.

Q9: [Lines 19 + 22] These lines appear to be somewhat similar assignments of zero to p, yet there are critical differences in what they accomplish. What is the difference in the effect of p = NULL versus *p = '\0' versus p = ""?

A9: Line 18 checks if p points to a delimiter (not null-terminating character). If it does, on line 19 it writes a null-terminating character there, and then on line 20 advances one place. Otherwise, p is set to NULL to indicate that this tokenization is done. Finally, it returns s, which now points to a shorter string that is just the first token, since strtok added a new null-terminating character where it found the first delimiter. Note that NULL is a pointer (specifically a pointer with numeric value 0), \0 is a character (a character with numeric value 0) and "" is the empty string (a string consisting of just a null terminator). Therefore, p = NULL changes a pointer,*p = '\0' changes a character, and p = "" points to a string constant with a single null terminator char. Drawing pictures can also help to differentiate between these.

Checkoff Answers

  • string.h: How would you test whether string s is a suffix of string t?
if (strlen(s) > strlen(t)) return false;
return strcmp(s, t + strlen(t) - strlen(s)) == 0;

Initially, it may seem like strstr is a good fit to solve this, but that may cause problems. As an example, say s = "car", t = "care" (strstr would find a match, even though it's at the beginning), or s = "na", t = "banana" (strstr would find a match, but not the one we want).

  1. Valgrind: What string-handling error is likely to be the root cause of a Valgrind report that "Conditional jump or move depends on uninitialised value(s)"? This means that you are reading into uninitialized/garbage memory and using those values to affect execution. E.g. calling strlen on an uninitialized char array, string missing null terminator.
  2. C-strings: What happens if you call strtok passing a string constant as the input to tokenize? Segfault (protection violation) when assigning to p[0]
  3. strtok: strtok uses both span functions, strspn and strcspn. How does it use each of them when tokenizing a string? It uses strspn to advance past any initial delimiters. It uses strcspn to find the delimiters that mark the end of the current token.

Lab-End Thought Questions

  • In which situations does a call to strncpy null-terminate the destination string and in which does it not? What are some of the consequences that result from using a string that is not properly null-terminated? strncpy only null-terminates if one of the N bytes it is told to copy over is the null-terminator. Without a null terminator, string functions may behave incorrectly if they rely on knowing where the end of the string is (e.g. strlen, strcat, etc.) and may access memory that doesn't belong to that variable!
  • A program appears to run just fine, but when run under Valgrind, it reports an error. Your friend dismisses the Valgrind report as off-base and paranoid. They argue that the program runs ok as proof, and what's the big deal? Explain to them why accessing any memory that doesn't belong to you is a problem and further describe how a memory bug can lurk asymptomatically, despite being in error. Accessing memory that doesn't belong to you can create cryptic bugs by changing variable values without explicitly modifying such data in your code. A memory bug may "appear" correct by coincidence, such as when it is looping over uninitialized memory after a non-null-terminated string to count up its length, and coincidentally encounters a null-terminator immediately in that memory that does not belong to us. But this behavior could change across runs of the program, or on different machines.
  • Give a C-expression that reports whether a given char* has a matching prefix/suffix N chars long. Be sure to take advantage of the string.h library functions. What would happen if your expression were evaluated with a value for N that was longer the string's length? You could iterate over the possible prefix/suffix lengths (up to N/2) and for each one, calculate where that suffix would start and end as char *s, and use strcmp to check if they are equal.
  • How would you test whether string s is a suffix of string t? Hint: try your ideas on the following examples: s = "car", t = "care", s = "na", t = "banana" See checkoff question solution
  • The atoi/strtol functions convert a string of digits to a decimal, but no standard function that converts in reverse. Describe a technique you could use to convert a number to a string equivalent. You could either a) use a printf variant that outputs to a string, or loop over each digit and convert them by using / and %.

[Optional] Extra Problems

atoi

Q1: [Line 2] What does the expression *s evaluate to? What is another equivalent way to write this expression? (Perhaps this alternate way would be preferred for readability, but alas...)

A1: *s gets the first character of s; an alternate way to write this is s[0].

Q2: [Line 3] What does the expressions++ do to s?

A2: This advances s to point to the next character in the string. This loop is advancing the s pointer past any space characters at the start of the string.

Q3: [Lines 6-12] If the string begins with a leading minus, at what point does it advance s past that char? This is subtle! (The fallthrough behavior of a switch statement can be surprising and often triggered by accident, so a comment noting there is intentionally no break statement would probably have been a good idea, but alas...)

A3: This switch statement is looking for a possible leading sign character (+/-). If the string begins with a leading minus, it will execute both cases because it "falls through" to the second one. If it finds a "-" sign, it updates the boolean to reflect this, and also increments s to advance past that character. Switch statement cases "fall through" by default, unless they hit a break statement, and continue executing the case code below them. If it finds a "+" sign, it just increments s to advance past that character.

Q4: The code has two loops that advance along the string, but neither has an explicit test to stop at the null terminator. Why is this not necessary?

A4: The isspace and isdigit functions return false when the parameter value is the null terminator, so loops will already stop correctly.

Q5: How is a single digit character converted to its numeric value? How is that value combined with the number being built up so far?

A5: Subtracting '0' gives us the numeric value of a character digit, because ASCII digits are numbered sequentially. This code builds up the numeric value, after initializing it to zero. It loops over each character in the remaining string as long as the character is numeric; as soon as it finds a non-digit, the function stops and returns whatever value it has constructed so far. It constructs the number one digit at a time; it sets number's new value to be its previous value times 10, minus the new digit. Note that the number is built up as a negative number and then potentially negated at the end.

Q6: [Line 17] One quirky note about s++ is that the expression not only modifies s, but it also evaluates to the original value of s (Check out this link for more about this quirky syntax). In other words, s++ will modify s but then evaluate to s rather than the incremented value. How is this used to make the string processing more concise?

A6: In the expression number = 10 * number - (*(s++) - '0');, s++ increments the pointer one space but evaluates as a whole to the character s originally points to. This allows us to advance the string pointer in the same expression where we are accessing the current character.

Q7: [Lines 15-20] The loop builds up n as a negative value and later negates it. The comment indicates this choice avoids overflow on INT_MIN. Why does INT_MIN seem to necessitate its own special case? Why does atoi construct a numeric value as a negative number instead of a positive number?

A7: The number is built up as a negative number because, if the user wants to convert INT_MIN, that cannot be represented as a positive number, because it would overflow. Conversely, every positive number has a representable negative equivalent. That being said, technically it will still work even without the special case, since the overflow will result in INT_MIN and -INT_MIN is still INT_MIN. But this special case avoids this overflow behavior.

Q8: What happens when the input string contains a non-digit char? What if that non-digit char is the very first character?

A8: If it's a space, atoi will move to the next character. If it's any other character, number is initialized at 0, and so atoi will return 0. This is because in the digit-processing loop, as soon as it finds a non-digit, the loop stops and returns whatever value it has constructed so far (which is initially 0).

Q9: [Line 20] Note that the ternary expression (with the question mark and colon) is evaluated using the form "CONDITION ? IF_TRUE : IF_FALSE". Try applying this to interpret the return statement above.

A9: Another way to read this is "if negative is true, return number; otherwise, return -number."

Q10: What about an input that is outside the representable range of integer? If the string is a number larger than INT_MAX, will it overflow, saturate, return 0, raise an error, or do something else entirely?

A10: It overflows in constructing the value number in the while loop, but otherwise doesn't crash or raise an error.