Lab 7: Compilation tools and executables

Lab sessions Mon May 16 to Thu May 19

Lab written by Julie Zelenski

Learning goals

After this lab, you should be able to:

describe the steps to build an executable from C code and tasks performed by the preprocessor/compiler/assembler/linker
interpret the symptoms of a build error and what to do to fix it
diagram the program address space

Find an open computer and somebody new to sit with. Introduce yourself and share war stories about your efforts to defuse your binary bomb.

Lab exercises

Get started. Clone the starter project using the command
```
hg clone /afs/ir/class/cs107/repos/lab7/shared lab7
```
This creates the lab7 directory which contains source files and a Makefile.

Pull up the online lab checkoff and have it open in a browser so you'll be able to jot down things as you go. At the end of the lab period, you will submit that sheet and have the TA check off your work.
Understanding object files. An object file is the product of the compiler/assembler translating C source into object code. There are several Unix tools that can be used to poke around in object files, such as the disassembler objdump, your old friend from disarming the bomb. Try out the commands below to see what information they provide. Each tool has a man page you can check into for further information.
- The strings command extracts text strings from a given file. From an object file, it will find string literals from the original C source and other character sequences. The way this tool works is surprisingly simple--- it scans the raw file contents and prints any byte sequence of 4 or more printable characters in a row. Try strings on emacs, gcc, or your reassemble program to see what if finds.
- nm prints the symbol table from an object file. The symbol table lists all functions and global variables referenced in the object file, giving the address, status, and segment (code, data, etc.) for each symbol. The symbol table can be removed from an object file using the strip command. If you invoke nm on a system executable like emacs, it reports "no symbols" because these executables have already been stripped. Use nm and strings on one of your own executables. Now strip that executable and try again. What changes?
- readelf is a comprehensive tool for dissecting ELF files (ELF = Executable and Linking Format used by our myth machines). readelf has many flags to control which information to extract. readelf -e will dump the file header, the section header table, and the program header table, which together serve as a road map to the contents of the object file. This information is used by the OS loader to configure the address space of the new process when starting the executable.
- We think of gcc as the compiler, but technically it is a compiler driver. When you invoke gcc, it sequences together the tools to do the build. Invoking gcc with the -v flag runs verbosely so you can observe the preprocessor, compiler, assembler, and linker each getting a turn, and adding the -save-temps flag will leave the intermediate files behind so you can examine the transformation stage by stage. Try make addrspace which is set to build with these flags so you can see the full build process and poke around into the intermediate files.
  - (file.c -> file.i) The preprocessor cpp/cc1 does various text transformations on the source first. You can run just the preprocessor with gcc -E file.c.
  - (file.i -> file.s) The compiler cc parses the C code and generates assembly for it. You can use gcc -S file.c to stop after compiling and before assembling.
  - (file.s -> file.o) The assembler as converts assembly into an object file containing binary-encoded machine instructions. Running gcc -c file.c stops after assembling but before before linking.
  - (file.o + file2.o + libs -> executable) The linker ld/collect combines multiple object files and any libraries into a executable file. Now the program is ready to execute!
The preprocessor cpp. As the first step of compilation, the preprocessor does a variety of text-based transformations such as:
- Removing comments and rearranging white space
- Handling #include by inserting the entire contents of the named file
- Expanding #define constants and macros
- Removing code based on the conditional compilation macros #ifdef/#if/#endif etc.
Read through the pre.c file and predict what it would look like after preprocessing. Run just the preprocessor using gcc -E pre.c and look at the output and verify you have the correct ideas.

The list below identifies some of things that might go wrong with preprocessor directives. A missing #include or wrong #define seems like a preprocessor error, but in most cases, the consequences won't show up until further downstream and it will require sleuthing to relate the symptom back to the root cause. Edit the pre.c file to create each of the problems listed below and try to build. If the build fails, when is the problem detected and by which tool (preprocessor, compiler, linker)? Is it a hard error or just a warning? If it builds despite the problem, does the program run correctly?
- typo in #define, e.g. #define MY_STRING "CS107 without closing quote
- typo in #include, e.g. #include <dstio.h>
- include the same header more than once
- missing type declaration, e.g. declare a variable of type FILE * without including stdio.h to get the typedef
- missing function prototype, e.g. call qsort without including stdlib.h to get the prototype
- missing constant define, e.g. use NULL without including stddef.h or any other header that includes stddef.h. Yes, it's true, NULL is not a C keyword, instead just a #define! What does NULL expand to after preprocessing?
Linking. The relationship between the compiler and linker is one of the more misunderstood aspects of the build process. The compiler operates on a single .c file at a time and produces an object file (also referred to as a relocatable file). A .o file contains compiled assembly for all the functions defined in the .c file, but it is not a full program until linked. The linker mashes together the object file(s) and system libraries, and in the process has to resolve cross-module references and relocate addresses to their final location. A key task for the linker is resolving symbols-- ensuring there is at least one and no more than one definition for each symbol name in the global namespace. The linker detects exactly two kinds of errors-- undefined symbols and multiply-defined symbols.
- One job of the linker is to relocate each symbol in the symbol table to its final address. Use nm util.o and nm main.o to view their symbol tables. In these relocatable files, the symbol addresses are all small numbers, offsets relative to start of this module. Now view nm main. In a fully-linked executable, the addresses are much larger numbers. The linker has relocated each symbol to the final address it will occupy in the executing program's address space. Run gdb on the main program and examine a few symbols with the gdb command info address symbolname to verify the addresses as written in the symbol table match the executing program. The relocation process mostly just consists of setting the base address for each module and calculating the final address for each symbol by adding its small-number offset to its module base address.
- The linker is also responsible for resolving cross-module references. First, let's understand how a cross-module reference is created. Each C source file is compiled independently. A call to any function not defined within this module is compiled using a placeholder that is passed along to the linker to later resolve. Use nm main.o and look for symbols marked with U; these are symbols referenced within main.c, but not defined there. Now disassemble objdump -d main.o and look into the instruction sequence for the main function to find a call to one of these undefined functions. The compiler doesn't know the function's address, so it inserts a placeholder for the target address when generating the call instruction. The placeholder value is always 00 00 00 00. Compare this to the call to the average or range function where the call instruction has the target correctly set, no placeholder needed as the compiler knows the address these functions because they are defined in this module. Calls to standard library functions such as strcmp or qsort also create unresolved cross-module references. The linker is responsible for resolving all of these references. The linker joins the symbol tables of all modules/libraries being linked and verifies there are no duplicate names. It looks up the name of each unresolved reference in the combined table to retrieve the symbol's address which is used to fill the placeholder. Disassemble objdump -d main and find those same call instructions in main function that you looked at earlier. In the fully-linked executable, the linker has replaced the placeholder in the call instructions with the function's address.
- A build failure due to an undefined reference generally indicates you are missing one or more modules/libraries you intended to link. To #include a module's interface and link with the the module's implementation are two independent actions and many a novice has been tripped by the wrong assumption that #include was enough to do both. As an example, consider the assign2 searchdir program. searchdir.c #include'd "cvector.h" to show the compiler the exported CVector features (types, constants, function prototypes). The #include is necessary to allow code in searchdir.c to make use of those feature and enables compiler to type-check functions calls against their prototypes. Even with the proper #include, searchdir.o is compiled with undefined references to CVector functions. If you try to build a program out of searchdir.o without linking the cvector.o module, the link will fail trying to unsuccessfully resolve those references. Remember that what is #include'd has no bearing on linking, it is those modules/libraries named in the link step that determine what is linked.
- A library is a collection of .o files mashed into one archive file that is linked to as a group. The header file for each module in the library details the exported features. As noted above, use of #include shows the compiler the interface of a module, but linking is required to bring in the library implementation. The standard C library libc is the archive of modules for string/stdio/stdlib/ctype/etc. The libc library is always linked by default, which explains how references to the standard library functions printf/qsort/malloc are resolved without any explicit action to link the standard library. For reasons of historical accident, the math functions (sqrt, cos, etc.) are separated into their own libm library that is not a part of libc nor is it linked by default. A program that uses the math functions must both show the prototypes to the compiler (#include the math.h header) and tell the linker to link the math library (by add -lm to the LDLIBS in Makefile). Edit the main.c source to uncomment the call to sqrt and observe the build issues it creates. Make the two changes needed to correct the build problems.
Build for interposing. The leak detector assignment has an interesting build configuration to allow interposition of library functions. If you haven't already, make a clone of assign6 and use it to explore the different executables built from the leaky.c program.
- cd to your assign 6 clone and do a make clean to remove any existing build products. This will allow to see all the build steps run a-new.
- Use make leaky to build the executable in the ordinary manner. The single file involved is leaky.c. It is first compiled into leaky.o and then linked into leaky. Use nm leaky.o to see its symbols and note of mixed of defined symbols and undefined symbols. Those undefined symbols are resolved when linking with libc (remember that libc is always linked by default).
- Use make leaky_cpp leaky_cpp.o to build the leak-detector-enhanced version using the cpp approach. Look at build steps echoed by make to see what's different about this build from the previous. When compiling leaky.c, it adds a header file with #define's that renames the wrapped functions. Take a look at lmalloc_client.h to see what's in there. What transformation is this applying to the leaky.c code? Use nm leaky_cpp.o to view its symbols. Compare its symbols to those in the ordinary leaky.o -- what is different? leaky_cpp.o contains references to undefined symbols that are not part of libc -- how are these going to be resolved when linking? The build compiles two additional files (symbols/lmalloc) and links them into the executable. Do a nm symbols.o and nm lmalloc_cpp.o to see what symbols are being provided by these support modules. A-ha -- mystery resolved!
- Use make leaky_ld to build the leak-detector-enhanced version using the ld approach. Notice that it doesn't recompile leaky.c -- it just uses the previous leaky.o object file from the earlier build. A nm leaky.o shows references to the undefined functions malloc and calloc. If you were to link leaky.o by itself in the ordinary way (exactly as is done for the ordinary build), it will work out fine - those undefined references would be resolved by the libc version. But notice this build links leaky with symbols and lmalloc. Try manually linking without a wrap flag gcc leaky.o symbols.o lmalloc.o and you get a link error about undefined symbols. What does the linker think is missing? Do an nm lmalloc.o to see its symbols. What symbols does it provide? What symbols does it reference that need to be defined elsewhere? How does this relate to the link error? Look at the link line echoed by the build process to see its use of the wrap flag. If you ask the linker to wrap the function binky, it will rename the definition of binky to __real_binky and change every reference to binky (i.e. every call) into a call to __wrap_binky instead. How does this solve the link error?
- Consider what is different about these two approaches to interposing. The ld approach wraps every allocation call everywhere throughout the program, but not so with cpp. Which calls get wrapped by cpp? Which calls do not? What is an example program that would get different behavior under the cpp approach versus ld?
Who detects what? One of the most important benefits of understanding the entire tool chain is that you are in a better position to know the right fix when you hit a build error. Below are a few common build errors. First, think through how you think each would be handled, then try making the error and building to verify your understanding is correct.
- You can't remember the right header for lfind but make a call to it anyway in your program. Does this compile? Does it link? Does it execute? Why or why not? What is affected if you decide to quiet the warning by adding your own prototype into your source? What if the prototype you hacked up is wrong -- you forget that the third argument is a size_t* and instead use a size_t in your prototype. How and when will you see a symptom of this mismatch? (As a rule, you should seek out the correct #include for a needed prototype instead of ignoring the warning or making your own)
- Your code makes calls to printf without #including the stdio.h header. Does this compile/link/execute? Why or why not? What if your code uses the global variable stdin from stdio.h without #including it? Does this compile/link/execute? Why or why not?
- Your code makes calls to math functions such as cos or sqrt without #including the math.h header nor linking the math library. Does it compile/link/execute? What changes if you only fix the #include? What changes if you only link the library?
- Your code invokes a preprocessor macro such as assert and you don't #include the header file that defines it. Does this compile/link/execute?
- You include the header cvector.h and make calls to cvec_create and the other functions declared in the header file, but don't link with the module/library containing the CVector object code. Does this compile/link/execute?
Charting the address space. A program's address space is divided into segments: text (code), stack, global data, etc. The segments tend to be placed in predictable locations. Developing a feel for the address range used for each purpose can help you theorize about what has gone wrong when memory is out of whack. Run the addrspace program under gdb to answer these questions:
- Where are global variables being stored?
- Where are string constants placed? What happens if you write to a string constant?
- Where is your code positioned in memory? (i.e. find the address of main) What about the code for library functions such as printf? Are functions at the same locations for different runs of the same program? How do these addresses relate to the symbol addresses printed by nm?
- Edit the code to attempt to write to an address in the code segment. (i.e. try to write over the instructions for a function by casting the function pointer and dereferencing through it) What happens?
- Where does the stack start? Does it start at the same location for different runs of the same program? How big can the stack grow (do you remember from the stack lab?)
- Where is the heap located, i.e. addresses being returned by malloc?
- While you have the addrspace program stopped in gdb, use the gdb command info proc mapping to see the list of memory segments. Set a breakpoint at main and view the initial memory map, then view again executing the function that allocates gobs of heap memory. Can you identify which segment in the memory map contains the heap? Can you identify what each segment holds (e.g. either your code, library code, stack, heap, global data, etc.)?
Chart the address space, label segments and note where gaps occur. Given a troublesome address, you can use this chart to identify whether the address is located within the stack/heap/global/code, which is a helpful clue when tracking down the problem. Of the entire addressable region, about what percentage appears in use?
Optional extra exploration: preprocessor macros and inline functions. Preprocessor macros have a number of pitfalls and we strongly discourage their use in favor of inlines. However, you may encounter macros in the code of others and it can useful to understand the mechanism and why macros can be problematic. Review the code in macro.c to see the definition and use of the macro MAX(x, y).
- One "feature" of macros is that they are type-less. Do you see how the macro can work for a variety of numeric inputs? When applied to integer arguments, the expansion uses integer operations, and applied to float arguments it uses floating point ops. That's pretty neat! But this lack of types can also lead to errors. What happens if you apply the macro to a string constant? If you were to see this symptom, would it be obvious what the root cause was?
- This first attempt at the macro is sloppy and has several issues. The program invokes the macro in various ways to trigger its problems. Compile and run the program to see its output. The problems with uses #1a and #1b can be resolved by adding parentheses around the entire macro definition, #2 requires additional parens around the arguments within the definition (The compiler does thankfully give a little warning to draw your attention), but the issue with use #3 is at odds with how macros expand. (It can be addressed, but requires some trickery we'd rather not get into).
- You can get the performance benefits of macros and avoid the drawbacks via inline functions. The inline keyword recommends a function to the compiler as a good candidate for inlining. The inline function max operates almost like a macro. Use nm macro.o and look for max-- it doesn't even appear as a symbol. If you ask gdb to break at max or disassemble max, you'll discover the debugger knows nothing about it either. Disassemble main and you'll see no caller setup, no parameter passing, no call <max> instruction; instead, the instructions in the body of max were directly pasted in place of the call. Inlining avoids all the function call overhead (setting up stack, copying parameters to stack, transfer of control, return, etc.) at the cost of duplicating the instructions in the function's body at every calling site. This micro-optimization can be appropriate for a very small function that is called repeatedly on a performance-critical path. Adding the inline keyword is treated as "advisory" -- the compiler can disregard your advice and either inline what you didn't ask for or not inline what you did. In particular, gcc only inlines if compiling at optimization level -O2 or higher. You can examine the disassembly (now that you are a superb reader of assembly!) to find out whether the compiler followed through on your recommendation to inline.
Optional extra diversion: binary hacking. The loader is responsible for running an executable file by starting the new process and configuring its address space. The code and data segments of the address space consist of data directly mapped in from the executable file. The executable file contains object code along with string constants, global data, and possibly symbol and debugging information. If you are very careful, there are ways to directly edit an executable file to change its runtime behavior, for example, by directly modifying data in the segments that will be mapped in. To be clear, there is rarely legitimate cause to do this, but we can play around with binary hacking to better understand the contents of the executable and its relationship to the executing program. If you open the binary file in emacs and invoke M-x hexl-mode, emacs will act as a raw hex editor. We're going to experiment with editing the addrspace executable file.
- One simple change is to edit string constants. Adding new character to extend the length of a string constant will cause havoc but it is fine to overwrite existing chars with different chars (including overwriting with null characters to make the string appear shorter). Edit the executable file to change the string that says "Hello world!\n" to "Hello cs107!\n". Run your hacked executable and see the change.
- Similarly, the values for initialized global variables are directly stored in the executable and can be edited. Find the location of the initial values for the extern and static global variables by searching the addrspace executable file for their distinctive values. Find and change the value for one of the variables in order to pass the test and "win". The same trick won't work to "double-win" but there is a different binary hack that can achieve it if you're determined, what might that be?
- Now that you're warmed up, what kind of binary hacking could you devise to suppress explosions from your binary bomb?

Just for fun lab followups

Reverse engineering has a number of frivolous, yet entertaining, uses. Check out the kcheat project that lets you forcibly manipulate values in an executing program. How does it work? It scans the address space looking for a particular value (e.g. the current value of your "strength") to identify the right location to poke a new value into to give you the strength you deserve.
Another web tidbit, this one on creating the smallest executable file by torturing the ELF format.
You gotta love a language that inspires competition to write the most hideous, unreadable code. The winners of the Obfuscated C Contest often abuse the preprocessor as one of their tricks. Look at hague-1986 or herrmann-2001 for particularly crazy examples! The Wikipedia entry on obfuscation gives more history and background on this crazy competition.
Another ode to the nature of C is found in the Underhanded C Contest which is about writing seemingly innocent, simple C code that secretly does something evil. That's C for ya!

Check off with TA

Before you leave, be sure to submit your checkoff sheet (in the browser) and have lab TA come by and confirm so you will be properly credited for lab If you don't finish everything before lab is over, we strongly encourage you to finish the remainder on your own. Double-check your progress with self check.