Lab sessions Mon Feb 27 to Thu Mar 02
Lab written by Julie Zelenski
Learning goals
After this lab, you should be able to:
- describe the steps to build an executable from C code and tasks performed by the preprocessor/compiler/assembler/linker
- interpret the symptoms of a build error and what to do to fix it
- diagram the program address space
Find an open computer and somebody new to sit with. Introduce yourself and share war stories about your efforts to defuse your binary bomb.
Lab exercises
1. Get started.
Clone the starter project using the command
hg clone /afs/ir/class/cs107/repos/lab7/shared lab7
This creates the lab7 directory which contains source files and a Makefile.
Pull up the online lab checkoff and have it open in a browser so you'll be able to jot down things as you go. At the end of the lab period, you will submit that sheet and have the TA check off your work.
2. Understanding object files.
An object file is the product of the compiler/assembler translating C source into object code. There are several Unix tools that can be used to poke around in object files, such as the disassembler objdump, your old friend from disarming the bomb. Try out the commands below to see what information they provide. Each tool has a man page you can check into for further information.
-
The
stringscommand extracts text strings from a given file. From an object file, it will find string literals from the original C source and other character sequences. The way this tool works is surprisingly simple--- it scans the raw file contents and prints any byte sequence of 4 or more printable characters in a row. Trystringson emacs, gcc, or your reassemble program to see what if finds. -
nmprints the symbol table from an object file. The symbol table lists all functions and global variables referenced in the object file, giving the address, status, and segment (code, data, etc.) for each symbol. The symbol table can be removed from an object file using thestripcommand. If you invokenmon a system executable like emacs, it reports "no symbols" because these executables have already been stripped. Usenmandstringson one of your own executables. Nowstripthat executable and try again. What changes? -
readelfis a comprehensive tool for dissecting ELF files (ELF = Executable and Linking Format used by our myth machines).readelfhas many flags to control which information to extract.readelf -ewill dump the file header, the section header table, and the program header table, which together serve as a road map to the contents of the object file. This information is used by the OS loader to configure the address space of the new process when starting the executable. -
We think of
gccas the compiler, but technically it is a compiler driver. When you invoke gcc, it sequences together the tools to do the build. Invoking gcc with the-vflag runs verbosely so you can observe the preprocessor, compiler, assembler, and linker each getting a turn, and adding the-save-tempsflag will leave the intermediate files behind so you can examine the transformation stage by stage. Trymake addrspacewhich is set to build with these flags so you can see the full build process and poke around into the intermediate files.(file.c -> file.i)The preprocessorcpp/cc1does various text transformations on the source first. You can run just the preprocessor withgcc -E file.c.(file.i -> file.s)The compilerccparses the C code and generates assembly for it. You can usegcc -S file.cto stop after compiling and before assembling.(file.s -> file.o)The assemblerasconverts assembly into an object file containing binary-encoded machine instructions. Runninggcc -c file.cstops after assembling but before before linking.(file.o + file2.o + libs -> executable)The linkerld/collectcombines multiple object files and any libraries into a executable file. Now the program is ready to execute!
3. The preprocessor cpp.
As the first step of compilation, the preprocessor does a variety of text-based transformations such as:
- Removing comments and rearranging white space
- Handling
#includeby inserting the entire contents of the named file - Expanding
#defineconstants and macros - Removing code based on the conditional compilation macros
#ifdef/#if/#endifetc.
Read through the pre.c file and predict what it would look like after preprocessing. Run just the preprocessor using gcc -E pre.c and look at the output and verify you have the correct ideas.
The list below identifies some of things that might go wrong with preprocessor directives. A missing #include or wrong #define seems like a preprocessor error, but in most cases, the consequences won't show up until further downstream and it will require sleuthing to relate the symptom back to the root cause. Edit the pre.c file to create each of the problems listed below and try to build. If the build fails, when is the problem detected and by which tool (preprocessor, compiler, linker)? Is it a hard error or just a warning? If it builds despite the problem, does the program run correctly?
- typo in #define, e.g.
#define MY_STRING "CS107without closing quote - typo in #include, e.g.
#include <dstio.h> - include the same header more than once
- missing type declaration, e.g. declare a variable of type
FILE *without includingstdio.hto get the typedef - missing function prototype, e.g. call
qsortwithout includingstdlib.hto get the prototype - missing constant define, e.g. use
NULLwithout includingstddef.hor any other header that includesstddef.h. Yes, it's true,NULLis not a C keyword, instead just a #define! What doesNULLexpand to after preprocessing?
4. Linking.
The relationship between the compiler and linker is one of the more misunderstood aspects of the build process. The compiler operates on a single .c file at a time and produces an object file (also referred to as a relocatable file). A .o file contains compiled assembly for all the functions defined in the .c file, but it is not a full program until linked. The linker mashes together the object file(s) and system libraries, and in the process has to resolve cross-module references and relocate addresses to their final location. A key task for the linker is resolving symbols-- ensuring there is at least one and no more than one definition for each symbol name in the global namespace. The linker detects exactly two kinds of errors-- undefined symbols and multiply-defined symbols.
-
One job of the linker is to relocate each symbol in the symbol table to its final address. Use
nm util.oandnm main.oto view their symbol tables. In these relocatable files, the symbol addresses are all small numbers, offsets relative to start of this module. Now viewnm main. In a fully-linked executable, the addresses are much larger numbers. The linker has relocated each symbol to the final address it will occupy in the executing program's address space. Run gdb on the main program and examine a few symbols with the gdb commandinfo address symbolnameto verify the addresses as written in the symbol table match the executing program. The relocation process mostly just consists of setting the base address for each module and calculating the final address for each symbol by adding its small-number offset to its module base address. -
The linker is also responsible for resolving cross-module references. First, let's understand how a cross-module reference is created. Each C source file is compiled independently. A call to any function not defined within this module is compiled using a placeholder that is passed along to the linker to later resolve. Use
nm main.oand look for symbols marked withU; these are symbols referenced within main.c, but not defined there. Now disassembleobjdump -d main.oand look into the instruction sequence for themainfunction to find a call to one of these undefined functions. The compiler doesn't know the function's address, so it inserts a placeholder for the target address when generating the call instruction. The placeholder value is always00 00 00 00. Compare this to the call to theaverageorrangefunction where the call instruction has the target correctly set, no placeholder needed as the compiler knows the address these functions because they are defined in this module. Calls to standard library functions such asstrcmporqsortalso create unresolved cross-module references. The linker is responsible for resolving all of these references. The linker joins the symbol tables of all modules/libraries being linked and verifies there are no duplicate names. It looks up the name of each unresolved reference in the combined table to retrieve the symbol's address which is used to fill the placeholder. Disassembleobjdump -d mainand find those same call instructions inmainfunction that you looked at earlier. In the fully-linked executable, the linker has replaced the placeholder in the call instructions with the function's address. -
A build failure due to an undefined reference generally indicates you are missing one or more modules/libraries you intended to link. To #include a module's interface and link with the the module's implementation are two independent actions and many a novice has been tripped by the wrong assumption that #include was enough to do both. As an example, consider the assign2 searchdir program. searchdir.c #include'd "cvector.h" to show the compiler the exported CVector features (types, constants, function prototypes). The #include is necessary to allow code in searchdir.c to make use of those feature and enables compiler to type-check functions calls against their prototypes. Even with the proper #include, searchdir.o is compiled with undefined references to CVector functions. If you try to build a program out of searchdir.o without linking the cvector.o module, the link will fail trying to unsuccessfully resolve those references. Remember that what is #include'd has no bearing on linking, it is those modules/libraries named in the link step that determine what is linked.
-
A library is a collection of .o files mashed into one archive file that is linked to as a group. The header file for each module in the library details the exported features. As noted above, use of #include shows the compiler the interface of a module, but linking is required to bring in the library implementation. The standard C library
libcis the archive of modules for string/stdio/stdlib/ctype/etc. Thelibclibrary is always linked by default, which explains how references to the standard library functions printf/qsort/malloc are resolved without any explicit action to link the standard library. For reasons of historical accident, the math functions (sqrt, cos, etc.) are separated into their ownlibmlibrary that is not a part oflibcnor is it linked by default. A program that uses the math functions must both show the prototypes to the compiler (#include the math.h header) and tell the linker to link the math library (by add-lmto the LDLIBS in Makefile). Edit the main.c source to uncomment the call tosqrtand observe the build issues it creates. Make the two changes needed to correct the build problems.
5. Who detects what?
One of the most important benefits of understanding the entire tool chain is that you are in a better position to know the right fix when you hit a build error. Below are a few common build errors. First, think through how you think each would be handled, then try making the error and building to verify your understanding is correct.
-
You can't remember the right header for
lfindbut make a call to it anyway in your program. Does this compile? Does it link? Does it execute? Why or why not? What is affected if you decide to quiet the warning by adding your own prototype into your source? What if the prototype you hacked up is wrong -- you forget that the third argument is asize_t*and instead use asize_tin your prototype. How and when will you see a symptom of this mismatch? (As a rule, you should seek out the correct #include for a needed prototype instead of ignoring the warning or making your own) -
Your code makes calls to
printfwithout #including the stdio.h header. Does this compile/link/execute? Why or why not? What if your code uses the global variablestdinfrom stdio.h without #including it? Does this compile/link/execute? Why or why not? -
Your code makes calls to math functions such as
cosorsqrtwithout #including the math.h header nor linking the math library. Does it compile/link/execute? What changes if you only fix the #include? What changes if you only link the library? -
Your code invokes a preprocessor macro such as
assertand you don't #include the header file that defines it. Does this compile/link/execute? -
You include the header cvector.h and make calls to
cvec_createand the other functions declared in the header file, but don't link with the module/library containing the CVector object code. Does this compile/link/execute?
6. Charting the address space.
A program's address space is divided into segments: text (code), stack, global data, etc. The segments tend to be placed in predictable locations. Developing a feel for the address range used for each purpose can help you theorize about what has gone wrong when memory is out of whack. Run the addrspace program under gdb to answer these questions:
- Where are global variables being stored?
- Where are string constants placed? What happens if you write to a string constant?
- Where is your code positioned in memory? (i.e. find the address of
main) What about the code for library functions such asprintf? Are functions at the same locations for different runs of the same program? How do these addresses relate to the symbol addresses printed bynm? - Edit the code to attempt to write to an address in the code segment. (i.e. try to write over the instructions for a function by casting the function pointer and dereferencing through it) What happens?
- Where does the stack start? Does it start at the same location for different runs of the same program? How big can the stack grow (do you remember from the stack lab?)
- Where is the heap located, i.e. addresses being returned by malloc?
- While you have the addrspace program stopped in gdb, use the gdb command
info proc mappingto see the list of memory segments. Set a breakpoint at main and view the initial memory map, then view again executing the function that allocates gobs of heap memory. Can you identify which segment in the memory map contains the heap? Can you identify what each segment holds (e.g. either your code, library code, stack, heap, global data, etc.)?
Chart the address space, label segments and note where gaps occur. Given a troublesome address, you can use this chart to identify whether the address is located within the stack/heap/global/code, which is a helpful clue when tracking down the problem. Of the entire addressable region, about what percentage appears in use?
7. Optional extra exploration: preprocessor macros and inline functions
Preprocessor macros have a number of pitfalls and we strongly discourage their use in favor of inlines. However, you may encounter macros in the code of others and it can useful to understand the mechanism and why macros can be problematic. Review the code in macro.c to see the definition and use of the macro MAX(x, y).
-
One "feature" of macros is that they are type-less. Do you see how the macro can work for a variety of numeric inputs? When applied to integer arguments, the expansion uses integer operations, and applied to float arguments it uses floating point ops. That's pretty neat! But this lack of types can also lead to errors. What happens if you apply the macro to a string constant? If you were to see this symptom, would it be obvious what the root cause was?
-
This first attempt at the macro is sloppy and has several issues. The program invokes the macro in various ways to trigger its problems. Compile and run the program to see its output. The problems with uses #1a and #1b can be resolved by adding parentheses around the entire macro definition, #2 requires additional parens around the arguments within the definition (The compiler does thankfully give a little warning to draw your attention), but the issue with use #3 is at odds with how macros expand. (It can be addressed, but requires some trickery we'd rather not get into).
-
You can get the performance benefits of macros and avoid the drawbacks via inline functions. The
inlinekeyword recommends a function to the compiler as a good candidate for inlining. The inline functionmaxoperates almost like a macro. Usenm macro.oand look formax-- it doesn't even appear as a symbol. If you ask gdb to break atmaxor disassemblemax, you'll discover the debugger knows nothing about it either. Disassemblemainand you'll see no caller setup, no parameter passing, nocall <max>instruction; instead, the instructions in the body ofmaxwere directly pasted in place of the call. Inlining avoids all the function call overhead (setting up stack, copying parameters to stack, transfer of control, return, etc.) at the cost of duplicating the instructions in the function's body at every calling site. This micro-optimization can be appropriate for a very small function that is called repeatedly on a performance-critical path. Adding theinlinekeyword is treated as "advisory" -- the compiler can disregard your advice and either inline what you didn't ask for or not inline what you did. In particular, gcc only inlines if compiling at optimization level -O2 or higher. You can examine the disassembly (now that you are a superb reader of assembly!) to find out whether the compiler followed through on your recommendation to inline.
8. Optional extra diversion: binary hacking
The loader is responsible for running an executable file by starting the new process and configuring its address space. The code and data segments of the address space consist of data directly mapped in from the executable file. The executable file contains object code along with string constants, global data, and possibly symbol and debugging information. If you are very careful, there are ways to directly edit an executable file to change its runtime behavior, for example, by directly modifying data in the segments that will be mapped in. To be clear, there is rarely legitimate cause to do this, but we can play around with binary hacking to better understand the contents of the executable and its relationship to the executing program. If you open the binary file in emacs and invoke M-x hexl-mode, emacs will act as a raw hex editor. We're going to experiment with editing the addrspace executable file.
-
One simple change is to edit string constants. Adding new character to extend the length of a string constant will cause havoc but it is fine to overwrite existing chars with different chars (including overwriting with null characters to make the string appear shorter). Edit the executable file to change the string that says
"Hello world!\n"to"Hello cs107!\n". Run your hacked executable and see the change. -
Similarly, the values for initialized global variables are directly stored in the executable and can be edited. Find the location of the initial values for the extern and static global variables by searching the addrspace executable file for their distinctive values. Find and change the value for one of the variables in order to pass the test and "win". The same trick won't work to "double-win" but there is a different binary hack that can achieve it if you're determined, what might that be?
-
Now that you're warmed up, what kind of binary hacking could you devise to suppress explosions from your binary bomb?
Just for fun lab followups
-
Reverse engineering has a number of frivolous, yet entertaining, uses. Check out the kcheat project that lets you forcibly manipulate values in an executing program. How does it work? It scans the address space looking for a particular value (e.g. the current value of your "strength") to identify the right location to poke a new value into to give you the strength you deserve.
-
Another web tidbit, this one on creating the smallest executable file by torturing the ELF format.
-
You gotta love a language that inspires competition to write the most hideous, unreadable code. The winners of the Obfuscated C Contest often abuse the preprocessor as one of their tricks. Look at hague-1986 or herrmann-2001 for particularly crazy examples! The Wikipedia entry on obfuscation gives more history and background on this crazy competition.
- Another ode to the nature of C is found in the Underhanded C Contest which is about writing seemingly innocent, simple C code that secretly does something evil. That's C for ya!
Check off with TA
Before you leave, be sure to submit your checkoff sheet (in the browser) and have lab TA come by and confirm so you will be properly credited for lab If you don't finish everything before lab is over, we strongly encourage you to finish the remainder on your own. Double-check your progress with self check.