Lab sessions Mon May 16 to Thu May 19
Lab written by Julie Zelenski
After this lab, you should be able to:
Find an open computer and somebody new to sit with. Introduce yourself and share war stories about your efforts to defuse your binary bomb.
Get started. Clone the starter project using the command
hg clone /afs/ir/class/cs107/repos/lab7/shared lab7
This creates the lab7 directory which contains source files and a Makefile.
Pull up the online lab checkoff and have it open in a browser so you'll be able to jot down things as you go. At the end of the lab period, you will submit that sheet and have the TA check off your work.
Understanding object files. An object file is the product of the compiler/assembler translating C source into object code. There are several Unix tools that can be used to poke around in object files, such as the disassembler objdump
, your old friend from disarming the bomb. Try out the commands below to see what information they provide. Each tool has a man page you can check into for further information.
strings
command extracts text strings from a given file. From an object file, it will find string literals from the original C source and other character sequences. The way this tool works is surprisingly simple--- it scans the raw file contents and prints any byte sequence of 4 or more printable characters in a row. Try strings
on emacs, gcc, or your reassemble program to see what if finds.nm
prints the symbol table from an object file. The symbol table lists all functions and global variables referenced in the object file, giving the address, status, and segment (code, data, etc.) for each symbol. The symbol table can be removed from an object file using the strip
command. If you invoke nm
on a system executable like emacs, it reports "no symbols" because these executables have already been stripped. Use nm
and strings
on one of your own executables. Now strip
that executable and try again. What changes?readelf
is a comprehensive tool for dissecting ELF files (ELF = Executable and Linking Format used by our myth machines). readelf
has many flags to control which information to extract. readelf -e
will dump the file header, the section header table, and the program header table, which together serve as a road map to the contents of the object file. This information is used by the OS loader to configure the address space of the new process when starting the executable.We think of gcc
as the compiler, but technically it is a compiler driver. When you invoke gcc, it sequences together the tools to do the build. Invoking gcc with the -v
flag runs verbosely so you can observe the preprocessor, compiler, assembler, and linker each getting a turn, and adding the -save-temps
flag will leave the intermediate files behind so you can examine the transformation stage by stage. Try make addrspace
which is set to build with these flags so you can see the full build process and poke around into the intermediate files.
(file.c -> file.i)
The preprocessor cpp
/cc1
does various text transformations on the source first. You can run just the preprocessor with gcc -E file.c
.(file.i -> file.s)
The compiler cc
parses the C code and generates assembly for it. You can use gcc -S file.c
to stop after compiling and before assembling.(file.s -> file.o)
The assembler as
converts assembly into an object file containing binary-encoded machine instructions. Running gcc -c file.c
stops after assembling but before before linking.(file.o + file2.o + libs -> executable)
The linker ld
/collect
combines multiple object files and any libraries into a executable file. Now the program is ready to execute!
The preprocessor cpp. As the first step of compilation, the preprocessor does a variety of text-based transformations such as:
#include
by inserting the entire contents of the named file#define
constants and macros#ifdef/#if/#endif
etc.Read through the pre.c
file and predict what it would look like after preprocessing. Run just the preprocessor using gcc -E pre.c
and look at the output and verify you have the correct ideas.
The list below identifies some of things that might go wrong with preprocessor directives. A missing #include or wrong #define seems like a preprocessor error, but in most cases, the consequences won't show up until further downstream and it will require sleuthing to relate the symptom back to the root cause. Edit the pre.c
file to create each of the problems listed below and try to build. If the build fails, when is the problem detected and by which tool (preprocessor, compiler, linker)? Is it a hard error or just a warning? If it builds despite the problem, does the program run correctly?
#define MY_STRING "CS107
without closing quote#include <dstio.h>
FILE *
without including stdio.h
to get the typedefqsort
without including stdlib.h
to get the prototypeNULL
without including stddef.h
or any other header that includes stddef.h
. Yes, it's true, NULL
is not a C keyword, instead just a #define! What does NULL
expand to after preprocessing?
Linking. The relationship between the compiler and linker is one of the more misunderstood aspects of the build process. The compiler operates on a single .c file at a time and produces an object file (also referred to as a relocatable file). A .o file contains compiled assembly for all the functions defined in the .c file, but it is not a full program until linked. The linker mashes together the object file(s) and system libraries, and in the process has to resolve cross-module references and relocate addresses to their final location. A key task for the linker is resolving symbols-- ensuring there is at least one and no more than one definition for each symbol name in the global namespace. The linker detects exactly two kinds of errors-- undefined symbols and multiply-defined symbols.
nm util.o
and nm main.o
to view their symbol tables. In these relocatable files, the symbol addresses are all small numbers, offsets relative to start of this module. Now view nm main
. In a fully-linked executable, the addresses are much larger numbers. The linker has relocated each symbol to the final address it will occupy in the executing program's address space. Run gdb on the main program and examine a few symbols with the gdb command info address symbolname
to verify the addresses as written in the symbol table match the executing program. The relocation process mostly just consists of setting the base address for each module and calculating the final address for each symbol by adding its small-number offset to its module base address.nm main.o
and look for symbols marked with U
; these are symbols referenced within main.c, but not defined there. Now disassemble objdump -d main.o
and look into the instruction sequence for the main
function to find a call to one of these undefined functions. The compiler doesn't know the function's address, so it inserts a placeholder for the target address when generating the call instruction. The placeholder value is always 00 00 00 00
. Compare this to the call to the average
or range
function where the call instruction has the target correctly set, no placeholder needed as the compiler knows the address these functions because they are defined in this module. Calls to standard library functions such as strcmp
or qsort
also create unresolved cross-module references. The linker is responsible for resolving all of these references. The linker joins the symbol tables of all modules/libraries being linked and verifies there are no duplicate names. It looks up the name of each unresolved reference in the combined table to retrieve the symbol's address which is used to fill the placeholder. Disassemble objdump -d main
and find those same call instructions in main
function that you looked at earlier. In the fully-linked executable, the linker has replaced the placeholder in the call instructions with the function's address. libc
is the archive of modules for string/stdio/stdlib/ctype/etc. The libc
library is always linked by default, which explains how references to the standard library functions printf/qsort/malloc are resolved without any explicit action to link the standard library. For reasons of historical accident, the math functions (sqrt, cos, etc.) are separated into their own libm
library that is not a part of libc
nor is it linked by default. A program that uses the math functions must both show the prototypes to the compiler (#include the math.h header) and tell the linker to link the math library (by add -lm
to the LDLIBS in Makefile). Edit the main.c source to uncomment the call to sqrt
and observe the build issues it creates. Make the two changes needed to correct the build problems.
Build for interposing. The leak detector assignment has an interesting build configuration to allow interposition of library functions. If you haven't already, make a clone of assign6 and use it to explore the different executables built from the leaky.c program.
cd
to your assign 6 clone and do a make clean
to remove any existing build products. This will allow to see all the build steps run a-new.make leaky
to build the executable in the ordinary manner. The single file involved is leaky.c
. It is first compiled into leaky.o
and then linked into leaky
. Use nm leaky.o
to see its symbols and note of mixed of defined symbols and undefined symbols. Those undefined symbols are resolved when linking with libc (remember that libc is always linked by default).make leaky_cpp leaky_cpp.o
to build the leak-detector-enhanced version using the cpp
approach. Look at build steps echoed by make to see what's different about this build from the previous. When compiling leaky.c, it adds a header file with #define's that renames the wrapped functions. Take a look at lmalloc_client.h
to see what's in there. What transformation is this applying to the leaky.c code? Use nm leaky_cpp.o
to view its symbols. Compare its symbols to those in the ordinary leaky.o -- what is different? leaky_cpp.o
contains references to undefined symbols that are not part of libc -- how are these going to be resolved when linking? The build compiles two additional files (symbols/lmalloc) and links them into the executable. Do a nm symbols.o
and nm lmalloc_cpp.o
to see what symbols are being provided by these support modules. A-ha -- mystery resolved!make leaky_ld
to build the leak-detector-enhanced version using the ld
approach. Notice that it doesn't recompile leaky.c -- it just uses the previous leaky.o object file from the earlier build. A nm leaky.o
shows references to the undefined functions malloc
and calloc
. If you were to link leaky.o by itself in the ordinary way (exactly as is done for the ordinary build), it will work out fine - those undefined references would be resolved by the libc version. But notice this build links leaky with symbols and lmalloc. Try manually linking without a wrap flag gcc leaky.o symbols.o lmalloc.o
and you get a link error about undefined symbols. What does the linker think is missing? Do an nm lmalloc.o
to see its symbols. What symbols does it provide? What symbols does it reference that need to be defined elsewhere? How does this relate to the link error? Look at the link line echoed by the build process to see its use of the wrap flag. If you ask the linker to wrap the function binky
, it will rename the definition of binky
to __real_binky
and change every reference to binky
(i.e. every call) into a call to __wrap_binky
instead. How does this solve the link error?cpp
approach versus ld
?
Who detects what? One of the most important benefits of understanding the entire tool chain is that you are in a better position to know the right fix when you hit a build error. Below are a few common build errors. First, think through how you think each would be handled, then try making the error and building to verify your understanding is correct.
lfind
but make a call to it anyway in your program. Does this compile? Does it link? Does it execute? Why or why not? What is affected if you decide to quiet the warning by adding your own prototype into your source? What if the prototype you hacked up is wrong -- you forget that the third argument is a size_t*
and instead use a size_t
in your prototype. How and when will you see a symptom of this mismatch? (As a rule, you should seek out the correct #include for a needed prototype instead of ignoring the warning or making your own)printf
without #including the stdio.h header. Does this compile/link/execute? Why or why not? What if your code uses the global variable stdin
from stdio.h without #including it? Does this compile/link/execute? Why or why not?cos
or sqrt
without #including the math.h header nor linking the math library. Does it compile/link/execute? What changes if you only fix the #include? What changes if you only link the library?assert
and you don't #include the header file that defines it. Does this compile/link/execute?cvec_create
and the other functions declared in the header file, but don't link with the module/library containing the CVector object code. Does this compile/link/execute?
Charting the address space. A program's address space is divided into segments: text (code), stack, global data, etc. The segments tend to be placed in predictable locations. Developing a feel for the address range used for each purpose can help you theorize about what has gone wrong when memory is out of whack. Run the addrspace
program under gdb to answer these questions:
main
) What about the code for library functions such as printf
? Are functions at the same locations for different runs of the same program? How do these addresses relate to the symbol addresses printed by nm
?info proc mapping
to see the list of memory segments. Set a breakpoint at main and view the initial memory map, then view again executing the function that allocates gobs of heap memory. Can you identify which segment in the memory map contains the heap? Can you identify what each segment holds (e.g. either your code, library code, stack, heap, global data, etc.)?Chart the address space, label segments and note where gaps occur. Given a troublesome address, you can use this chart to identify whether the address is located within the stack/heap/global/code, which is a helpful clue when tracking down the problem. Of the entire addressable region, about what percentage appears in use?
Optional extra exploration: preprocessor macros and inline functions. Preprocessor macros have a number of pitfalls and we strongly discourage their use in favor of inlines. However, you may encounter macros in the code of others and it can useful to understand the mechanism and why macros can be problematic. Review the code in macro.c
to see the definition and use of the macro MAX(x, y)
.
inline
keyword recommends a function to the compiler as a good candidate for inlining. The inline function max
operates almost like a macro. Use nm macro.o
and look for max
-- it doesn't even appear as a symbol. If you ask gdb to break at max
or disassemble max
, you'll discover the debugger knows nothing about it either. Disassemble main
and you'll see no caller setup, no parameter passing, no call <max>
instruction; instead, the instructions in the body of max
were directly pasted in place of the call. Inlining avoids all the function call overhead (setting up stack, copying parameters to stack, transfer of control, return, etc.) at the cost of duplicating the instructions in the function's body at every calling site. This micro-optimization can be appropriate for a very small function that is called repeatedly on a performance-critical path. Adding the inline
keyword is treated as "advisory" -- the compiler can disregard your advice and either inline what you didn't ask for or not inline what you did. In particular, gcc only inlines if compiling at optimization level -O2 or higher. You can examine the disassembly (now that you are a superb reader of assembly!) to find out whether the compiler followed through on your recommendation to inline.
Optional extra diversion: binary hacking. The loader is responsible for running an executable file by starting the new process and configuring its address space. The code and data segments of the address space consist of data directly mapped in from the executable file. The executable file contains object code along with string constants, global data, and possibly symbol and debugging information. If you are very careful, there are ways to directly edit an executable file to change its runtime behavior, for example, by directly modifying data in the segments that will be mapped in. To be clear, there is rarely legitimate cause to do this, but we can play around with binary hacking to better understand the contents of the executable and its relationship to the executing program. If you open the binary file in emacs and invoke M-x hexl-mode
, emacs will act as a raw hex editor. We're going to experiment with editing the addrspace
executable file.
"Hello world!\n"
to "Hello cs107!\n"
. Run your hacked executable and see the change.Just for fun lab followups
Before you leave, be sure to submit your checkoff sheet (in the browser) and have lab TA come by and confirm so you will be properly credited for lab If you don't finish everything before lab is over, we strongly encourage you to finish the remainder on your own. Double-check your progress with self check.