Lab 5: Assembly

Lab sessions Mon May 02 to Thu May 05

Lab written by Julie Zelenski

Learning goals

This lab is designed to give you a chance to:

  1. use objdump and gdb to disassemble and trace assembly code
  2. study the relationship between source code and its assembly translation
  3. reverse-engineer a little assembly back to C

Find an open computer and somebody new to sit with. Introduce yourself and share your suggestions about how to best prep for the upcoming midterm.

Lab exercises

  1. Get started. Clone the lab starter project using the command

    hg clone /afs/ir/class/cs107/repos/lab5/shared lab5
    

    Have our guide to x86-64 basics and this handy one page of x86-64 in your browser for reference during lab. Bring up the online lab checkoff up so you can jot down things as you go. At the end of the lab period, submit the sheet and have the TA check off your work.

  2. Deadlisting with objdump. As part of the compilation process, the assembler takes in assembly instructions and encodes them into the binary form understood by the hardware. Disassembly is the reverse process that converts binary-encoded instructions back into human-readable assembly. You wrote a little disassembler in assign4. objdump is a tool that operates on object files (i.e. files containing compiled machine code). It can dig out all sorts of information from the object file, but one of the more common uses is as a disassembler. Let's try it out!

    • Invoking objdump -d extracts the instructions from an object file and outputs the sequence of binary-encoded machine instructions alongside the assembly equivalent. This dump is called a deadlist ("dead" to distinguish from the study of "live" assembly as it executes). If symbol names were present, the instructions are grouped into sequences by function name. If you add the flag --no-show-raw-insn flag, it omits the binary encoding and just shows the assembly (cuts down on clutter). If the object file was compiled with debugging information, adding the -S flag to objdump will intersperse the original C source. Use objdump -d -S --no-show-raw-insn trace to get a sample deadlist.
    • The countops.py python script in the repo reports the assembly instructions most heavily used in a given object file. Try out countops.py trace for an example. The script operates by invoking objdump to disassemble the file, tallies instructions by opcode, and reports the top 10 most frequent. Try it out on a few executables (your reassemble or synonyms or tools like emacs and gcc). Does the mix of assembly instructions seem to vary much by program?

  3. Gdb commands for live assembly-level debugging. The debugger has great support for working with code at the assembly level. Load the trace program in gdb, use the gdb command start to get program going and stopped in main. From there, try out the gdb commands listed below that allow to poke around at the assembly-level. To learn more about any gdb command, try gdb's built-in help.

    • The gdb command disassemble with no argument will print the disassembled instructions for the currently executing function. You can also give an optional argument of what to disable, a function name or code address.

      (gdb) disassemble main
      Dump of assembler code for function main:
         0x0000000000400700 <+0>: push   %rbx
         0x0000000000400701 <+1>: callq  0x4005b8 <locals>
         0x0000000000400706 <+6>: mov    %eax,%ebx
         0x0000000000400708 <+8>: callq  0x40063e <solve>
      ...
      

      In the disassembly as printed by gdb, the hex number in the leftmost column is the address in memory for that instruction and in angle brackets is the offset of that instruction relative to the start of the function. You may notice minor differences in presentation between the disassembled instructions as printed by gdb versus the output from objdump, e.g. use of movq instead of mov, negative signed values may display as large unsigned, and so on.

    • The disassemble option /m intersperses C source with the asm. This can be helpful when trying to relate the two. (Though for more complex passages that are significantly rearranged during compilation, both may be confusing.)

      (gdb) disassemble/m main
      Dump of assembler code for function main:
      130 {
         0x0000000000400700 <+0>: push   %rbx
      
      131     int a;
      132     a = locals();
         0x0000000000400701 <+1>: callq  0x4005b8 <locals>
         0x0000000000400706 <+6>: mov    %eax,%ebx
      
    • You can set a breakpoint at a specific machine instruction by specifying its address b *address or an offset within a function b * main+6. Note that the latter is not 6 instructions into main, but 6 bytes worth of instructions into main. Given the variable-length encoding of instructions, 6 bytes can correspond to one or several instructions.

      (gdb) b *0x400784              break at specified address
      (gdb) b *main+6                break at instruction 6 bytes past start of main
      
    • The gdb commands stepi and nexti allow you to single-step through assembly instructions. These are the assembly-level equivalents of the source-level step and next commands. They can be abbreviated si and ni.

      (gdb) stepi                    executes next single machine instruction
      (gdb) nexti                    executes next machine instruction (proceed over fn calls)
      
    • The gdb command info reg will print the value of the integer registers and condition codes. You can refer to an individual register by name to view or change the register's value. Within gdb, a register name is prefixed with $ instead of the usual %.

      (gdb) info reg
      (gdb) p $rax                   show current value in %rax register
      (gdb) set $rax = 9             change current value in %rax register
      
    • The gdb command set dissasemble on turns on assembly-level display. When execution is paused/stopped, gdb usually shows you the C source line to next be executed. After setting disassemble on, it will also show the assembly instructions corresponding to the C source.

      (gdb) set disassemble on
      
    • The tui (text user interface) we showed in lecture splits your session into panes for simultaneously viewing the C source, assembly translation, and/or current register state. The gdb command layout <argument> starts tui mode. The argument specifies which pane you want (src, asm, regs, or split). Tui mode is super-handy for tracing execution and observing what is happening with code/registers as you stepi. Occasionally, tui trips itself and garbles the display. The gdb command refresh sometimes works to clean it up. If things get really out of hand, ctrl-x a will exit tui mode and return you to ordinary non-graphical gdb.

  4. Reading and tracing assembly in gdb. Read over the C code in trace.c. Compile the program and run in gdb. Use the gdb commands from the previous exercise to set breakpoints, disassemble, stepi through the assembly, print registers, and so on to answer the following questions.

    In the my_variables function:

    • Where is arr being stored? How are the values in arr initialized? What happened to the strlen call on the string constant to init the last array element?
    • What instructions were emitted to compute the value assigned to count? What does this tell you about the sizeof operator?
    • Use the gdb command display total to set up a auto-display expression for the variable total and single-step through the function. At start and end of the function, gdb reports that total has been <optimized out> but during the instructions where the value is "live", its value will be shown. Use the disassembly to figure the location where total is being kept and for what range of instructions it is live. What other way could you view the live value during execution without referencing it by name?
    • Stop at the function start and use the gdb command info locals to show the local variables. Compare this list to the declarations in the C source. You'll see some variables are shown with values ("live"), some are <optimized out>, but others don't show up at all. Look at the disassembly to figure out what happened to these entirely missing variables. How does gdb respond when you ask it to print the value of one of the unlisted variables? What if you try to set its value? Step through the function repeating the info locals command to observe which variables are live at each step. Examine the disassembly to explain why there is no step at which both total and squared are live.

    In the u_arith and s_arith functions:

    • These functions invoke same sequence of arithmetic operations but differ in the signedness of the operands. Carefully compare the disassembly for the two functions.
    • The first three C statements using add, subtract, and multiply compile into exactly the same assembly sequence for both functions. How it is possible that these instructions do the correct thing for both unsigned and signed arithmetic?
    • The branch instruction emitted for the if statement is different depending on the signedness of the operand -- why? For what values will the path taken differ due to the difference in branch? Set a breakpoint before the cmp statement and change the value of register being compared to one of those values and stepi from there to verify the difference in paths taken for unsigned versus signed.
    • When doing a right-shift, does gcc emit an arithmetic (sar) or logical (shr) shift? Does it change whether the type is signed or unsigned?
    • To divide by 2, what instruction is used for unsigned? For signed, the assembly sequence is more complex. Trace through by hand or stepi in gdb and explain what the sequence is doing and why it differs from the unsigned calculation.

    In the for_loop, while_loop and dowhile_loop functions:

    • First, read the C code for the three loop variants. Under which conditions are these loops expected to have the same behavior and when will they differ?
    • Now examine the assembly. Two of the loops have the exact same assembly sequence -- which two? How does the assembly of the third loop differ from the other two? Why does it differ?
    • Set a breakpoint on loops and change the value of the parameter n being passed to the three calls such that the loop results will differ. Continue from there and see what is printed.

  5. Exploring C compilation to assembly. A fun tool for investigating C to asm is the GCC Explorer, an online "interactive compiler". (Thanks, Josh K, for sharing!) Use the link https://godbolt.org/g/fHoZ7S configured to use the myth's version of GCC (4.8.x) and the compiler flags from the CS107 makefiles. You can enter some C code, tweak it a bit, and immediately observe how those changes are reflected in the assembly. The tool is doing the same tasks you could do on myth using gcc/gdb, but in a quick exploratory context. Here are a few experiments to try:

    • The lea instruction allows two adds and a multiply by constant 1, 2, 4 or 8 to be jammed into one instruction. It was designed for address arithmetic, but the math is compatible with regular integer operations and it is often used by compiler to do an efficient add/multiply combo. Type in a simple sum(x, y) function that takes two integer arguments and returns their sum. Look at the assembly and you'll note it issued a lea instead of the expected add. Interesting! Change the function to return x + 2*y or x + 8*y -17 and see how the lea can adapt. If you try return x + 3*y it will no longer fit the pattern for the lea, what does the compiler use instead?
    • Multiply is a mildly expensive operation and the compiler will do its darnedest to use a combo of add, shift, or lea instead. Type in a simple scale(x) function that takes one integer and returns the argument multiplied by constant 2. What instruction does the compiler use for the computation? What about a multiply by 8 or 16 or 256? Making a special case for powers of 2 is perhaps unsurprising but what does it do for multiply by 3 or 17 or 25? Experiment to find an integer constant C such that C*x is expressed as a true imul instruction.

  6. Reverse-engineering. The program babybomb asks for input and uses it to make a call to the function mystery in hopes of getting a successful return value. What kind of input is necessary to win at this game? Let's look into this mystery! Open the mystery.s file to view the assembly and then use gdb stepi through the execution of a call to mystery and observe its execution. Once you understand how it operates, give input to the program that will pass the test and win. There are multiple ways to win -- try to find at least two different ones. You're on your way to tackling binary bomb!

Check off with TA

Before you leave, be sure to submit your checkoff sheet and have your lab TA come by and confirm so you will be properly credited. If you don't finish everything before lab is over, we strongly encourage you to finish the remainder on your own. Double-check your progress with self check.