Lecture Review

Here are key concepts relevant to this week's lab:

optimization is the task of making your program faster or more efficient with space or time.
GCC can perform a variety of optimizations for us when compiling with different optimization flags.
the static instruction count is the number of assembly instructions included in the program code (e.g. objdump). the dynamic instruction count is the number of assembly instructions executed. These are not necessarily equal - for instance, a few static assembly instructions could encode a loop where each of the individual instructions is executed many times (resulting in a large dynamic instruction count).
There are a variety of different optimizations GCC does, such as constant folding, strength reduction, loop unrolling, and more. These are outlined in the lecture and slides.
GCC can't predict or optimize everything - there are some cases where we must do targeted optimization to improve efficiency and remove bottlenecks.

Solutions

1) Profiling with Callgrind

Powers of 2

This is a walkthrough of how to use callgrind on a program that calculates powers of 2 in different ways.

Q1: How do you think the performance will compare between these different versions?

A1: version C will probably be the worst, because it does calculations manually and inefficiently. Versions A and B will likely be faster, and version D will likely be the fastest, because bit operations are single assembly instructions!

Q2: Which function has the most instructions? The least? Does a shorter static instruction count necessarily mean a function will be faster?

A2: two_to_power_A and two_to_power_C have 9 instructions, two_to_power_B is 8 instructions, and two_to_power_D is 4 instructions. A shorter static instruction count doesn't necessarily mean a function will be faster - the few instructions could execute many times, or some instructions could be more expensive than others.

Q3: The bitwise version is untouchably fast (just a few instructions!). How much more work is involved in the other options? Did this surprise you?

A3: (these include the instructions inside called functions as well). A is 136 instructions total, B is 51 instructions total, C is 236 instructions total, and D is 4. Float functions can be expensive, and for C all calculations + the loop is expensive. D is wicked fast!

Copying Data

Q4: How do you think the performance will compare between these different versions?

A4: Copying large amounts of data at once, using library functions, will likely be the fastest. If we copy smaller chunks of data, it will take longer.

Q5: In the instruction counts for the char, int and long loops, about how many fewer instructions is copying by ints instead of chars? By longs instead of ints?

A5: Copying by chars took roughly 4 times as many instructions as ints and ints took approximately 2 times as many instructions as longs; this makes sense because the number of iterations was 4:1 for chars to ints and 2:1 for ints to longs.

Q6: About how many fewer instructions is copying by longs directly instead of memcpying 8 bytes?

A6: copying longs directly is 774 instructions (the whole function), vs. memcpying is 2963! (1,664 instructions just in the execution of memcpy).

Extra question: If you're interested, check out the extra problem in lab4 to see how one implementation from musl chooses to handle overlap. What does it do?

Answer: memmove copies in different directions depending on overlap type; pretty clever!

2) Ethics Case Study: Therac-25

This lab problem is different from other lab problems before it, in that it is a more open-ended ethics case study discussion. Here are examples of possible answers, but there are many possible ones - our main goal was to facilitate a discussion and encourage thinking about the breadth of possibilities.

Q1: Who should be held responsible for incidents such as these ones? (some ideas: the creator company, its shareholders, software engineers, quality testers, the hospitals using it, etc.?) In what way should they be held responsible?

A1: Company could be held accountable - perhaps by being liable to lawsuits, monetary damages, penalties from the government, etc. Either the company itself, leadership, or specific project managers could be held accountable for failure to implement responsible processes such as testing procedures or external review of code. The company could also be held responsible for their failures of transparency and disclosure. The shareholders could be held accountable, though on the other hand maybe they didn't know as much about the product's flaws? The software engineers - should they be held accountable? They implemented the buggy code, but maybe one person alone didn't know about the scope of the issue? This gets back to the question of manager or team responsibility for instituting testing and review procedures, which might be more important than individual responsibility. Quality testers - it's hard to know, it sounds like their testing did not raise the issue. The hospitals using it - perhaps the employee who misused it should be held responsible via lawsuits, firing, etc.?

Q2: The FDA now regulates medical devices, including contemporary radiation machines, more thoroughly than they did in the 1980s. However, most uses of software are not regulated. In addition to medical software, what other kinds of software (if any) ought to be regulated in order to preserve human life, health, and safety?

A2: some ideas: navigation apps? military hardware and software? software inside self-driving cars? or just regular cars, which now have significant on-board computers? Stock trading software? Financial software for FDIC insured banks? medical apps? Note that medical apps that only provide health advice or personal health data were de-regulated.

End-Lab Thought Questions

Loop unrolling is the process of reworking a loop to accomplish more per iteration and iterate fewer times. In what situations might this provide a performance win?
This may improve performance in a case where the loop would run many times, as it would remove overhead from the loop jmp/test procedures, but at the expense of static instruction counts.
Avoiding the expense of function call/return is often cited for the motivation for inlining a function (though this is not as necessary in current versions of C). What are the estimated savings in terms of number of instructions? Inlining also enables additional optimization opportunities, given that gcc applies most transformations within a function, but not inter-procedurally. Do you expect these gains to be more or less significant than the call overhead? Explain.
Inlining may perhaps remove the need to call or ret, and set up values in parameter registers or push extra parameters beyond the first 6 onto the stack. However, this may be much less significant than optimizations as a result of inlining the instructions from the other function, since a variety of operations, potentially taking many instructions, could be optimized!
Error messages are a form of documentation. What should have been different about the Therac-25 error messages? What would have made them more informative? (just one possible answer) The error message should have explicitly stated what went wrong, what could happen due to this error message, and the severity of the error.
AECL, the manufacturers of Therac-25, were sued by a patient who was injured in 1985. AECL claimed that it was "impossible" that the software could have malfunctioned. Why might they have been this overly confident in their own software? What systems could be put in place to help calibrate their own confidence correctly? (just one possible answer) They seemed to gain confidence from the fact that they carried over code from their already-shipped product, and they used a hardware simulator. However, they should have had independent testing, real-world testing, and had tests for their code (even carried over code) to ensure it didn't have lingering issues.

Checkoff Answers

Callgrind. What's the difference between Callgrind and callgrind_annotate? Callgrind is a profiler - it observes an executing program and gathers statistics. callgrind_annotate is an annotator - it is a separate program that can read the file created by callgrind with the statistics and output useful info about the execution.
Hot Spots. What is a "hot spot"? How do you identify a hot spot in the annotated source? How does finding a hot spot help you optimize your code? A hot spot is a line or group of lines of code that are responsible for a large number of dynamic instructions - you can spot this high number in the callgrind annotated output. Knowing what line(s) take the most instructions can help you target your efforts to make your program more efficient by focusing on removing or changing the lines that contribute most to the program's dynamic instruction count.
Copying. About how much much faster is a block memcpy copy versus a loop copying chunk by chunk using memcpy? It is an order of magnitude faster. E.g. copy_memloop is 2963 instructions while copy_memcpy is 131.
Responsibility. Who should be held responsible for incidents such as the Therac-25 incident? In what way should they be held responsible? see some possible answers from walkthrough above)
Regulation. The FDA now regulates medical devices, including contemporary radiation machines, more thoroughly than they did in the 1980s. However, most uses of software are not regulated. In addition to medical software, what other kinds of software (if any) ought to be regulated in order to preserve human life, health, and safety? (see some possible answers from walkthrough above)

[Optional] Extra Problem

Stack vs. Heap Allocation

Q1: Review the C code in isort.c and talk it over with your group - what predictions do you have about the relative performance of the different functions?

A1: Perhaps the heap will be more expensive than either stack example, due to the performance cost of finding space on the heap. Between the two stack examples, it might take more instructions to support a variable-sized stack allocation vs. fixed-size that can be hardcoded into the assembly.

Q2: What is the difference in the number of instructions executed on entry and exit to swap_fixedstack vs swap_varstack?

A2: swap_fixedstack: entry = 7, exit = 5. swap_varstack: entry = 24, exit = 6.

Q3: Compare the disassembly for these two functions. How many instructions are required to set up and clean up a constant-length stack array? How many additional instructions are needed for a variable-length stack array? Does the count of additional instructions in the disassembly match the count of executed instructions reported by callgrind?

A3: There are many additional instructions to create the variable-length stack array, and it even includes conditional jumps; you can see this by comparing the instructions before memcpy is called for the first time in each function's disassembly. In swap_fixedstack those are up through +14, and in swap_varstack those are up through +87. However, it turns out that not all of them are executed in this run of the program, when the third parameter is always 8. It takes the jump on +51, but not on +85, and the number of instructions executed in total is 24 for this run. There is also 1 additional instruction to clean up a variable-length stack array, an additional pop; you can see this by comparing the instructions after the last memcpy call in each function's disassembly. There is also a modification from an add to an lea for an existing instruction.

Q4: About how many instructions does each call to malloc incur? Each call to free? How does that compare to the cost of allocating and deallocating space on the stack?

A4: Each call to malloc is 38 + 4, to free is 82 + 3. This total (about 127) is way more than for stack!