Assignment 1: A bit of fun

Due: Mon Apr 16 11:59 pm
Ontime bonus 5%. Grace period for late submissions until Wed Apr 18 11:59 pm

Assignment by Julie Zelenski

Learning goals

This assignment delves into those topics covered in the first week of lecture and explored in lab1. You will be building your skills with:

  • editing, compiling, testing, and debugging C programs under Unix
  • writing code that manipulates bits and integers using the C bitwise and arithmetic operators
  • working within/around the limits of integer representation

Overview

This assignment asks you to complete three small programs, each of which explores a particular aspect of bitwise or numeric manipulation. The code you will write is quite short, but these small passages are mighty! Digging into this code will solidify your understanding of bits, integer representation, and computer arithmetic.

Note: To remind you of the collaboration policy as it applies to the assignments: you are to do your own independent thinking, design, coding, and debugging. The assignments are for you to solidify your own understanding and develop your individual skills. If you are having trouble completing the assignment on your own, please reach out to the course staff, we are here to help!

Get started

Check out the starter project from your cs107 repo using the command

git clone /afs/ir/class/cs107/repos/assign1/$USER assign1

The starter project contains three partially-written programs (sat.c, automata.c and utf8.c) and the supporting Makefile and custom_testsfiles. In the samples subdirectory, you'll find our sample solutions.

All of the programs are designed to be conveniently tested with our sanitycheck tool.

Review and comment starter code

When you inherit code as a starting point, your first task should be to carefully review it. You want to know how the given code operates and what it does and doesn't handle so you can plan how the code you write will fit into the program. You are given about 25 lines for each program and will add a dozen lines of your own, so there is more starter code to review than new code to add.

Our provided code processes the command-line arguments and handles user error in how the program is invoked. This kind of code is fairly mundane, but has fiddly details. One idiomatic pattern to note is converting string arguments to numbers. The triangle program from assign0 used the venerable atoi function which is simple to use but has no capability to detect malformed inputs. For these programs, we have switch to using the full-featured strtol which adds flexibility and robustness to the conversion, but requires more complex code. Read the function documentation at man strtol and review how the function is used in convert_arg.

Some questions you might consider as a self-test:

  • How do the sample programs respond if invoked with missing arguments? excess arguments?
  • What is the error function and how is it used here?
  • When converting the user's string argument to a number, what base is passed to strtol? In which bases does this allow the argument to be specified? How does the user's argument indicate which base to use?
  • What does strtol do if it encounters an invalid character in the string to convert?
  • How is the second argument to strtol used for error-detection? Why is it ok that the end variable is used without being initialized? If you did not want that error-detection, what do you pass as the second argument instead?
  • What does strtol do if asked to convert a string whose value would be too large to be represented in a long?
  • Do you see anything unexpected or erroneous? We intend for our code to be bug-free; if you find otherwise, please let us know!

After reading through the code to get the lay of the land, now experiment with it! Build and run the program, invoking with different arguments to see how each is treated. Run the program under the debugger and step through code, printing intermediate values as you go. Observing the code in action is an excellent way to truly make sense of it. Confirm your answers to the self-test questions as a double-check on your understanding.

We intentionally left off all the comments on the starter code and expect you to pick up the slack and add the missing documentation. You should add an appropriate overview comment for the entire program, document each of the functions, and add inline comments where necessary to highlight any particularly tricky details.

1. Saturating arithmetic

The sat program performs a synthetic saturating addition operation. Saturating addition clamps the result into the representable range. Instead of overflowing with wraparound as ordinary two's-complement addition does, a saturating addition returns the type's maximum value when there would be positive overflow, and minimum when there would be negative overflow. Saturating arithmetic is a common feature in 3D graphics and digital signal processing applications.

Below shows two sample runs of the sat program:

$ ./sat 8
8-bit signed integer range
min: -128   0xffffffffffffff80
max:  127   0x000000000000007f
$ ./sat 8 126 5
126 + 5 = 127

The program reports that an 8-bit signed value has a range of -128 to 127 and if you attempt to add 126 to 5, the result overflows and sticks at the maximum value of 127.

You are to implement the functions below to support saturating addition for the sat program.

long signed_min(int bitwidth);
long signed_max(int bitwidth);
long sat_add(long operand1, long operand2, int bitwidth);

The bitwidth argument is a number between 4 and 64. A two's-complement signed value with bitwidth total bits is limited to a fixed range. The signed_min and signed_max functions return the smallest and largest values of that range. The sat_add function implements a saturating addition operation which returns the sum of its operands if the sum is within range or the appropriate min/max value when the result overflows. The type of the two operands is long but you can assume the value of the operands will always be within the representable range for a bitwidth-sized signed value.

There are two important restrictions on the code you write for this problem:

  • No relational operators, no math.h functions. Your code is prohibited from making any use of the relational operators. This means absolutely no use of < > <= >=. You also cannot call any function from the floating point math.h library (e.g no pow, no exp2). These restrictions are intended to guide you to implement the operation via bitwise manipulation.
  • No special cases based on bitwidth. Whether the value of bitwidth is 4, 64, or something in between, your functions must use one unified code path to handle any/all values of bitwidth without special-case handling. No use of if switch ?: to divide the code into different cases based on the value of bitwidth. This doesn't mean that you can't use conditional logic (such as to separately handle overflow or non-overflow cases), but conditionals that dispatch based on the value of bitwidth or make a special case out of one or more bitwidths are disallowed.

A solution that violates either restriction will receive zero credit, so please verify your approach is in compliance!

2. Cellular automata

Many of you implemented Conway's delightful Life simulation as a CS106B assignment and this problem will revisit cellular automaton, albeit in a simpler form and implemented via bit tricks. Daniel Shiffman's neat book The Nature of Code contains an excellent explanation of the elementary cellular automaton. Rather than have me attempt to recreate it here, please read Sections 7.1 and 7.2 for essential background information. (Note that Section 7.3 of the above link goes on to develop an automata program, but it employs a more heavyweight array/OO design that we eschew in favor of a streamlined approach using bits.)

We compactly represent a generation as a 64-bit unsigned long, one bit for each cell. A 1 bit indicates the cell is live, 0 if not. Using this bit-packed representation, reading and updating a single cell is implemented as a bitwise operation.

Let's trace how one generation advances to the next. A cell and its two neighbors form a "neighborhood". A neighborhood is effectively a 3-bit number, a value from 0 to 7. Consider a live cell with a live left neighbor and an empty right neighbor. The cell's neighborhood in binary is 110, which is the 3-bit number 6. The ruleset dictates whether a cell with neighborhood 6 will be live or empty in the next generation. A ruleset is represented as an 8-bit number, one bit for each of the 8 possible neighborhood configurations. Let's say the ruleset is 77. The number 77 expressed in binary is 01001101. Because the bit at position 6 (note: position 0 -> least significant) of the ruleset is 1, this cell will be live in the next generation. Are you getting the sense there will be much binary data to practice your bitwise ops on for this problem?

The automata program animates an elementary cellular automaton starting from the initial generation and applying a specified ruleset through a number of iterations. In the output, each line represents one generation. Below shows a sample run of the program for ruleset 77:

$ ./automata 77
                               +                                
++++++++++++++++++++++++++++++ + +++++++++++++++++++++++++++++++
+                            + + +                             +
+ ++++++++++++++++++++++++++ + + + +++++++++++++++++++++++++++ +
+ +                        + + + + +                         + +
+ + ++++++++++++++++++++++ + + + + + +++++++++++++++++++++++ + +
+ + +                    + + + + + + +                     + + +
+ + + ++++++++++++++++++ + + + + + + + +++++++++++++++++++ + + +
+ + + +                + + + + + + + + +                 + + + +
+ + + + ++++++++++++++ + + + + + + + + + +++++++++++++++ + + + +
+ + + + +            + + + + + + + + + + +             + + + + +
+ + + + + ++++++++++ + + + + + + + + + + + +++++++++++ + + + + +
+ + + + + +        + + + + + + + + + + + + +         + + + + + +
+ + + + + + ++++++ + + + + + + + + + + + + + +++++++ + + + + + +
+ + + + + + +    + + + + + + + + + + + + + + +     + + + + + + +
+ + + + + + + ++ + + + + + + + + + + + + + + + +++ + + + + + + +
+ + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + +

You are to implement these functions for the automata program.

void draw_generation(unsigned long gen);
unsigned long advance(unsigned long gen, unsigned char ruleset);

The draw_generation function outputs a row of 64 characters, one for each cell. A live cell is shown as an inverted box, an empty cell is a blank space. The generation is ordered such that the cell corresponding to the most significant bit is in the leftmost column and the least significant bit in the rightmost column.

The advance function advances the generation forward one iteration. For each cell, it examines its neighborhood and uses the ruleset to determine the cell's state in the next generation. The two cells at the outside edges require a small special case due to having only one neighbor; their non-existent neighbor should be treated as though permanently off.

When testing your program, try playing with different values for the rulesets. Can you find a ruleset that produces Sierpinski's triangle? What about an inverted Sierpinski? Which ruleset produces your most favorite output?

Rather than trying to manually eyeball the output for correctness, sanitycheck can be used to compare your output to that of the sample. Sanitycheck converts the output to show live cells as # and empty as .. This makes the output less aesthetically pleasing, but easier for visual discrimination.

3. UTF-8

A C char is a one-byte data type, capable of storing 28 different bit patterns. The ASCII standard (man ascii) establishes a mapping for those patterns, but is limited to 256 options, so only ordinary letters, digits, and punctuation make the cut to be included. Unicode is ASCII's more cosmopolitan cousin, defining a universal character set that supports glyphs from a wide variety of world languages, both modern and ancient (Egyptian hieroglyphics, the original emoji? ).

A Unicode character is identified by a number called its code point and the Unicode codespace encompasses over a million different code points. The notation U+NNNN refers to the code point whose value is hex 0xNNNN. The utf8 program takes one or more code points and displays the Unicode hex encoding and corresponding character glyph for each code point. Here is a sample run:

$ utf8 0x41 0xfc 0x3c0 0x4751
U+0041   Hex: 41         Character: A  
U+00FC   Hex: c3 bc      Character: ü  
U+03C0   Hex: cf 80      Character: π  
U+4751   Hex: e4 9d 91   Character: 䝑

UTF-8 is an encoding used for Unicode. It represents a code point as a sequence of bytes; the length of the sequence is determined by the number of significant bits in the code point's binary representation. The following table shows the structure for one, two and three-byte UTF-8 encodings. If the number of significant bits in the code point is no more than 7, it fits into the one-byte sequence; if number of significant bits is between 8 and 11, it requires a two-byte sequence, and if between 12 and 16 bits, a three-byte sequence. (There is also a four-byte encoding, but we will ignore it).

Code point range Num sigbits UTF-8 encoded bytes (in binary)
U+0000 to U+007F 7 0xxxxxxx
U+0080 to U+07FF 11 110xxxxx 10xxxxxx
U+0800 to U+FFFF 16 1110xxxx 10xxxxxx 10xxxxxx

The 0 and 1 bits shown in the UTF-8 bytes above are fixed for all code points. The xxx bits store the binary representation of the code point. The one-byte sequence stores 7 bits, the two-byte sequence stores 11 bits (5 bits in first byte and 6 in the other), and the three-byte stores 16 bits (divided 4, 6, and 6).

In a single-byte sequence, the high-order bit is always 0, the other 7 bits are the value of the code point itself. (The first 128 Unicode code points deliberately use the same single byte representation as the corresponding ASCII char, for backwards-compatibility with older systems that only know about ASCII -- neat!)

For the multi-byte sequences, the first byte is called the leading byte, the subsequent byte(s) are continuation bytes. The high-order bits of the leading byte indicate the number of bytes in the sequence. The high-order bits of a continuation byte are always 10. The bit representation of the code point is then divided across the low-order bits of the leading and continuation bytes.

Example (paraphrased from Wikipedia UTF-8 page)

Consider encoding the Euro sign, €.

  • The Unicode code point for € is U+20AC.
  • The code point is within the range U+0800 to U+FFFF and will require a three-byte sequence. There are 14 significant bits in the binary representation of 0x20AC.
  • Hex code point 20AC is binary 00100000 10101100. Two leading zero bits of padding are used to fill to 16 bits. These 16 bits will be divided across the three-byte sequence.
  • Leading byte. High-order bits are fixed: three 1s followed by a 0 indicate the three-byte sequence. The low-order bits store the 4 highest bits of the code point. This byte is 11100010. The code point has 12 bits remaining to be encoded.
  • Continuation byte. High-order bits are fixed 10, low-order bits store the next 6 bits of the code point. This byte is 10000010.
  • Continuation byte. High-order bits are fixed 10, low-order bits store the last 6 bits of the code point. This byte is 10101100.

The final three-byte sequence 1110001010000010 10101100can be more concisely written in hexadecimal, as e2 82 ac. The underscores indicate where the bits of the code point were distributed across the encoded bytes.

You are to write the to_utf8 function that takes a Unicode code point and constructs its sequence of UTF-8 encoded bytes.

int to_utf8(unsigned short code_point, unsigned char seq[])

The to_uft8 function has two parameters; an unsigned short code_point and a byte array seq. The function constructs the UTF-8 representation for the given code point and writes the sequence of encoded bytes into the array seq. The seq array is provided by the client and is guaranteed to be large enough to hold a full 3 bytes, although only 1 or 2 may be needed. The function returns the number of bytes written to the array (either 1, 2, or 3).

This task is excellent practice with constructing bitmasks and applying bitwise ops and standard techniques to extract/rearrange/pack bits.

Testing and custom sanity tests

In addition to teaching you the inner workings of computer systems, CS107 also provides strength and cardio training for your coding skills. A key piece of this training is learning the value of thoughtful and thorough testing. When checking code into the team repository, a professional developer is responsible for having first vetted that code through comprehensive testing. In the case of CS107, you want to find and fix the bugs before the autotester ferrets them out!

Sanitycheck is a wonderful tool and we recommend you use it early and often. (And not only is it helpful for testing, running sanitycheck also makes a snapshot of your code as a safety precaution to recover against editing mishaps).

Keep in mind that the basic tests supplied with default sanitycheck tests are just a start. You'll need to devise tests of your own to achieve comprehensive coverage. We designed sanitycheck to allow you to integrate your tests into via the custom option. Using custom sanitycheck means you get the benefit of having automatic comparison to our sample solution, very convenient!

How do you brainstorm what additional tests to consider? First review the cases supplied with default sanitycheck to see what is already covered. Then consider what is the full range of inputs/possibilities beyond those simple cases. If the variety of inputs is sufficiently small, you may be able to enumerate a test case for each possibility. In other situations, exhaustive coverage is not feasible, so you instead identify representative samples to validate the operation on specific cases. You should target ordinary cases as well as inputs that are unusual or edge conditions.

For example, consider testing sat. Your tests should cover a variety of different bitwidths, including extreme values, and on sums within range as well as those that overflow. For automata, look carefully at where you have special cases in advancing the generation, and set up test cases that explicitly poke at those issues. There are some 65,000 possible code points you could feed to utf8, but given that they all flow through just a few distinct paths, having a test case targeted to hit each of those paths, with a little extra attention at the boundaries, should be sufficient to give you confidence about the rest.

To encourage you to flex your testing muscles, for assign1 we require that you submit your custom_tests file that shows the tests you have added. We are expecting at least 3 to 5 additional tests of your own per program that show thoughtful effort to develop comprehensive test coverage. When you add a test to your custom_tests, document your plan by including comments in the file that remind you why you included each test and how the tests relate to one another.

Advice/FAQ

The assignment writeup (this page) is the place where we spell out the formalities (required specifications, due date, grading standards, logistics, and so on). We also maintain a separate companion advice page for the assignment that offers informal recommendations and hints, along with answers to common student questions. Please check it out! Go to advice/FAQ page

Grading

Below is the tentative grading rubric. We use a combination of automated tests and human review to evaluate your submission. More details are given in our page explaining How assignments are graded.

Code functionality (80 points)

  • Sanity cases. (30 points) The default sanity check tests are used for sanity grading.
  • Comprehensive cases. (45 points) We will thoroughly test on a variety of an additional inputs ranging from simple to complex, including edge conditions where appropriate.
  • Custom sanitycheck tests (5 points) Your custom_tests file should include at least 10 tests of your own that show thoughtful effort to develop comprehensive testing coverage. Please include comments that explain your choices. We will run your custom tests against your submission as well as review the cases to assess the strength of your testing efforts.
  • Clean compile. (2 points) We expect your code to compile cleanly without warnings.

Code quality (buckets weighted to contribute ~10 points)

The grader's code review is scored into buckets to emphasize the qualitative features of the review over the quantitative.

  • Bitwise manipulation. We expect you to show proficiency in bitwise manipulation through clean construction and use of masks and proper application of the bitwise operations.
  • Algorithms. Your chosen approach should demonstrate that you understand how to leverage bit patterns and numeric representation to directly and efficiently accomplish the task. Unnecessary divergence and special-case code should be avoided; unify into one general-purpose path wherever possible.
  • Style and readability. We expect your code to be clean and readable. We will look for descriptive names, defined constants (not magic numbers!), and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
  • Documentation. You are to document both the code you wrote and what we provided. We expect program overview and per-function comments that explain the overall design along with sparing use of inline comments to draw attention to noteworthy details or shed light on a dense or obscure passage. The audience for the comments is your C-savvy peer.

On-time bonus (+5%)

How CS107 handles deadlines and late work may differ from other classes, be sure you have read our complete late policy.

The on-time bonus for this assignment is 5%. Submissions received by the due date earn the on-time bonus. The bonus is calculated as a percentage of the point score earned by the submission.

Finish and submit

Review the How to Submit page for instructions. You can submit as many times as desired; your most recent submission is the one we will grade. Submitting a stable but unpolished/unfinished is like an insurance policy. If the unexpected happens and you miss the deadline to submit your final version, this previous submit will earn points. Without a submit, we cannot grade your work.

Submissions received by the due date receive the on-time bonus. If you miss the due date, late work may be submitted during the grace period without penalty. No submissions will be accepted after the grace period ends, please plan accordingly!

How did it go for you? Review the post-task self-check.

Congratulations are in order on completing your first step on your journey in leveling up your systems knowledge and skills! Way to go and looking forward to more great things to come!