Assignment 1: A Bit of Fun

Due: Wed Apr 14 11:59 pm
Late submissions accepted until Fri Apr 16 11:59 pm

Assignment by Julie Zelenski, with modifications by Nick Troccoli, Katie Creel and Brynne Hurst

Along with this assignment, we've posted a page of extra bits practice problems with solutions. These are very helpful to confirm your understanding of bit operators, binary, and more as you get started with this assignment!

Learning Goals

This assignment covers topics from the first few lectures and the first lab. You will be building your skills with:

  • editing, compiling, testing, and debugging C programs under Unix
  • writing code that manipulates bits and integers using the C bitwise and arithmetic operators
  • working within/around the limits of integer representation
  • understanding the ramifications of integer representations in real-world software

Overview

This assignment asks you to complete three programs and one case study. Each program explores a particular aspect of bitwise or numeric manipulation. The code you will write is short, but these small passages are mighty!

Take a look at the working on assignments page linked to from the assignments dropdown as you are working on this assignment; it outlines everything you need to know about working through a CS107 assignment, from getting the starter code to testing to submitting.

Note: To remind you of the collaboration policy as it applies to the assignments: you are to do your own independent thinking, design, coding, and debugging. The assignments are for you to solidify your own understanding and develop your individual skills. If you are having trouble completing the assignment on your own, please reach out to the course staff; we are here to help!

To get started on the assignment, clone the starter project using the command

git clone /afs/ir/class/cs107/repos/assign1/$USER assign1

The starter project contains the following:

  • readme.txt: a text file where you will answer questions for the assignment
  • sat.c. automata.c, utf8.c and Makefile: three partially-written programs that you will modify, and their Makefile for compiling
  • custom_tests: the file where you will add custom tests for your programs
  • samples: a symbolic link to the shared directory for this assignment. It contains:
    • SANITY.ini, sanity.py and prototypes.h: files to configure and run Sanity Check. You can ignore these.
    • automata_soln, sat_soln and utf8_soln: executable solutions for the programs you will modify.
    • stylecheck.py: a program that can check your code for common style issues. This is not exhaustive, and is only meant to identify common style issues.
  • tools: contains symbolic links to the sanitycheck and submit programs for testing and submitting your work.

Assignment Support: Through TA helper hours and the discussion forum, our focus will be on supporting you so that you can track down your own bugs. Please ask us how to best use tools (like the brand-new GDB!), what strategies to consider, and advice about how to improve your debugging process or track down your bug. We're happy to help you with these in order to help you drive your own debugging. For this reason, if you have debugging questions during helper hours, please make sure to gather information and explore the issue on your own first, and fill out the QueueStatus questions with this information. For instance, if your program is failing in certain cases, try to find a small replicable test case where the program fails; then, try using GDB to narrow down to what function/code block/etc. you think is causing the issue; then, further investigate just that area to try and find the bug. As another example, if your program is crashing, take a similar approach, but try using GDB and the backtrace command (check it out in lab!) to narrow in on the source of the crash. This information is important for the course staff to effectively help you with debugging. Starting with a future assignment, we will require this information when signing up for helper hours for debugging help, so please make sure to provide as much information as possible.

Review and Comment

When you inherit code as a starting point, your first task should be to carefully review it. You want to know how the given code operates and what it does and doesn't handle so you can plan how the code you write will fit into the program. You are given about 25 lines for each program and will add about a dozen lines of your own per program, so there is more starter code to review than new code to add.

To get you started, our provided code for each program processes the command-line arguments, handles user error in how the program is invoked, and has a few important details, particularly in converting string arguments to numbers. The triangle program from assign0 used the atoi function, which is simpler to use but has no capability to detect malformed inputs. For these programs, we are using the more robust, but complex, strtol. Read the function documentation at man strtol and review how the function is used in convert_arg.

Some questions you might consider as a self-test:

  • How do the sample programs respond if invoked with missing arguments? excess arguments?
  • How is the error function used here?
  • When converting the user's string argument to a number, what base is passed to strtol? In which bases does this allow the argument to be specified? How does the user's argument indicate which base to use?
  • What does strtol do if it encounters an invalid character in the string to convert?
  • How is the second argument to strtol used for error-detection? Why is it ok that the end variable is used without being initialized? If you did not want that error-detection, what do you pass as the second argument instead?
  • What does strtol do if asked to convert a string whose value would be too large to be represented in a long?
  • Do you see anything unexpected or erroneous? We intend for our code to be bug-free; if you find otherwise, please let us know!

After reviewing the code, try experimenting with it! Build and run the programs, invoking them with different arguments to see how each is treated. Run the program under the debugger (check out our Debugger Guide under "Handouts") and step through code, printing intermediate values as you go. Observing the code in action is an excellent way to better understand it. Confirm your answers to the self-test questions above to verify your understanding.

We intentionally left off all the comments on the starter code, so you should add in documentation. Add an appropriate overview comment for the entire program, document each of the functions, and add inline comments where necessary to highlight any particularly tricky details.

1. Saturating Arithmetic

The sat program performs a synthetic saturating addition operation. Saturating addition clamps the result into the representable range. Instead of overflowing with wraparound as ordinary two's-complement addition does, a saturating addition returns the type's maximum value when there would be positive overflow, and minimum when there would be negative overflow. Saturating arithmetic is a common feature in 3D graphics and digital signal processing applications.

Here are two sample runs of the sat program:

$ ./sat 8
8-bit signed integer range
min: -128   0xffffffffffffff80
max:  127   0x000000000000007f
$ ./sat 8 126 5
126 + 5 = 127

The program reports that an 8-bit signed value has a range of -128 to 127 and if you attempt to add 126 to 5, the result overflows and sticks at the maximum value of 127.

Your task is to implement the functions below to support saturating addition for the sat program.

long signed_min(int bitwidth);
long signed_max(int bitwidth);
long sat_add(long operand1, long operand2, int bitwidth);

The bitwidth argument is a number between 4 and 64, inclusive. A two's-complement signed value with bitwidth total bits is limited to a fixed range. The signed_min and signed_max functions return the smallest and largest values of that range. The sat_add function implements a saturating addition operation which returns the sum of its operands if the sum is within range, or the appropriate min/max value when the result overflows. The type of the two operands is long but you can assume the value of the operands will always be within the representable range for a bitwidth-sized signed value. That being said, this means there may be more bits than you need for a bitwidth-sized value, and these extra bits should be set appropriately to ensure the value is correctly interpreted when it is inside a long. For example, with a bitwidth of 4, the max is binary 111 , which represents the value 7. We want to return the value 7 but represented using 64 bits, which would be 0000...0111.

There are three important restrictions on the code you write for this problem (these restrictions apply only to this program, and not others):

  • No relational operators or math.h functions. You are prohibited from making any use of the relational operators. This means no use of < > <= >=. You may use != ==. You also should not call any function from the floating point math.h library (e.g no pow, no exp2). These restrictions are intended to guide you to implement the operation via bitwise manipulation. All other operators (arithmetic, logical, bitwise, ...) are fine.
  • No special cases based on bitwidth. Whether the value of bitwidth is 4, 64, or something in between, your functions must use one unified code path to handle any/all values of bitwidth without special-case handling. You should not use if switch ?: to divide the code into different cases based on the value of bitwidth. This doesn't mean that you can't use conditional logic (such as to separately handle overflow or non-overflow cases), but conditionals that dispatch based on the value of bitwidth or make a special case out of one or more bitwidths are disallowed.
  • No loops or recursion. No loops or recursion at all. (In particular, no loop increment by one and stop at max.)

A solution that violates any of these restrictions will receive zero credit, so please verify your approach is in compliance! These restrictions are only for this problem, and not later problems.

2. Case Study: Ariane-5

Once you have implemented your sat program, you will have a taste of functions that work around overflow (completing sat is not required to complete this part, but provides context). Overflow errors can cause many problems, which we saw some examples of in Lecture 2. These errors are not very common, but they can be disastrous when they occur. Read through this article explaining an issue experienced by the European Space Agency, and answer the following questions in the readme.txt file.

View Ariane-5 Article

1) What was the cause of the Ariane 5 rocket failure, and why was it not detected before the rocket was launched? Write your answer in 3-5 sentences.

2) Imagine that you are developing software for Ariane 6. You notice that Ariane 6 takes the average of two different sensors' measurements of horizontal velocity (which are stored as ints) by adding them together and dividing by 2, as in the following code:

int average = (velocity1 + velocity2) / 2;

What issues might arise from this calculation? How would you test for other potential overflow issues that could arise before the launch date? Write your answer in 3-5 sentences. Note: There are many possible answers to this question. Your answer should rely on critical reflection about appropriate testing strategies.

3. Cellular Automata

A cellular automaton models the life cycle of a cell colony, which you can think of as a 1D array of cells, each of which can be either live or empty. Starting with an initial pattern, the automaton simulates the birth and death of future generations of cells using a set of simple rules. In this part of the assignment, you will implement a cellular automaton using bits and bit operations so that it uses memory as efficiently as possible.

For this assignment, we will represent a colony of cells as a 64-bit unsigned long, with one bit for each cell in a 1D world. A 1 indicates the cell is live, and 0 indicates the cell is empty. A location's neighbors are the cells to its immediate left and right. The ruleset tells us, based on the state of a cell and its neighbors in the current generation, whether that cell is live or empty in the next generation. Using this bit vector representation, reading and updating a single cell is implemented as a bitwise operation.

Daniel Shiffman's neat book The Nature of Code contains an excellent explanation of how the simulation works. Rather than repeat it here, please read Sections 7.1 and 7.2 for essential background information. (Note that Section 7.3 of the above link goes on to develop an automata program, but it employs a more heavyweight array/OO design that we eschew in favor of a streamlined approach using bits.) Note that we'll be using larger colonies than those examples, but the idea is the same. Once you've read these sections, continue on below to confirm your understanding of the simulation.

Let's review how one generation advances to the next. As mentioned in the book sections, a cell and its two neighbors form a "neighborhood". A neighborhood is effectively a 3-bit number, a value from 0 to 7. Consider a live cell with a live left neighbor and an empty right neighbor. The cell's neighborhood in binary is 110, which is the 3-bit number 6. The ruleset dictates whether a cell with neighborhood 6 will be live or empty in the next generation. A ruleset is represented as an 8-bit number, one bit for each of the 8 possible neighborhood configurations. Let's say the ruleset is 77. The number 77 expressed in binary is 01001101. Because the bit at position 6 (note: position 0 refers to the least significant bit) of the ruleset is 1, this cell will be live in the next generation. Important note: the Nature of Code diagrams may be confusing in determining the ruleset bit order. A ruleset's least significant bit says what to do for the neighborhood of all 0s. A ruleset's most significant bit says what to do for the neighborhood of all 1s. Make sure not to invert this order!

The automata program displays an elementary cellular automaton starting from the initial generation and applying a specified ruleset through a number of iterations. It stops early if the generations do not change. In the output, each line represents one generation, going from the first generation at the top to the last generation at the bottom. Below shows a sample run of the program for ruleset 77:

$ ./automata 77
                               +                                
++++++++++++++++++++++++++++++ + +++++++++++++++++++++++++++++++
+                            + + +                             +
+ ++++++++++++++++++++++++++ + + + +++++++++++++++++++++++++++ +
+ +                        + + + + +                         + +
+ + ++++++++++++++++++++++ + + + + + +++++++++++++++++++++++ + +
+ + +                    + + + + + + +                     + + +
+ + + ++++++++++++++++++ + + + + + + + +++++++++++++++++++ + + +
+ + + +                + + + + + + + + +                 + + + +
+ + + + ++++++++++++++ + + + + + + + + + +++++++++++++++ + + + +
+ + + + +            + + + + + + + + + + +             + + + + +
+ + + + + ++++++++++ + + + + + + + + + + + +++++++++++ + + + + +
+ + + + + +        + + + + + + + + + + + + +         + + + + + +
+ + + + + + ++++++ + + + + + + + + + + + + + +++++++ + + + + + +
+ + + + + + +    + + + + + + + + + + + + + + +     + + + + + + +
+ + + + + + + ++ + + + + + + + + + + + + + + + +++ + + + + + + +
+ + + + + + + ++ + + + + + + + + + + + + + + + + + + + + + + + +

Pretty cool, huh? Since a ruleset is an 8-bit number, this means there are 256 different possible rulesets to dictate how cell generations evolve. There are some really cool outcomes you get depending on the ruleset. We'll let you explore these!

You are to implement these functions for the automata program.

void draw_generation(unsigned long gen);
unsigned long advance(unsigned long gen, unsigned char ruleset);

The draw_generation function outputs a row of 64 characters, one for each cell. A live cell is shown as an inverted box (a provided constant), and an empty cell is shown as a blank space. The generation is ordered such that the cell corresponding to the most significant bit is in the leftmost column and the least significant bit is in the rightmost column. We recommend you implement this first. If you then run the program, it should print out one line for the first generation, another blank line for a generation of 0 (since that is what the advance function below initially returns), and then it will stop because the generation does not change.

The advance function advances the generation forward one iteration. For each cell, it examines its neighborhood and uses the ruleset to determine the cell's state in the next generation. The two cells at the outside edges require a small special case due to having only one neighbor; their non-existent neighbor should be treated as though permanently off.

When testing your program, try playing with different values for the rulesets. Can you find a ruleset that produces Sierpinski's triangle? What about an inverted Sierpinski? Which ruleset produces your favorite output?

Rather than trying to manually eyeball the output for correctness, we recommend using SanityCheck to compare your output to that of the sample. Sanitycheck converts the output to show live cells as # and empty as .. This makes the output less aesthetically pleasing, but easier for visual discrimination.

4. UTF-8

A C char is a one-byte data type, capable of storing 28 different bit patterns. The ASCII encoding standard (man ascii) establishes a mapping for those patterns to various letters, digits and punctuation, but is limited to 256 options, so only a subset are included. Unicode is ASCII's more cosmopolitan cousin, defining a universal character set that supports glyphs from a wide variety of world languages, both modern and ancient (Egyptian hieroglyphics, the original emoji?). This large number of characters requires a different system than just a single char, however. For this part of the assignment, you will implement the most popular Unicode encoding, UTF-8, which according to recent measurements, is used by over 93% of all websites (with known character encodings)!

Unicode maps each character to a code point, which is a hexadecimal number. There are over 1 million different code points in the Unicode standard! A character's code point is commonly written in the format U+NNNN, which signifies the hexadecimal value 0xNNNN.

However, the Unicode standard does not specify how to best encode these code points in binary. There are a variety of different possible Unicode encodings which specify, given a code point, how to actually store it. Why multiple encodings? Each may have a different priority, whether it is compatibility with other non-Unicode encodings, the locale it is intended for, space efficiency, etc. Some questions that may have different answers depending on the encoding are: should smaller code points take up the same amount of space as larger code points? If not, how can we tell how long the encoding is? And is it possible to preserve compatibility with other encodings, such as ASCII?

One of the most popular Unicode encodings is UTF-8. UTF-8 is popular partially because it preserves backwards compatibility with ASCII - in fact, it was designed such that ASCII is a subset of UTF-8, meaning that the characters represented in the ASCII encoding have the same encoding in ASCII and UTF-8.

The UTF-8 encoding represents a code point using 1-4 bytes, depending on the size of the code point. If it falls in the range U+0000 to U+007F, its UTF-8 encoding is 1 byte long. If it falls in the range U+0080 to U+07FF, its UTF-8 encoding is two bytes long. And if it falls in the range U+0800 to U+FFFF, its UTF-8 encoding is 3 bytes long (there is a 4 byte range, but we'll be ignoring that for this assignment).

The way UTF-8 works is it splits up the binary representation of the code point across these UTF-8 encoded byte(s). Code points from U+0000 to U+007F use at most 7 bits (these are called its "significant bits"), so the UTF-8 encoding is one byte with its most significant bit0, and its remaining 7 bits as the seven significant bits from the code point. Code points from U+0080 to U+07FF have at most 11 significant bits, so the first byte of the UTF-8 encoding is a leading 110 (we'll see why later), followed by the 5 most significant code point bits. The second byte of this UTF-8 encoding is a leading 10, followed by the remaining 6 significant code point bits. Code points from U+0800 to U+FFFF have at most 16 significant bits, so the first byte of the UTF-8 encoding is a leading 1110, followed by the 4 most significant code point bits. The second byte of this UTF-8 encoding is a leading 10, followed by the 6 next most significant code point bits. The third byte of this UTF-8 encoding is a leading 10, followed by the final 6 significant code point bits.

Here is a table containing the design described above. For each byte layout, the 0 and 1 bits are fixed for all representations. The xxx bits store the binary representation of the code point.

Code Point Range Significant Bits UTF-8 Encoded Bytes (Binary)
U+0000 to U+007F 7 0xxxxxxx
U+0080 to U+07FF 11 110xxxxx 10xxxxxx
U+0800 to U+FFFF 16 1110xxxx 10xxxxxx 10xxxxxx

As you can see, the one-byte sequence stores 7 bits of the code point, the two-byte sequence stores 11 bits (5 bits in first byte and 6 in the other), and the three-byte stores 16 bits (divided 4, 6, and 6). The different UTF-8 byte lengths are determined based on how many bytes are needed to store all the bits of the code point (1 for up to 7 significant bits, 2 for up to 11 significant bits, and 3 for up to 16 significant bits).

In a single-byte sequence, the high-order bit is always 0, and the other 7 bits are the value of the code point itself. This is so the first 128 Unicode code points use the same single byte representation as the corresponding ASCII character!

For the multi-byte sequences, the first byte is called the leading byte, and the subsequent byte(s) are continuation bytes. The high-order bits of the leading byte indicate the number of total bytes in the sequence through the number of 1s (for instance, the leading byte of an encoding for a code point from U+0080 to U+07FF starts with 110, indicating the entire encoding is 2 bytes long, since there are two 1s). The high-order bits of a continuation byte are always 10. The bit representation of the code point is then divided across the low-order bits of the leading and continuation bytes.

Example (paraphrased from Wikipedia UTF-8 page)

Consider encoding the Euro sign, €.

  1. The Unicode code point for € is U+20AC.
  2. The code point is within the range U+0800 to U+FFFF and will require a three-byte sequence. There are 14 significant bits in the binary representation of 0x20AC.
  3. Hex code point 20AC is binary 00100000 10101100. Two leading zero bits of padding are used to fill to 16 bits. These 16 bits will be divided across the three-byte sequence.
  4. Let's calculate the leading byte. High-order bits are fixed: three 1s followed by a 0 indicate the three-byte sequence. The low-order bits store the 4 highest bits of the code point. This byte is 11100010. The code point has 12 bits remaining to be encoded.
  5. Next, let's calculate the first continuation byte. High-order bits are fixed 10, low-order bits store the next 6 bits of the code point. This byte is 10000010. The code point has 6 bits remaining to be encoded.
  6. Finally, let's calculate the last continuation byte. High-order bits are fixed 10, low-order bits store the last 6 bits of the code point. This byte is 10101100.
  7. The three-byte sequence this results in is 1110001010000010 10101100, which can be more concisely written in hexadecimal, as e2 82 ac. The underscores indicate where the bits of the code point were distributed across the encoded bytes.

Your Task

You are to write the to_utf8 function that takes a Unicode code point and constructs its sequence of UTF-8 encoded bytes.

int to_utf8(unsigned short code_point, unsigned char utf8_bytes[])

The to_utf8 function has two parameters; an unsigned short code_point and a byte array utf8_bytes. The function constructs the UTF-8 representation for the given code point and writes the sequence of encoded bytes into the array utf8_bytes. The first byte (e.g. leading byte) should go at index 0, and the remaining bytes (if any, e.g. continuation bytes) should follow at index 1, etc. The utf8_bytes array is provided by the client and is guaranteed to be large enough to hold a full 3 bytes, although only 1 or 2 may be needed. While we will talk about C arrays in more depth later, it turns out that if you pass an array as a parameter, modifying the elements of that array parameter will modify the elements of the original array passed in, so the function above can pass back the byte(s) it creates. The function returns the number of bytes written to the array (either 1, 2, or 3).

You will implement this within the provided utf8 program, which takes one or more code points and displays the UTF-8 hex encoding and corresponding character glyph for each code point using your to_utf8 function. Here is a sample run:

$ utf8 0x41 0xfc 0x3c0 0x4751
U+0041   UTF-8 Hex: 41         Character: A  
U+00FC   UTF-8 Hex: c3 bc      Character: ü  
U+03C0   UTF-8 Hex: cf 80      Character: π  
U+4751   UTF-8 Hex: e4 9d 91   Character: 䝑

This task is excellent practice with constructing bitmasks and applying bitwise ops and standard techniques to extract/rearrange/pack bits.

Testing

In addition to the above programs, as part of this assignment you should also add at least 3 to 5 additional tests of your own per program (total of at least 10) in the custom_tests file that show thoughtful effort to develop comprehensive test coverage. When you add a test, also document your work by including comments in the file that remind you why you included each test and how the tests relate to one another. The tests supplied with the default SanityCheck are a start that you should build on, with the goal of finding and fixing any bugs before submitting, just like how a professional developer is responsible for vetting any code through comprehensive testing before adding it to a team repository. As a reminder, you can visit the working on assignments linked to from the Assignments dropdown for information about sanitycheck and how to use it with custom tests. We recommend you run Sanity Check early and often (and remember, it even makes a snapshot of your code to guard against editing mishaps!).

How do you brainstorm what additional tests to consider? First review the cases supplied with default sanitycheck to see what is already covered. Then consider the full range of inputs/possibilities beyond those simple cases. If the variety of inputs is sufficiently small, you may be able to enumerate a test case for each possibility. In other situations, exhaustive coverage is not feasible, so you instead identify representative samples to validate the operation on specific cases. You should target ordinary cases as well as edge conditions and unusual inputs.

For example, consider testing sat. Your tests should cover a variety of different bitwidths, including extreme values, and on sums within range as well as those that overflow. For automata, look carefully at where you have special cases in advancing the generation, and set up test cases that explicitly poke at those issues. There are some 65,000 possible code points you could feed to utf8, but given that they all flow through just a few distinct paths, having a test case targeted to hit each of those paths, with a little extra attention at the boundaries, should be sufficient to give you confidence about the rest.

For more advice, check out our guide to testing, from the assignments dropdown.

Style Check

We have added a new tool, stylecheck, that can check your code for common style issues. You can run it like this from within your assignment folder:

samples/stylecheck.py

This check is NOT exhaustive; it is only meant to identify common style issues. Check the course style guide and assignment spec for more style guidelines. We hope you find this helpful!

Submitting

Once you are finished working and have saved all your changes, check out the guide to working on assignments for how to submit your work. We recommend you do a trial submit in advance of the deadline to familiarize yourself with the process and allow time to work through any snags. You may submit as many times as you would like; we will grade the latest submission. Submitting a stable but unpolished/unfinished is like an insurance policy. If the unexpected happens and you miss the deadline to submit your final version, this previous submit will earn points. Without a submission, we cannot grade your work.

When you submit, you may optionally indicate that you do not plan to make a submission after the on-time deadline. This allows the staff to start grading some submissions as soon as the on-time deadline passes, rather than waiting until after the late period to start grading.

  • When in doubt, it's fine to indicate that you may make a late submission, even if you end up submitting on time
  • If you do indicate you won't submit late, this means once the on-time deadline passes, you cannot submit again. You can resubmit any time before the on-time deadline, however.
  • If you want to change your decision, you can do so any time before the on-time deadline by resubmitting and changing your answer.
  • If you know that you will not make a late submission, we would appreciate you indicating this so that we can grade assignments more quickly!

You should only need to modify the following files for this assignment: readme.txt, sat.c, automata.c, utf8.c, custom_tests

We would also appreciate if you filled out this homework survey to tell us what you think once you submit. We appreciate your feedback!

Grading

Below is the tentative grading rubric. We use a combination of automated tests and human review to evaluate your submission. More details are given in our page linked to from the Assignments dropdown explaining How assignments are graded.

Readme Answers (10 points)

Code Functionality (80 points)

  • Sanity cases. (30 points) The default sanity check tests are used for sanity grading.
  • Comprehensive cases. (45 points) We will thoroughly test on a variety of an additional inputs ranging from simple to complex, including edge conditions where appropriate.
  • Custom sanitycheck tests (5 points) Your custom_tests file should include at least 10 tests of your own, 3-5 per program, that show thoughtful effort to develop comprehensive testing coverage. Please include comments that explain your choices. We will run your custom tests against your submission as well as review the cases to assess the strength of your testing efforts.
  • Clean compile. (2 points) We expect your code to compile cleanly without warnings.

Code Quality (buckets weighted to contribute ~10 points)

The grader's code review is scored into buckets to emphasize the qualitative features of the review over the quantitative. The styleguide is a great overall resource for good program style. Here are some highlights for this assignment:

  • Bitwise manipulation. We expect you to show proficiency in bitwise manipulation through clean construction and use of masks and proper application of the bitwise operations.
  • Algorithms. Your chosen approach should demonstrate that you understand how to leverage bit patterns and numeric representation to directly and efficiently accomplish the task. Avoid unnecessary divergence and special-case code; unify into one general-purpose path wherever possible.
  • Style and readability. We expect your code to be clean and readable. We will look for descriptive names, defined constants (not magic numbers!), and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
  • Documentation. You should document both the code you wrote and what we provided. We expect program overview and per-function comments that explain the overall design along with sparing use of inline comments to draw attention to noteworthy details or shed light on a dense or obscure passage. The audience for the comments is your C-savvy peer.

Post-Assignment Check-in

How did the assignment go for you? We encourage you to take a moment to reflect on how far you've come and what new knowledge and skills you have to take forward. Once you finish this assignment, you will have completed your first C programs manipulating bits, and understood more about the limitations of integer representations! This assignment may definitely have a bit of a learning curve getting used to the write/compile/debug process in C and the command line. Celebrate this important milestone; the skills you develop now will pay off throughout the quarter!

To help you gauge your progress, for each assignment/lab, we identify some of its takeaways and offer a few thought questions you can use as a self-check on your post-task understanding. If you find the responses don't come easily, it may be a sign a little extra review is warranted. These questions are not to be handed in or graded. You're encouraged to freely discuss these with your peers and course staff to solidify any gaps in you understanding before moving on from a task.

  • Consider all negative ints whose absolute value is an exact power of two (-8, -32 and so on). What do their bit patterns have in common?
  • How can you determine if an integer bit pattern has a pair of consecutive 'on' bits? (Straightforward to do using a loop, but there is also a clever single expression...)
  • How can integer limitations manifest themselves in real-world software?
  • A treasure trove of bit wizardry curated by Sean Anderson, Stanford CS PhD.

Advice

To help approach this assignment, we've compiled some tips that may help with the various problems, as well as bit manipulation in general.

Don't use 0b-prefixed literals. There is no official support in C for numbers written in binary with 0b. We used this form in lecture to explicitly write out binary numbers, but it's not something you can use in standard C programs. Instead, you should... (see the next part)

Embrace hexadecimal. The base-16 system very conveniently translates to and from binary -- each byte is expressed in two hex digits, one hex digit for the 4 high-order bits, the other for the 4 low bits. Masks in hex or expressed via their construction are more evocative of the binary representation and help you avoid mistakes. For example, a mask to extract the sign bit from a 32-bit value can be cleanly expressed as 1 << 31 or 0x80000000 and both are quite readable. The mask also happens to be decimal 2147483648U, but who can tell? Would you notice if you had accidentally mistyped 2147486348U? Say no to decimal, hex is your pal!

Learn your bitwise operators. It is wasteful to invoke the (expensive, floating-point) pow function to compute integer powers of two: 1 << 10 does a perfect job of computing pow(2, 10) and requires way fewer compute cycles. The bitwise operations are also the right tools to rearrange/isolate/change bits as opposed to the more indirect use of integer multiply/divide/mod with powers of two.

Aim for directness. There may be more direct ways to write certain bit manipulations. Here are a few:

No left-shift right-shift. In the code below, the first version right-shifts then immediately left-shifts back. This seems to just cycle the bits back to where they were originally, but in fact the right-shift will discard the lsb and the left-shift will then zero-fill. This is roundabout way to wipe the lsb, but the better approach is to just directly mask it off as in second version:

    result = (val >> 1) << 1;       // original version

     // Better version:
     result = val & ~1;             // better: directly turn off lsb

No redundant/unnecessary ops. Imagine your goal is to extract the 4 high-order bits of a byte and shift them down to the lower half-byte. You could do it with two steps (mask and shift), but in fact, the shift will already discard the low-order bits, so there's no need to mask them off first!

    result = (val & 0xf0) >> 4;  // mask off lower 4, then shift down

    result = val >> 4;            // better: directly shift, 
                                  // low-order bits gone!

Know your identities/relationships. When putting together a bitwise construction, stop and consider whether there is a more direct way to accomplish the same thing, e.g.

    b = a | 0                   // this is no op!
    b = a ^ -1                  // more clearly written as b = ~a
    b = ~a + 1                  // more clearly written as b = -a
    b = (1U << 31) - 1          // INT_MAX from <limits.h> 
    b = INT_MIN >> 31           // This is -1!

At first you may find yourself using a longer expression than needed, but reviewing in a later pass, you can spot these issues and make them more concise. The goal is eventually to go straight to the most direct expressions from the beginning.

Frequently Asked Questions

How do I display values in binary?

In gdb, p/t prints in the binary format. The /t format also works for the gdb x and display commands. The C printf function does not have a binary format, but %x will print in hex, which is the next best thing.

Are there any differences in the bitwise operations when applied to signed versus unsigned values?

The bitwise operations & | ^ ~ << operate identically on signed versus unsigned types. The right shift operator >> comes in two flavors: logical and arithmetic. Check out the lecture materials for more information!

I'm getting the error warning: left shift count >= width of type. What does this mean?

Code such as this will produce that warning:

long val = 1 << 32;

An integer constant defaults to type int, so shifting a 32-bit int by 32 positions would effectively shift off all the bits, and thus the compiler warns about it. Assigning the result to a variable of type long (64 bits) doesn't help; you have to start with a 64-bit value if you intend to compute a 64-bit result. Read the next question!

What bitwidth and signedness does a numeric integer constant have?

An integer literal is a signed int by default. Suffixes can be added to the constant to dictate its type. A suffix of L indicates the type is long and U (either by itself or with L) indicates it is unsigned. E.g. 1UL means a literal 1 that is an unsigned long.

What are the rules for integer promotion in C?

When assigning from narrower bitwidth (e.g. char, short) to a wider one, the source bytes are copied to the low-order bytes of the destination, and the upper bytes are assigned based on the promotion rules. If the source type was unsigned, the upper bytes of destination are zero-filled. If the source type was signed, its sign bit (the msb) is replicated across the upper bytes of destination (this is called sign extension). It may help to trace through this and confirm for yourself how/why this is value-preserving. If an operation is performed on two operands of mixed type, integer promotion is implicitly applied to the narrower first. It also the case (and a bit surprising, too) that integer promotion is also applied to types narrower than int when an operation is performed on them, even if no other wider type is involved. For example, the bitshift and invert operations promote the operand to an int, so bit-shifting a char will promote to an int before shifting, bit-inverting of a char similarly promotes, and the default result type of these expressions will be int. If the bitwidth of the operand was larger than int, the bitwidth will be unchanged by these promotion rules.

My Unicode characters are not printing or appear garbled. How can I fix this?

If the characters output by the UTF-8 program do not appear correctly on your terminal, but the encoded hex bytes are correct, you may need to set your terminal to expect UTF-8. Changing the configuration is OS/terminal-specific, but within 5 minutes of Googling we found utf8 advice for Putty, MobaXterm and Xterm. The default Mac OS X Terminal.app works by default. Note that sanity check should still pass even if your terminal is showing the wrong character.