Assign7: Huffman coding

Huffman coding is an algorithm devised by David Huffman in 1952 for compressing data. This simple and elegant approach is powerful enough that variants of it are still used today in computer networks, fax machines, modems, HDTV, and other areas. You will be impressed with its clever use of trees and your ability to implement such a nifty tool. What a great capstone to your awesome quarter in CS106B!

Background

Please review the May 29 lecture on Huffman coding and this supplemental document with background on Huffman coding. Also be sure to have completed the preparatory warmup exercises.

The rest of this handout assumes familiarity with that content and will not repeat it.

Review starter code

Before you get started in earnest, take stock of the files in the starter project and what's provided to you. Here's a few things to be sure you are acquainted with:

treenode.h defines the EncodedTreeNode type you will use for representing encoding trees.
bits.h defines the Bit type that is used to represent a single value of 0 or 1. A Bit can be used somewhat interchangeably with the integers 0 and 1:
```
  Queue<Bit> q;
  Bit b = 0;   
  q.enqueue(0);   
  if (q.dequeue() == 1) { }
```
The special feature of Bit is that is restricted to only allow values 0 or 1. Any other values will raise an error (this was a surprisingly common error in past and the source of much grief, thus we have introduced this constrained type to help you avoid these issues)
bits.h also defines the EncodedData struct for the data stored in a Huffman compressed file.
main.cpp contains an interactive main program that uses your functions to provide end-to-end file compression and decompression.
huffman.cpp is where you will implement the required functions for huffman compression and decompression. The function interfaces are given to you and should not be changed. There are also some provided test cases that you can use and augment with your own student tests.
When constructing a Huffman coding tree, you will have need of a priority queue. The PQHeap you implemented for Assignment 5 is close to what you need, but it only stored elements of type int. The priorityqueue.h interface in the Stanford library provides a general-purpose PriorityQueue implemented as an efficiency binary min-heap. Its design is similar to your PQHeap but has a few differences of note: it is defined as a template and thus capable of storing the client's choice of element type and it uses priorities of type double instead of int. Consult the PriorityQueue documentation for more details. In particular, make sure you know how to enqueue and dequeue elements and how to determine the priorities of relevant items. The elements you store in the priority queue will be trees, that is to say, the element type is EncodingTreeNode*.

Strategy

Ultimately you will develop compress and decompress operations that apply Huffman coding to an input, but this is a fairly complex goal which will be more manageable if you decompose into smaller tasks. To provide some structure to you, we have decomposed the project into a number of helper functions. You'll build the assignment bottom-up, writing and testing individual functions first and then moving on to combine them into the final program.

In terms of where to start, we recommend you first tackle the decode/unflatten/decompress tasks. This may seem a little backwards, but it turns out to be a great strategy. The code to decode/decompress is a bit simpler than encode/compress and it is easier to verify the correctness of its output, as its result is that you get the original text back.

Then go on to implement the generate/encode/flatten/compress tasks. Because you have already have a working decode/unflatten/etc, you can feed the output from your encode operation back into your decode as part of your testing.

At the end, use our provide main program to do end-to-end file compression and decompression and bask in awe of the impressive tool you have created!

Decode/decompress

We recommend that you implement the functions in order listed below:

string decodeText(EncodingTreeNode* tree, Queue<Bit>& messageBits)

The decodeText function is a gentle introduction to writing code that operates on trees.

EncodingTreeNode* unflattenTree(Queue<Bit>& treeBits, Queue<char>& treeLeaves)

If needed, revisit the warmup to practice with the operation on paper before trying to code unflattenTree. This task is all about tree traversal.

string decompress(EncodedData& data)

The final decompress sounds like it would be the hard one, but it's simply stringing together the first two, woo hoo!

Although we say that these functions can assume that the inputs they receive will be valid and well-formed, it may be wise to write your code defensively anyway. If you verify those assumptions hold and raise an error if not, this may help you out later. We won't test on invalid inputs, but while you are in development, you might very well accidentally slip up and feed your function a bad input and your efforts to defend against that could save you some time debugging an incorrect test case.

Encode/compress

We recommend that you implement the functions in order listed below:

Queue<Bit> encodeText(EncodingTreeNode* tree, string text)

It is significantly easier to implement encodeText if you start by making a table that explicitly associates each character with its bit sequence from the tree. You can then encode each character from the text by reading its sequence from the table, rather than having to repeatedly search the tree for it.

void flattenTree(EncodingTreeNode* tree, Queue<Bit>& treeBits, Queue<char>& treeLeaves)

This function is back to being all about tree traversals – once you are done writing flattenTree you will be a tree traversal expert!.

EncodingTreeNode* buildHuffmanTree(string text)

The buildHuffmanTree function is the workhorse of the entire program. It takes the input text and follows the Huffman algorithm to constructs an optimal Huffman coding tree for the input.

Keep in mind that there can be many equally good Huffman trees for a given string, differing only in how the algorithm broke ties. When writing your own test cases, take care to recognize how tie-breaking affects tree outcome. It is a bummer to get mired in try resolve why your tree is not equal to expected only to realize the tree is a valid and equally optimal but just different.

You should be able to handle strings made of any characters, not just letters, numbers, etc. In particular, you can’t reserve certain char values to mean “this is an internal node.” Many files on disk – especially files containing non-text data – make use of all possible characters.

EncodedData compress(string messageText)

This final compress pulls together the above operations into a complete whole. In doing so, you may discover that there were some bugs lurking in your implementation, which you’ll need to then correct to get everything working. (It’s common in software engineering to find that each individual piece of a program passes its tests independently but fails when things come together; it’s usually either due to tests not covering every possible case or to some unexpected interactions between the components). When that happens, use the debugger to isolate where the issue is. Did you build the encoding tree incorrectly? Did you encode the message bits wrong, or is the issue in the decoder?

Testing

The starter code includes provide tests that test basic operation of each function on a simple known input. These cases are good for early testing. You may want to supplement with other small test cases that are similarly very targeted. However, constructing large-scale tests by hand is quite tedious and not something we recommend. When ready for more comprehensive tests, instead move on to using an end-to-end test strategy across the full compress -> decompress cycle. We also provide a sample main program that runs compress -> decompress on files, which gives you easy access to larger and more varied test inputs than would be reasonable to construct by hand.

Advice

Develop piecewise: Do not try to write all the functions in one go and debug the entire program at once. Concentrate on one function in isolation and write, test, and debug it thoroughly before moving on to the next. Use test cases to verify each function before movin on.
Testing: Do start on small files (one word, two words, a sentence) you can easily verify with hand-tracing before attempting to compress lengthy Russian novels.
- What sort of files do you expect Huffman to be particularly effective at compressing? On what sort of files will it less effective? Are there files that grow instead of shrink when Huffman encoded? Consider creating sample files to test out your theories.
- Your implementation should be robust enough to compress all types of file: text, binary, image, or even one it has previously compressed. Your program probably won’t be able to further squish an already compressed file (and in fact, it can get larger because of the additional overhead of the encoding table) but it should be possible to compress multiple iterations, decompress the same number of iterations, and return to the original file.
We provide code that can read and write EncodedData to and from a stream. The internal logic of how this works is more of a CS107-level topic that involves manipulating individual bits, which is beyond the scope of what we’re going to cover in CS106B. However, if you’re curious to see how this works, you’re welcome to poke around in bits.cpp to learn more!

Style

As your last hurrah, this is your chance to show your section leader that you've been absorbing all that great feedback on style from your IGs. Let's see your best work!

Don't modify required function headers: Please do not make modifications to the required functions' names, parameter types, or return types. Our client code should be able to call the functions successfully without any modification.
Reuse instead of repeat: Redundancy is another major grading focus; avoid repeated logic as much as possible. If two of your functions are similar, have one call the other, or utilize a common helper function.
No global variables: As always, all program data should be properly passed in and out of functions, not declared or accessed globally.
No memory leaks: Your code should have no memory leaks, including no memory leaks in test cases.

Extensions

There are many, many other techniques based on Huffman encoding that you can use to get even better data compression. Here are a few concepts to look up if you’re like to learn more about this.

Adaptive Huffman coding Huffman encoding works by looking at the entire piece of text to compress, finding the global frequencies of each character, and then building a single encoding tree. However, in some files – especially image data – different regions of the file will likely have wildly different frequencies of each letter. For example, in a picture of a yellow sunflower against a bright blue sky, the parts of the file that store the sky data will have most of its bits dedicated to storing different shades of blue, and the parts of the file storing the sunflower will primarily store bits used to hold shades of yellow. For such patterns, it can be worth the effort to adjust the encoding while running through the file. An adaptive Huffman coding slices the file into smaller sections and uses different encoding trees per section.
LZW compression compression. The LZW algorithm was developed by Lempel, Ziv and Welch. This extension on Huffman coding achieves higher compression by additionally assigning short codes to common letter sequences, instead of just common letters. For example, the letter “r” is often followed by the letter “e”, so we could treat the sequence “re” as just another letter when assigning codes.
Shannon entropy. Huffman encoding is based on the assumption that, when reading characters from the file, each character is chosen randomly, with the likelihood of any character coming up being directly proportional to the number of times that letter appears in the overall file. Under this assumption, you can show that the theoretical limit of any compression algorithm for a file is related to a quantity called the Shannon entropy of the distribution. The Shannon entropy is actually relatively easy to compute, and if you’re looking for a nifty “how does theory meet practice” exercise, take a look at the file sizes generated by Huffman encoding and see how they compare against the theoretical lower bound.