Huffman coding


Thanks to Keith Schwarz for structure of tree flatten/unflatten.

Huffman coding is an algorithm devised by David Huffman in 1952 for compressing data. This simple and elegant approach is powerful enough that variants of it are still used today in computer networks, fax machines, modems, HDTV, and other areas. You will be impressed with its clever use of trees and your ability to implement such a nifty tool!

Huffman warmup

Be sure to have completed the preparatory warmup exercises.

Review starter code

Before you get started in earnest, take stock of the files in the starter project and what's provided to you. Here are some details to become acquainted with:

  • treenode.h defines the EncodedTreeNode struct you will use for as node type for an encoding tree. You were introduced to this type in the warmup.
  • bits.h defines the Bit type that is used to represent a single value of 0 or 1. A Bit can be used somewhat interchangeably with the integers 0 and 1:

      Queue<Bit> q;
      Bit b = 0;   
      q.enqueue(0);   
      if (q.dequeue() == 1) { }
    

    The special feature of Bit is that it is restricted to only allow values 0 or 1. Any other values will raise an error (this was a surprisingly common error in the past and the source of much grief; thus we have introduced this constrained type to help you avoid these issues).

  • bits.h also defines the EncodedData struct for the data stored in a Huffman compressed file.
  • In bits.cpp, we provide code to read and write bits to and from a stream as this is material beyond the scope of CS106B. The code uses the C++ features for bit-manipulation, a topic that you'll explore if you go on to CS107. However, if you’re curious to get a jump on it, you’re welcome to poke around in bits.cpp to check it out.
  • main.cpp contains an interactive console main program to do end-to-end file compression and decompression by calling your functions.
  • huffman.cpp is where you will implement the required functions for Huffman compression and decompression and add your student test cases.
  • When constructing a Huffman coding tree, you will need a priority queue. The PQHeap you previously implemented is close to what you need, but it only stored elements of type int. The priorityqueue.h interface in the Stanford library provides a general-purpose PriorityQueue implemented as an efficient binary min-heap. Its design is similar to your PQHeap but has a few differences of note: it is defined as a template and thus is capable of storing the client's choice of element type, and it uses priorities of type double instead of int. Consult the PriorityQueue documentation for more details. In particular, make sure you know how to enqueue and dequeue elements and how to determine the priorities of relevant items. The elements you store in the priority queue will be trees; that is to say, the element type is EncodingTreeNode*.

Strategy

"I think that little by little I’ll be able to solve my problems" – wise words from Frida Kahlo

Ultimately, the goal is develop a full compress and decompress pipeline, but your job will be more manageable if you break it down into steps ad develop in stages. We have decomposed the project into a number of helper functions and propose a order of implementation that works from the bottom-up, writing and testing individual functions and then moving on to combine them into the final program.

We recommend you first tackle the decode/unflatten/decompress tasks. This may seem a little backwards, but it turns out to be a great strategy. The code to decode/decompress is a bit simpler than encode/compress, and it is more straightforward to verify the correctness of its output because it restores the original text.

Then when you move on to implement encode/flatten/compress, you will already have a working decode/unflatten/decompress, which means you can feed the output from your encode operation back into your decode for testing.

With both decompress and compress working, you can use our provided main program to do end-to-end file compression and decompression and then bask in awe of the impressive tool you have created!

Decode/decompress

We recommend that you implement the functions in the order listed below:

  • string decodeText(EncodingTreeNode* tree, Queue<Bit>& messageBits)
    

    The decodeText function is a good introduction to writing code that operates on trees. To review decoding using a tree, look back to Q1 of the warmup. Be sure to test! The starter code has one provided test for decodeText; augment with student tests of your own to achieve full coverage.

  • EncodingTreeNode* unflattenTree(Queue<Bit>& treeShape, Queue<char>& treeLeaves)
    

    If needed, revisit Q5 of the warmup to practice on paper with the unflatten operation before trying to code unflattenTree. This task is all about tree traversal and tree construction. You may find it easier to confirm that the unflattened tree is correct by exploring the tree in the debugger than by writing test cases.

  • string decompress(EncodedData& data)
    

    The final decompress sounds like it would be the difficult task, but you're already done all the hard work, so all that's left to do is string the functions together, woo hoo! Re-using the same trees from the warmup could make good candidates for simple test cases.

Decode a mystery

Now that you have a working decompress, use it to unpack the compressed mystery file we included in the starter project.

Run your program and when the main menu prompts you for which tests to run, respond 0 (None). Now the main enters a simple console program that lets you select a file to compress or decompress. The starter project contains a compressed file you can use to try out your decompress function.

When prompted to enter the input filename, specify the name relative to the project folder, for example res/mystery.jpg.huf (Windows uses the reverse slash res\mystery.jpg.huf). The uncompressed version will be written to a file namedres/unhuf.mystery.jpg. Open the uncompressed file to confirm your success at restoring the original. Neat!

Working versions of decode/unflatten/decompress and full pipeline will be a good help to you when writing and testing the encode/compress side coming up next.

Encode/compress

We recommend that you implement the functions in the order listed below:

  • Queue<Bit> encodeText(EncodingTreeNode* tree, string text)
    

    Review Q2 of the warmup for practice with encoding. It is significantly easier to implement encodeText if you first traverse the tree and build a map that associates each character with its encoded bit sequence. You can then encode a character by accessing its bit sequence from the map, rather than having to repeatedly traverse the tree to find it. encodeText paired with decodeText gives you a round-trip to use for your test cases.

  • void flattenTree(EncodingTreeNode* tree, Queue<Bit>& treeShape, Queue<char>& treeLeaves)
    

    Review Q4 of the warmup for practice with the flatten operation. This function involves more tree traversals – once you are done writing flattenTree, you will be a tree traversal expert! A good test is to flatten a tree and reverse via unflatten and confirm the input and output trees are equal.

  • EncodingTreeNode* buildHuffmanTree(string text)
    

    The buildHuffmanTree function is the workhorse of the entire program. It takes the input text and follows the Huffman algorithm to construct an optimal Huffman coding tree for the input. You practiced this task in Q6 of the warmup.

    When ready to test this function, keep in mind that there can be many valid Huffman trees for a given input, differing only in how the algorithm breaks ties. Be sure to consider how tie-breaking affects the tree outcome when constructing your test cases. It would be a bummer to get mired in trying to resolve why your tree is not equal to an expected output, only to realize that your tree is valid and equally optimal but just different than you thought it would be.

  • EncodedData compress(string messageText)
    

    This final compress function pulls together the above operations into a complete whole. Test cases that run an input through compress -> decompress should reconstruct the original and allow you to confirm the functionality of the entire pipeline. You can also use the console main program to try compressing and decompressing files now.

    When you get the final assembly, you may discover that bugs lurking in functions that were not surfaced earlier. It's not uncommon to find that each individual piece of a program passes its tests independently but the program fails when things come together; this may be an indication that the earlier test coverage was lacking, there are unexpected interactions between the components that were not previously triggered, or a nefarious memory bugs may lurk. If this happens to you, bust our your superior testing and debugging skills to hunt down those gremlins and eradicate them.

We included a variety of files in the res folder of the starter project to use as inputs to compress. You can also add files of your own for testing.

Encode a message for your SL

Compress a secret message to send your section leader as part of your submission. In the res subfolder of starter project, you'll find a file named secretmessage.txt. Edit that file to contain your message. The message can be anything (non-offensive â˜ș) that you like.

Use your program to compress the file and save the result to secretmessage.txt.huf. Submit the compressed huf file to Paperless along with your code. Your section leader will uncompress your message with your program and read it while grading.

Specifications

  • Do not change prototypes for any of the required functions. This applies to both the utility functions described in warmup and the functions described in this document.
  • The documentation for the individual functions (decode, unflatten, etc.) state what assumptions you can make about the inputs being valid and well-formed. Even so, it may be wise to write your code defensively anyway. If you check the required assumptions and raise an error if violated, this may help you out later. We won't test on invalid inputs, but while you are in development, you might accidentally slip up and feed your function a bad input. Your efforts to defend against that misstep could save you some time debugging an incorrect test case.
  • You should be able to handle inputs consisting of any characters, not just letters, numbers, etc. Many files especially those containing non-text data such as sounds and images make use of all possible character values from 0 to 255, so you must not place any restrictions on what are considered allowable characters.
  • The input to be compressed is required to contain at least two distinct characters to be Huffman-encodable. If the given input does not meet this requirement, you should reject it and report an error.
  • Your code should have no memory leaks, including no memory leaks in test cases.
  • Consider this assignment an opportunity to show your section leader that you've been absorbing all that great feedback on style from your IGs. Let's see your best work on writing the beautiful code we know you are capable of!

Testing

  • The starter code includes provided tests that test basic operation of each function on a simple known input. These cases are good for early testing. You should supplement with other small test cases that are similarly very targeted.
  • Constructing large-scale tests by hand is quite tedious and not something we recommend. When ready for more comprehensive tests, instead move on to using an end-to-end test strategy across the full compress -> decompress cycle. Our provided main console program that runs compress -> decompress on files, which gives you easy access to larger and more varied test inputs than would be reasonable to construct by hand.
  • Your implementation should be robust enough to compress all types of files: text, binary, image, or even one that has previously compressed. Your program probably won’t be able to further squish an already compressed file (and in fact, it can get larger because of the additional overhead of the flattened tree), but it should be possible to compress multiple iterations, decompress the same number of iterations, and return to the original file.
  • What sorts of files do you expect Huffman to achieve particularly good good compression? On what sorts of files will it be less effective? Are there files that grow instead of shrink when they are Huffman encoded? Create sample files to test out your theories.

Extensions

There are many, many other techniques based on Huffman encoding that you can use to get even better data compression. Here are a few concepts to look up if you’d like to learn more about this.

  • Adaptive Huffman coding. Huffman encoding works by looking at the entire piece of text to compress, finding the global frequencies of each character, and then building a single encoding tree. However, in some files – especially image data – different regions of the file will likely have wildly different frequencies of each letter. For example, in a picture of a yellow sunflower against a bright blue sky, the parts of the file that store the sky data will have most of their bits dedicated to storing different shades of blue, and the parts of the file storing the sunflower will primarily store bits used to hold shades of yellow. For such patterns, it can be worth the effort to adjust the encoding while running through the file. An adaptive Huffman coding slices the file into smaller sections and uses different encoding trees per section.
  • LZW compression. The LZW algorithm was developed by Lempel, Ziv, and Welch. This extension on Huffman coding achieves higher compression by additionally assigning short codes to common letter sequences, instead of just common letters. For example, the letter 'r' is often followed by the letter 'e', so we could treat the sequence "re" as just another letter when assigning codes.
  • Shannon entropy. Huffman encoding is based on the assumption that, when reading characters from the file, each character is chosen randomly, with the likelihood of any character coming up being directly proportional to the number of times that letter appears in the overall file. Under this assumption, you can show that the theoretical limit of any compression algorithm for a file is related to a quantity called the Shannon entropy of the distribution. The Shannon entropy is actually relatively easy to compute, and if you’re looking for a nifty “how does theory meet practice” exercise, take a look at the file sizes generated by Huffman encoding and see how they compare against the theoretical lower bound.