Huffman tree warmup


In these exercises, you will practice with Huffman trees to establish a solid understanding of the Huffman algorithm before you start implementing the program.

Huffman coding was introduced in Monday's lecture. We have also posted a handout on Huffman that repeats the background info if reading works better for you than listening and watching. The assignment writeup assumes familiarity with this background content on Huffman and does not repeat it.

Encoding and decoding using an encoding tree

First, let's practice decoding a bit sequence using an encoding tree The diagram below is an encoding tree for the characters O N M and S. Each leaf node corresponds to a character. The path from the root to a leaf node traces the sequence of bits that encode the node's character. In the diagram, we marked interior nodes with * for visualization purposes; but an interior node does not store a character and the path from root to an interior node is only a partial encoding path.

             *
            / \
           *   O
          / \
         N   *  
            / \
           M   S

We label the leftward branch zero and rightward one. The path from the root node to the leaf node S traces left-right-right which corresponds to the bit sequence 011.

Answer the following question in short_answer.txt:

Q1. Use the above encoding tree to decode the bit sequence 0101100011.

Q2. Prepare a table for the above encoding tree that lists each character with its assigned bit sequence. Use your table to encode the string "SONS".

As you work these operations manually, take a moment to consider what will be required to do these same operations in code.

Q3. Huffman codes obey the prefix property: no character's encoded bit sequence is a prefix of any other. What feature of an encoding tree demonstrates that it obeys the prefix property?

Flattening and unflattening an encoding tree

One of the practical concerns of Huffman coding is the compressed data must include information about the encoding tree used, since that encoding is unique to the input data. It is not possible to decode the bit sequences without knowledge of the original encoding tree.

We want to write the Huffman tree into the compressed file, but it isn't possible to write a tree directly. Therefore, we must come up with a way to "flatten" the tree into a form that records the data and structure of the tree and allows you to later reconstruct the tree.

There are a variety of ways to flatten the tree, but one tidy and compact approach is to summarize the tree as two sequences: a sequence of bits that give the shape of the tree and a sequence of characters that correspond to the tree leaves.

  1. The tree shape is flattened into a sequence of bits as follows:

    • If the root of the tree is a leaf node, it’s represented by the bit 0.
    • If the root of the tree is not a leaf node, it’s represented by a 1 bit, followed by the flattened version of its zero (left) subtree, followed by the flattened one (right) subtree.

    The sequence of bits describes the tree structure in the order that the tree nodes are visited in a pre-order traversal.

  2. The tree leaves are flattened into a sequence of characters by listing the characters of the leaf nodes as visited during an in-order traversal.

Together the two flattened sequences describe the tree shape and data in a way that allows us to later recreate the original tree.

For example, the two encoding trees below left and middle are labeled with the flattened sequence of bits and sequence of characters:

     *                          *                          *
   /   \                      /   \                       / \
  E     *                    A     *                     *   O
       / \                        /  \                  / \
      W   K                      *    N                N   * 
                                / \                       / \
                               D   B                     M   S
 
  10100                       1011000
  EWK                         ADBN

Answer the following question in short_answer.txt:

Q4. Flatten the encoding tree above on the right into its sequence of bits and sequence of characters.

Q5. Unflatten the sequences 110100100 and FLERA to reconstruct the original encoding tree.

As you do these operations by hand, be sure to also work out how you will translate these same steps to code. Each sequence is stored into a Queue. The flatten operation writes the tree shape into a Queue<bit> and the leaves into a Queue<char> . The unflatten operation takes in two queues and uses them to reconstruct the encoding tree.

Generating an optimal Huffman tree

The final practice task is to construct a Huffman tree for a given input by manually following the Huffman algorithm.

Answer the following questions in short_answer.txt:

Q6. Construct a Huffman coding tree for the input "BOOKKEEPER".

As you run the Huffman algorithm, you may encounter multiple nodes/subtrees that have the same weight and are thus tied. The algorithm can break ties arbitrarily. This means there can be many equally optimal Huffman trees for a given string, differing only in how the algorithm broke ties. Exchanging the zero/one subtrees of any interior node would create a mirrored but also equally optimal result. Any of the resulting trees is considered correct.

Answer the following questions in short_answer.txt:

Q7. A node in a Huffman coding tree has two non-null children or no children. Why does it not make sense for a node in a Huffman tree to have just one non-null child?

Q8. Describe the difference in shape of a Huffman coding tree that will lead to significant savings for compression versus one that will achieve little to no compression.

Test utility functions

As you learned on the previous assignment, the test cases to validate operations on linked structures don't come easy. By first investing in writing utility functions (like functions to create linked lists, compare linked lists, deallocate linked lists, etc), it is more manageable to write test cases later. Testing on trees will be similar. Our provided tests give you some idea of what writing tests for trees should look like, but they depend on you writing some tree utility functions in huffman.cpp as a foundation.

EncodingTreeNode

First review the definition of struct EncodingTreeNode in the file treenode.h. The struct declares two convenience constructors, one to initialize a new leaf node, the other for a new interior node. Every EncodingTreeNode has a character field ch and two child pointers, zero and one. For a leaf node, both child pointers are nullptr and the ch field is set to a character value. For an interior node, the child pointers are non-null and the character field is unused.

The struct fields (zero, one, and ch) are public, which allows your code to directly access them. However the struct also provides member functions isLeaf and getChar which access the fields on your behalf. We recommend that you use these functions. It will make your code a bit tidier and in the case of getChar, you gain the safety feature to alert if you mistakenly access the character field of an interior node. The code below demonstrates some sample use:

    EncodingTreeNode* t = new EncodingTreeNode('A');
    if (t->isLeaf()) {
        cout << t->getChar() << endl;
    }

createExampleTree

Some of the provided test cases use this sample encoding tree:

               *
             /   \
            T     *
                 / \
                *   E
               / \
              R   S

The tree has 4 leaf nodes, each corresponding to one of the characters {T, R, S, E} in the encoding. The 3 interior nodes are marked with * just for the purpose of visualization. An interior node does not store any character value.

EncodingTreeNode* createExampleTree()

The utility function createExampleTree is intended to construct an encoding tree that matches the one shown above. This is not a general-purpose tree creation function; it manually constructs exactly and only this tree. Implement the function to allocate the necessary nodes and wire them together in the exact form shown above.

To confirm that your function is doing what you intend, run it under the debugger and set a breakpoint after constructing the tree. Use the Variables pane of the debugger to explore the tree structure that you have created, starting from the root. Your process will be reminiscent of how you explored your personal labyrinth in the previous assignment. Confirm that the tree has the correct shape and correct character in each leaf node. Remember that the character field for an interior node is unused, so its value is not relevant.

deallocateTree

Any memory allocated using new during a test case should be deallocated with delete before finishing with the test case. Our provided test cases call deallocateTree to deallocate all of the memory for the nodes of a tree.

void deallocateTree(EncodingTreeNode* t)

After implementing deallocateTree, confirm it works correctly by writing a student test case that calls your createExampleTree to allocate a tree and uses deallocateTree to clean it up. Run the test case and confirm no memory leaks are reported.

areEqual

The areEqual utility function is given pointers to the root nodes of two encoding trees and returns true if the two trees are equal; that is, they have the same shape and same characters in each leaf node. If the two trees have mismatching shapes or differing leaf characters the function should return false.

bool areEqual(EncodingTreeNode* a, EncodingTreeNode* b)

To confirm your areEqual function is working as intended, you should add a variety of your own student tests. Here are some test cases to be sure to cover:

  • Create a singleton tree of just one leaf node, compare to an empty tree and confirm areEqual returns false.
  • Create a second leaf node and compare to the first. If both nodes are initialized to same character, they should be equal, and false otherwise.
  • Call createExampleTree to build tree and confirm it is not equal to the singleton tree.
  • Call createExampleTree a second time and confirm the two copies of the tree are equal.
  • Compare the example tree with one of its subtrees and confirm they are not equal.

Be sure that areEqual is only comparing the valid fields of each tree node. In particular, interior nodes do not use the character field, so that only characters you should be comparing are those in the leaf nodes.

After you have implemented and thoroughly tested your utility functions, you're ready to move on to the main event!