Assign7: Huffman warmup

The warmup exercises are intended to give you practice with encoding trees and the Huffman algorithm to ensure you are able to build a solid foundation of understanding before you get started on actually implementing the algorithm.

You should start by reviewing the 5/29 lecture on Huffman Coding. We have also posted a written document with supplemental background on Huffman if reading works better for you than watching and listening to lecture. The rest of this document assumes familiarity with the Huffman algorithm and will not repeat the content of the lecture and handout.

Encoding and decoding using an encoding tree

First, let's practice using an encoding tree to decode a bit sequence. Look at the encoding tree diagrammed below. Interior nodes are marked with * (Note: there is no asterisk character stored in the node, it is only for visualization in the diagram); leaf nodes are marked with the character encoded on that path.

We label the left-going branch zero and right-going one. As an example the bit sequence for S is 011, corresponding to the traversal left-right-right.

Answer the following question in short_answer.txt:

Q1. Decode the characters from the bit sequence 0101100011.
Q2. Encode the string "SONS" into its bit sequence.

As you do these operations by hand, take a moment to consider what will be required to do these same operations in code.

Flattening and unflattening an encoding tree

One of the practical concerns of Huffman coding is the compressed data has to include information about the encoding tree used. It is not possible to decode the bit sequences back to the original characters without knowledge of the original encoding tree.

In spirit, we want to write the Huffman tree into the compressed file; but it isn't possible to write a tree directly, so we must come up with a way to "flatten" the tree into a form that records the data and structure of the tree and allows you to later reconstruct the tree.

There are many ways to do this, but one very space-efficient approach is to flatten the tree into two sequences: a sequence of bits that gives the shape of the tree and the sequence of characters from the leaves.

The shape of the tree is flattened into a sequence of bits as follows:

If the root of the tree is a leaf node, it’s represented by the bit 0.
If the root of the tree is not a leaf node, it’s represented by a 1 bit, followed by the encoding of its zero (left) subtree, followed by the encoding of its one (right) subtree.

In this way this first sequence of bits describes the node structure in the order that the nodes are visited in a pre-order traversal of the tree.

Then, the characters of the tree are flattened into a second sequence that contains the characters from the leaf nodes in the order they are visited during an in-order traversal of the tree.

Together the two flattened sequences describe the tree shape and data in a way that allows to to recreate that exact tree later.

For example, here are two encoding trees and how they’d be flattened to a sequence of bits and a sequence of characters:

               *                                   *
             /   \                               /   \
            E     *                             A     *
                 / \                                 /  \
                W   K                               *    N
                                                   / \
                                                  D   B
          10100                                 1011000
          EWK                                    ADBN

Answer the following question in short_answer.txt:

Q3. Give the flattened sequences for the encoding tree that you used for Questions 1 and 2.
Q4. Cover up the trees above and confirm that you can correctly rebuild the tree from its flattened sequences. What process did you go through to convince yourself that you could properly rebuild the tree from the two flattened sequences?

As you do these operations by hand, be sure to consider what will be required to do these same operations in code. You will use a queue to store a flattened sequence. The flatten operation takes in an encoding tree and writes the tree shape into a queue of bits and the leaves into a queue of characters. The unflatten operation takes in the two queues and uses them to reconstruct the encoding tree.

Generating an optimal Huffman tree

The final task to practice with is generating a Huffman coding tree. Be sure have read the background material on Huffman coding to understand the algorithm for how the optimal tree is constructed.

Answer the following questions in short_answer.txt:

Q5. Show an optimal Huffman coding tree for the input "BOOKKEEPER".

If multiple characters are tied for having the same weight, or if two or more intermediate trees are tied for having the same weight, you can break ties arbitrarily. There can be many equally optimal Huffman trees for a given string, differing only in how the algorithm broke ties. Exchanging the zero/one subtrees of any interior node would create a mirrored but also equally optimal result.

Answer the following questions in short_answer.txt:

Q6. A node in a Huffman coding tree has two non-null children or zero. Why is not possible for a node to have just one non-null child?
Q7. Describe the difference in shape of a Huffman coding tree that will lead to significant savings for compression versus one that will achieve little to no compression.

Testing code

As you saw on your previous assignment, test cases to validate operations on linked structures don't come easy. By first investing in writing utility routines (like routines to create linked lists, compare linked lists, deallocate linked lists, etc), it is more manageable to write test cases later. Testing on trees will be similar and will follow much of the same pattern. Our provided tests give you some idea of what writing tests for tress should look like, but they depend on you writing some tree utility functions as a foundation.

createExampleTree

Some of the provided test cases are designed around this example encoding tree:

The tree has 4 leaf nodes, each corresponding to one of the characters {T, R, S, E} in the encoding. The 3 interior nodes are marked with * just for the purpose of visualization. An interior node does not store any character value.

The utility function createExampleTree is intended to construct an encoding tree that matches the one shown above. The function doesn't need to be a general-purpose tree creation routine, it just manually constructs exactly this tree.

Read the treenode.h file to see the definition of the EncodingTreeNode struct that is used to represent the nodes in the tree. Note that this struct provides two convenient constructors, one used to create leaf nodes and the other for interior nodes. All EncodingTreeNodes have a character field ch and two child pointers, zero and one, but note that the character field will be unused for an interior node.

Implement the body of createExampleTree to create the necessary 7 nodes and wire them together in the exact form shown above.

To confirm that your function is doing what you intend, run it under the debugger and set a breakpoint after constructing the tree. Use the Variables pane of the debugger to explore explore the tree structure that you have created, starting from the root. Your process for doing this exploration should be very reminiscent of what you did when exploring your personal labyrinth in the last assignment. Confirm that the tree has the correct shape and correct character value in each leaf node. Remember that the character field for an interior node is unused, so its value is not relevant.

deallocateTree

Any memory allocated using new during a test case should be deallocated with delete before finishing with the test case. Our provided test cases depend on the deallocateTree to deallocate all of the memory for the nodes of a tree starting from the root.

Implement the deallocateTree function. Confirm it works correcty by writing a student test case that calls your createExampleTree to build a tree and uses deallocateTree to clean it up. Run the test case and confirm no memory leaks are reported.

areEqual

The final utility function is the areEqual function which is given pointers to the root nodes of two encoding trees and returns true if the two trees are equal; that is, they have the same shape and same values in each leaf node. If the two trees have mismatching shapes or differences in leaf node values then the function should return false.

Implement the areEqual function to do this comparison. To confirm areEqual is working as expected, you should write some student tests. For these two student tests, you should consider a sequence of operations like:

Call createExampleTree twice to get two trees that should be identical to one another
Compare the two trees directly to one another and confirm that areEqual returns true
Compare two identical subtrees of the original trees and confirm that areEqual returns true
Compare two different subtrees of the original trees and confirm that areEqual returns false

Once you have thoroughly tested the behavior of this last utility function, you're ready to move on to the main event!