Overview:

This program implements a data compression and decompression operation using binary trees. This is a pair exercise; you are allowed to work individually or work with a single partner on this part of the assignment.

  • icon Starter Code

    We provide a ZIP archive with a starter project that you should download and open with Qt Creator. You will edit and turn in only the following files. The ZIP contains other files/libraries; do not modify these. Your code must work with the other files unmodified.

    Turn in encoding.cpp and secretmessage.txt.huf.

  • icon Paperless (turn in)

    If you work as a pair, only one of you should submit the assignment; do not turn in two copies. Comment both members' names on top of every submitted code file.

Program Description:

Huffman coding is an algorithm devised by David Huffman in 1952 for compressing data. This simple and elegant approach is powerful enough that variations of it are still used today in computer networks, fax machines, modems, HDTV, and other areas. You will be impressed with its clever use of trees and your ability to implement such a nifty tool. What a great capstone to your awesome quarter in CS106B!

Your task is to write a program that uses Huffman coding to compress and decompress files. Carefully read the companion Huffman document for background information on data compression and description of the algorithm. This specification document doesn't repeat that material, and instead just describes the structure of the assignment.

Ultimately, you will implement these two functions:

    void compress(istream& input, HuffmanOutputFile& output);
    void decompress(HuffmanInputFile& input, ostream& output);
                

The compress operation reads the original file text and writes a compressed binary version to a Huffman file. The decompress operation reads the compressed binary data from a Huffman file and outputs the reconstructed original file text.

Rather than just turning you loose on these functions, we’ve decomposed the program into number of helper functions. You’ll build the assignment bottom-up, first writing the helpers that handle individual tasks and then combining them into the final product. (Hey, that’s kinda like how Huffman coding works, too!) The main function of our provided starter code has a test driver that lets you try out each helper function as you complete it, as well as test the final compress and decompress operations.

Step 1: Build the frequency table

The Huffman Coding algorithm starts with the frequencies of the characters in the input. Thus, your first task is a function that takes an istream containing the input data and returns a map associating each character with its frequency:

    Map<char, int> buildFrequencyTable(istream& input)
                

This operation simply reads characters one-by-one and counts occurrences. To read a single character from a stream, do not use getline or the stream extraction operator >> (this would skip whitespace, which we don’t want.) Instead, use the function istream::get(char& ch), which you can invoke as shown below:

    char ch;
    while (stream.get(ch)) { //  true if successful, false if fail/EOF
        // do something with the character ch 
    }
                

This loop read characters one by one, stopping at the end of the input.

Step 2: Build the Huffman encoding tree

Your next task is to implement the function

    HuffmanNode* buildEncodingTree(Map<char, int>& freqTable);
                

which takes the frequency map as an input and constructs a Huffman encoding tree. (If you haven't already reviewed the lecture and/or companion Huffman document, now would be a good time to so!).

The nodes in the encoding tree use the HuffmanNode type which is defined in the header file HuffmanNode.h as follows:

struct HuffmanNode {
    char ch;             // character being represented if leaf node, not used if interior node
    HuffmanNode* zero;   // 0 (left) subtree (nullptr if empty)
    HuffmanNode* one;    // 1 (right) subtree (nullptr if empty)
};
                

During construction of the encoding tree, you will maintain a forest of partial trees. The most appropriate collection ADT for the forest is a priority queue, which supports quick access to elements in order of priority. In this case of the Huffman algorithm, the "priority" of a tree is based on its weight, where smaller-weight trees are more urgent priority.

The PatientQueue you implemented for Assignment 5 was a priority queue, but since it was written to store elements of type Patient, it isn't quite what is needed here. The priorityqueue.h interface in the Stanford library provides a general-purpose PriorityQueue implemented as a binary heap for efficiency. Its design is similar to your PatientQueue but has a few differences of note: it is defined as a template and thus capable of storing the client's choice of element type, it supports only min-heap mode, and uses priorities of type double instead of int. Consult the PriorityQueue documentation for more details. The elements you store in the priority queue will be trees, that is to say, the element type is Huffman Node*.

Step 3: Build the encoding map

You’ve now got yourself a nifty tree that assigns a bit pattern to each character. Nice! Your next task is to rearrange the encoding into a more convenient form for use during the compress operation. The function

    Map<char, string> buildEncodingMap(HuffmanNode* encodingTree);
                

takes as input a Huffman encoding tree and returns a map that associates each character with the string of zero/ones used to encode that character. For example, suppose you’re given this Huffman tree which has encodings for five characters:

               +-----+    
               |     |    
               +-----+    
              /       \
             /         \
     +-----+             +-----+
     |     |             |     |
     +-----+             +-----+
       / \                 / \
      /   \               /   \
+-----+   +-----+   +-----+   +-----+
| ' ' |   |     |   | 'A' |   | 'n' |
+-----+   +-----+   +-----+   +-----+
            / \
           /   \
     +-----+   +-----+
     | '$' |   | 'Z' |
     +-----+   +-----+

The encoding map for the above tree contains these key/value pairings:

    ' ':  "00"
    'A':  "10"
    'n':  "11"
    '$':  "010"
    'Z':  "011"

Step 4: Encoding tree as header

When compressing an input, you must store your carefully constructed encoding tree with the compressed output, as that tree will be needed for the later decompress operation. Without access to the original encoding tree, you would not know the mappings from bit patterns to characters!

In spirit we want to write out the Huffman tree into the compressed file; but it isn't possible to write out the tree directly, so we must come up with a way to "flatten" the tree into a string form suitable to write to the file. This string will contain the necessary information and structure of the tree that allows you to recreate the tree later.

An encoding tree is made up of leaf and interior nodes. The representation we suggest is to write a leaf node in the form .A (i.e. a period followed by the specific character) and write an interior node as a pair of parens wrapped around its two child nodes (ZO) (Z and O could each be either leaf or internal nodes). Using this strategy to flatten the encoding tree diagrammed above in Step 3 produces this string:

     ((. (.$.Z))(.A.n))

The string contains 5 leaf nodes (each beginning with a period), and 4 interior nodes (each wrapped in parentheses). The order of the nodes within the string corresponds to a pre-order traversal of the encoding tree. Note how this matches the tree diagrammed above.

Creating the flattened string is a nice opportunity to practice tree traversal. The inverse function that reconstructs the tree from the flattened string is a particularly pleasing application of recursive parsing. Writing these two functions is great practice with recursion and trees!

    string flattenTreeToHeader(HuffmanNode* encodingTree);
    HuffmanNode* recreateTreeFromHeader(string str);
                

Note that for very small input files, the flattened tree header can end up being larger than the compressed file's data; it can even make it so that the compressed file is larger than the original uncompressed file was -- oops! Don't sweat this; compression is most beneficial for large files anyway, where the cost of the header will be a mere pittance against the savings.

The final helper function in this step is used to clean up the memory for an encoding tree when no longer needed:

    void freeTree(HuffmanNode* t);
                

This function accepts a pointer to the root of a Huffman tree as a parameter and frees all of the heap-allocated memory associated with that tree: The idea is that your other functions should call this function as appropriate to clean up after themselves.

Putting it all together

After you have completed and tested the helper functions for steps 1-4, you're ready to assemble them into an end-to-end Huffman encoder and decoder.

    void compress(istream& input, HuffmanOutputFile& output);
    void decompress(HuffmanInputFile& input, ostream& output);
                

The compress operation takes an input stream to be compressed. You set up for the compression by calling your various helper functions to count characters, construct a Huffman encoding tree, build an encoding Map, and flatten the tree into header form.

The compress operation takes it over the finish line by writing the flattened tree header into the Huffman File, followed by iterating over characters in the input and writing the bit sequence corresponding to each character's assigned code.

One gotcha for compress is that the initial task of counting characters reads the entire input stream, so when you later want to read the characters again to encode them, you’ll find that there aren’t any characters left in the stream. You must rewind the stream to the beginning in order to read it a second time. You can do this by calling the handy rewindStream(istream& input) function from filelib.h.

The decompress operation reconstructs the original from the compressed information. You first read the header from the Huffman File and use it to re-construct the encoding tree. You then read bits one by one and decode each bit sequence into a character by tracing the corresponding zero/one paths in the encoding tree.

You may assume that the given input and output parameters are valid. The streams are already opened and ready to be read/written; you do not need to prompt the user or open/close the files yourself. Our custom Huffman File classes are used to read and write binary-encoded files. You can also further assume that anything you’re decompressing is something that you yourself compressed and is in the proper format.

All assumptions aside, when you’re first writing these operations, it still may be wise to defensively include some error-detection for malformed inputs just in case there’s something a bit buggy in your helper functions.

If compress or decompress allocate any dynamic heap memory, be sure to clean up when finished and delete it.

Creative Aspect, a secret message to your SL

You now have a working compress and decompress, so why not use to transmit a secret message? Edit the file named secretmessage.txt file in the res folder of your project to contain a nice message for your section leader. The message can be anything (non-offensive ☺) that you like.

Use your program to compress the file and save the result to secretmessage.txt.huf. The output file will written to the build directory that Qt Creator automatically makes for you when you configure a project. Submit the secretmessage.txt.huf file to Paperless along with your code. Your section leader will uncompress your message with your program and read it while grading. This is worth a small part of your grade.

Output logs

The following is one sample partial log of execution of the provided main program. More logs are available above or through the Compare Output feature in your console window.

CS 106B/X Shrink-It!
This program uses the Huffman coding algorithm for compression.
...

1) test build character frequency table 2) test build encoding tree 3) test build encoding map 4) test read/write encoding tree header

C) compress file D) decompress file

H) view contents of huffman file T) view contents of text file S) side-by-side file comparison Q) quit

Your choice? c Input file name: seuss.txt Output file name (Enter for seuss.txt.huf): Reading 3273 input bytes. Compressing ... Wrote 1775 compressed bytes.

Example log of execution

Huffman Nodes

A Huffman encoding tree contains entries of type HuffmanNode. Review the HuffmanNode.h header in the starter files for additional information.

HuffmanNode(ch) Constructs a leaf node for the specified character. All HuffmanNodes should be allocated via new, e.g. HuffmanNode* leaf = new HuffmanNode('a');
HuffmanNode(zero,one) Constructs a interior node that is the parent of the given zero and one subtrees. All HuffmanNodes should be allocated via new, e.g. HuffmanNode* parent = new HuffmanNode(z, o);
bool isLeaf() Returns true if this node is a leaf, false otherwise.

Huffman Files

Regular stream objects do not have operations for reading or writing single bits, so we provide custom classes with the necessary operations for your use in this assignment. You will use the HuffmanInputFile and HuffmanOutputFile objects to read and write the binary-encoded files. To keep things simple for you, these classes provide a tiny, controlled interface that is easy to use properly, along with some behind-the-scenes machinery to help avoid client mis-steps.

Review the HuffmanFile.h header in the starter files for additional information.

HuffmanOutputFile
void writeHeader(str) Writes the given header string to the file
void writeBit(bit) Writes a single bit to the file. Value of bit must be 0 or 1.
HuffmanInputFile
string readHeader() Reads the header from the file and returns it as a string
int readBit() Reads a single bit from the input file and returns 0 or 1 depending on the bit value. If all bits have been read from the file, returns EOF (-1)

Development Strategy and Hints:

Here’s some general advice and recommendations for working through this assignment:

  • Review starter code. The code you write all goes into the encoding.cpp file but you should also review the other code files to see the rest of the program and how the code you will write fits in context. In particular, there is a substantial test driver program in main.cpp that can be used to test your helper functions one at a time. If you understand how to use the driver effectively, it will super-charge your development process.
  • Develop piecewise: Do not try to write all the functions in one go and debug the entire program at once. Concentrate on one function in isolation and write, test, and debug it thoroughly before moving on to the next. The provided test driver should make this relatively painless.
  • Testing: Do start on small files (one word, two words, a sentence) you can easily verify with hand-tracing before attempting to compress lengthy Russian novels.
    • What sort of files do you expect Huffman to be particularly effective at compressing? On what sort of files will it less effective? Are there files that grow instead of shrink when Huffman encoded? Consider creating sample files to test out your theories.
    • Your implementation should be robust enough to compress all types of file: text, binary, image, or even one it has previously compressed. Your program probably won’t be able to further squish an already compressed file (and in fact, it can get larger because of the additional overhead of the encoding table) but it should be possible to compress multiple iterations, decompress the same number of iterations, and return to the original file.
    • You can assume that any input file will have at least two distinct characters. Empty files or a file of only one distinct character would require a special case for encoding. Do you see why? We don't want you to worry about this and will not test against such cases.
  • You program only need be self-consistent. Your compressed output may not exactly match files compressed by other students' programs, since slight variations in the tree-building algorithm can cause the program to encode the same file in valid but different ways. You do not need to take special precautions to protect against decompressing an incompatible compressed file. The important test is that your program can reliably decompress files that you yourself compressed.
  • Be tidy about memory: Don’t leak memory. Make sure not to leak any tree nodes, and if you allocate any extra memory, be sure to deallocate it!
  • Bitwise I/O can be slow. The bit operations are somewhat inefficient and it is expected that large files will take some time to read and write. It is definitely worthwhile to do some tests on large files as part of stress-testing your program, but don’t be concerned if the reading/writing is slow.
  • Note that Qt Creator writes the compressed binary files created by program code into the build_Xxxxxxx folder, rather than in the normal res/ folder of your project.

Style Details:

As in other assignments, you should follow our Style Guide for information about expected coding style. You are also expected to follow all of the general style constraints emphasized in the previous Homework specs, such as the ones about good problem decomposition, parameters, redundancy, using proper C++ idioms, and commenting. The following are additional points of emphasis and style constraints specific to this problem.

  • Binary tree usage: Part of your grade will come from appropriately utilizing binary trees and recursive algorithms to traverse them. Any functions that traverse a binary tree from top to bottom should implement that traversal recursively. If a particular function must traverse a tree multiple times, it is okay to write a loop that initiates each traversal, as long as the traversal itself is recursive. We will check this particular constraint strictly; no exceptions!
  • Don't modify required function headers: Please do not make modifications to the required functions' names, parameter types, or return types. Our client code should be able to call the functions successfully without any modification.
  • Reuse instead of repeat: Redundancy is another major grading focus; avoid repeated logic as much as possible. If two of your functions are similar, have one call the other, or utilize a common helper function.
  • No global variables: As always, all program data should be properly passed in and out of functions, not declared or accessed globally.
  • Memory usage: Your code should have no memory leaks. Free the memory associated with any new objects you allocate internally. The Huffman nodes you will allocate when building encoding trees are passed back to the caller, so it is that caller's responsibility to call your freeTree function to clean up the memory. But if you create a Huffman tree yourself to help you implement another function, you must free that entire tree yourself.

Frequently Asked Questions (FAQ):

Q: I have no idea what is going on in this assignment.
A: Review the slides/video from Friday's class and/or read the background document on Huffman Coding. The textbook chapter on binary trees may also be helpful.
Q: The spec says I am not supposed to modify the .h files. But I want to use a helper function. Don't I need to modify the .h file to add a function prototype declaration for my helpers? Can I still use helper functions even if I don't modify the .h file?
A: Do not modify the provided .h file. Just declare your function prototypes in your .cpp file (near the top, above any code that tries to call those functions) and it'll work fine. You can declare a function prototype anywhere: in a .cpp file, in a .h file, wherever you want. The idea of putting them in a .h file is a convention used when the functions are published to other modules.
Q: My Huffman tree doesn't get created correctly. How can I tell what's going on?
A: We suggest inserting print statements in the function that builds the tree. There is also a printSideways function provided that takes a HuffmanNode* and prints that entire tree sideways.
Q: The contents of my priority queue don't seem to be in sorted order. Why?
A: A PriorityQueue's ordering is based on the priorities you pass in when you enqueue each element. Are you sure you are adding each node with the right priority?
Q: What should the priority queue's ordering be if the two nodes' frequencies are equal?
A: If the counts are the same, just add them both with the same priority and let the priority queue decide how to relatively order those two items.
Q: I don't understand the different kinds of input/output streams in the assignment. Which kind of stream is used in what situation? How do I create and open/close them?

A: Here's a rundown of the different types of streams:

  • An istream (aka ifstream) is used to read file data character by character.
  • An ostream (aka ofstream) is used to write file data character by character.
  • A HuffmanInputFile is used to read binary file data bit by bit.
  • A HuffmanOutputFile is used to write binary file data bit by bit.

Here's a diagram summarizing the streams:

                                    compress:
   
+-----------------+   read bytes                write bits    +-----------------+
|   normal file   |    istream        YOUR   HuffmanOutputFile| compressed file |
|     foo.txt     | -------------->   CODE   ---------------> |   foo.txt.huf   |
+-----------------+  'h', 'i', ...             010101010101   +-----------------+

=================================================================================
                                    decompress:

+-----------------+   read bits                 write bytes   +-----------------+
| compressed file |HuffmanInputFile   YOUR       ostream      |   normal file   |
|  foo.txt.huf    | -------------->   CODE   ---------------> |  unhuf.foo.txt  |
+-----------------+  010101010101              'h', 'i', ...  +-----------------+
                    

You never need to create or initialize a stream; the client code does that for you. You are passed a stream that is ready to use; you don't need to create it or open it or close it.

Q: How can I tell what bits are getting written to my compressed file?
A: The main testing program has a "view Huffman file" option to print out the bits of a binary file. Between that and print statements in your own code for debugging, you should be able to figure out what bits came from where.
Q: My individual functions for each step of the encoding process seem to work. But my compress function always produces an empty file or a very small file. Why?
A: Maybe you are forgetting to "rewind" the input stream. Your compress function reads over the input stream data twice: once to count the characters so that you can build a Huffman tree, and a second time to actually compress it using your encodings. Between those two actions, you must rewind the input stream by writing code such as:
rewindStream(input);   // tells the stream to seek back to the beginning
                    
Q: My program works for most files, but when I try to decompress a large file, I get a crash. Why?
A: If your decompress function works correctly for most files but crashes for very large files, it is possible that your algorithm is nesting too many recursive calls. When you reach the bottom of the tree (a leaf), you must not make a recursive call to return to the top of the tree. Instead, let your recursive calls return, and then initiate a new descent of the tree. The overall initiation of each descent of the tree can be a loop, but the process of descending the tree a single time must be recursive.
Q: My program runs really slowly on large files. How can I speed it up?
A: It is expected that the code will take a little while to run on a large file. Our solution takes a few seconds to process a long novel. Your program also might be slow because you're running it on a slow disk drive such as a USB thumb drive.
Q: When do I call my freeTree function? Do I ever need to call it myself?
A: If you ever create an encoding tree yourself as a helper to assist you in solving some larger task, then you should free that tree so that you don't leak memory. So for example, your buildEncodingTree function should not free the tree because it is supposed to return that tree to the client, and presumably that client will later free it. But if you call buildEncodingTree somewhere in your code because you want to use an encoding tree to help you, then when you are done using it, you should call freeTree on it.
Q: How do I make a secretmessage.huf file? Where does it get stored on my computer?
A: Create it using your own program by compressing the text that you want to send to your grader. Your program will output it into the build_xxxxxxxxxx folder which is usually found in the parent directory of your project's folder.

Possible Extra Features:

  • Make the encoding tree header ultra-compact: The header containing the encoding tree is somewhat bulky, and undesirably increases the size for small files See if you can find a a more compact way to represent the flattened tree. If you're feeling up for a challenge, try looking up succinct data structures and see if you can write out the encoding tree using just one bit per node and one byte per character!

  • Add support for encryption: Without knowledge of the encoding table, it's impossible to decode compressed files. Protect the encoding header so that it prompts for a password or uses some other technique to make it hard for Bad People to decompress the data.

  • Implement a more advanced compression algorithm: Huffman coding is a good compression algorithm, but there are much better alternatives in many cases. Try researching and implementing a more advanced algorithm, like LZW, in addition to Huffman coding.

  • Other: If you have your own creative idea for an extra feature, ask your SL and/or the instructor about it.

Indicating that you have done extra features: If you complete any extra features, then in the comment heading on the top of your program, please list all extra features that you worked on and where in the code they can be found (what functions, lines, etc. so that the grader can look at their code easily).

Submitting a program with extra features: Since we use automated testing for part of our grading process, it is important that you submit a program that conforms to the preceding spec, even if you want to do extra features. If your feature(s) cause your program to change the output that it produces in such a way that it no longer matches the expected sample output test cases provided, you should submit two versions of your program file: a first one with the standard file name without any extra features added (or with all necessary features disabled or commented out), and a second one whose file name has the suffix -extra.cpp with the extra features enabled. Please distinguish them in by explaining which is which in the comment header. Our turnin system saves every submission you make, so if you make multiple submissions we will be able to view all of them; your previously submitted files will not be lost or overwritten.

Turning In:

When you are finished, submit your assignment using Paperless web system. General rules:

  • Please turn in only the file(s) specified in the assignment.
  • (For pair assignments) Please turn in only one submission per pair, and make sure both names are entered in the system.