Problem Set 4

Due Tuesday, May 13 at 1:00 pm Pacific

Solutions are available!

🔑 Solutions

This assignment may be completed individually or with a partner.

This problem set explores balanced and augmented search trees and how to use them to solve problems much faster than seems reasonable at first glance. You'll also see how to use data structure isometries to elegantly design novel data structures.

$\LaTeX$ Template

Assignments are required to be typed. We've provided a $\LaTeX$ template for this assignment, though you're not required to use it.

🖋 PS4 $\LaTeX$ Template

Problem One: Implementing Order Statistic Trees

This problem involves writing C++ code. You can find the starter files at /usr/class/cs166/assignments/a4/. As usual, use cp -r to copy the files and /usr/class/cs166/bin/submit to submit them. Your code needs to compile without warnings (-Wall -Werror) and run cleanly under valgrind. And, of course, your code should be well-commented and so beautiful to behold that it brings tears to your eyes. Check the README for details about how to build and run everything.

In this problem, you’ll take an implementation of a red/black tree that only supports insertions and lookups, then convert into an order statistics tree by adding support for the rankOf and select operations. The select operation is the one we talked about in lecture: it takes in a number $k\text,$ then returns the kth order statistic. The rankOf operation is a sort of inverse of select: it takes in a key, then returns the number of elements in the red/black tree smaller than the key. (The key in question doesn’t actually have to be in the tree.)

Edit RedBlackTree.h and .cpp with your solution. Use ./run-tests to validate your solution. Check the README for details on the provided starter files.

Some notes on this problem:

Feel free to edit whatever parts of the provided starter code that you see fit, provided that (1) you still back the data structure with a red/black tree and (2) all operations run in time $O(\log n)\text,$ except for the destructor (time $O(n)$) and printDebugInfo (can be whatever you’d like). We don’t think you will need to do much surgery on the provided RedBlackTree type, so if you find yourself fundamentally rewriting large parts of the code, chances are you’re missing an easier solution.
The select function should throw a std::out_of_range exception if the index to select is invalid (for example, trying to select the fifth item in a tree with four elements). However, the rankOf function should work just fine regardless of whether the key is in the tree.

Problem Two: Beyond Red/Black Trees

We can encode a B-tree of order 2 (a 2-3-4 tree) as a binary search tree by coloring the nodes one of two colors (red and black) and enforcing various structural properties on those colors (the root is black, no red node has a red child, etc.). In lecture, we used this to argue that red/black trees have height $O(\log n)\text,$ but by paying more attention to constant factors we can be even more precise about their heights.

Using the isometry between 2-3-4 trees and red/black trees, prove that every red/black tree with $n$ nodes has height at most $(2 - o(1)) \lg n\text.$

Prove that the bound you came up with in part (i) of this problem is tight. Specifically, show that there are arbitrarily large red/black trees whose heights are $(2 - o(1)) \lg n\text.$

Red/black trees are one point in a larger isometry-based data structure designs. The rest of this problem probes this design space to see what other structures exist and what their relative tradeoffs are.

Show how to encode a B-tree of order 4 as a binary search tree using three colors (red, blue, and black) subject to various structural constraints. Specifically, give a set of color rules for a binary search tree such that
1. every B-tree of order 4 can be encoded as a red/black/blue tree according to your rules, and
2. every red/black/blue tree colored according to your rules can be converted to a B-tree of order 4.
Briefly justify your answer, though no formal proof is required. You do not need to describe the algorithms to handle insertions and deletions in your red/black/blue tree; just give the structural requirements.

Your results from parts (i) and (ii) collectively show that all red/black trees with $n$ nodes have height at most $(2 - o(1)) \lg n\text.$ Give the tightest value of $k$ you can for which every red/black/blue tree with $n$ nodes has height at most $(k - o(1)) \lg n\text.$ Briefly explain why your bound is tight, though no formal proofs are necessary. (Those proofs would look like what you did in parts (i) and (ii) and you already did the work there.)

Red/black trees are widely used in practice, while red/black/blue trees are, to the best of my knowledge, purely of theoretical interest. Some of this might be because they're not widely known or are a bit trickier to code up, but there are solid engineering reasons why they aren't deployed more commonly.

Give one advantage of red/black/blue trees over traditional red/black trees. Then, list one disadvantage of red/black/blue trees compared with red/black trees. Justify your answers.

Problem Three: Dynamic Prefix Parity

Consider the following problem, called the dynamic prefix parity problem (DPP). Your task is to design a data structure that logically represents an array of $n$ bits and supports these operations:

initialize(n), which creates a new data structure for an array of $n$ bits, all initially 0;
ds.flip(i), which flips the $i\text{th}$ bit; and
ds.prefix-parity(i), which returns the parity of the subarray consisting of the first $i$ bits of the array. (The parity of a subarray is zero if the subarray contains an even number of 1 bits and is one if it contains an odd number of 1 bits. Equivalently, the parity of a subarray is the logical XOR of all the bits in that array).

It's possible to solve this problem with initialize taking $O(n)$ time such that flip runs in time $O(1)$ and prefix-parity runs in time $O(n)$ or vice-versa. (Do you see how?) However, by using balanced trees, it’s possible to do significantly better than this. In this problem, you’ll work through a series of smaller milestones that culminates in a theoretically optimal result.

Describe how to use augmented binary trees to solve dynamic prefix parity such that initialize runs in time $O(n)$ and both flip and prefix-parity run in time $O(\log n)\text.$ Argue correctness and justify your runtime bounds.

Some hints and suggestions:
- You have the luxury of working with a tree where insertions and deletions will never happen; the number of bits is specified by initialize(n) and never changes after that. Therefore, provided you can give initial values to any extra information you’re caching at each node, and provided that you can update that information in $O(\log n)$ total time after a flip operation, you can meet the required time bounds.
- If you squint at this problem in just the right way, this will look a lot an order statistic tree. See you can adapt some of the augmentation techniques from there.
- A powerful technique we haven’t yet encountered, but which might be useful to you here: consider making your tree such that only the leaf nodes store actual data, with all the internal nodes just serving as a way to join smaller subtrees together and cache relevant data.
- When designing an augmented tree, it often helps to first solve the problem on a static array using a divide-and-conquer algorithm. So consider doing the following: suppose you had an array of $n$ bits. Could you design a divide-and-conquer algorithm for computing prefix parities that has the recurrence relation $T(n) = 2T(\frac{n}{2}) + O(1)\text?$ If so, you may be able to translate your idea into an augmented tree by caching, at each node in the tree, the value that would be returned by running that divide-and-conquer algorithm on the elements in that tree.

Explain how to revise your solution from part (i) of this problem so that instead of using augmented binary trees, you use augmented multiway trees. Your solution should have initialize take time $O(n)\text,$ flip take time $O(\log_b n)\text,$ and prefix-parity take time $O(b \log_b n)\text.$ Here, $b$ is a tunable parameter. Argue correctness and justify your runtime bounds.

Some hints:
- In an order statistic tree, we store one piece of information per node, since of a node’s two children we only needed to care about the left child. Now that you have a multiway tree where each node can have multiple children, you may want to store multiple pieces of extra information per node.
- You may find it useful to tag each node with information about which child of the parent it is.

Using the Method of Four Russians, modify your data structure from part (ii) so that initialize still runs in time $O(n)\text,$ but both flip and prefix-parity run in time $O\left( \frac{\log n}{\log \log n} \right)\text.$ Argue correctness, and, in detail, justify your runtime bounds.

This last step is probably the trickiest part. Here are some hints:
- All basic integer arithmetic operations are assumed to take time $O(1)\text.$ Floating-point operations are not considered basic arithmetic operations, nor are operations like “count the number of 1 bits in a machine word” or “find the leftmost 1 bit in a machine word.”
- An array of bits can be thought of as an integer, and integers can be used as indices in array-based lookup structures.
- Be precise with your choice of $b\text;$ Constant factors matter! If you're using B-trees, make sure you have precise counts of the number of keys per node.

Pat yourself on the back when you finish this problem. Isn’t that an amazing data structure?