Problem Set 3

Due Thursday, May 1 at 1:00 pm Pacific

Solutions are available!

🔑 Solutions

This assignment may be completed individually or with a partner.

This problem set explores hashing and sketching algorithms. You'll fill in some details from the analyses in lecture, develop a cardinality estimator, and learn more about HyperLogLog by implementing and evaluating it. By the time you're done you'll have a much deeper understanding for how these data structures work!

$\LaTeX$ Template

Assignments are required to be typed. We've provided a $\LaTeX$ template for this assignment, though you're not required to use it.

🖋 PS3 $\LaTeX$ Template

Problem One: Final Details on Count Sketches

In our analysis of count sketches from lecture, we made the following simplification when determining the variance of our estimate:

\[\Var\left[ \sum_{i \ne j}{\mathbf{a}_j s(x_i) s(x_j) \mathbb{1}_{h(x_i) = h(x_j)}} \right] = \sum_{i \ne j}{\Var\left[{\mathbf{a}_j s(x_i) s(x_j) \mathbb{1}_{h(x_i) = h(x_j)}} \right]}\text.\]

In this expression, we’ve fixed some value for an index $i$ and are summing over all the other indices.

In general, the variance of a sum of random variables is not the same as the sum of their variances. That only works in the case where all those random variables are pairwise uncorrelated, as you saw on Problem Set Zero.

Prove that for any indices $j \ne k$ (where $i \ne j$ and $i \ne k$) that

\[\mathbf{a}_j s(x_i) s(x_j) \mathbb{1}_{h(x_i) = h(x_j)} \qquad \text{and} \qquad \mathbf{a}_k s(x_i) s(x_k) \mathbb{1}_{h(x_i) = h(x_k)}\]

are uncorrelated random variables, under the assumption that both $s$ and $h$ are drawn uniformly and independently from separate 2-independent families of hash functions. Refer back to the slides on the count sketch for the definitions of the relevant terms here. Remember that the $\mathbf{a}$ terms are not random variables. Two random variables $X$ and $Y$ are uncorrelated when $\E[XY]$ = $E[X]E[Y]\text.$

Problem Two: Cardinality Estimation

In this problem, you’ll design and analyze a cardinality estimator that works on a different principle than the HyperLogLog estimator we saw in lecture. For the purposes of this problem, let’s assume our elements are drawn from the set $\mathscr{U}$ and that the true cardinality of the set seen is $n\text.$

Here’s an initial data structure for cardinality estimation. Choose a hash function $h$ uniformly at random from a family of 2-independent hash functions $\mathscr{H}\text,$ where every function in $\mathscr{H}$ maps $\mathscr{U}$ to the open interval of real numbers $(0, 1)\text.$ (Hashing to a uniformly-random real number poses some theoretical challenges; the "correct" way to do this uses a slightly different strategy. For the purposes of this problem, we'll ignore this and assume that hashing to the real interval is possible.)

Our data structure works by hashing the elements it sees using $h$ and doing some internal bookkeeping to keep track of the $k$th-smallest hash code out of all the hash codes seen so far, ignoring duplicates. The fact that we ignore duplicate hash codes is important; we’d like it to be the case that if we call see(x) multiple times, it has the same effect as just calling see(x) a single time. (The fancy term for this is that the see operation is idempotent.) We’ll implement estimate() by returning the value $\hat{n} = \frac{k}{h_k}\text,$ where $h_k$ denotes the $k$th-smallest hash code seen.

Explain, intuitively, why $\hat{n}$ is a somewhat reasonable guess for the actual number of elements.

Let $\varepsilon \in (0, 1)$ be some accuracy parameter that’s provided to us.

Prove that $\Pr{\left[ \hat{n} \ge (1 + \varepsilon)n \right]} \le \frac{2}{k \varepsilon^2}\text.$ This shows that by tuning $k\text,$ we can make it unlikely that we overestimate the true value of $n\text.$

Use the techniques we covered in class: use indicator variables and some sort of concentration inequality. What has to happen for the estimate $\hat{n}$ to be too large? As a reminder, your hash function is only assumed to be 2-independent, so you can’t assume it behaves like a truly random function and can only use properties of 2-independent hash functions.

As a hint, $\hat{n}$ is not an unbiased estimator and computing $\E[\hat{n}]$ is extremely challenging – as in, we’re not sure how to do it! See if you can solve this problem without computing $\E[\hat{n}]\text.$

Using a proof analogous to the one you did in part (ii) of this problem, we can also prove that $\Pr{\left[ \hat{n} \le (1 - \varepsilon)n \right]} \le \frac{2}{k \varepsilon^2}\text.$ The proof is very similar to the one you did in part (ii), so we won’t ask you to write this one up. However, these two bounds collectively imply that by tuning $k\text,$ you can make it fairly likely that you get an estimate within $\pm \varepsilon n$ of the true value! All that’s left to do now is to tune our confidence in our answer.

Using the above data structure as a starting point, design a cardinality estimator with tunable parameters $\varepsilon \in (0, 1)$ and $\delta \in (0, 1)$ such that
- see(x) takes time $O(poly(\varepsilon^{-1}, \log \delta^{-1}))\text,$ where $poly(x, y)$ denotes “something bounded from above by a polynomial in $x$ and $y$,” such as $x^3 y + y^2 \log x\text;$
- estimate() takes time $O(poly(\log \delta^{-1}))\text,$ and if $\hat{C}$ denotes the estimate returned this way, then $\Pr \left[ \abs{\hat{C} - n} \ge \varepsilon n \right] \le \delta\text;$ and
- the total space usage is $O(poly(\varepsilon^{-1}, \log \delta^{-1}))\text.$

You’ve just built a tunable cardinality estimator that just needs 2-independent hash functions. Nicely done!

Problem Three: Implementing HyperLogLog

This problem involves writing C++ code. You can find the starter files at /usr/class/cs166/assignments/a3/. As usual, use cp -r to copy the files and /usr/class/cs166/bin/submit to submit them. Your code needs to compile without warnings (-Wall -Werror) and run cleanly under valgrind. And, of course, your code should be well-commented and so beautiful to behold that it brings tears to your eyes. Check the README for details about how to build and run everything.

In lecture, we briefly sketched (no pun intended) how the HyperLogLog estimator works. In this problem, you'll code up an implementation of HyperLogLog and apply some adjustments to make it an excellent practical cardinality estimator.

Suppose we are interested in estimating cardinalities of sets up to $2^{64}\text,$ a very generous upper bound. We pick a hash function $h$ that hashes from our universe of elements $\mathscr{U}$ to 64-bit values. This hash function will not be drawn from a family of 2-independent hash functions; instead, we'll pick some specific, fixed hash function and assume that it behaves "randomly enough" for our purposes.

Next, we pick a parameter $m$ that is analogous to the width $w$ of the count- and count-min sketches. Increasing $m$ increases the memory used by the estimator, but also improves the accuracy of the estimator. Note that $m$ is required to be a perfect power of two, and in what follows we'll let $m = 2^p\text,$ where we'll refer to $p$ as the precision parameter. (It turns out that setting $w = \Theta(\varepsilon^{-2})$ gives a constant probability of getting an estimate within $(1 \pm \varepsilon)n\text;$ check the original paper for details.)

To initialize the data structure, we form an array $A$ of $m$ 6-bit integers, all initially zero. (We use 6-bit integers because $\lg \lg 2^{64} = 6\text{.)}$

To implement see(x), we do the following:

Compute $h(x)\text.$
Read the highest $p$ bits of $h(x)$ to form an index $i$ into the array $A\text.$
Compute $b\text,$ the zero-indexed position of the least-significant 1 bit of the lower $64 - p$ bits of $h(x)\text.$
Set $A[i] = \max\set{A[i], b + 1}\text.$

To implement estimate(), we do the following:

Compute the harmonic mean of the values $2^{A[0]}, 2^{A[1]}, 2^{A[2]}, \ldots, 2^{A[m - 1]}\text;$ call this $H\text.$
Return $m \cdot H\text.$

Your first task is to code up this basic implementation.

Implement the above as the NaiveHyperLogLog type. We've provided you a SixBitIntArray type that, like IntArray from PS2, represents an array of 6-bit integers. Use ./run-tests to run some basic tests to validate your implementation. We encourage you to add your own tests beyond what we provided.

Once you're passing those tests, run the ./generate-csv program to generate data about the accuracy of your NaiveHyperLogLog estimator across a range of values of $n\text.$

As your written submission to this problem, use ./generate-csv to generate data for your NaiveHyperLogLog type using $p = 12$ and create a plot showing the mean and standard deviation of your estimates.

In reviewing your data sets, you should see that while HyperLogLog's estimate tracks the true set cardinalities, the actual answers are consistently off from the correct answers. This is due to multiple independent factors that you'll address in the remainder of this problem.

The first problem is that the naive HyperLogLog estimator you implemented in part (i) is not unbiased. Let's define $NHLL_m(n)$ to be the expected value of the naive HyperLogLog estimator, with precision parameter $m\text,$ as applied to a set of cardinality $n\text.$ Then the ratio

\[\alpha_m \triangleq \lim_{n \to \infty} \frac{n}{NHLL_m(n)}\]

tends toward a fixed, nonzero value depending on $m\text.$ By determining what $\alpha_m$ is, we can debias the HyperLogLog estimate by simply multiplying the result by $\alpha_m$ before returning it.

There are several ways you could work out values for $\alpha_m\text.$ You could try to numerically evaluate the following improper integral from the original HyperLogLog paper, which gives exact values for $a_m\text:$

\[\alpha_m = \left(m \int_{0}^\infty{\log_2 \left( \frac{2 + u}{1 + u} \right)^m du } \right)^{-1}\]

Another option would be to run your NaiveHyperLogLog estimator over various values of $m\text,$ look at the generated data, and empirically work out an approximation for $\alpha_m\text.$ A third option would be to find a paper describing HyperLogLog and see what approximations they recommend. And we're sure others exist as well.

Come up with approximations for $\alpha_m$ for $m = 2^p$ with $p$ ranging from 3 to 12. Then, implement the HyperLogLog type, which works via the same strategy as NaiveHyperLogLog except that it scales the estimate by $\alpha_m$ before returning it.

As your written submission to this problem, use ./generate-csv to generate data for your HyperLogLog type using $p = 12$ and create a plot showing the mean and standard deviation of your estimates. Then, describe what approach you took to determine $\alpha_m$ and why you chose it.

If you've done everything right at this point, you should see that your HyperLogLog estimator now gives very good estimates when $n$ is large, but strays from the real value when $n$ is small. However, the degree to which it strays from the correct value for small $n$ depends both on $n$ and $m\text.$

Here's a practical and simple way to correct for this error in the low-$n$ regime. Run experiments over a large range of values of $n$ and $m$ to determine the empirical bias resulting from the HyperLogLog estimator (the ratio of the actual cardinality $n$ to the estimated value). Then, when performing an estimate for small cardinalities, use the empirically-generated data to "debias" the estimate before returning it.

We're going to leave it up to you to determine how exactly you would "debias" the result in light of the experiments you run. There are many, many ways to do this. Here are some questions to keep in mind as you decide how to proceed:

Once $n$ is sufficiently large, you likely won't need to do any debiasing; multiplying by $\alpha_m$ already handles that. However, you don't know the exact value of $n\text;$ you only have an estimate for it. What criteria will you use to determine when to stop applying debiasing terms?
Your approach to debiasing the result should be time- and space-efficient. We're dealing with small $n$ here, so big-O notation is not a very good tool for quantifying efficiency here. Instead, think about how many bytes you'll need to use to introduce your correction and how fast, from a wall-clock perspective, your approach is.

We encourage you to experiment a bit to see what approaches work well and which don't. After all, if you were in charge of actually implementing this estimator for a database application, you would want to be sure you selected something that worked well across a variety of workflows!

Implement the ImprovedHyperLogLog type to use known information about HyperLogLog error estimates to produce an estimator that corrects for bias for low values of $n\text.$

Use ./generate-csv to run your ImprovedHyperLogLog with $p = 12$ and include a plot of the data in your writeup.

Additionally, in your writeup, summarize and justify your design decisions. What approach did you use to debias the estimate? What alternatives did you consider? Why did you choose this final version?

You've just built a high-quality cardinality estimator from scratch. Nicely done!

If you're interested in exploring ways to further improve the estimator, check out the following:

The original HyperLogLog paper recommends using a technique called linear counting when $n$ is small. This simple approach works very well in the small-$n$ range and is easy to implement once you already have a basic HyperLogLog implementation coded up.
Heule et al give a simple strategy for reducing the space usage of HyperLogLog in the case where $n$ is small, based on the idea that allocating a full suite of $m$ 6-bit integers isn't a good idea when $n$ is much smaller than $m\text.$
Karppa and Pagh found a simple but effective way to reduce the number of bits per integer from six to three in what they dubbed the "HyperLogLogLog" estimator. The core idea is not too complex: when the integer values all start to get large, subtract out some amount from each of them and store that value once, plus some extra information to handle outlier integers.