\documentclass[12pt]{exam}
\usepackage[utf8]{inputenc} % For UTF8 source encoding.
\usepackage{amsmath} % For displaying math equations.
\usepackage{amsfonts} % For mathematical fonts (like \mathbb{E}!).
\usepackage{upgreek} % For upright Greek letters, such as \upvarphi.
\usepackage{wasysym} % For additional glyphs (like \smiley!).
\usepackage{mathrsfs} % For script text (hash families and universes).
% For document margins.
\usepackage[left=.8in, right=.8in, top=1in, bottom=1in]{geometry}
\usepackage{lastpage} % For a reference to the number of pages.
% TODO: Enter your name here :)
\newcommand*{\authorname}{[Your name goes here]}
\newcommand*{\psetnumber}{5}
\newcommand*{\psetdescription}{Randomized Data Structures}
\newcommand*{\duedate}{Tuesday, May 28th}
\newcommand*{\duetime}{2:30 pm}
% Fancy headers and footers
\headrule
\firstpageheader{CS166\\Spring 2019}{Problem Set \psetnumber\\\psetdescription}{Due: \duedate\\at \duetime}
\runningheader{CS166}{Problem Set \psetnumber}{\authorname}
\footer{}{\footnotesize{Page \thepage\ of \pageref{LastPage}}}{}
% Exam questions.
\newcommand{\Q}[1]{\question{\large{\textbf{#1}}}}
\qformat{} % Remove formatting from exam questions.
% Useful macro commands.
\newcommand*{\bigtheta}[1]{\Theta\left( #1 \right)}
\newcommand*{\bigo}[1]{O \left( #1 \right)}
\newcommand*{\bigomega}[1]{\Omega \left( #1 \right)}
\newcommand*{\prob}[1]{\text{Pr} \left[ #1 \right]}
\newcommand*{\ex}[1]{\text{E} \left[ #1 \right]}
\newcommand*{\var}[1]{\text{Var} \left[ #1 \right]}
\newcommand*{\norm}[1]{\left\lVert #1 \right\rVert}
\newcommand*{\HH}{\mathscr{H}} % Family of hash functions.
\newcommand*{\UU}{\mathscr{U}} % Universe.
\newcommand*{\eps}{\varepsilon} % Epsilon.
% Custom formatting for problem parts.
\renewcommand{\thepartno}{\roman{partno}}
\renewcommand{\partlabel}{\thepartno.}
% Framed answers.
\newcommand{\answerbox}[1]{
\begin{framed}
\hspace{\fill}
\vspace{#1}
\end{framed}}
\printanswers
\setlength\answerlinelength{2in} \setlength\answerskip{0.3in}
\begin{document}
\title{CS166 Problem Set \psetnumber: \psetdescription}
\author{\authorname}
\date{}
\maketitle
\thispagestyle{headandfoot}
\begin{questions}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Q{Problem One: Final Details on Count Sketches (3 Points)}
In our analysis of count sketches from lecture, we made the following simplification when determining the variance of our estimate:
\[
\var{\sum_{j \ne i}{\mathbf{a}_j s(x_i) s(x_j) X_j}} = \sum_{j \ne i}\var{\mathbf{a}_j s(x_i) s(x_j) X_j}
\]
In general, the variance of a sum of random variables is not the same as the sum of their variances. That only works in the case where all those random variables are \textbf{\textit{pairwise uncorrelated}}, as you saw on Problem Set Zero.
Prove that any two terms in the above summation are uncorrelated under the assumption that both $s$ and $h$ are drawn uniformly and independently from separate 2-independent families of hash functions.
As a refresher, two random variables $X$ and $Y$ are uncorrelated if $\ex{XY} = \ex{X}\ex{Y}$.
\begin{solution}
Your solution goes here!
\end{solution}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\Q{Problem Two: Cardinality Estimation (12 Points)}
A \textbf{\emph{cardinality estimator}} is a data structure that supports the following two operations:
\begin{itemize}
\item $ds.\texttt{see}(x)$, which records that the value $x$ has been seen; and
\item $ds.\texttt{estimate}()$, which returns an estimate of the number of \textit{\textbf{distinct}} values we've seen.
\end{itemize}
Imagine that we are given a data stream consisting of elements drawn from some universe $\UU$. Not all elements of $\UU$ will necessarily be present in the stream, and some elements of $\UU$ may appear multiple times. We'll denote the number of unique elements in the stream as $\textbf{\textit{F}}_\mathbf{0}$.
Here's an initial data structure for cardinality estimation. We'll begin by choosing a hash function $h$ uniformly at random from a family of 2-independent hash functions $\HH$, where every function in $\HH$ maps $\UU$ to the open interval of real numbers $(0, 1)$.
Our data structure works by hashing the elements it \texttt{see}s using $h$ and doing some internal bookkeeping to keep track of the $k$th-smallest unique hash code seen so far. The fact that we're tracking \textit{unique} hash codes is important; we'd like it to be the case that if we call $\texttt{see}(x)$ multiple times, it has the same effect as just calling $\texttt{see}(x)$ a single time. (The fancy term for this is that the \texttt{see} operation is \textbf{\emph{idempotent}}.) We'll implement $\texttt{estimate}()$ by returning the value $\hat{F} = \frac{k}{h_k}$, where $h_k$ denotes the $k$th smallest hash code seen.
%%%%%%%%%%%%%%%%%%
\begin{parts}
\part Explain, intuitively, why $\hat{F}$ is likely to be close to the number of distinct elements.
\begin{solution}
Your solution goes here!
\end{solution}
%%%%%%%%%%%%%%%%%%
\uplevel{Let $\eps \in (0, 1)$ be some accuracy parameter that's provided to us.}
\part Prove that $\prob{\hat{F} > (1+\eps) F_0} \le \frac{2}{k\eps^2}$. This shows that by tuning $k$, we can make it unlikely that we overestimate the true value of $F_0$.
As a hint, use the techniques we covered in class: use indicator variables and some sort of concentration inequality. What has to happen for the estimate $\hat{F}$ to be too large? As a reminder, your hash function is only assumed to be 2-independent, so you can't assume it behaves like a truly random function and can only use the properties of 2-independent hash functions.
\begin{solution}
Your solution goes here!
\end{solution}
\uplevel{Using a proof analogous to the one you did in part (ii) of this problem, we can also prove that
\[
\prob{\frac{k}{h_k} < (1-\eps) F_0} \le \frac{2}{k\eps^2}.
\]
The proof is very similar to the one you did in part (ii), so we won't ask you to write this one up. However, these two bounds collectively imply that by tuning k, you can make it fairly likely that you get an estimate within $\pm \eps F_0$ of the true value! All that's left to do now is to tune our confidence in our answer.}
%%%%%%%%%%%%%%%%%%
\part Using the above data structure as a starting point, design a cardinality estimator with tunable parameters $\eps \in (0, 1)$ and $\delta \in (0, 1)$ such that:
\begin{itemize}
\item $\texttt{see}(x)$ takes time $\bigo{\log \eps^{-1} \cdot \log \delta^{-1}}$;
\item $\texttt{estimate}()$ takes time $\bigo{\log \delta^{-1}}$, and if we let $C$ denote the estimate returned this way, then
\[
\prob{|C - F_0| > \eps F_0} < \delta; \textrm{and}
\]
\item the total space usage is $\bigtheta{\eps^{-2} \log \delta^{-1}}$.
\end{itemize}
\begin{solution}
Your solution goes here!
\end{solution}
\end{parts}
As a reminder, $\log \eps^{-p} = \Theta(\log \eps^{-1})$ for any constant $p$.
There are a number of really beautiful approaches out there for building cardinality estimators. Check out the impressive HyperLogLog estimator for an example of a totally different approach to solving this problem that's used widely in practice!
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\Q{Problem Three: Hashing IRL (10 Points)}
This one is mostly, but not all, coding! In addition to the brief writeup below, make sure to submit your code on \texttt{myth}.
\begin{parts}
\part The theory predicts that linear probing and cuckoo hashing degenerate rapidly beyond a certain load factor. How accurate is that in practice?
\begin{solution}
Your solution goes here!
\end{solution}
\part How does second-choice hashing compare to chained hashing across the range of load factors? Why do you think that is?
\begin{solution}
Your solution goes here!
\end{solution}
\part How does Robin Hood hashing compare to linear probing across the range of load factors? Why do you think that is?
\begin{solution}
Your solution goes here!
\end{solution}
\part In theory, cuckoo hashing requires much stronger classes of hash functions than the other types of hash tables we've covered. Do you see this in practice?
\begin{solution}
Your solution goes here!
\end{solution}
\part In theory, cuckoo hashing's performance rapidly degrades as the load factor approaches $\alpha = \frac{1}{2}$. Do you see that in practice?
\begin{solution}
Your solution goes here!
\end{solution}
\end{parts}
\end{questions}
\end{document}