\documentclass[12pt]{exam}
\usepackage[utf8]{inputenc} % For UTF8 source encoding.
\usepackage{amsmath} % For displaying math equations.
\usepackage{amsfonts} % For mathematical fonts (like \mathbb{E}!).
\usepackage{upgreek} % For upright Greek letters, such as \upvarphi.
\usepackage{wasysym} % For additional glyphs (like \smiley!).
\usepackage{mathrsfs} % For script text (hash families and universes).
% For document margins.
\usepackage[left=.8in, right=.8in, top=1in, bottom=1in]{geometry}
\usepackage{lastpage} % For a reference to the number of pages.
% TODO: Enter your name(s) here :)
\newcommand*{\authorname}{[Your name(s) go(es) here]}
\newcommand*{\psetnumber}{5}
\newcommand*{\psetdescription}{Randomized Data Structures}
\newcommand*{\duedate}{Thursday, May 28}
\newcommand*{\duetime}{2:30 PM Pacific}
% Fancy headers and footers
\headrule
\firstpageheader{CS166\\Spring 2020}{Problem Set \psetnumber\\\psetdescription}{Due: \duedate\\at \duetime}
\runningheader{CS166}{Problem Set \psetnumber}{\authorname}
\footer{}{\footnotesize{Page \thepage\ of \pageref{LastPage}}}{}
% Exam questions.
\newcommand{\Q}[1]{\question{\large{\textbf{#1}}}}
\qformat{} % Remove formatting from exam questions.
% Useful macro commands.
\newcommand*{\ex}{\mathbb{E}}
\newcommand*{\bigtheta}[1]{\Theta\left( #1 \right)}
\newcommand*{\bigo}[1]{O \left( #1 \right)}
\newcommand*{\bigomega}[1]{\Omega \left( #1 \right)}
\newcommand*{\prob}[1]{\text{Pr} \left[ #1 \right]}
\newcommand*{\var}[1]{\text{Var} \left[ #1 \right]}
\newcommand*{\RMQ}{\textrm{RMQ}}
\newcommand*{\RMQcomplexity}[2]{\left< #1, #2 \right>}
\newcommand*{\norm}[1]{\left\lVert #1 \right\rVert}
\newcommand*{\HH}{\mathscr{H}} % Family of hash functions.
\newcommand*{\UU}{\mathscr{U}} % Universe.
\newcommand*{\eps}{\varepsilon} % Epsilon.
% Custom formatting for problem parts.
\renewcommand{\thepartno}{\roman{partno}}
\renewcommand{\partlabel}{\thepartno.}
% Framed answers.
\newcommand{\answerbox}[1]{
\begin{framed}
\hspace{\fill}
\vspace{#1}
\end{framed}}
% MZ
\usepackage{amsthm}
\usepackage{amssymb}
\let\oldemptyset\emptyset
\renewcommand{\emptyset}{\text{\O}}
\renewcommand\qedsymbol{$\blacksquare$}
\newenvironment{prf}{{\bfseries Proof.}}{\qedsymbol}
\newcommand{\bi}[1]{\textit{\textbf{#1}}}
\newcommand{\annotate}[1]{\textit{\textcolor{blue}{#1}}}
\usepackage{stmaryrd}
\usepackage{pgfplots}
\pgfplotsset{compat=1.15}
\makeatletter
\@namedef{ver@framed.sty}{9999/12/31}
\@namedef{opt@framed.sty}{}
\makeatother
\usepackage{minted}
\usepackage{mathtools}
\usepackage{alltt}
\printanswers
\setlength\answerlinelength{2in} \setlength\answerskip{0.3in}
\begin{document}
\title{CS166 Problem Set \psetnumber: \psetdescription}
\author{\authorname}
\date{}
\maketitle
\thispagestyle{headandfoot}
\begin{questions}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Q{Problem One: Cardinality Estimation}
A \textbf{\emph{cardinality estimator}} is a data structure that supports the following two operations:
\begin{itemize}
\item $ds.\texttt{see}(x)$, which records that the value $x$ has been seen; and
\item $ds.\texttt{estimate}()$, which returns an estimate of the number of \textit{\textbf{distinct}} values we've seen.
\end{itemize}
Imagine that we are given a data stream consisting of elements drawn from some universe $\UU$. Not all elements of $\UU$ will necessarily be present in the stream, and some elements of $\UU$ may appear multiple times. We'll denote the number of unique elements in the stream as $\textbf{\textit{F}}_\mathbf{0}$.
Here's an initial data structure for cardinality estimation. We'll begin by choosing a hash function $h$ uniformly at random from a family of 2-independent hash functions $\HH$, where every function in $\HH$ maps $\UU$ to the open interval of real numbers $(0, 1)$. \textit{(We're aware that hashing to a uniformly-random real number poses some theoretical challenges; in practice, you'd use a slightly different strategy, but for the purposes of this problem assume this is possible.)}
Our data structure works by hashing the elements it \texttt{see}s using $h$ and doing some internal bookkeeping to keep track of the $k$th-smallest hash code out of all the hash codes seen so far, ignoring duplicates. The fact that we're tracking \textit{unique} hash codes is important; we'd like it to be the case that if we call $\texttt{see}(x)$ multiple times, it has the same effect as just calling $\texttt{see}(x)$ a single time. (The fancy term for this is that the \texttt{see} operation is \textbf{\emph{idempotent}}.) We'll implement $\texttt{estimate}()$ by returning the value $\hat{F} = \frac{k}{h_k}$, where $h_k$ denotes the $k$th smallest hash code seen.
%%%%%%%%%%%%%%%%%%
\begin{parts}
\part Explain, intuitively, why $\hat{F}$ is likely to be close to the number of distinct elements.
\begin{solution}
Your solution goes here!
\end{solution}
%%%%%%%%%%%%%%%%%%
\newpage
\uplevel{Let $\eps \in (0, 1)$ be some accuracy parameter that's provided to us.}
\part Prove that $\prob{\hat{F} > (1+\eps) F_0} \le \frac{2}{k\eps^2}$. This shows that by tuning $k$, we can make it unlikely that we overestimate the true value of $F_0$.
As a hint, use the techniques we covered in class: use indicator variables and some sort of concentration inequality. What has to happen for the estimate $\hat{F}$ to be too large? As a reminder, your hash function is only assumed to be 2-independent, so you can't assume it behaves like a truly random function and can only use the properties of 2-independent hash functions.
\begin{solution}
Your solution goes here!
\end{solution}
\uplevel{Using a proof analogous to the one you did in part (ii) of this problem, we can also prove that
\[
\prob{\frac{k}{h_k} < (1-\eps) F_0} \le \frac{2}{k\eps^2}.
\]
The proof is very similar to the one you did in part (ii), so we won't ask you to write this one up. However, these two bounds collectively imply that by tuning $k$, you can make it fairly likely that you get an estimate within $\pm \eps F_0$ of the true value! All that's left to do now is to tune our confidence in our answer.}
%%%%%%%%%%%%%%%%%%
\newpage
\part Using the above data structure as a starting point, design a cardinality estimator with tunable parameters $\eps \in (0, 1)$ and $\delta \in (0, 1)$ such that:
\begin{itemize}
\item $\texttt{see}(x)$ takes time $\bigo{poly(\eps^{-1}, \log \delta^{-1})}$, where $poly(x, y)$ denotes "something bounded from above by a polynomial in $x$ and $y$," such as $x^3y + y^2 \log x$;
\item $\texttt{estimate}()$ takes time $\bigo{poly(\log \delta^{-1})}$, and if we let $C$ denote the estimate returned this way, then
\[
\prob{|C - F_0| > \eps F_0} < \delta; \textrm{and}
\]
\item the total space usage is $\bigo{poly(\eps^{-1}, \log \delta^{-1})}$.
\end{itemize}
\begin{solution}
Your solution goes here!
\end{solution}
\end{parts}
There are a number of really beautiful approaches out there for building cardinality estimators. Check out the impressive HyperLogLog estimator for an example of a totally different approach to solving this problem that's used widely in practice!
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\Q{Problem Two: Hashing IRL}
This one is mostly, but not all, coding! In addition to the brief writeup below, make sure to submit your code on \texttt{myth}.
\begin{parts}
\part The theory predicts that linear probing and cuckoo hashing degenerate rapidly beyond a certain load factor. How accurate is that in practice?
\begin{solution}
Your solution goes here!
\end{solution}
\part How does Robin Hood hashing compare to linear probing across the range of load factors? Why do you think that is?
\begin{solution}
Your solution goes here!
\end{solution}
\part In theory, cuckoo hashing requires much stronger classes of hash functions than the other types of hash tables we've covered. Do you see this in practice?
\begin{solution}
Your solution goes here!
\end{solution}
\part In theory, cuckoo hashing's performance rapidly degrades as the load factor approaches $\alpha = \frac{1}{2}$. Do you see that in practice?
\begin{solution}
Your solution goes here!
\end{solution}
\end{parts}
\end{questions}
\end{document}