\documentclass[12pt]{exam}

\usepackage[utf8]{inputenc}  % For UTF8 source encoding.
\usepackage{amsmath}  % For displaying math equations.
\usepackage{amsfonts} % For mathematical fonts (like \mathbb{E}!).
\usepackage{upgreek}  % For upright Greek letters, such as \upvarphi.
\usepackage{wasysym}  % For additional glyphs (like \smiley!).
\usepackage{mathrsfs} % For script text (hash families and universes).
% For document margins.
\usepackage[left=.8in, right=.8in, top=1in, bottom=1in]{geometry}
\usepackage{lastpage} % For a reference to the number of pages.

% TODO: Enter your name here :)
\newcommand*{\authorname}{[Your name goes here]}

\newcommand*{\psetnumber}{5}
\newcommand*{\psetdescription}{Randomized Data Structures}
\newcommand*{\duedate}{Thursday, May 24th}
\newcommand*{\duetime}{2:30 pm}

% Fancy headers and footers
\headrule
\firstpageheader{CS166\\Spring 2018}{Problem Set \psetnumber\\\psetdescription}{Due: \duedate\\at \duetime}
\runningheader{CS166}{Problem Set \psetnumber}{\authorname}
\footer{}{\footnotesize{Page \thepage\ of \pageref{LastPage}}}{}

% Exam questions.
\newcommand{\Q}[1]{\question{\large{\textbf{#1}}}}
\qformat{}  % Remove formatting from exam questions.

% Useful macro commands.
\newcommand*{\bigtheta}[1]{\Theta\left( #1 \right)}
\newcommand*{\bigo}[1]{O \left( #1 \right)}
\newcommand*{\bigomega}[1]{\Omega \left( #1 \right)}
\newcommand*{\prob}[1]{\text{Pr} \left[ #1 \right]}
\newcommand*{\ex}[1]{\text{E} \left[ #1 \right]}
\newcommand*{\var}[1]{\text{Var} \left[ #1 \right]}

\newcommand*{\norm}[1]{\left\lVert #1 \right\rVert}
\newcommand*{\HH}{\mathscr{H}}   % Family of hash functions.
\newcommand*{\UU}{\mathscr{U}}   % Universe.
\newcommand*{\eps}{\varepsilon}  % Epsilon.


% Custom formatting for problem parts.
\renewcommand{\thepartno}{\roman{partno}}
\renewcommand{\partlabel}{\thepartno.}

% Framed answers.
\newcommand{\answerbox}[1]{
\begin{framed}
\hspace{\fill}
\vspace{#1}
\end{framed}}

\printanswers

\setlength\answerlinelength{2in} \setlength\answerskip{0.3in}

\begin{document}
\title{CS166 Problem Set \psetnumber: \psetdescription}
\author{\authorname}
\date{}
\maketitle
\thispagestyle{headandfoot}

\begin{questions}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Q{Problem One: Count Sketches with 2-Independent Hashing}

In our analysis of count sketches from lecture, we made the following simplification when determining the variance of our estimate:
\[
	\var{\sum_{j \ne i}{\mathbf{a}_j s(x_i) s(x_j) X_j}} = \var{\sum_{j \ne i}\mathbf{a}_j s(x_i) s(x_j) X_j}
\]

In general, the variance of a sum is not the sum of the variances, so this step required some justification. The justification we provided in lecture was that if we let $s$ and $h$ be 3-independent hash functions, then the terms in the sum were pairwise independent. However, to write the variance of a sum as a sum of variances, we don't actually need the terms of the sum to be independent of one another. They just need to be \textbf{\emph{uncorrelated}}.

Prove that any two terms in the above summation are uncorrelated under the assumption that both $s$ and $h$ are drawn uniformly and independently from 2-independent families of hash functions. This shows that 2-independent hashing is sufficient for count sketching, and does so in a way that doesn't require us to redo any of the remaining analysis.
As a refresher, two random variables $X$ and $Y$ are uncorrelated if $\ex{XY} = \ex{X}\ex{Y}$.

\begin{solution}
Your solution goes here!
\end{solution}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\Q{Problem Two: Cardinality Estimation}

A \textbf{\emph{cardinality estimator}} is a data structure that supports the following two operations:
\begin{itemize}
	\item $ds.\texttt{see}(x)$, which records that the value $x$ has been seen; and
	\item $ds.\texttt{estimate}()$, which returns an estimate of the number of distinct $x$'s we've seen.
\end{itemize}

Imagine that we are given a data stream consisting of elements drawn from some universe $\UU$. Not all elements of $\UU$ will necessarily be present in the stream, and some elements of $\UU$ may appear multiple times. We'll denote the number of unique elements in the stream as $F_0$.

Here's an initial data structure for cardinality estimation. We'll begin by choosing a hash function $h$ uniformly at random from a family of 2-independent hash functions $\HH$ from $\UU$ to the open interval of real numbers $(0, 1)$. For simplicity's sake, we'll assume that there are no hash collisions, which isn't too unreasonable given that the codomain is infinite.

Our data structure works by hashing the elements it \texttt{see}s using $h$ and doing some internal bookkeeping to keep track of the $k$th-smallest unique hash code seen so far. The fact that we're tracking unique hash codes is important; we'd like it to be the case that if we call $\texttt{see}(x)$ multiple times, it has the same effect as just calling $\texttt{see}(x)$ a single time. (The fancy term for this is that the \texttt{see} operation is \textbf{\emph{idempotent}}.) We'll implement $\texttt{estimate}()$ by returning $\frac{k}{h_k}$, where $h_k$ denotes the $k$th smallest hash code seen so far.

%%%%%%%%%%%%%%%%%%
\begin{parts}
\part Explain, intuitively, why you'd expect $\frac{k}{h_k}$ to be a good estimate of $F_0$. As a hint, think about two different ways of counting up how many elements should have hash codes less than $h_k$.

\begin{solution}
Your solution goes here!
\end{solution}

%%%%%%%%%%%%%%%%%%
\part Let $\eps \in (0, 1)$ be some accuracy parameter that's provided to us.

Prove that $\prob{\frac{k}{h_k} > (1+\eps) F_0} \le \frac{4}{k\eps^2}$. This shows that by tuning $k$, we can make it unlikely that we overestimate the true value of $F_0$.

\begin{solution}
Your solution goes here!
\end{solution}

Some hints and things to think about as you work through the above problem:
\begin{itemize}
	\item At some point you should end up with an expression involving $(1+\eps)^{-1}$. We strongly recommend applying the inequality $(1 + \eps)^{-1} \le 1 - \frac{\eps}{2}$, which holds for any $\eps \in (0, 1)$.
	\item Find a random variable $X$ that's the sum of indicator variables where $$\prob{\frac{k}{h_k} > (1+\eps) F_0} \le \prob{X > k}.$$
    \item Use Chebyshev's inequality.
\end{itemize}

Using a proof analogous to the one you did in part (ii) of this problem, we can also prove that
\[
	\prob{\frac{k}{h_k} < (1-\eps) F_0} \le \frac{2}{k\eps^2}. 
\]

The proof is very similar to the one you did in part (ii), so we won't ask you to write this one up. However, these two bounds collectively imply that by tuning k, you can make it fairly likely that you get an estimate within $\pm \eps F_0$ of the true value! All that's left to do now is to tune our confidence in our answer.

%%%%%%%%%%%%%%%%%%
\part Using the above data structure as a starting point, design a data structure with tunable parameters $\eps \in (0, 1)$ and $\delta \in (0, 1)$ such that:
\begin{itemize}
	\item $\texttt{see}(x)$ takes time $\bigo{\log \eps^{-1} \cdot \log \delta^{-1}}$;
	\item $\texttt{estimate}(x)$ takes time $\bigo{\log \delta^{-1}}$, and if we let $C$ denote the estimate returned this way, then 
\[
    	\prob{(1-\eps) F_0 \le C \le (1+\eps) F_0} \ge 1 - \delta; \textrm{and}
\]
    \item the total space usage is $\bigtheta{\eps^{-2} \log \delta^{-1}}$.
\end{itemize}

\begin{solution}
Your solution goes here!
\end{solution}

\end{parts}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\Q{Problem Three: Hashing in the Real World}

This one is mostly, but not all, coding! In addition to the brief writeup below, make sure to submit your code on \texttt{myth}.

\begin{parts}
\part The theory predicts that linear probing and cuckoo hashing degenerate rapidly beyond a certain load factor. How accurate is that in practice?

\begin{solution}
Your solution goes here!
\end{solution}

\part How does second-choice hashing compare to chained hashing across the range of load factors? Why do you think that is?

\begin{solution}
Your solution goes here!
\end{solution}

\part How does Robin Hood hashing compare to linear probing across the range of load factors? Why do you think that is?

\begin{solution}
Your solution goes here!
\end{solution}

\part In theory, cuckoo hashing requires much stronger classes of hash functions than the other types of hash tables we've covered. Do you see this in practice?

\begin{solution}
Your solution goes here!
\end{solution}

\part In theory, cuckoo hashing's performance rapidly degrades as the load factor approaches $\alpha = \frac{1}{2}$. Do you see that in practice?

\begin{solution}
Your solution goes here!
\end{solution}

\end{parts}

\end{questions}
\end{document}
