\documentclass[12pt]{exam}

\usepackage[utf8]{inputenc}  % For UTF8 source encoding.
\usepackage{amsmath}  % For displaying math equations.
\usepackage{amsfonts} % For mathematical fonts (like \mathbb{E}!).
\usepackage{upgreek}  % For upright Greek letters, such as \upvarphi.
\usepackage{wasysym}  % For additional glyphs (like \smiley!).
% For document margins.
\usepackage[left=.8in, right=.8in, top=1in, bottom=1in]{geometry}
\usepackage{lastpage} % For a reference to the number of pages.

% TODO: Enter your name here :)
\newcommand*{\authorname}{[Your name goes here]}

\newcommand*{\psetnumber}{2}
\newcommand*{\psetdescription}{String Data Structures}
\newcommand*{\duedate}{Tuesday, April 24th}
\newcommand*{\duetime}{2:30 pm}

% Fancy headers and footers
\headrule
\firstpageheader{CS166\\Spring 2018}{Problem Set \psetnumber\\\psetdescription}{Due: \duedate\\at \duetime}
\runningheader{CS166}{Problem Set \psetnumber}{\authorname}
\footer{}{\footnotesize{Page \thepage\ of \pageref{LastPage}}}{}

% Exam questions.
\newcommand{\Q}[1]{\question{\large{\textbf{#1}}}}
\qformat{}  % Remove formatting from exam questions.

% Useful macro commands.
\newcommand*{\ex}{\mathbb{E}}
\newcommand*{\bigtheta}[1]{\Theta\left( #1 \right)}
\newcommand*{\bigo}[1]{O \left( #1 \right)}
\newcommand*{\bigomega}[1]{\Omega \left( #1 \right)}
\newcommand*{\prob}[1]{\text{Pr} \left[ #1 \right]}
\newcommand*{\var}[1]{\text{Var} \left[ #1 \right]}

% Custom formatting for problem parts.
\renewcommand{\thepartno}{\roman{partno}}
\renewcommand{\partlabel}{\thepartno.}

% Framed answers.
\newcommand{\answerbox}[1]{
\begin{framed}
\hspace{\fill}
\vspace{#1}
\end{framed}}

\printanswers

\setlength\answerlinelength{2in} \setlength\answerskip{0.3in}

\begin{document}
\title{CS166 Problem Set \psetnumber: \psetdescription}
\author{\authorname}
\date{}
\maketitle
\thispagestyle{headandfoot}

\begin{questions}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Q{Problem One: Time/Space Tradeoffs in Tries (10 Points)}

If you code up a trie, you'll quickly run into a design choice: how do you represent the child pointers associated with each node? This decision shouldn't be taken lightly; it will have an impact on the amount of time required to perform operations on the trie and space used to represent the trie. In this problem, you'll analyze the time/space tradeoffs involved in three different representations.

Let's introduce some terminology that we'll use throughout this problem. When referring to a trie, we'll have $\mathbf{\Sigma}$ represent the set of characters that can appear in the strings we'll be storing. An assumption that's common throughout Theoryland that's also usually true in practice is that, given some character $x \in \Sigma$, it's possible to map $x$ to some integer in $\{0, 1, 2, \dots, |\Sigma| - 1\}$ in time $O(1)$, making it possible to do index-based lookups on characters efficiently. We'll also assume that each character fits into a machine word.

Additionally, we'll have \textit{\textbf{m}} denote the number of total nodes in the trie, and we'll have \textit{\textbf{n}} denote the number of total words stored in the trie. (Make sure you see why these aren't necessarily the same!)

%%%%%%%%%%%%

First, suppose you decide to store the child pointers in each trie node as an array of size $|\Sigma|$. Each pointer in that array points to the child labeled with that character and is null if there is no child with that label.

\begin{parts}

\part Give a big-$\Theta$ expression for the total memory usage of this trie. Then, give the big-O runtime of looking up a string of length $L$ in a trie represented this way. Briefly justify your answers.

\begin{solution}
Your solution goes here!
\end{solution}

%%%%%%%%%%%%

The above approach is one of the more common ways of implementing tries because of the speed of performing a lookup. However, as you can see, the memory usage isn't great.

Now, let's imagine that you decide to store the child pointers in each trie node in a binary search tree. Each node in the BST stores (in addition to its left and right fields) a pair of a character and a pointer to the trie node at the end of the edge labeled with that character. (This representation of a trie is sometimes
called a \textit{\textbf{ternary search tree}}.)

\part Find an asymptotically tight (big-$\Theta$) expression for the space used by a ternary search tree. As a way of checking your work, your overall expression should have no dependence on $|\Sigma|$. (Whoa! That's unexpected!)

\begin{solution}
Your solution goes here!
\end{solution}

%%%%%%%%%%%%

The cost of performing a search in a trie represented this way depends on the particular shape of the BSTs stored in each node.

\part Suppose each node in the ternary search tree stores its children in a balanced BST. What is the runtime (in big-$O$ notation) of performing a lookup of a string of length $L$ in such a data structure?

\begin{solution}
Your solution goes here!
\end{solution}

%%%%%%%%%%%%

As you can see here, there's a time/space tradeoff involved in using ternary search trees as opposed to an array-based trie. The (asymptotic) memory usage is lower for the ternary search tree than for the array-based trie, but the lookups are slightly slower.

It's relatively common to see tries built for medium-sized collections of strings built from very large alphabets. For example, suppose you want to store your contacts in a trie. Each person's name would probably be represented as a Unicode string (where $|\Sigma|$ is around 140,000), but you'd probably only have a couple
hundred contacts (n would be around 500 or so). If you have a bunch of friends from different parts of the world and try representing their names in their native alphabets, you're really going to feel the effect of a $|\Sigma|$ or even log $|\Sigma|$ term! For cases like these, where $|\Sigma|$ is huge but n is not too large, there's a very clever way of representing a trie that completely eliminates any runtime or space dependence on $\Sigma$. The technique involves a surprisingly useful type of binary search tree called a \textit{\textbf{weight-balanced tree}}.

Suppose that you have a set of keys $x_1,\dots,x_k$ that you'd like to store in a binary search tree. Each key is associated with a positive weight $w_1,\dots,w_k$. We'll define the \textit{\textbf{weight}} of a BST to be the sum of the weights of all the keys in that tree. A \textit{\textbf{weight-balanced tree}} is then defined as follows: the root node is chosen so that the difference in weights between the left and right subtrees is as close to zero as possible, and the left and right subtrees are then recursively constructed using the same rule.

\part Suppose that the total weight in a weight-balanced tree is $W$. Prove that there is a universal constant $\varepsilon$ (that is, a constant that doesn't depend on $W$ or the specific tree chosen) where $0 < \varepsilon < 1$ and the left subtree and right subtree of a weight-balanced tree each have weight at most $\varepsilon W$.

This one is a little trickier than the preceding problems. You might have some luck tackling this one by first trying to find the most imbalanced weight-balanced tree that you can, then seeing if you can use that intuition to put together a formal proof.

\begin{solution}
Your solution goes here!
\end{solution}

%%%%%%%%%%%%

Imagine that we start with a ternary search tree, which already stores child pointers in a BST, but instead of using a regular balanced BST, we instead store the child pointers for each node in weight-balanced trees where the weight associated with each BST node is the number of strings stored in the subtrie associated with the given child pointer.

\part Prove that looking up a string of length $L$ in a trie represented this way takes time $O(L + \log n)$. As a hint, imagine that you have two bank accounts, one with $O(L)$ dollars in it and one with $O(\log n)$ dollars with it. Explain how to charge each unit of work done by a search in a trie represented this way to one of the two accounts without ever overdrawing your funds.

\begin{solution}
Your solution goes here!
\end{solution}

%%%%%%%%%%%%

So now we have three ways of representing a trie.
\begin{itemize}
\item You can use a regular old array for the child pointers. That gives you crazy fast lookups, but the memory usage grows pretty quickly with $|\Sigma|$. This would be great in a setting like computational genomics where $|\Sigma|$ is tiny.
\item You can store all the child pointers in a balanced BST. This reduces the space down to a much more manageable amount, but slows down your searches in the event that your alphabet gets large.
\item You can store all the child pointers in a weight-balanced tree. This has all the space advantages of the above approach, but makes lookups much faster in the case where the trie doesn't have too many strings in it.
\end{itemize}

We'll spin back around to weight-balanced trees in a few weeks; they have all sorts of applications!

There are a bunch of other ways you can encode tries, each of which has its advantages and disadvantages. If you liked this problem, consider looking into this as a topic for your final project.

\end{parts}
\newpage

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\Q{Problem Two: Longest $k$-Repeated Substrings (7 Points)}

\begin{parts}

In Thursday's lecture, we saw how to find the longest repeated substring of a string of length $m$ in time $O(m)$ by using a suffix tree. (Remember that those repeats can overlap, so in the string banana, the longest repeated substring would be ``ana", not ``na".)

\part Design and describe an $O(m)$-time algorithm for the longest repeated substring problem that uses suffix arrays and LCP information instead of suffix trees.

\begin{solution}
Your solution goes here!
\end{solution}

%%%%%%%%%%%%

The \textit{\textbf{longest k-repeated substring}} problem is the following: given a string $T$ and a number $k$, what is the longest substring of $T$ that appears at least $k$ times in $T$?

\part Design and describe an $O(m)$-time, suffix-tree-based algorithm for solving the longest $k$-repeated substring problem. Note that there should be no runtime dependence on $k$.

\begin{solution}
Your solution goes here!
\end{solution}

%%%%%%%%%%%%

\part Design and describe an $O(m)$-time, suffix-array-based algorithm for solving the longest $k$-repeated substring problem. Note that there should be no runtime dependence on $k$.

\begin{solution}
Your solution goes here!
\end{solution}

\end{parts}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\Q{Problem Three: Suffix Array Search (3 Points)}

This one is all coding! Make sure to submit your implementation on \texttt{myth}.

\Q{Problem Four: Implementing SA-IS (15 Points)}

This is one is also all coding! Make sure to submit your implementation on \texttt{myth}.

\end{questions}
\end{document}
