\documentclass[12pt]{exam}
\usepackage[utf8]{inputenc} % For UTF8 source encoding.
\usepackage{amsmath} % For displaying math equations.
\usepackage{amsfonts} % For mathematical fonts (like \mathbb{E}!).
\usepackage{upgreek} % For upright Greek letters, such as \upvarphi.
\usepackage{wasysym} % For additional glyphs (like \smiley!).
% For document margins.
\usepackage[left=.8in, right=.8in, top=1in, bottom=1in]{geometry}
\usepackage{lastpage} % For a reference to the number of pages.
% TODO: Enter your name here :)
\newcommand*{\authorname}{[Your name goes here]}
\newcommand*{\psetnumber}{2}
\newcommand*{\psetdescription}{String Data Structures}
\newcommand*{\duedate}{Thursday, April 25th}
\newcommand*{\duetime}{2:30 pm}
% Fancy headers and footers
\headrule
\firstpageheader{CS166\\Spring 2019}{Problem Set \psetnumber\\\psetdescription}{Due: \duedate\\at \duetime}
\runningheader{CS166}{Problem Set \psetnumber}{\authorname}
\footer{}{\footnotesize{Page \thepage\ of \pageref{LastPage}}}{}
% Exam questions.
\newcommand{\Q}[1]{\question{\large{\textbf{#1}}}}
\qformat{} % Remove formatting from exam questions.
% Useful macro commands.
\newcommand*{\ex}{\mathbb{E}}
\newcommand*{\bigtheta}[1]{\Theta\left( #1 \right)}
\newcommand*{\bigo}[1]{O \left( #1 \right)}
\newcommand*{\bigomega}[1]{\Omega \left( #1 \right)}
\newcommand*{\prob}[1]{\text{Pr} \left[ #1 \right]}
\newcommand*{\var}[1]{\text{Var} \left[ #1 \right]}
% Custom formatting for problem parts.
\renewcommand{\thepartno}{\roman{partno}}
\renewcommand{\partlabel}{\thepartno.}
% Framed answers.
\newcommand{\answerbox}[1]{
\begin{framed}
\hspace{\fill}
\vspace{#1}
\end{framed}}
% MZ
\makeatletter
\@namedef{ver@framed.sty}{9999/12/31}
\@namedef{opt@framed.sty}{}
\makeatother
\usepackage{minted}
\printanswers
\setlength\answerlinelength{2in} \setlength\answerskip{0.3in}
\begin{document}
\title{CS166 Problem Set \psetnumber: \psetdescription}
\author{\authorname}
\date{}
\maketitle
\thispagestyle{headandfoot}
\begin{questions}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Q{Problem One: Time/Space Tradeoffs in Tries (10 Points)}
If you code up a trie, you'll quickly run into a design choice: how do you represent the child pointers associated with each node? This decision shouldn't be taken lightly; it will have an impact on the amount of time required to perform operations on the trie and on the space used to encode the trie. In this problem, you'll analyze the time/space tradeoffs involved in three different trie representations.
Here's some notation we'll use throughout this problem:
\begin{itemize}
\item We'll have $k$ denote the total number of words stored in our trie.
\item We'll have $t$ denote the total number of nodes in the trie.
\item We'll have $\Sigma$ denote the alphabet from which the strings in our trie are drawn.
\end{itemize}
An assumption that's common throughout Theoryland that's also usually true in practice is that, given some character $x \in \Sigma$, it's possible to map $x$ to some integer in $\{0, 1, 2, \dots, |\Sigma| - 1\}$ in time $O(1)$, making it possible to do index-based lookups on characters efficiently. We'll also assume that each character fits comfortably into a machine word.
For the purposes of this problem, we’ll imagine that each node in the trie is represented with some sort of structure that looks like this:
\begin{center}
\begin{minipage}{0.5\textwidth}
\begin{minted}{c++}
struct TrieNode {
bool isWord;
Collection children;
};
\end{minted}
\end{minipage}
\end{center}
%%%%%%%%%%%%
First, suppose you decide to store the child pointers in each trie node as an array of size $|\Sigma|$. Each pointer in that array points to the child labeled with that character and is null if there is no child with that label.
\begin{parts}
\part Give a big-$\Theta$ expression for the total memory usage of this trie. Briefly justify your answer.
\begin{solution}
Your solution goes here!
\end{solution}
%%%%%%%%%%%%
\part Give the big-O runtime of determining whether a string of length $L$ in contained in a trie when the trie's children are represented as an array as in part (i). Briefly justify your answer.
\begin{solution}
Your solution goes here!
\end{solution}
%%%%%%%%%%%%
\uplevel{Now, let's imagine that you decide to store the child pointers in each trie node in a binary search tree. (A trie represented this way is sometimes called a \textbf{\textit{ternary search tree}}.)}
\part Give a big-$\Theta$ expression for the space used by a ternary search tree. As a way of checking your work, your overall expression should have no dependence on $|\Sigma|$. (Whoa! That's unexpected!) Briefly justify your answer.
\begin{solution}
Your solution goes here!
\end{solution}
%%%%%%%%%%%%
\part Suppose each node in the ternary search tree stores its children in a balanced BST. Briefly explain why the cost of a lookup in the ternary search tree is O$(L \log |\Sigma|)$.
\begin{solution}
Your solution goes here!
\end{solution}
%%%%%%%%%%%%
\uplevel{As you can see here, there's a time/space tradeoff involved in using ternary search trees as opposed to an array-based trie. The memory usage is lower for the ternary search tree than for the array-based trie, but the lookups are slightly slower.}
\uplevel{If you’re working in computational biology where $\Sigma$ is often $\texttt{\{A, C, T, G\}}$, then the cost of an extra factor of $|\Sigma|$ or $\log |\Sigma|$ isn't going to be too big. On the other hand, if you're working with general text, where $\Sigma$ might be all of Unicode, including emojis, the impact of a $|\Sigma|$ or $\log |\Sigma|$ term can be enormous. Are there ways of representing tries where neither the time nor space depends on $|\Sigma|$?}
\uplevel{Amazingly, the answer is yes, using a tool called a \textit{\textbf{weight-balanced tree}}. A weight-balanced tree is a binary search tree in which each key $x_i$ has an associated weight $w_i > 0$. A weight-balanced tree is then defined as follows: the root node is chosen so that the difference in weights between the left and right subtrees is as close to zero as possible, and the left and right subtrees are then recursively constructed using the same rule.}
\part Draw a weight-balanced tree containing the following key/weight pairs.
$$(a, 2)~~~(b, 7)~~~(c, 1)~~~(d, 8)~~~(e, 4)~~~(f, 5)~~~(g, 9)$$
\begin{solution}
Your solution goes here!
\end{solution}
%%%%%%%%%%%%
\part Suppose that the total weight in a weight-balanced tree is $W$. Prove that there is a universal constant $\varepsilon$ (that is, a constant that doesn't depend on $W$ or the specific tree chosen) where $0 < \varepsilon < 1$ and the left subtree and right subtree of a weight-balanced tree each have weight at most $\varepsilon W$.
\begin{solution}
Your solution goes here!
\end{solution}
%%%%%%%%%%%%
\uplevel{Let's now bring things back around to tries. Imagine that we start with a ternary search tree, which already stores child pointers in a BST, but instead of using a regular balanced BST, we instead store the child pointers for each node in weight-balanced trees where the weight associated with each BST node is the number of strings stored in the subtrie associated with the given child pointer.}
\uplevel{For example, consider the trie from Slide 10 of our lecture on suffix trees. Focus on the node associated with the word \texttt{ant}. Then its child labeled \texttt{e} would have weight 3, since that subtrie contains three words, and its child labeled \texttt{i} would have weight 1, since that subtrie contains just one word.}
\part Prove that looking up a string of length $L$ in a trie represented this way takes time O$(L + \log k)$. As a hint, imagine that you have two bank accounts, one with O$(L)$ dollars in it and one with O$(\log k)$ dollars with it. Explain how to charge each unit of work done by a search in a trie represented this way to one of the two accounts without ever overdrawing your funds.
\begin{solution}
Your solution goes here!
\end{solution}
%%%%%%%%%%%%
\uplevel{Weight-balanced trees have a bunch of other wonderful properties, and we'll spin back around to them later in the quarter.}
\uplevel{There are a bunch of other ways you can encode tries, each of which has its advantages and disadvantages. If you liked this problem, consider looking into this as a topic for your final project!}
\end{parts}
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Q{Problem Two: The Anatomy of Suffix Trees (2 Points)}
Consider the following paragraph, which is taken from an English translation of the excellent short story ``Before the Law'' by Franz Kafka:
\begin{center}
\begin{minipage}{0.9\textwidth}
\texttt{Before the law sits a gatekeeper. To this gatekeeper comes a man from the country who asks to gain entry into the law. But the gatekeeper says that he cannot grant him entry at the moment. The man thinks about it and then asks if he will be allowed to come in later on. "It is possible," says the gatekeeper, "but not now."}
\end{minipage}
\end{center}
Actually building a suffix tree for this text, by hand, would be quite challenging. But even without doing so, you can still infer much about what it would look like.
\begin{parts}
\part Without building a suffix tree or suffix array for the above text, determine whether the suffix tree for this passage contains a node labeled with the string \texttt{moment}. Briefly justify your answer.
\begin{solution}
Your solution goes here!
\end{solution}
%%%%%%%%%%%%
\part Repeat the above exercise for the string \texttt{man}.
\begin{solution}
Your solution goes here!
\end{solution}
%%%%%%%%%%%%
\part Repeat the above exercise for the string \texttt{gatekeeper}.
\begin{solution}
Your solution goes here!
\end{solution}
\end{parts}
\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Q{Problem Three: Longest $\mathbf{k}$-Repeated Substrings (3 Points)}
Design an O$(m)$-time algorithm that, given a string $T$ of length $m$ and a positive integer $k$, returns the longest substring that appears in $T$ in at least $k$ different places. The substrings are allowed to overlap with one another; for example, given the string \texttt{HAHAHAHAHA} and $k = 3$, you'd return \texttt{HAHAHA}.
As a note, the runtime of O$(m)$ is independent of $k$. For full credit, your solution use should not use suffix trees, though you're welcome to use any other data structures you'd like.
\begin{solution}
Your solution goes here!
\end{solution}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\Q{Problem Four: Suffix Array Search (3 Points)}
This one is all coding! Make sure to submit your implementation on \texttt{myth}.
\Q{Problem Five: Implementing SA-IS (12 Points)}
This is one is also all coding! Make sure to submit your implementation on \texttt{myth}.
\end{questions}
\end{document}