\documentclass{article}
\usepackage[pdftex]{graphicx}
\usepackage{amsfonts}
\usepackage{amsmath, amsthm, amssymb}
\usepackage{moreverb}
\usepackage{pdfpages}
\title{CS 246: Problem Set 4}
\author{Tony Hyun Kim}
\setlength{\parindent}{0pt}
\setlength\parskip{0.1in}
\setlength\topmargin{0in}
\setlength\headheight{0in}
\setlength\headsep{0in}
\setlength\textheight{8.2in}
\setlength\textwidth{6.5in}
\setlength\oddsidemargin{0in}
\setlength\evensidemargin{0in}

\pdfpagewidth 8.5in
\pdfpageheight 11in

% THK: Commands accumulated over the problem sets...
\newcommand{\vectornorm}[1]{\left|\left|#1\right|\right|}

\begin{document}

\maketitle

\section{Strategies for high-frequency trading}

\subsection{Baseline: Mean reversion\label{subsec:meanrevert}}

Fig.~\ref{fig:removebad679} shows the trajectory of stock $\# 679$ between $10$ AM and $3$ PM on Jan 5, 2006. The bold blue trace shows the result of removing bad trades, \emph{i.e.} all ticks where the trade price is not between the bid and ask prices.

\begin{figure}[b!]
	\begin{center}
		\includegraphics[width=0.6\textwidth]{removebad679.pdf}
	\end{center}
	\caption{Trajectory of stock $\# 679$. The bold blue trace shows only ``good'' trades, \emph{i.e.} ticks where the trade price is between the bid and ask prices.\label{fig:removebad679}}
\end{figure} 

The percent accuracy, utilizing the $+1$ and $-1$ scoring for correct and incorrect predictions respectively, is about $10\%$. I suppose this is better than random guessing. The Matlab script is attached.

\subsection{Improvement 1}

\subsubsection{Rolling $5$ minute high-low}

A simple dynamic programming algorithm that uses $1$-minute buckets would be as follows. For the current minute, utilize a priority queue to determine the high and the low within that minute. When the minute is up, store the final high and low to a shift register of length four. We then compute the ``five-minute rolling'' statistics using the shift register and the instantaneous priority queue.

The Matlab script is attached.

\subsubsection{Regression for next-tick's percentage change}

I have tried the model that takes into account small and large moves on stocks $679$, $980$, $17948$ and $27969$. Statistics are given in Table~\ref{tab:regstats}. For all stocks, I find that the coefficient for both small and large moves is negative.

\begin{table}[t]
\begin{center}
\begin{tabular}{|l|c|c|c|c|}
\hline
Stock & Coeff & StdErr & tStat & pVal\\
\hline
$679$ (Small) & $-9.50\times10^{-5}$ & $8.68\times10^{-6}$ & $-10.95$ & $6.00\times10^{-27}$ \\
$679$ (Big)   & $-1.25\times10^{-4}$ & $5.19\times10^{-5}$ & $-2.41$  & $1.61\times10^{-2}$\\
\hline
$980$ (Small) & $-3.35\times10^{-4}$ & $1.54\times10^{-5}$ & $-21.79$ & $7.28\times10^{-92}$\\
$980$ (Big)   & $-4.16\times10^{-4}$ & $3.10\times10^{-5}$ & $-13.44$ & $5.18\times10^{-39}$\\
\hline
$17948$ (Small) & $-1.30\times10^{-4}$ & $1.02\times10^{-4}$ & $-1.28$ & $0.20$\\
$17948$ (Big)   & $-2.81\times10^{-4}$ & $1.90\times10^{-4}$ & $-1.48$ & $0.14$\\
\hline
$27969$ (Small) & $-2.52\times10^{-4}$ & $1.80\times10^{-5}$ & $-13.98$ & $6.70\times10^{-42}$\\
$27969$ (Big)   & $-3.53\times10^{-4}$ & $1.32\times10^{-4}$ & $-2.68$  & $7.49\times10^{-3}$\\
\hline 
\end{tabular}
\end{center}
\caption{Regression statistics when taking into account whether the last move is ``big'' or ``small'' (with respect to the rolling high-low). For both big and small moves, I see a negative correlation between the previous and next move, which leads to a trade strategy that is identical to mean reversion.\label{tab:regstats}}
\end{table}

\subsubsection{Primitive algorithmic trading}

As shown in Table~\ref{tab:regstats}, I find that the previous price move is negatively correlated with future moves, whether or not the prior move was big or small compared to the rolling window. Hence, the trading strategy will be identical to mean reversion of Section~\ref{subsec:meanrevert}, and I hard-coded for mean reversion. I applied this trading strategy to the stock trajectories during $3$-$4$ PM, and found the following profits for each of the stocks:
\begin{itemize}
	\item Stock \#$679$: $\$1.1050$
	\item Stock \#$980$: $\$1.8100$
	\item Stock \#$17948$: $\$0.0900$
	\item Stock \#$27969$: $\$1.2090$
\end{itemize}
I suppose, technically, we made some money.

\subsection{Hello world SVM}

\begin{itemize}
	\item We use $C \cdot \sum_i e_i$ in the objective in order to penalize the use of slack variables $e_i$. If we were to set $C=0$, then the slack variables can take any value and our model will basically ignore the training set.
	\item One obtains the dual form of the SVM by formulating the optimization problem by introducing Lagrange multipliers $\alpha_i$ (one for each training example)
		\begin{equation*}
			\mathcal{L} = \frac{1}{2}w^T w + C\sum_i e_i + \sum_i \alpha_i g_i
		\end{equation*}
		where the first two terms correspond to the original cost function, and 
		\begin{equation*}
			g_i = -\left[y^{(i)}\left(\langle w, x^{(i)}\rangle + b\right) - 1 + e_i\right] \leq 0
		\end{equation*}
		represent the margin and slack constraints. 
		The dual function of the SVM optimization problem is obtained by defining $W(\alpha) = \min_{w,b} \mathcal{L}$. The dual optimization task is then to maximize over $\alpha$
		\begin{equation*}
			W(\alpha) = \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} y^{(i)}y^{(j)}\alpha_i\alpha_j \langle x^{(i)}, x^{(j)} \rangle
		\end{equation*}
		subject to the constraints $0\leq \alpha_i \leq C$ and $\sum_i \alpha_i y^{(i)}=0$. Under further conditions (known as KKT conditions), the dual solution may be used to obtain the solutions to the original (``primal'') cost function. Note that I consulted my machine learning course notes (CS 229) for this section.
\end{itemize}

\subsection{Improvement 2: SVM-based trading}

Here is my toy implementation:
\begin{itemize}
	\item At any time index $i$, the feature vector is the last $K$ price moves, \emph{i.e.}
		\begin{equation*}
			X(i) = \left[ \begin{array}{cccc}
							p_i-p_{i-1} & p_{i-1}-p_{i-2} & \cdots & p_{i-(K-1)}-p_{i-K}
				   		  \end{array} \right)]
		\end{equation*}
		I selected $K=3$ admittedly without too much thought. I also tried a few values of the regularization parameter $C$.
	\item On the training dataset, I computed the margin $|w^T X + b|$ for each example, and computed the $33\%$, $50\%$, and $75\%$ quantiles. This is to give me a sense of the typical ``confidences'' observed in the training portion of the stream.
	\item During the trading period, I streamed the past $K$ price differentials as was done in my training model, and computed the margin $w^T X + b$ with respect to the separating hyperplane. Depending on the magnitude of the margin with respect to the previously-computed quantiles at $33\%$, $50\%$ and $75\%$, I traded $1$, $2$ or $5$ units in the proper direction (\emph{i.e.} sell when the model predicted the price to fall, and buy when the price was predicted to rise).
\end{itemize} 

The Matlab script is attached.

\includepdf[pages=-,nup=2x2]{p1code.pdf}

\section{Decision tree learning}

\subsection{Reduction of impurity}

Our toy decision tree seeks to predict whether a person enjoys beer.

\subsubsection{Necessary and sufficient condition for an attribute to be ``useful,'' $G>0$}

We have a set $D$ that is split by a binary attribute into disjoint sets $D_k$, where $k=0,1$ labels the value taken by the attribute. Furthermore, $D_0$ has $u$ positive and $v$ negative examples, and $D_1$ has $x$ positive and $y$ negative examples. It follows that the set $D$ has $x+u$ positive and $y+v$ negative examples.

We then compute the impurities of sets $D$, $D_0$ and $D_1$,
\begin{eqnarray*}
	I(D) &=& (x+y+u+v)\cdot\left[1-\left(\frac{x+u}{x+y+u+v}\right)^2-\left(\frac{y+v}{x+y+u+v}\right)^2\right]\\
	I(D_0) &=& (u+v)\cdot\left[1-\left(\frac{u}{u+v}\right)^2-\left(\frac{v}{u+v}\right)^2\right]\\
	I(D_1) &=& (x+y)\cdot\left[1-\left(\frac{x}{x+y}\right)^2-\left(\frac{y}{x+y}\right)^2\right]
\end{eqnarray*}
and hence the reduction in impurity, defined by $G=\max\left[{I(D)-I(D_0)-I(D_1)}\right]$, is
\begin{equation*}
	G = \frac{u^2}{u+v} + \frac{v^2}{u+v} + \frac{x^2}{x+y} + \frac{y^2}{x+y} - \frac{(x+u)^2}{x+y+u+v} - \frac{(y+v)^2}{x+y+u+v}.
\end{equation*}

Finally, we apply the condition of ``usefulness,'' \emph{i.e.} $G>0$, which yields after simple rearrangement
\begin{equation*}
	\frac{u^2}{u+v} + \frac{v^2}{u+v} + \frac{x^2}{x+y} + \frac{y^2}{x+y} > \frac{(x+u)^2}{x+y+u+v} + \frac{(y+v)^2}{x+y+u+v}.
\end{equation*}

Since all of our manipulations are reversible, the above condition is sufficient and necessary for the reduction in impurity to be positive. Note that we assume $x+y>0$ and $u+v>0$; if these assumptions are false, the attribute is useless for tree construction.

\subsubsection{Wine, Running and Pizza}

I have transcribed the given $\{$wine, running, pizza$\}\times$beer information into the previous notation of $x$, $y$, $u$ and $v$ in Table~\ref{tab:decisionattr}. Given the reductions $G$ of each attribute, I would choose \textbf{pizza} as the root attribute for the decision tree.

\begin{table}[t]
\begin{center}
\begin{tabular}{|c|c c|c c|c|}
\hline
Attribute & $x$ & $y$ & $u$ & $v$ & $G_\mathrm{attr}$ \\
\hline
Wine 	& $30$ & $20$ & $30$ & $20$ & $0.0000$\\
Running & $20$ & $10$ & $40$ & $30$ & $0.3810$\\
Pizza   & $50$ & $30$ & $10$ & $10$ & $0.5000$\\ 
\hline
\end{tabular}
\end{center}
\caption{Potential indicators for whether someone enjoys beer. There are $100$ people overall in the sample set, of which $60$ people like beer and $40$ do not.\label{tab:decisionattr}}
\end{table}

\subsection{Prevention of overfitting}

\subsubsection{A ridiculous example}

The given scenario is an extreme. If we have a dataset that contains all possible assignments to $100$ binary attributes, and we were to construct a complete binary decision tree, we will have a full binary tree of depth $100$, with $2^{100}$ leaves that correspond exactly to each of the examples. This is the ``truest'' definition of overfitting, since we are explicitly setting the labels (as seen in the training set) for all possible values of the attributes.

Since we are told that the target label is equal to the first of these attributes $a_1$ for $99\%$ of these examples, a more sensible decision tree is one that looks at only $a_1$ and predicts the label accordingly.

\subsubsection{Pruning the tree}

\textbf{Sequence of trees}

Fig.~\ref{fig:prunetree} shows the sequence of trees that I generated manually using the $\alpha$ metric.

\begin{figure}[h]
	\begin{center}
		\includegraphics[width=1\textwidth]{prunetree.pdf}
	\end{center}
	\caption{Sequence of pruned trees. Nodes are indicated as circles, and leaves as squares. The red label below each leaf indicates the predicted class of each leaf. Nodes (possibly sub-trees) are removed in order of apparent error rate per pruned node $\alpha$ (computed with respect to the training set).\label{fig:prunetree}}
\end{figure} 

\textbf{Test error}

With respect to the given four-element test set, the trees $T_0,\cdots,T_4$ yield a corresponding error sequence $2, 2, 1, 0, 2$. The tree $T_3$ has the best generalization error.

\subsection{Categorical attributes}

\subsubsection{Partitioning the attributes}

Partitioning $d$ possible values into two classes (\emph{i.e.} the left and right branches) is identical to the task of assigning $0$ or $1$ to each value. Hence, the number of possible partitions is $2^d$. (Note that, if we do not distinguish between ``left'' and ``right'' classes, the number of possibilities is $2^{d-1}$.)

\subsubsection{Number of partitionings that need to be considered is only $O(d)$ if the attribute categories are ordered in increasing $P(Y=1|X=x_i)$}

\textbf{Expressions for the impurity}

We wish to show that 
\begin{equation}
	I(D_L) = |D|\cdot\sum_{i\in L} P(X=x_i) \cdot \left[2P(Y=1|X\in L)-2P^2(Y=1|X\in L)\right].
\end{equation}

We begin with the definition of the impurity for a binary target variable $Y$:
\begin{equation*}
	I(D_L) = |D_L|\cdot(1-p_{Y=1|L}^2-p_{Y=0|L}^2).
\end{equation*}

Now, it is clear that $|D_L| = |D|\cdot\sum_{i\in L} P(X=x_i)$ and that $p_{Y=0|L} = 1-p_{Y=1|L}$. Making these substitutions in the above expression, we find
\begin{eqnarray*}
	I(D_L) &=& |D|\cdot\sum_{i\in L} P(X=x_i) \cdot \left[1-p_{Y=1|L}^2-\left(1-p_{Y=1|L},\right)^2\right]\\
		   &=& |D|\cdot\sum_{i\in L} P(X=x_i) \cdot \left[2p_{Y=1|L}-2p_{Y=1|L}^2\right].
\end{eqnarray*}

\textbf{Partial derivatives of the combined impurity}

As per the assignment description, we will assume the following result.

Let the variable $a_i$ indicate whether $x_i\in L$ (by $a_i=1$) or $x_i\in R$ (by $a_i=0$). The partial derivative of the combined impurity $I(D_L)+I(D_R)$ evaluated at $a_i=0$ exists, and is of the form 
\begin{equation}
	P(X=x_i)\cdot\left[A P(Y=1|X=x_i) + B\right]
	\label{eqn:freebie}
\end{equation}
for some constants $A$, $B$ that does not depend on $i$. Furthermore, the partial derivative maintains the same sign over $[0, 1]$.

\textbf{Only ``contiguous'' binary splits need to be considered}

Let us begin by interpreting how to ``use'' the partial derivative of the impurity with respect to $a_i$. First, recall that $a_i$ takes only binary values. We have two cases:
\begin{itemize}
	\item Suppose $\frac{\partial}{\partial a_i}\left[I(D_L)+I(D_R)\right]_{a_i=0}>0$. Since the derivative maintains the same sign over $[0, 1]$, it follows that $\left[I(D_L)+I(D_R)\right]_{a_i=1} > \left[I(D_L)+I(D_R)\right]_{a_i=0}$. Since we seek to minimize the combined impurity, in this case we would then assign $a_i=0$.
	\item Suppose $\frac{\partial}{\partial a_i}\left[I(D_L)+I(D_R)\right]_{a_i=0}<0$. By a similar reasoning, we would then assign $a_i=1$.
\end{itemize}

Hence, it is seen that the sign of the partial derivative (with respect to $a_i$) determines whether we assign $a_i=0$ or $a_i=1$. Now, returning to the result of Eq.~\ref{eqn:freebie}, we see that the sign of the partial derivative is controlled by the linear function $AP(Y=1|X=x_i)+B$. Since the categorical values are ordered with respect to $P(Y=1|X=x_i)$, they fall -- in the same order -- on the line defined by $(A,B)$. Let the index $l$ indicate the value for which the derivative changes sign, \emph{i.e.}
\begin{equation*}
	AP(Y=1|X=x_l)+B \neq AP(Y=1|X=x_{l+1})+B.
\end{equation*}
It then follows that we would assign all $x_i$ for $i\leq l$ \emph{or} all $x_i$ for $i>l$ to $L$. (The choice between the two forms depend on the signs of $A$ and $B$.)

\textbf{Splitting colors}

When we sort the given colors in increasing order of $P(Y=1|\,\mathrm{color})$, we obtain $\{$green, red, blue, black$\}$. I then performed the calculation of $G$, and find $\{$green$\}$, $\{$red, blue, black$\}$ to be the best split.

\section{Clustering data streams}

\subsection{A simple proof\label{subsec:simpleproof}}

I believe that the inequality holds for any real number $a$ and $b$, not just non-negative numbers. We proceed by expanding the LHS:
\begin{eqnarray*}
	(a+b)^2 &\leq& 2a^2 + 2b^2\\
	a^2+2ab+b^2 &\leq& 2a^2 + 2b^2\\
	0 &\leq& a^2 - 2ab + b^2\\
	0 &\leq& (a-b)^2
\end{eqnarray*}
which is true for any real number $a$ and $b$.

\subsection{Prove $\mathrm{cost}(S,T) \leq 2 \cdot \mathrm{cost}_w(\hat{S},T)+2\sum_{i=1}^l \mathrm{cost}(S_i,T_i)$\label{subsec:costbound}}

As always, we begin with definition
\begin{equation}
	\mathrm{cost}(S,T) = \sum_{x\in S} d(x,T)^2 = \sum_{i=1}^l \sum_{x\in S_i} d(x,T)^2 = \sum_{i=1}^l \sum_{x\in S_i} \left[\min_{z\in T} d(x,z)\right]^2
	\label{eq:costdef}
\end{equation}
where we have used the fact that $S=S_1 \cup S_2 \cup \cdots S_l$.

Now recall the triangle inequality, \emph{i.e.} for any $x, y, z$ we have
\begin{equation*}
	d(x,z) \leq d(x,y) + d(y,z)
\end{equation*}
from which it follows that (for fixed $x$, $y$)
\begin{equation*}
	\min_{z\in T} d(x,z) \leq \min_{z\in T} \left[d(x,y)+d(y,z)\right] = d(x,y) + \min_{z\in T} d(y,z).
\end{equation*}

We apply the above result in Eq.~\ref{eq:costdef}. Furthermore, for every $x\in S_i$, we let $y=t_{ij}$ where $j$ is the index for which $x\in S_{ij}$. In other words, $y$ is the centroid that $x \in S_i$ is assigned to in the $i$-th iteration of \verb=ALG=. We then obtain:
\begin{eqnarray*}
	\mathrm{cost}(S,T) &=& \sum_{i=1}^l \sum_{x\in S_i} \left[\min_{z\in T} d(x,z)\right]^2\\
					  &\leq& \sum_{i=1}^l \sum_{x\in S_i} \left[d(x,y)+\min_{z\in T} d(y,z)\right]^2 
\end{eqnarray*}
to which we apply the result of Section~\ref{subsec:simpleproof}
\begin{eqnarray*}
	\mathrm{cost}(S,T) &\leq& 2 \sum_{i=1}^l \sum_{x\in S_i} d(x,y)^2 + 2 \sum_{i=1}^l \sum_{x\in S_i} \left[\min_{z\in T} d(y,z)\right]^2.
\end{eqnarray*}

Consider the first term. We have defined $y$ to be the cluster vector to which each $x\in S_i$ is assigned. It thus follows that $\sum_{x\in S_i} d(x,y)^2 = \sum_{x\in S_i} d(x,T_i)^2 = \mathrm{cost}(S_i,T_i)$.

Now consider the second term. We note that $y$ takes the values in $\hat{S}=\left\{t_{ij}\right\}$, and the number of times that $y$ takes a particular outcome $t_{ij}$ is proportional to the number of times $x\in S_i$ is assigned to cluster center $t_{ij}$. We conclude that $\sum_{i=1}^l\sum_{x\in S_i} d(y,T)^2 = \sum_{y\in\hat{S}} |S_{ij}|\cdot d(y,T)^2 = \mathrm{cost}_w(\hat{S},T)$.

Putting these two results together, we conclude
\begin{equation}
	\mathrm{cost}(S,T) \leq 2 \sum_{i=1}^l \mathrm{cost}(S_i,T_i) + 2\cdot\mathrm{cost}_w(\hat{S},T).
	\label{eq:partb}
\end{equation}

\subsection{Prove $\sum_{i=1}^l \mathrm{cost}(S_i,T_i) \leq \alpha\cdot \mathrm{cost}(S,T^*)$\label{subsec:partc}}

Consider each term $\mathrm{cost}(S_i,T_i)$ in the summation. The subroutine \verb=ALG= guarantees that
\begin{equation*}
	\mathrm{cost}(S_i,T_i) \leq \alpha \cdot \mathrm{cost}(S_i,T_i^*) \leq \alpha \cdot\mathrm{cost}(S_i,T^*).
\end{equation*}
In the first inequality, $T_i^*$ is defined to be the globally optimum assignment of the cluster centers given $S_i$. The inequality follows the assumption that the routine \verb=ALG= returns a set $T_i$ that is $\alpha$-approximate of $T_i^*$. The second inequality follows since $T_i^*$ is the optimal clustering set for $S_i$, \emph{i.e.} it must necessarily have a cost that is lower than any other candidate $T'$ including $T^*$.

Thus we may write
\begin{equation*}
	\sum_{i=1}^l \mathrm{cost}(S_i,T_i) \leq \alpha \sum_{i=1}^l \mathrm{cost}(S_i,T^*) = \alpha \cdot \mathrm{cost}(S,T^*).
\end{equation*}
The final equality uses the fact that $S = \cup_{i=1}^l S_i$ to collect the multiple summations over $S_i$ over a single sum over $S$ (embedded in the definition of the cost function).

\subsection{Prove $\mathrm{cost}_w(\hat{S},T)\leq \alpha \cdot \mathrm{cost}_w(\hat{S},T^*)$}

The proof is identical to the previous part. Let $\hat{T}^*$ denote the optimal solution of the clustering problem for dataset $\hat{S}$. Then, the $\alpha$-approximate property of \verb=ALG= yields
\begin{equation*}
	\mathrm{cost}_w(\hat{S},T) \leq \alpha\cdot\mathrm{cost}_w(\hat{S},\hat{T}^*) \leq \alpha\cdot\mathrm{cost}_w(\hat{S},T^*)
\end{equation*}
where the last inequality follows from the fact that $T^*$ must necessarily have a larger cost with respect to $\hat{S}$ (with weight $w$) than $\hat{T}^*$ since the latter is defined to be the global optimum.

\subsection{Prove $\mathrm{cost}_w(\hat{S},T^*) \leq 2 \sum_{i=1}^l \mathrm{cost}(S_i,T_i) + 2\cdot\mathrm{cost}(S,T^*)$\label{subsec:parte}}

This proof will be similar to that in Section~\ref{subsec:costbound}. As before, our basic technique is to use the triangle inequality of $d$, and the results of Section~\ref{subsec:simpleproof}.

First, we begin by carefully writing out the definition of $\mathrm{cost}_w(\hat{S},T^*)$
\begin{equation}
	\mathrm{cost}_w(\hat{S},T^*) = \sum_{t\in\hat{S}} w(t)\cdot d(t,T^*)^2 = \sum_{i=1}^l \sum_{t_{ij}\in T_i} w(t_{ij})\cdot d(t_{ij},T^*)^2 = \sum_{i=1}^l \sum_{t_{ij}\in T_i} |S_{ij}|\cdot d(t_{ij},T^*)^2.
	\label{eq:costdef2}
\end{equation}
Here, we have utilized the fact that $\hat{S} = \cup_{i=1}^l T_i$ and that $w(t_{ij}) = |S_{ij}|$.

We now consider the ``innermost'' term $|S_{ij}|\cdot d(t_{ij},T^*)^2$. We have
\begin{eqnarray*}
	|S_{ij}|\cdot d(t_{ij},T^*)^2 &=& \sum_{k=1}^{|S_{ij}|} d(t_{ij},T^*)^2\\
						      &\leq& \sum_{x\in S_{ij}} \left[d(t_{ij},x)+d(x,T^*)\right]^2\\
						      &\leq& 2 \sum_{x\in S_{ij}} d(t_{ij},x)^2 + 2 \sum_{x\in S_{ij}} d(x,T^*)^2
\end{eqnarray*}
In the first equality, the RHS consists of $|S_{ij}|$ identical terms. In the second line, we apply the triangle inequality with different intermediate point $x$ for each term. The third line uses the results of Section~\ref{subsec:simpleproof}.

Inserting the above inequality into Eq.~\ref{eq:costdef2} we find:
\begin{equation*}
	\mathrm{cost}_w(\hat{S},T^*) \leq 2 \sum_{i=1}^l \sum_{t_{ij}\in T_i} \sum_{x\in S_{ij}} d(t_{ij},x)^2 + 2 \sum_{i=1}^l \sum_{t_{ij}\in T_i} \sum_{x\in S_{ij}} d(x,T^*)^2
\end{equation*}

Finally, it is clear that $\sum_{t_ij\in T_i} \sum_{x\in S_ij} d(t_{ij},x)^2$ is simply the (unweighted) cost of $S_i = \cup_j S_{ij}$ with respect to $T_i$. Likewise, the summation $\sum_{i=1}^l \sum_{t_{ij}\in T_i} \sum_{x\in S_{ij}} d(x,T^*)^2$ simply iterates over all elements $x\in S$. We therefore conclude
\begin{equation*}
	\mathrm{cost}_w(\hat{S},T^*) \leq 2 \sum_{i=1}^l \mathrm{cost}(S_i,T_i) + 2\cdot \mathrm{cost}(S,T^*).
\end{equation*}

Note that the above result can also be written using Section~\ref{subsec:partc}
\begin{equation}
	\mathrm{cost}_w(\hat{S},T^*) \leq 2 \alpha\cdot\mathrm{cost}(S,T^*) + 2\cdot \mathrm{cost}(S,T^*).
\end{equation}

\subsection{Conclude that $\mathrm{cost}(S,T) \leq (4\alpha^2+6\alpha)\cdot\mathrm{cost}(S,T^*)$}

Here, the task is to simply cascade all of our previous results. We begin with Eq.~\ref{eq:partb}, and apply the results of Sections~\ref{subsec:costbound}--\ref{subsec:parte} sequentially
\begin{eqnarray*}
	\mathrm{cost}(S,T) &\leq& 2\cdot\mathrm{cost}_w(\hat{S},T)+2\sum_{i=1}^l\mathrm{cost}(S_i,T_i)\\
	&\leq& 2\cdot\mathrm{cost}_w(\hat{S},T)+2\alpha\cdot\mathrm{cost}(S,T^*)\\
	&\leq& 2\alpha\cdot\mathrm{cost}_w(\hat{S},T^*)+2\alpha\cdot\mathrm{cost}(S,T^*)\\
	&\leq& 2\alpha\cdot\left[2\alpha\cdot\mathrm{cost}(S,T^*)+2\cdot\mathrm{cost}(S,T^*)\right]+2\alpha\cdot\mathrm{cost}(S,T^*)\\
	&\leq& \left(4\alpha^2+6\alpha\right)\cdot\mathrm{cost}(S,T^*).
\end{eqnarray*}

So, \verb=ALGSTR=, which is based on $\alpha$-approximate \verb=ALG= applied to the ``segmented'' datasets $S_i$, is $(4\alpha^2+6\alpha)$-approximate for the overall problem on $S=\cup_{i}S_i$.

\subsection{Memory requirement}

Suppose $|S|=n$ and we run \verb=ALGSTR= with $k$-centroid centers. Let $S$ be partitioned into $l$ parts $S_1,\cdots,S_l$. We will run \verb=ALG= on each of the partitions at a time. We must also accumulate the results $T_i$ for each partition. Since $|T_i|\propto k$, the basic memory requirement is
\begin{equation*}
	\mathrm{Required~memory} \propto n/l + k\cdot l.
\end{equation*}
Let $l=\sqrt{n/k}$. In then follows that the memory usage of \verb=ALGSTR= is $O(\sqrt{nk})$.

\section{Data streams}

\subsection{Prove that $\tilde{F}[i]\geq F[i]$ for any $i=1,2,\cdots,n$}

Let the true count of item $i$ be $c$, \emph{i.e.} $F[i]=c$. For every hash function $h_j$, item $i$ will be hashed into the bucket $h_j(i)$ precisely $c$ times. Now, it may be the case that the hash $h_j$ maps other indices to $h_j(i)$. Hence, we conclude
\begin{equation*}
	c_{j,h_j(i)} \geq c
\end{equation*}
for every $j$. It follows immediately that
\begin{equation*}
	\tilde{F}[i] = \min_j \left\{c_{j,h_j(i)}\right\} \geq c = F[i].
\end{equation*}

\subsection{Prove that $E[c_{j,h_j(i)}] \leq F[i] + \frac{\epsilon}{e}(t-F[i])$ for all $i,j$\label{subsec:expectation}}

Consider $j$ to be fixed. First, we note that the expectation is over the distribution of possible hash functions that can be assigned to $h_j$.

Now, for any hash function $h_j$, the algorithm will map the $F[i]$ occurrences of the item $i$ into bucket $h_j(i)$. Given the distribution over the hash function $h_j$, all other items $i'\neq i$ are equally likely to be mapped to any of the buckets $1,2,\cdots,\left\lceil e/\epsilon \right\rceil$. Thus, we obtain the following expectation
\begin{eqnarray*}
	E[c_{j,h_j(i)}] &=& F[i] + \frac{1}{\left\lceil e/\epsilon\right\rceil} \sum_{i'\neq i}F[i'] = F[i] + \frac{1}{\left\lceil e/\epsilon\right\rceil} \left(t-F[i]\right)\\
				   &\leq& F[i] + \frac{1}{e/\epsilon}\left(t-F[i]\right) = F[i] + \frac{\epsilon}{e}(t-F[i])
\end{eqnarray*}
where, in the last line, we have utilized the fact that $\left\lceil e/\epsilon\right\rceil \geq e/\epsilon$.

\subsection{Prove that $Pr\left(\tilde{F}[i]\leq F[i]+\epsilon t\right) \geq 1-\delta$\label{subsec:goodapprox}}

We consider the probability of the complement event. We make use of the fact that for the minimum of a set to exceed some threshold value, all elements in the set must exceed that threshold.
\begin{eqnarray*}
	Pr\left(\tilde{F}[i] > F[i]+\epsilon t\right) &=& Pr\left(\min_j\left\{c_{j,h_j(i)}\right\} > F[i]+\epsilon t\right)\\
	&=& \prod_j Pr\left(c_{j,h_j(i)}>F[i]+\epsilon t\right) = \prod_j Pr\left(c_{j,h_j(i)}-F[i] > \epsilon t\right)\\
	&\leq& \left[Pr\left(c_{j,h_j(i)}-F[i]>\epsilon t\right)\right]^{\log{1/\delta}}.
\end{eqnarray*}
In the second line, we use the fact that the hash functions $h_j$ are i.i.d. In the last line, we obtain the inequality since $\log{1/\delta}\leq\left\lceil\log{1/\delta}\right\rceil$.

We then apply the Markov inequality, incorporating the results of Section~\ref{subsec:expectation}, yielding
\begin{equation*}
	Pr\left(c_{j,h_j(i)}-F[i]>\epsilon t\right) \leq \frac{E(c_{j,h_j(i)}-F[i])}{\epsilon t} \leq \frac{(\epsilon/e)(t-F[i])}{\epsilon t} = \frac{1}{e}\cdot\frac{t-F[i]}{t}.
\end{equation*}

Making use of the above result, we find
\begin{equation*}
	Pr\left(\tilde{F}[i] > F[i]+\epsilon t\right) \leq \left[\frac{1}{e}\cdot\frac{t-F[i]}{t}\right]^{\log{1/\delta}} = \delta \cdot \left(\frac{t-F[i]}{t}\right)^{\log{1/\delta}} \leq \delta.
\end{equation*}
Note that the final inequality follows since $\frac{t-F[i]}{t} \leq 1$.

Taking the complement, we thus conclude
\begin{equation*}
	Pr\left(\tilde{F}[i] \leq F[i]+\epsilon t\right) \geq 1-\delta.
\end{equation*}

\subsection{Application to the dense subgraph search}

The algorithm for dense subgraph search in the previous problem set required us to count the number of edges connected to each node (\emph{i.e.} the degree of the node). With $n$ nodes in the graph, my implementation had used an array of $n$ counters. We can clearly use the current algorithm to approximate the counts in $O(\log{n})$ space.

The approximate counting routine is particularly suitable to the previous dense subgraph search algorithm. As noted in Section~\ref{subsec:goodapprox}, the approximation is better for items $i$ where $F[i]$ is not very small compared to $t$. In the dense subgraph search algorithm, we are seeking nodes that possess large degrees. So, the approximation should work well in this application.

\end{document}