\documentclass[twoside]{article}
\usepackage{amssymb, amsmath}
\usepackage{mathrsfs}
\newtheorem{theorem} {Theorem}[section]
\newtheorem{remark} [theorem] {Remark}
\newtheorem{lemma} [theorem] {Lemma}
\newcommand{\R}{\mathbb{R}}
\newcommand{\X}{\mathcal{X}}
\oddsidemargin 0in \evensidemargin 0in \topmargin -0.5in
\headheight 0.2in \headsep 0.2in
\textwidth 6.5in \textheight 9in
\parskip 1.5ex \parindent 0ex \footskip 40pt
\begin{document}
\framebox[6.4in]{
\begin{minipage}{6.4in}
\vspace{1mm}
\center \makebox[6.2in]{{\bf CS369M: Algorithms for Modern Massive Data Set Analysis \hfill Lecture 5 - 10/07/2009}}
\vspace{2mm} \\
\center \makebox[6.2in]{{\Large Approximating L2 regression }}
\vspace{1mm} \\
\center \makebox[6.2in]{{\it Lecturer: Michael Mahoney \hfill Scribes: Kshipra Bhawalkar and Mengqiu Wang}}
\vspace{1mm}
\end{minipage}
} \vspace{2mm} \\
\mbox{{ \it * Unedited notes.}}
\section{ Introduction}
All of the methods we have seen so far have provided an additive error guarantee i.e. the approximate matrix A' is such that,
\[ | A - A_k| \leq |A - A'| \leq |A - A_k| + \epsilon |A| \]
In this lecture we study some methods that provide relative guarantees i.e. guarantees of the form,
\[ |A - A'| \leq (1+ \epsilon)|A - A_k| \]
Relative error guarantees are a relatively new thing, for many years it was not even known if such guarantees could be achieved. But once people were able to generate such guarantees many more emerged some notable guarantees include DMM, HP using geometric algorithms, DV using iterative update, Sarlos for fast matrix computations. One notable property of these algorithms is that they are not efficiently implementable in few passes.
In this lecture we study a relative error guarantee for approximately solving the problem of linear regression. More formally given a matrix $A \in \R^{n \times d}$ and a vector $b \in \R^d$, we want to find a vector $x \in \R^n$ such that $Ax = b$. It's important to note that such an $x$ may not always exist, hence we find the best possible solution i.e. find x s.t.
\[ x = arg\; min_x \|Ax - b\|_2 \; and \; z = min_x \|Ax - b\|_2 \]
\section{Solving linear regression in $O(nd^2)$ time}
The linear regression can be solved exactly in $O(nd^2)$ time using various decompositions of A like QR and SVD. We now look at some randomized method which calculate the correct answer in $O(nd^2)$ time with high probability. Note that these methods are clearly worse than solving the system exactly but we present them here as they will be used as the basis for algorithms in the next section.
{\bf Approach 1: Random Projection} \\
We generate a random matrix $R \in \R^{l \times n}$, as in previous lectures and then solve,
\[ min_x \|R^TAx - R^Tb\|_2 \]
This gives a $(1 + \epsilon)$ approximation and takes $O(nd^2)$ time.
{\bf Approach 2: Random Sampling} \\
We could carry out random sampling similar to in previous lectures. Unfortunately, this does not work very well. Matrix A might loose rank in the random sampling and it is hard to capture that in a relative error term. To illustrate this better consider a matrix A with $n-1$ rows of vector $\alpha$ and one row containing a vector $\alpha'$ which is much different from $\alpha$. Then any sort of sampling would loose $\alpha'$ with high probability and we will have high error. Remember that in the previous lecture we had additive error for random sampling which is why loosing rank was not an issue.
Q: How can we still make random sampling work? \\
Idea: De-convolute subspace of A and size of A information. \\
Let $ A = U \Sigma V^T$ be the singular value decomposition of A. We use these decomposition to define modified sampling probabilities. Define $\{P_i\}_{i=1}^n: P_i = \frac{\|(U_A)_{(i)}\|_2^2}{\sum_i' \|(U_A)_{(i')}\|_2^2} = \frac{\|(U_A)_{(i)}\|_2^2}{d}$.
Intuition for $\|(U_A)_{(i)}\|_2^2$: \\
the values $\|(U_A)_{(i)}\|_2^2$ correspond to the diagonal elements of the projection matrix $P_A$.
\[ (P_A)_{ij} = (A (A^TA)^TA^T)_{ij} = (UU^T)_{ij} = \langle{(U_A)_{(i)}}, { (U_A)_{(j)}} \rangle \]
These values allow us to detect points in the data set that might have disruptive effect or might have high leverage on the model. There are also referred as leverage scores or effective resistance. Real data often have a high variety in leverage scores, so these values allow us to detect those early on. \\
Now we propose the modified random sampling algorithm: \\
{\bf Algorithm}: Sampling $L_2$ algorithm. \\
\begin{itemize}
\item Compute $P_i = \frac{1}{d} \| (U_A)_{(i)} \|_2^2$.
\item Sample $O(\frac{d \lg d}{\epsilon^2})$ rows and generate sampling matrix S.
\item Solve $min_x |SAx - Sb|$
\end{itemize}
This algorithm still takes $O(nd^2)$ just for the first step. It is an open question as to whether the $P_i$'s can be computed approximately with low relative error in $o(nd^2)$ time. \\
In the next section we analyze the error rate of this algorithm and discuss how it can be modified for faster performance.
\section{Analysis of the modified Random Sampling Algorithm}
\begin{theorem}
If $\{P_i\}$ s.t. $P_i \geq \beta. \frac{\|(U_A)_{(i)}\|_2^2}{d}$, $\beta \in (0, 1)$, $r \geq \theta(k^2/\beta \epsilon)$, then with high probability
\begin{eqnarray}
\| b - A \tilde{x}_{OPT} \| \leq (1 + \epsilon)z = (1 + \epsilon) \| b - A x_{OPT} \| \\
\| x_{OPT} - \tilde{x}_{OPT}\| \leq \epsilon \kappa(A) \sqrt{\gamma^2 -1} \|x\|_2
\end{eqnarray}
where, $\gamma \in (0, 1)$ is such that, $\|U_AU_A^Tb\| \geq \gamma \|b\|$.
\end{theorem}
\begin{remark} The $k^2$ in the form for r can be improved to $k \lg k$.
\end{remark}
Before proceeding to prove the theorem stated above we establish the following structural result.
\begin{lemma}
Let $\X$ be any matrix such that
\begin{eqnarray}
\sigma_min(\X U_A) \geq 9/10 \\
\| U_A^T \X^T \X b^{\perp} \| \leq \epsilon/2 z^2 \\
then (*1) and (*2) holds
\end{eqnarray}
where $b^{\perp}$ is the part of b outside the space spanned by A i.e. $b^{\perp} = b - A^+Ab$ \\
\end{lemma}
Proof:
$ \min_{X'} \| \X_b - \X A_{X'} \| = \min_{y'} \| \X A x_{OPT} + b^{\perp} - \X A x_{OPT} + y' \| $
hence:
\begin{eqnarray}
& b = A x_{OPT} + b^{\perp} \\
= & \min_{y'} \| \X b^{\perp} - \X A y' \| \\
= & min_{z'} \| \X b^{\perp} - \X U_A Z' \| ** \\
\end{eqnarray}
where $Z' = \Sigma U^T y' $
Define: $Z = \Sigma_A V_{A}^{T} (x_{OPT} - \tilde{x}_{OPT}) $
note: this is an optimal solution to **
Solve ** using normal equation
$ (\X U_A)^T (\X U_A) Z = (\X U_A)^T x b^{\perp} $
take the norm on both sides
$1/2 | Z |^2 \leq | (\X U_A)^T (\X U_A) Z | \leq Z^2/2 $
look at residual vector
\begin{eqnarray}
& | b - A x^{2}_{OPT} |_{2}^{2} = | b - A x^{2}_{OPT} + A x_{OPT} - A x_{OPT} |_{2}^{2} \\
= & | b - A^2 x_{OPT} |_{2}^{2} + | A (x_{OPT} - A x^{2}_{OPT} |_{2}^{2} \\
= & Z^2 + | U_A Z |_{2}^{2} \\
\leq & Z^2 + \epsilon Z^2 \\
= & (1+\epsilon) Z^2 \\
\end{eqnarray}
to get second claim:
$ A( x_{OPT} - \tilde{x}_{OPT} ) = U_A Z$
where $Z = \Sigma_{A} V_A^T (x_{OPT} - \tilde{x}_{OPT})$
\begin{eqnarray}
\| x_{OPT} - \tilde{x}_{OPT} \|_2^2 & \leq & \frac{| U_A Z |_2^2}{\sigma_{min}^2(A)} \\
& \leq & \frac{\epsilon Z^2}{\sigma_{min}^2(A)} *** \\
\end{eqnarray}
Assume $ | U_A U_A^T b |_2 \geq \gamma | b |$
Then:
\begin{eqnarray}
Z^2 & = & | b |_2^2 - | U_A U_A^T b |_2^2 \\
& \leq & (\gamma^{-2} - 1) | U_A U_A^T b |_2^2 \\
& \leq & \sigma_{max}(A) (\gamma^{-2}-1) | x_{OPT} | \\
\end{eqnarray}
Since:
\begin{eqnarray}
x_{OPT} & = & A^{-1} b \\
& = & V \Sigma^{-1} U^T b \\
| x_{OPT} | & = & | V \Sigma^{-1} U^T b | \\
& \geq & \sigma_{min}(\Sigma^{-1}) | U^T b | \\
& = & \frac{| U_A U_A^T b |^2}{ \sigma_{max}(A) } \\
\end{eqnarray}
Play it back to ***, we get:
\begin{eqnarray}
\| x_{OPT} - \tilde{x}_{OPT} \|_2^2 & \leq & \frac{\epsilon Z^2}{ \sigma_{min}^2(A)} \\
& \leq & \epsilon K(A)^2 \sqrt{\gamma^{-2}} | x_{OPT} | \\
\end{eqnarray}
Note: I may have lost a square root on $\epsilon$
Key Assumption:
$ \X U_A$ is something like an orthogonal matrix
$X U_A \perp X b^{\perp} $
\begin{remark}
If S is construct such that $p_i \geq \beta \cdot |(U_A)_{(i)}|^2_2/d$ then S satisfies the conditions on $\X$ in the above lemma.
\end{remark}
{\bf Extension 1: Relative error low rank matrix approximation. } \\
Given a matrix A. Let $p_i := |U_{A, k}|_2^2/k$. Sample $r = O(d^2)$ columns of A and return C, then
\[ \|A - P_cA\|_2 \leq (1+\epsilon) \|A- A_k\|_2 \]
{\bf Extension 2: Fast Johnson Lindenstrauss Transform (Ailon, Chazelle)} \\
Algorithm for fast least squares. \\
\begin{itemize}
\item Preprocess A,b by randomized Hadamard transform
\item Uniformaly sample $O(d \lg d/\epsilon)$ rows.
\item Solve induced subproblem.
\end{itemize}
then $(1+ \epsilon)$ approximation in $o(nd^2)$ time. \\
~\\
definition: {\bf Hadamard Matrix} \\
A hadamard matrix is a matrix with entries from $\{-1, 1\}$, whose rows and columns are orthonormal. A family of hadamard matrices for $n = 2^k$ can be constructed recursively as follows, \\
\[H_n = \left( \begin{array}{cc} H_{n/2} & H_{n/2} \\ H_{n/2} & -H_{n/2} \end{array} \right)\; \text{and} H_2 = \left( \begin{array}{cc} 1& 1 \\ 1& -1 \end{array} \right) \]
Any hadamard matrix can be normalized by multiplying with $1/\sqrt{n}$. \\
{\bf Note} \\
For a hadamard matrix H, Hx for any $x \in \mathbb{R}^n$ can be computed in $O(n log(n))$ time. \\
~\\
We hope to use hadamard matrices to generate smoother versions of vector x. Towards that we define a measure $\|x\|_{\infty}/\|x\|_2$ as the measure of localization and set out to construct H with low localization. With deterministic H, can have dense vectors x s.t. Hx is sparse. Hence we used a randomized hadamard transform $HD$, where $D$ is a diagonal matrix with entries from $\{-1, 1\}$. The D needs to be chosen randomly so that $HDx$ will have low localization with high probability. \\
~\\
\begin{theorem}(Ailon, Chazelle 2006) \cite{ailon}
If x is a unit vector, $\mathcal{H} = HD$ where H is a normalized hadamard matrix and D is a random diagonal matrix with entries from $\{-1, 1\}$. Let $y = \frac{1}{\sqrt{n}} \mathcal{H}x$ then $\|y\| = 1$ and with high probability $\|y\|_{\infty} = \theta(\frac{\log{n}}{n})$.
\end{theorem}
~\\
\begin{theorem} \cite{DMMS}
Let U be an $n \times d$ orthogonal matrix, let $N = max \{n, d\}$ and let $\mathcal{H} = HD$ be the $n \times n$ randomized Hadamard transform described above. Then with high probability,
\[ | (\mathcal{H} U)_{ij} \leq C_0 \sqrt{\frac{log N}{n}}\]
\[ | (\mathcal{H}U)_{(i)}|_2^2 \leq C_0 \frac{d log N}{n} \]
for some constant $C_0$.
\end{theorem}
Now we define the modified algorithm that will run in $o(nd^2)$ time. \\
The algorithm calculates a solution $x'_{opt}$ for the problem,
\[ \mathcal{Z} = min_{x \in \mathbb{R}^d} |\mathcal{H}(Ax-b)|_2 \]
Since $\mathcal{H}$ is orthonormal this will also be a solution to the original problem. The algorithm is as follows, \\
~\\
{\bf Algorithm}
\begin{itemize}
\item Define a sampling matrix S and the hadamard transform.
\item Preprocess the data with hadamard transform $\mathcal{H}$ i.e. calculate $\mathcal{H}A$, $\mathcal{H}b$
\item Sample r rows uniformly where $r = O(d log d/\epsilon)$.
\item Solve $min | \mathcal{S} \mathcal{H} U \Sigma V^T - \mathcal{SH}b|$.
\end{itemize}
~\\
More details and proofs can be found in \cite{DMMS}. See bibliography for other related papers.
\begin{thebibliography}{9}
\bibitem{ailon} N. Ailon and B. Chazelle, Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform, STOC, pages 557-563, 2006
\bibitem{DMMS} P. Drineas, M. W. Mahoney, S. Muthukrishnan, T. Sarlos, Faster Least Squares Approximation.
\bibitem{dmm} P. Drineas, M. W. Mahoney, S. Muthukrishnan, 'Sampling Algorithms for $\ell_2$ Regression and applications.
\bibitem{rt} Rokhlin, Tygert, 'A fast randomized algorithm for overdetermined linear list-squares regression'.
\bibitem{amt} Avron, Maymounkov, and Toledo, "Blendenpik: Supercharging LAPACK's least-squares solver'.
\end{thebibliography}
\end{document}