\documentclass{article}
\usepackage[pdftex]{graphicx}
\usepackage{amsfonts}
\usepackage{amsmath, amsthm, amssymb}
\usepackage{moreverb}
\usepackage{pdfpages}
\usepackage{multirow}
\usepackage[ruled,linesnumbered]{algorithm2e}

\title{CS 224w: Problem Set 4}
\author{Tony Hyun Kim}
\setlength{\parindent}{0pt}
\setlength\parskip{0.1in}
\setlength\topmargin{0in}
\setlength\headheight{0in}
\setlength\headsep{0in}
\setlength\textheight{8.2in}
\setlength\textwidth{6.5in}
\setlength\oddsidemargin{0in}
\setlength\evensidemargin{0in}

\pdfpagewidth 8.5in
\pdfpageheight 11in

% Custom commands
\newcommand{\vectornorm}[1]{\left|\left|#1\right|\right|}

\begin{document}

\maketitle

\section{Variations on a theme of PageRank}

\subsection{Personalized PageRanks for $E$, $F$, $G$}

The basic intuition is that personalized PageRank should follow superposition in the teleport set $S$. For instance, the personalized PageRank vector for $S=\left\{1,2\right\}$ should be a superposition (up to normalization) of the PageRank vectors of $\left\{1\right\}$ and $\left\{2\right\}$ separately. The superposition property makes particular sense given the ``random surfer with restarts'' interpretation of personalized PageRank, where each instance of the path made by the random surfer is independent of others.

\subsubsection{Eloise}

Yes. Eloise's PageRank vector is given by:
\begin{equation}
	r_E = 3r_A - (3r_B-(3r_C-r_D)) - r_D = 3r_A - 3r_B + 3r_C - 2r_D.
\end{equation}

\subsubsection{Felicity}

No. Given the known PageRank vectors, it is not possible to form a linear combination of the corresponding teleport sets to produce the desired set $\left\{5\right\}$. Note that the teleport sets that include $5$ are Bertha's and Clementine's. However both also have $4$ in their teleport sets, which is not possible to eliminate without also eliminating $5$.

\subsubsection{Glynnis}

Yes, Glynnis' PageRank vector is given by:
\begin{eqnarray}
	r_G &=& \frac{1}{10} \left( 2\cdot 3r_A + 1\cdot 3r_B + 1\cdot 3r_C - 2\cdot 1 r_D\right),\label{eqn:glynnis_comb}\\
			&=& \frac{1}{10} \left( 6r_A + 3r_B + 3r_C - 2r_D\right).
\end{eqnarray}

In Eq.~\ref{eqn:glynnis_comb}, I factored each coefficient as $a\cdot b$ where $a$ is the combination coefficient required to construct Glynnis' teleport set from the corresponding known teleport sets, and $b$ is a normalization factor.

\subsection{Set of possible personalized PageRank vectors}

Given $V$, we can compute PageRank vectors that correspond to teleport sets that can be formed as a linear combination of the teleport sets associated with $V$.

\subsection{Isolated spam farm}

We use the fact that by symmetry of the boosting pages, $p_1 = p_2 = \cdots = p_k$. The PageRank equation for the target page can be written as follows:
\begin{eqnarray}
	p_0 &=& \beta \cdot \sum_{i\rightarrow 0} \frac{r_i}{d_i} + (1-\beta)\cdot\frac{1}{N},\\
			&=& \beta \cdot \left( \lambda + k p_1 \right) + (1-\beta)\cdot\frac{1}{N}\label{eqn:spamfarm_p0},
\end{eqnarray}
where, in the second line, we have separately accounted for the PageRank flow coming from the ``rest of the network'' and the boosting pages.

The PageRank equation for the boosting pages can be written as:
\begin{equation}
	p_1 = \beta \cdot \frac{p_0}{k} + (1-\beta)\cdot\frac{1}{N}\label{eqn:spamfarm_p1}.
\end{equation}

We then eliminate $p_1$ from Eq.~\ref{eqn:spamfarm_p0} using Eq.~\ref{eqn:spamfarm_p1}:
\begin{equation}
	p_0 = \beta \cdot \left[\frac{\lambda + (1-\beta)\cdot\frac{k}{N}}{1-\beta^2}\right] + (1-\beta)\cdot \left[\frac{1}{(1-\beta^2)\cdot N}\right]\label{eqn:p_0}.
\end{equation}

\subsection{Linked spam farm: Configuration 1}

In the new configuration, $p_1 = p_2 = \cdots = p_k = q_1 = q_2 = \cdots = q_m$. In addition, $\bar{p}_0 = \bar{q}_0$.

Applying the PageRank equation, we find:
\begin{eqnarray}
	\bar{p}_0 = \bar{q}_0 &=& \beta \cdot \left[k\cdot \frac{p_1}{2}+ m\cdot \frac{q_1}{2}\right] + (1-\beta)\cdot \frac{1}{N},\\
												&=& \beta \cdot \left[ (k+m)\cdot \frac{p_1}{2} \right] + (1-\beta)\cdot \frac{1}{N},\\
	p_1 = q_1 &=& \beta \cdot \left[ \frac{\bar{p}_0}{k+m} + \frac{\bar{q}_0}{k+m} \right] + (1-\beta)\cdot \frac{1}{N},\\
						&=& \beta \cdot \left[ \frac{2 \cdot\bar{p}_0}{k+m} \right] + (1-\beta)\cdot\frac{1}{N}.
\end{eqnarray}

Solving for $\bar{p}_0$ yields:
\begin{equation}
	\bar{p}_0 = \bar{q}_0 = \beta \cdot \left[\frac{\left(\frac{k+m}{2}\right)}{(1+\beta)\cdot N}\right] + (1-\beta)\cdot\left[\frac{1}{(1-\beta^2)\cdot N}\right].
\end{equation}

For comparison against the isolated spam farm case, we set$\lambda = 0$ in Eq.~\ref{eqn:p_0}:
\begin{eqnarray}
	p_0' &=& \beta \cdot \left[\frac{k}{(1+\beta)\cdot N}\right] + (1-\beta)\cdot \left[\frac{1}{(1-\beta^2)\cdot N}\right],\\
	q_0' &=& \beta \cdot \left[\frac{m}{(1+\beta)\cdot N}\right] + (1-\beta)\cdot \left[\frac{1}{(1-\beta^2)\cdot N}\right].
\end{eqnarray}

In then follows that:
\begin{eqnarray}
	\bar{p}_0 - p_0' &=& \beta \cdot \left[\frac{\left(\frac{-k+m}{2}\right)}{(1+\beta)\cdot N}\right] + (1-\beta)\cdot\left[\frac{1}{(1-\beta^2)\cdot N}\right],\\
	\bar{q}_0 - q_0' &=& \beta \cdot \left[\frac{\left(\frac{k-m}{2}\right)}{(1+\beta)\cdot N}\right] + (1-\beta)\cdot\left[\frac{1}{(1-\beta^2)\cdot N}\right].
\end{eqnarray}

Overall, this spam farm configuration simply ``averages'' the effort of the two independent two farms and does not offer any nonlinear ``compounding'' gains in the resources used (\emph{i.e.} $\bar{p}_0 + \bar{q}_0 = p_0' + q_0'$).

\subsection{Linked spam farm: Configuration 2}

We now consider the case where only the target pages are linked. In this configuration, the boosting pages do not have any in-links, so their PageRank values are $p_1 = \cdots = p_k = q_1 = \cdots = q_m = (1-\beta)\cdot\frac{1}{N}$. The PageRank equations for $\bar{\bar{p}}_0$ and $\bar{\bar{q}}_0$ are:
\begin{eqnarray}
	\bar{\bar{p}}_0 &=& \beta \cdot \left(k\cdot p_1 + \bar{\bar{q}}_0\right) + (1-\beta)\cdot\frac{1}{N} = \beta \cdot \left( (1-\beta)\cdot \frac{k}{N} + \bar{\bar{q}}_0\right) + (1-\beta)\cdot\frac{1}{N},\\
	\bar{\bar{q}}_0 &=& \beta \cdot \left(m\cdot q_1 + \bar{\bar{p}}_0\right) + (1-\beta)\cdot\frac{1}{N} = \beta \cdot \left( (1-\beta)\cdot \frac{m}{N} + \bar{\bar{p}}_0\right) + (1-\beta)\cdot\frac{1}{N}.
\end{eqnarray}

Solving the set of equations for $\bar{\bar{p}}_0$ yields:
\begin{equation}
	\bar{\bar{p}}_0 = \beta \cdot \left[\frac{k+\beta m+1}{(1+\beta)\cdot N}\right] + (1-\beta)\cdot\left[\frac{1}{(1-\beta^2)\cdot N}\right].
\end{equation}
We can also obtain $\bar{\bar{q}}_0$ by swapping $k$ and $m$:
\begin{equation}
	\bar{\bar{p}}_0 = \beta \cdot \left[\frac{m+\beta k+1}{(1+\beta)\cdot N}\right] + (1-\beta)\cdot\left[\frac{1}{(1-\beta^2)\cdot N}\right]
\end{equation}

So, it follows that:
\begin{eqnarray}
	\bar{\bar{p}}_0 - p_0' &=& \beta \cdot \left[\frac{\beta m + 1}{(1+\beta)\cdot N}\right],\\
	\bar{\bar{q}}_0 - q_0' &=& \beta \cdot \left[\frac{\beta k + 1}{(1+\beta)\cdot N}\right].
\end{eqnarray}
In other words, both $\bar{\bar{p}}_0$ and $\bar{\bar{q}}_0$ gain by linking their spam farms in this manner (\emph{i.e.} $\bar{\bar{p}}_0 + \bar{\bar{q}}_0 > p_0' + q_0'$).

\section{Approximate betweenness centrality}

The comparison between the exact and approximate betweenness centrality computation results are shown in Fig.~\ref{fig:p2}. The approximate result is plausible, given that we only used $10\%$ of the nodes for the approximation.

On the other hand, the exact ordering of the edges according to decreasing betweenness centrality isn't too great for the approximate method (\emph{i.e.} what is the true rank of the edge with the highest \emph{approximate} betweenness centrality)? This makes sense, since the ordering of edges will be very sensitive to errors in the approximate betweenness centralities.

\begin{figure}[t]
	\begin{center}
		\includegraphics[width=0.75\textwidth]{p2.pdf}
	\end{center}
	\caption{Comparison of the exact and approximate betweenness centrality computation results.\label{fig:p2}}
\end{figure}

I attached the exact and approximate codes in the following page. Admittedly, my approximate code is not efficient, it will always make \texttt{max\_iter} $= n/10$ BFS iterations, even if all $\Delta_e$'s are greater than $cn$. To implement the approximate algorithm correctly, one should properly implement the terminating condition in the loop. However, for the current case, even if I run BFS for every $v \in V$ (\emph{i.e.} the exact algorithm), I found that very few edges attain an aggregate $\Delta_e > cn = 5000$.

\includepdf[pages=-,nup=1x2,landscape=true,]{p2_code.pdf}

\section{Stochastic Kronecker graphs}

\subsection{Indexing scheme} 

\begin{figure}[t!]
	\begin{center}
		\includegraphics[width=0.5\textwidth]{p3_indexing.pdf}
	\end{center}
	\caption{Indexing scheme (decreasing binary counter) that allows $\Theta_k$ to be represented as a product of $\Theta_1$, here illustrated for $k=2$.\label{fig:p3_indexing}}
\end{figure}

In order to use the desired factorization $\Theta_k(u_1u_2\cdots u_k, v_1v_2\cdots v_k) = \prod_{i=1}^k \Theta_1(u_i,v_i)$, the indexing should be a decrementing $k$-bit counter (\emph{i.e.} decreasing index in columns from left to right, decreasing index in row from top to bottom). This indexing is explicitly illustrated for $k=2$ in Fig.~\ref{fig:p3_indexing}.

\subsection{Compute $P[u,v]$}

Given the results of the previous part, we are able to compute $P[u,v]$ by considering each of the $k$ bit pairs $(u_i, v_i)$. Since we are told
\begin{itemize}
	\item the weight of node $u$ is $l$ (\emph{i.e.} the number of $1$'s in the binary representation of $u$),
	\item $i$ is the number of bits where $u_b = v_b = 1$,
	\item $j$ is the number of bits where $u_b = 0$ and $v_b = 1$,
\end{itemize}
the frequency of each bit pair combination is enumerated in Table~\ref{tab:freq}. Based on this table, we conclude that:
\begin{equation}
	P[u,v] = \alpha^i\cdot \beta^{(l-i)+j} \cdot \gamma^{(k-l)-j}.
\end{equation}

\begin{table}[b!]
	\begin{center}
		\begin{tabular}{cccc}
			$u_b$ & $v_b$ & $\Theta_1(u_b,v_B)$ & Frequency \\
			\hline
			$1$ & $1$ & $\alpha$ & $i$   \\
			$1$ & $0$ & $\beta$  & $l-i$ \\
			$0$ & $1$ & $\beta$  & $j$   \\
			$0$ & $0$ & $\gamma$ & $(k-l)-j$
		\end{tabular}
	\end{center}
	\caption{Frequency of each bit pair combination\label{tab:freq}}
\end{table}

\subsection{Expected degree of node $u$ with weight $l$ represented using $k$-bits}

The expected degree of node $u$ with weight $l$ represented using $k$ bits is (I think):
\begin{equation}
	\left(\alpha+\beta\right)^l \cdot \left(\beta+\gamma\right)^{k-l}\label{eq:expdeg}.
\end{equation}

I observed the above pattern in the case for $k=2$. Consider again Fig.~\ref{fig:p3_indexing}. The expected degree of any particular node is the column sum of the adjacency matrix for that node, since each entry in the matrix is the probability of forming a single edge. In the case of $k=2$, I observed that the column sums are:
\begin{itemize}
	\item $l=2$ (first column): The column sum is $\alpha^2 + 2\alpha\beta + \beta^2 = (\alpha+\beta)^2$,
	\item $l=1$ (second and third columns): The column sum is $\alpha\beta + \alpha\gamma + \beta\beta + \beta\gamma = (\alpha+\beta)\cdot(\beta+\gamma)$,
	\item $l=0$ (last column): The column sum is $\beta^2+2\beta\gamma+\gamma^2 = (\beta+\gamma)^2$.
\end{itemize}
I then make a leap of faith for $k\geq3$.

\subsection{Expected number of edges in the graph}

We assume that the graph is undirected. In this case, we can obtain the expected number of edges $\bar{M}$ in the graph by computing the sum of the expected degrees of each node and dividing by two. (We should in principle exempt self-edges from the divide-by-two, but I'll ignore this complication as instructed.)

The sum of the expected degrees of each node is the aggregate sum of all entries in the adjacency graph $\Theta_k$. We have previously computed the column sums in Eq.~\ref{eq:expdeg}. Now, noting that there are $k$ choose $l$ nodes with weight $l$ (the possible arrangements of $l$ $1's$ in a $k$-bit binary sequence), we have:
\begin{eqnarray}
	\bar{M} &=& \frac{1}{2}\cdot\sum_{l=0}^k \binom{k}{l} \cdot (\alpha+\beta)^l\cdot(\beta+\gamma)^{k-l},\\
					&=& \frac{1}{2}\cdot(\alpha+2\beta+\gamma)^k,
\end{eqnarray}
as the expected number of edges in the graph.

\subsection{Expected number of self loops}

The expected number of self loops is the sum over the main diagonal of the adjacency graph. 

Consider a node with weight $l$. The probability it forms a self-edge is then given by $\alpha^l\cdot\gamma^{k-l}$. Again, taking the combinatorial multiplicity of the nodes with weight $l$ into account, we have:
\begin{eqnarray}
	\bar{M}_\mathrm{self} &=& \sum_{l=0}^k \binom{k}{l} \cdot \alpha^l \cdot \gamma^{k-l},\\
												&=& (\alpha+\gamma)^k
\end{eqnarray}
as the expected number of self-loops.

\section{Anchored $k$-cores in social networks}

\subsection{The equilibrium network}

\begin{figure}[t!]
	\begin{center}
		\includegraphics[width=0.75\textwidth]{p4_eq.pdf}
	\end{center}
	\caption{[Left] The equilibrium network when each node requires $k=3$ neighbors in order to stay in the graph. [Right] The equilibrium network when nodes J and Q are ``brainwashed'' to stay in the network.\label{fig:p4_eq}}
\end{figure}

I pruned the provided network for the $3$-core. The equilibrium network includes nodes A, B, C, K; D, E, F, I, L, as shown in the left panel of Fig.~\ref{fig:p4_eq}.

\subsection{Finding the $k$-core}

The algorithm is given in Alg.~\ref{alg:kcore} below.

\begin{algorithm}[b]
	\SetAlgoLined
	$V_0 \gets$ set of nodes in $G$\;
	$V_1 \gets \emptyset$\;
	\While{1}{
		$G_1 \gets \mathrm{InducedGraph}(G, V_0)$\;
		$V_1 \gets \left\{v \in G_1\; |\; \mathrm{deg}_{G_1}(v)\geq k\right\}$\;
		\If {$V_0 = V_1$}
			 {break\;}
		$V_0 \gets V_1$\;
	}
	\Return{$V_1$}
	\caption{Find the $k$-core of a graph $G=(V,E)$\label{alg:kcore}}
\end{algorithm}

\subsection{Saving nodes}

In the provided ``Facebook'' network, I would save nodes J and Q. This case is illustrated in the right panel of Fig.~\ref{fig:p4_eq}. With this intervention, we save $5$ additional nodes beyond the original case.

\subsection{Failure of naive greedy}

Indeed, the anchored $k$-core problem is greatly simplified for $k=2$... I will demonstrate a network structure in which the naive greedy algorithm for anchored $k$-core will performs arbitrarily badly.

\begin{figure}[t]
	\begin{center}
		\includegraphics[width=0.6\textwidth]{p4_naive_failure.pdf}
	\end{center}
	\caption{An example of a network for which the naive greedy algorithm for anchored $k$-core will fail arbitrarily badly (for $k=2$, $b=2$).\label{fig:p4_naive_failure}}
\end{figure}

Consider the network in Fig.~\ref{fig:p4_naive_failure}. The network consists of two disjointed parts. One of the disjointed parts is a single triad with a short, dangling segment. The other part is a linear chain of $N$ nodes, where $N$ is assumed to be very large.

Note that for $k=2$, linear chains are not stable structures. For instance, the short, dangling segment connected to the triad will iteratively be eliminated (starting from $v_1$). Likewise, the long linear chain will also decay from the edges (starting from $v_2$ and $v_3$). In the basic network, only the triad persists in the equilibrium network.

In the network of Fig.~\ref{fig:p4_naive_failure}, the optimal solution for anchored $k$-core is to select $v_2$ and $v_3$ which, when taken together, will prevent the long linear chain from being eliminated. The optimal solution will save the $N$ nodes involved in the linear chain.

On the other hand, the naive algorithm that only considers one anchor at a time, will initially select $v_1$, since that choice will provide an incremental improvement of two nodes ($v_1$ and its neighbor) compared to an incremental improvement of one node (for any other choice). In the second iteration, naive would then choose some random node in the long linear chain. Unfortunately, for $k=2$, you can't save a chain without capping both ends!

So, with $b=2$, the optimal solution would have saved $N$ nodes, whereas the greedy algorithm only saves $3$ nodes. Hence, as $N$ is made large, the naive method fails arbitrarily badly.

\subsection{RemoveCore}

Note that nodes in the $k$-core ($C_k$) of the unanchored network will always be in the equilibrium network when anchors are added. Hence, we do not need to explicitly consider the elements of $C_k$ when searching for the optimal anchor set.

Instead, we can perform the $k$-core problem for $G'$ (where edges in $C_k$ are removed) where nodes in $C_k$ are considered to be anchored for ``free'' (\emph{i.e.} without using up the budget). In effect, we are optimizing the incremental saves beyond the baseline case, rather than the absolute number of nodes in equilibrium.

\subsection{TwoStepGreedy}

Nothing.

\subsection{Data exploration}

My implementation of \texttt{TwoStepGreedy} and \texttt{HighestDegree} is attached at the end of this submission.

Fig.~\ref{fig:p4_results} shows the performance of \texttt{TwoStepGreedy} and \texttt{HighestDegree} algorithms for anchored $K=2$-core on \texttt{g1.txt} and \texttt{g2.txt}. It is interesting that the \texttt{HighestDeg} algorithm is stalled on \texttt{g2.txt} for $b\geq2$ (even though there is room for improvement, as illustrated by the results of \texttt{TwoStepGreedy}).

\begin{figure}[t]
	\begin{center}
		\includegraphics[width=0.7\textwidth]{p4_results.pdf}
	\end{center}
	\caption{The performance of \texttt{TwoStepGreedy} (blue) and \texttt{HighestDeg} algorithms on \texttt{g1.txt} and \texttt{g2.txt}.\label{fig:p4_results}}
\end{figure}

\subsection{Possible structures for $G_1$ and $G_2$}

\begin{figure}[t]
	\begin{center}
		\includegraphics[width=0.9\textwidth]{p4_structures.pdf}
	\end{center}
	\caption{Two network structures in which the \texttt{HighestDeg} algorithm will perform well (left) and poorly (right). The salient difference is the average distances between the nodes of high degree (highlighted in red).\label{fig:p4_structures}}
\end{figure}

A notable feature of the $K$-core problem for $K=2$ is that we can save linear chains by anchoring the two endpoints. In \texttt{HighestDeg} algorithm, we are ``hoping'' that the nodes with large degrees are likely to be endpoints of chains that may be saved.

In Fig.~\ref{fig:p4_structures}, I show two structures where the \texttt{HighestDeg} algorithm will perform well on one (save many nodes in addition to the anchor), and poorly on the other. The distinction between two network structures is the average distance between high-degree branching points.

The graph in the left panel of Fig.~\ref{fig:p4_structures} has relatively large distances between the nodes of high degree (marked in red) that are likely to be picked by the \texttt{HighestDeg} algorithm. In this case, the degree of a node is a decent proxy for effective anchors, since the nontrivial chains between the selected (red) nodes will also be saved. I suspect that \texttt{g1.txt} may be of this structure.

In contrast, the graph in the right panel has very short distances between the nodes of high degree. In this case, high-degree nodes are not effective anchors. I suspect that \texttt{g2.txt} may have such a structure, which is why the performance is capped at $b\geq2$ (Fig.~\ref{fig:p4_results}).

\includepdf[pages=-,nup=1x2,landscape=true,]{p4_code.pdf}

\end{document}