\documentclass{article}
\usepackage[pdftex]{graphicx}
\usepackage{amsfonts}
\usepackage{amsmath, amsthm, amssymb}
\usepackage{moreverb}
\usepackage{pdfpages}
\usepackage{multirow}

\title{CS 224w: Problem Set 1}
\author{Tony Hyun Kim}
\setlength{\parindent}{0pt}
\setlength\parskip{0.1in}
\setlength\topmargin{0in}
\setlength\headheight{0in}
\setlength\headsep{0in}
\setlength\textheight{8.2in}
\setlength\textwidth{6.5in}
\setlength\oddsidemargin{0in}
\setlength\evensidemargin{0in}

\pdfpagewidth 8.5in
\pdfpageheight 11in

% Custom commands
\newcommand{\vectornorm}[1]{\left|\left|#1\right|\right|}

\begin{document}

\maketitle

\section{Fighting Reticulovirus avarum}

\subsection{Set of nodes that will be infected}

We are assuming that once R. avarum infects a host, it always infects all of the host's contacts. Given an initially infected node $v$, it follows that the set $\mathrm{Out}(v)$ (i.e. nodes that can be reached from $v$) will be infected.

\subsection{Bow-tie structure of the email network}

Basic measurements on the email network:
\begin{itemize}
	\item Total number of nodes: $85,591$;
	\item Largest SCC: $22,868$ nodes ($26.7\%$ of total nodes);
	\item In-component of the largest SCC: $8,579$ ($10.0\%$);
	\item Out-component of the largest SCC: $12,319$ ($14.4\%$);
	\item Disconnected components (all nodes not part of the above ``bow-tie''): $41,825$ ($48.9\%$).
\end{itemize}

\subsection{Probability that a randomly chosen infected node leads to a large-scale epidemic (at least $30\%$ of the graph)}

Per the instructions, we focus on the ``bow-tie'' structure (the SCC core, the in-component and the-outcomponent). There are three possibilities:
\begin{itemize}
	\item \textbf{A node in the SCC core is initially infected.} In this case, all nodes in the SCC will be infected, as well as the out-component. Hence the infection is large-scale.
	\item \textbf{A node in the in-component is initially infected.} In this case, at least one node of the in-component is infected, as well as the entirety of the SCC and the out-component. The infection is large-scale.
	\item \textbf{A node in the out-component is initially infected.} In this case, at most the entire out-component is infected, whose size is below the ``large-scale'' criterion ($30\%$). The infection is \emph{not} large-scale.
\end{itemize}

We ignore potential large-scale outbreaks in the non-bowtie components (tendrils, tubes, disconnected components).

Hence, the probability that a randomly chosen infected node leads to a large-scale epidemic is:
\begin{equation}
	\left(8,579+22,868\right)/85,591 = 36.7\%,
\end{equation}
\emph{i.e.} the probability that the randomly chosen node lies in the core or the in-component of the largest SCC.

\subsection{Twitter bow-tie}

\subsubsection{Infection in SCC}
If a node in the SCC is infected, then the entirety of the SCC and the out-component is infected, leading to $40\mathrm{M} + 30\mathrm{M} = 70\mathrm{M}$ infected nodes ($70\%$ of the entire graph).

\subsubsection{Worst-case outbreak size}
With no assumptions about the network's structure, the worst site for the initial infection is the in-component, since the infection will capture the entirety of the SCC and the out-component.

Now, the fraction of the in-component that will become infected will depend on the structure of the network. The worst case scenario is that a single infection site in the in-component can reach all nodes in the in-component. 

\begin{figure}[t]
	\begin{center}
		\includegraphics[width=0.5\textwidth]{bowtie.pdf}
	\end{center}
	\caption{[Left] A ``chain'' structure for the in-component of the Twitter network. We assume that the in-component consists of a single chain of length $20$M that leads into the SCC. The initial node (red) represents the ideal initial infection site for maximum havoc. In this structure, the in-component is ``weakly'' connected to the SCC and can be disconnected by, for instance, deleting the edge to SCC (blue arrow).\label{fig:bowtie}}
\end{figure}

A simple network structure (though highly unlikely) that achieves this worst-case infection scenario is if the in-component is a single directed chain (of $20$M nodes) that leads into the SCC, as shown in Fig.~\ref{fig:bowtie}. In this case, the evil TA should target the leading node in the in-component (marked red) for maximum havoc, i.e. infecting the entire bow-tie of $90$M nodes.

\subsubsection{Reduce the worst-case outbreak size}

Given the simple ``chain'' assumption for the in-component (Fig.~\ref{fig:bowtie}), it is trivial to reduce the size of the worst-case outbreak. I would remove the single edge (marked in blue) that connects the in-component to the SCC. By removal of this bridge edge, the worst case infection is reduced from $90$M to $70$M.

\includepdf[pages=-,]{p1.pdf}

\section{Network characteristics}

\subsection{Degree distribution}

The (unnormalized) degree distribution is plotted in the left panel of Fig.~\ref{fig:degdistr}. Some comments about the individual curves:
\begin{itemize}
	\item $G(n,m)$ random network: the degree distribution is expected to be a binomial distribution with success probability $p \approx 2\cdot m^2/n$, with mean $\bar{k} = (n-1)\cdot p = 5.5$. In the right panel of Fig.~\ref{fig:degdistr}, the measurements on the simulated graph is compared against theory (binomial distribution), which shows excellent agreement.
	\item Small-world network: The salient feature of this distribution is that the distribution is non-zero for $k \geq 4$. Obviously, this follows from the way in which the network is constructed: we began with an ordered network where each node has exactly $4$ neighbors, and random long-distance edges were added.
	\item Real-world collaboration network: Interestingly, the real world graph has a longer tail compared to the previous two examples. It appears that certain authors are extremely prolific and influential (many co-authors).
\end{itemize}

\begin{figure}[p]
	\begin{center}
		\includegraphics[width=1\textwidth]{p2a.pdf}
	\end{center}
	\caption{[Left] The degree distributions of $G(n=5242, m=14496)$ (blue), small world graph (red) and a real-world collaboration graph (black) on a log-log plot. [Right] Comparison of the observed degree distribution of $G(n,m)$ (dots) to the theoretically expected binomial distribution (dashed line).\label{fig:degdistr}}
\end{figure}

\subsection{Excess degree distribution}

\subsubsection{Plot of the excess degree distributions}

The (unnormalized) excess degree distribution is plotted in the left panel of Fig.~\ref{fig:excessdegdistr}. While the distributions are not normalized in the figure, it can be seen that the tail of the distribution (high degrees) is more significant in the excess degree distribution relative to the degree distribution. This is so, because the excess degree distribution weights the degrees of nodes that have a large number of neighbors (shown explicitly in Section~\ref{subsubsec:closed-form}).

\begin{figure}[p]
	\begin{center}
		\includegraphics[width=1\textwidth]{p2b.pdf}
	\end{center}
	\caption{[Left] Excess degree distributions of $G(n=5242, m=14496)$ (blue), small world graph (red) and a real-world collaboration graph (black) on a log-log plot. [Right] Comparison of the explicitly calculated excess degree distribution of the arXiv network (dots) to the closed-form expression based on the degree distribution (dashed line).\label{fig:excessdegdistr}}
\end{figure}

\subsubsection{Closed-form formula\label{subsubsec:closed-form}}

We wish to express the excess degree distribution $\left\{q_k\right\}$ in terms of the degree distribution $\left\{p_k\right\}$.

Consider the unnormalized excess degree distribution $q_k'$, given by
\begin{equation}
	q_k' = \sum_{i \in V} \sum_{(i,j) \in E} I_{\left[k_j=k+1\right]} = \sum_{i \in V} \sum_{(i,j) \in E} I_{\left[k_i=k+1\right]}.
	\label{eqn:unnorm-excess-deg}
\end{equation}
Both double sum expressions in Eq.~\ref{eqn:unnorm-excess-deg} enumerate the $2\cdot m$ terms that represent the two terminating nodes of every edge in the graph. It then follows:
\begin{eqnarray}
	q_k' &=& \sum_{i \in V} \sum_{(i,j) \in E} I_{\left[k_i=k+1\right]},\\
		   &=& \sum_{i \in V} \mathrm{deg}(i) \cdot I_{\left[k_i=k+1\right]} = \sum_{i \in V} (k+1)\cdot I_{\left[k_i=k+1\right]},\label{eqn:deriv-l2}\\
			 &=& (k+1)\cdot \sum_{i \in V} I_{\left[k_i=k+1\right]} = (k+1) \cdot p_k',\label{eqn:deriv-l3}
\end{eqnarray}
where in Eq.~\ref{eqn:deriv-l2} we have used the property that the indicator function is nonzero only when $\mathrm{deg}(i) = k+1$. The right panel of Fig.~\ref{fig:excessdegdistr} shows a comparison between the explicitly counted excess degree distribution (of the arXiv network) and the closed-form formula of Eq.~\ref{eqn:deriv-l3}. Using the normalization relations $p_k = 1/n \cdot p_k'$ and $q_k = 1/(2m) \cdot q_k'$, we conclude:
\begin{equation}
	q_k = \frac{n}{2m}\cdot(k+1)\cdot p_{k+1}.
\end{equation}

\subsection{Clustering coefficient}

My approach for computing the clustering coefficient is as follows. Fix a node $i$, with at least $2$ neighbors. Obtain the induced subgraph consisting of node $i$ and its immediate neighbors. The number of edges in the induced graph minus $\mathrm{deg}(i)$ gives the number of edges $e_i$ between the neighbors of $i$, and the clustering coefficient for node $i$ can be computed as $C_i = \frac{2\cdot e_i}{k_i\cdot(k_i-1)}$.

Note that in computing the average clustering coefficient $\bar{C} = \frac{1}{|V|}\sum_{i \in V} C_i$, I let $|V|$ equal the number of nodes that have degree at least $2$, \emph{not} the total number of nodes in the graph.

The calculated clustering coefficients are as follows:
\begin{itemize}
	\item $G(n,m)$ random network: $\bar{C} = 0.0017$;
	\item Small-world network: $\bar{C} = 0.2839$;
	\item Real-world collaboration network: $\bar{C} = 0.3479$.
\end{itemize}

As we know from our class discussion, the purely random network $G(n,m)$ has very small clustering coefficient. The small world model has a higher clustering coefficient, which owes to the regular connectivity structure of the graph prior to the random long-distance edges. (The small world model with no random edges would have a clustering coefficient of $0.5$.) The real-world collaboration network has even a higher clustering coefficient, reflecting our general observation that real-life (social) networks have significant ``local structure.''

\includepdf[pages=-,nup=1x2,landscape=true,]{p2.pdf}

\section{Decentralized search}

\subsection{Basic tree properties}

\subsubsection{Write $h(T)$ in terms of $N$}
\begin{equation}
	h(T) = \log_b(N)
\end{equation}

\subsubsection{Maximum value of $h(v,w)$}
\begin{equation}
	h(v,w) \leq h(T) = \log_b(N)
\end{equation}

\subsubsection{Number of nodes satisfying $h(v,w) = d$}

Let the distance $d$ and a particular network node $v$ fixed. Consider the subtree $T_1$ whose root node $r$ is the $d$-th ancestor of $v$. The subtree $T_1$ has a total of $b^d$ leaves. Now, among the direct descendants of $r$, there is one subtree $T_2$ that contains $v$ as one of its leaves. This subtree $T_2$ has a total of $b^{d-1}$ leaves. Now, any leaves of $T_2$ has a tree distance to $v$ of $d-1$ at most. It follows that there are exactly $b^d-b^{d-1}$ nodes that have a tree distance of exactly $d$ with respect to $v$.

\subsection{Network path properties}

\subsubsection{Show that $Z \leq \log_b(N)$}

The partition function $Z$ is defined as a sum over the nodes of the graph:
\begin{equation}
	Z = \sum_{w \neq v} b^{-h(v,w)}\label{eqn:partitionfun}.
\end{equation}

We may express the sum in Eq.~\ref{eqn:partitionfun} as a summation over the possible distances from a node $v$:
\begin{equation}
	Z = \sum_{d=1}^{h(T)} (b^d - b^{d-1})\cdot b^{-d} = \left(1-\frac{1}{b}\right)\cdot h(T) = \left(1-\frac{1}{b}\right)\cdot \log_b(N) \leq \log_b(N),
\end{equation}
where in the first equality we have used the fact that there are exactly $b^d - b^{d-1}$ nodes that have a distance $d$ to a fixed node $v$.

\subsubsection{Probability of edge pointing to $T'$}

By construction, the subtree $T'$ has $b^{h(v,t)-1}$ nodes and any leaf $w$ in $T'$ has distance $h(v,t)$ to the original node $v$. It follows that the probability of obtaining an edge from $v$ into $T'$ is:
\begin{equation}
	p_{e\to T'} = b^{h(v,t)-1}\cdot \frac{1}{Z}b^{-h(v,t)} = \frac{1}{bZ} \geq \frac{1}{b\log_b(N)}.
\end{equation}

\subsubsection{Probability of no edges into $T'$ given $k$ out-degree}

Based on the previous result, the probability that a single edge from $v$ does \emph{not} reach $T'$ is:
\begin{equation}
	p_{e\not\to T'} = 1 - p_{e\to T'} \leq 1 - \frac{1}{b\log_b(N)}.
\end{equation}

The probability that $k = c\cdot (\log_b(N))^2$ independent edges all fail to reach $T'$ is then bounded by:
\begin{equation}
	p_{e\not\to T'}^k \leq \left(1 - \frac{1}{b\log_b(N)}\right)^k = \left(1 - \frac{1}{b\log_b(N)}\right)^{c\cdot(\log_b(N))^2}.\label{eqn:k-not-reach-T}
\end{equation}

Using the substitution $x = b\log_b(N)$, Eq.~\ref{eqn:k-not-reach-T} can be rewritten as:
\begin{equation}
	p_{e\not\to T'}^k \leq \left[\left(1 - \frac{1}{x}\right)^x\right]^{\frac{c}{b}\cdot\log_b(N)}.
\end{equation}

We take the limit as $N \to \infty$ (equivalently $x \to \infty$):
\begin{equation}
	\lim_{x\to\infty} p_{e\not\to T'}^k \leq \lim_{x\to\infty} \left[\left(1 - \frac{1}{x}\right)^x\right]^{\frac{c}{b}\cdot\log_b(N)} = \lim_{N\to\infty} e^{-\frac{c}{b}\cdot\log_b(N)} = \lim_{N\to\infty} N^{-\frac{c}{b}}.
\end{equation}

\subsubsection{Show that starting from any node $s$, within $O(\log_bN)$ steps, we can reach any node $t$}

Previously we showed that for any initial node $v$ with $k = c\cdot (\log_b(N))^2$ edges, there exists an edge to a node in $T'$ (which contains the target node $t$) in the limit $N\to\infty$. By taking this edge into $T'$, we reduce the tree distance to $t$ by $1$.

For any two nodes $s$ and $t$, the maximum tree distance $h(s,t)$ is $\log_bN$. At each iteration, by taking the edge that leads into $T'$, we reduce the tree distance by $1$. It follows that we can reach the target node $t$ in $O(\log_bN)$ steps.

\subsection{Simulation}

\subsubsection{Navigation simulation in Matlab}

I didn't see any particular reason to use Snap for this simulation, so I decided to implement in Matlab. For efficiency, I precomputed the tree distance between all pairs of nodes. Furthermore, I wrote one script to generate the random graphs, and another to perform random searches over the precomputed graphs.% By the way, in the previous section we showed that the tree distance can enable efficient search as long as $k \approx c \cdot (\log_bN)^2$. The parameters given here -- $h(T)=10, b=2, k=5$ -- doesn't have ``enough'' out-degree per node (neglecting the role of $\alpha$). But I will do as you say...

%Also, the fact that $b=2$ allows us to calculate the tree distance between two leaves efficiently by using the binary representation of their node id. See the implementation for details.

The main results, namely the search success probability and the average path length for successful searches as a function of $\alpha$, are shown in Fig.~\ref{fig:sim}. In the figure, I also show some representative adjacency matrices as a function of $\alpha$. (I have ordered the node indexing such that clustering around the diagonal represents local connectivity, whereas the off-diagonal clusters represent long-range links.)

\begin{figure}[p]
	\begin{center}
		\includegraphics[width=1\textwidth]{p3_simulation.pdf}
	\end{center}
	\caption{[Top left] The success probability of $N_\mathrm{search}=1000$ random $(s,t)$ searches as a function of $\alpha$. [Top right] The average path length of successful searches as a function of $\alpha$. [Bottom] Some representative adjacency matrices. The plots correspond to graphs with $\alpha = 0.1, 1.0, 8.0$. The node indexing is such that clustering around the diagonal represents local (in the tree) connectivity, whereas the off-diagonal entries (roughly) represent long-range links.\label{fig:sim}}
\end{figure}

\subsubsection{Comments on results}

Firstly, consider the success probability as a function of $\alpha$. We find that the success probability peaks for $\alpha \approx 1$. This phenomenon can be understood as follows:
\begin{enumerate}
	\item Case $\alpha \to 0$: In this limit, the edges in the graph are basically random and do not ``respect'' the tree organization that underlies the construction of the graph. The randomness is evidenced in the bottom panel of Fig.~\ref{fig:sim} for $\alpha=0.1$ where the adjacency matrix has uniformly distributed nonzero entries. So we are trying to search in a random graph, which we showed in lecture is not efficient.
	\item Case $\alpha \to \infty$: In this limit, the edges strongly prefer to stay within the local structure as defined by the tree organization, and there are few long-range edges. The strong locality makes it difficult to find paths between arbitrary leaves of the tree.
	\item Case $\alpha \approx 1$: In this case, there is both local structure \emph{and} long-range edges to other parts of the tree organization. (See the $\alpha = 1.0$ adjacency matrix in Fig.~\ref{fig:sim}.) So, this case is most like the ``small-world network'' and is optimally (relatively speaking) searchable.
\end{enumerate}

Secondly, we consider the average path length of successful searches.
\begin{enumerate}
	\item Case $\alpha \to 0$: In this limit, the edges are basically randomly distributed (ignores the tree organization). In this case, a successful search consists of short paths that, by chance, connect $s$ to $t$.
	\item Case $\alpha \to \infty$: In this limit, successful searches are when $s$ and $t$ are chosen (by chance) to be very near one another in the tree organization. Given the strong local connectivity for such values of $\alpha$, it becomes likely that $s$ and $t$ are directly connected.
	\item Case $\alpha \approx 1$: In this case, the graph has both local structure and long-range links. So, we can obtain nontrivial path lengths by our heuristic, which seeks to decrease the tree distance at each iteration.
\end{enumerate}

\includepdf[pages=-,nup=1x2,landscape=true,]{p3.pdf}

\end{document}