CS 246: Mining Massive Data Sets

February 28, 202646 minShurui Liu

Association Rule Discovery: Market-basket model

Definitions

item, basket (a set of items). We are given the set $\mathbf{Items}$ of all items, and the set of all baskets $\mathbf{Buskets}=\{\mathbf{B}_i\}\subset\operatorname{Power}(\mathbf{Items})$.
frequent itemsets:
- support for itemset $I$ is the number of baskets containing $I$, i.e. $$\textrm{supp}(I):=|\{\mathbf{B}\in \mathbf{Baskets}:\mathbf{B}\supseteq I\}|.$$ Support is monotonely decreasing, $I\subseteq J\Rightarrow \textrm{supp}(I)\geq \textrm{supp}(J).$
- given a support threshold $s$, frequent itemsets are $\{I\subseteq \mathbf{Items}:\textrm{supp}(I)\geq s\}.$
association rules:
- If-then rule $I\rightarrow j$: if a basket contains $I$, then it is likely to contain $j$.
- support of this rule is defined to be the sum of the support of $I$ and the support of $j$.
- confidence: $\textrm{conf}(I\rightarrow j):=\frac{\textrm{supp}(I\cup\{j\})}{\textrm{supp}(I)}.$ This estimates $\textrm{Pr}(j|I)=\frac{\textrm{Pr}(I,j)}{\textrm{Pr}(I)}.$
- interest value: $\mathrm{Interest}(I\rightarrow j):= |\textrm{conf}(I\rightarrow j)-\textrm{Pr}(j)| = |\textrm{Pr}(j|I) - \textrm{Pr}(j)|,$ where $\mathrm{Pr(j)}:=\frac{\textrm{supp}(j)}{|\mathbf{Buskets}|}.$
- interesting rules are those with high interest values (usually above 0.5).
- absolute value to capture both positive and negative association rules.
Association rule mining: find all association rules with support $\geq s$ and confidence $\geq c$.

Algorithms

Overlook

Step1: find all frequent itemsets $I$.
Step2: rule generation
- For every subset $A$ of $I$, genearte rule $A\rightarrow I\backslash A$.
- observation: if $A,B,C\rightarrow D$ is below confidence, then so is $A,B\rightarrow C,D.$
Output the rules generated above the confidence threshold $s$. To compactify the output, we can output only maximal frequent itemsets, or closed itemsets (no immediate superset has the same support).

Step 1 is hard. And in general, the hardest scenario is finding frequent pairs ($|I| =2$.)

finding frequent paris

Naive approach

read file once, counting in main memory the occurences of each pair. This fails if $|\mathbf{Items}|^2$ exceeds main memory. - triangular matrix: updates $\textrm{count}[i][j]$ for $i<j$. - triples: updates $(i,j,\textrm{count})$. Better than triangular matrix if less than $1/3$ of possible pairs actually occur.

A-priori algorithm: two-pass approach

Pass 1: read baskets and count occurances of each individual item. So after pass 1, we know $\textrm{supp}(i)$ for $i\in \mathbf{Items}.$
Pass 2: read baskets again and count those pairs $(i,j)$ where both $i,j$ are frequent, using triangular matrix.

PCY algorithm for frequent pairs:

Uses the memory left in 1st pass of A-priori. Maintain a hash table with as many as buckets as fit in memory.

Pass 1:

For (each basket B):
    For (each item i in B):
        count[i]+=1
    For (each pair (i,j) in B):
        hash (i,j) to a bucket b;
        count_bucket[b]+=1

bit-vector: count_buckey[b] = 1 if bucket $b$ is frequent, otherwise set it to be $0$.
pass 2: count only for candidate pair (i.e. $i$ and $j$ are frequent, the bucket $b$ to which $(i,j)$ hashes to is frequent.)

k passes for size $k$

A-priori and PCY generalize to $|I|=k$ case by $k$-passes: prune non-frequent $|I'|=k-1$ itemsets from $(k-1)$-th pass.

Frequent itemsets in $\leq2$ passes for all sizes

May miss some frequent itemsets.

Random Sampling

Take a random sample of the market baskets. Reduce support threshold proportionally: if your sample is $1/100$ of the baskets, use $s/100$ as your support threshold instead of $s$.

SON algorithm

1st pass: Repeatedly read small subsets of the baskets into main memory and run an in- memory algorithm to find all frequent itemsets, support threshold = $|\textrm{chunk}|/|\textrm{whole file}|*s.$
2nd pass: count all the candidate itemsets

Toivonen's algorithm

Pass 1: Start with a random sample, but lower the threshold slightly for the sample. To the itemsets that are frequent in the sample, add the negative border (not frequent in the sample, but all its immediate subsets are).
Pass 2: Count all candidate and also their negative border.
If no itemset from the negative border turns out to be frequent, then we found all the frequent itemsets, otherwise, start over with another sample.

Recommendation Systems

Problem Setup

Given customers $X$, items $S$, and a utility function $u: X\times S\to\mathbb{R}$ (e.g. 1--5 star ratings), the utility matrix records known ratings (most entries blank). Three problems: (1) gather known ratings; (2) extrapolate unknown ratings; (3) evaluate extrapolation.

Key challenge: the matrix is sparse + cold-start (new items or users have no history).

Content-Based Filtering

Pipeline

Build an item profile (feature vector) for each item. For text: use the words with highest TF-IDF scores.
Build a user profile as a weighted average of the item profiles of items the user rated highly.
Recommend items closest (by cosine similarity) to the user profile.

TF-IDF

Let $f_{ij}$ = frequency of term $i$ in item $j$, $n_i$ = number of items containing term $i$, $N$ = total items.

$$ TF_{ij} = \frac{f_{ij}}{\max_k f_{kj}}, \qquad IDF_i = \log\frac{N}{n_i}, \qquad w_{ij} = TF_{ij}\times IDF_i. $$

High TF-IDF $\Rightarrow$ term is frequent in this item but rare globally.

Prediction

$$ u(x, i) = \cos(x, i) = \frac{x\cdot i}{|x|,|i|}. $$

Use LSH to find items closest to user profile $x$ efficiently.

Pros / Cons

Pros: no data from other users needed; can recommend new/unpopular items; no item cold-start; explanations available.
Cons: hard to extract features for non-text items; user cold-start; over-specialisation (never discovers items outside the user's profile).

Collaborative Filtering

Uses the utility matrix directly (no content features). Two flavours: user--user and item--item.

User Similarity Measures

Let $S_{xy}$ = items rated by both users $x$ and $y$, $\bar{r}_x$ = mean rating of user $x$.

Jaccard: $\mathrm{sim}(x,y)=|r_x\cap r_y|/|r_x\cup r_y|$. Ignores rating values.
Cosine: $\mathrm{sim}(x,y)=r_x\cdot r_y/(|r_x||r_y|)$. Treats missing entries as 0 (negative bias).
Pearson correlation (preferred):

$$ \mathrm{sim}(x,y) = \frac{\displaystyle\sum_{s\in S_{xy}}(r_{xs}-\bar{r}x)(r{ys}-\bar{r}y)} {\displaystyle\sqrt{\sum{s\in S_{xy}}(r_{xs}-\bar{r}x)^2}\cdot \sqrt{\sum{s\in S_{xy}}(r_{ys}-\bar{r}_y)^2}}. $$

Mean-centering removes user-level bias; sum only over co-rated items. (Cosine on mean-centered data equals Pearson.)

User--User CF: Rating Prediction

Let $N$ = $k$ users most similar to $x$ who rated item $i$.

$$ r_{xi} = \frac{\sum_{y\in N} s_{xy}\cdot r_{yi}}{\sum_{y\in N} s_{xy}}, \qquad s_{xy}=\mathrm{sim}(x,y). $$

Item--Item CF: Rating Prediction

Let $N(i;x)$ = items most similar to $i$ that were also rated by $x$.

$$ r_{xi} = \frac{\sum_{j\in N(i;x)} s_{ij}\cdot r_{xj}}{\sum_{j\in N(i;x)} s_{ij}}. $$

Empirical finding: item--item CF usually outperforms user--user CF because items are simpler entities and item neighbourhoods are more stable over time.

Baseline-Adjusted Prediction

$$ r_{xi} = b_{xi} + \frac{\sum_{j\in N(i;x)} s_{ij}\cdot(r_{xj}-b_{xj})}{\sum_{j\in N(i;x)} s_{ij}}, \qquad b_{xi} = \mu + b_x + b_i, $$

where $\mu$ = global mean rating, $b_x$ = user deviation, $b_i$ = item deviation.

Problems with Basic CF

Similarity $s_{ij}$ is arbitrary (Pearson, cosine, etc.) and not optimized.
Pairwise similarities neglect interdependencies among items in $N(i;x)$.
The normalized weighted average constrains the model unnecessarily ($\sum w = 1$, same sign as similarities).

Complexity & Scalability

Finding $k$ nearest users costs $O(|X|)$ per query --- too slow at runtime. Solutions: pre-compute similarities offline; LSH for approximate neighbours; cluster users before search; dimensionality reduction.

Pros / Cons of Collaborative Filtering

Pros: works for any item type; no feature engineering.
Cons: cold start (new users/items); sparsity; first-rater problem; popularity bias (over-recommends popular items).

Hybrid Methods

Combine content-based and collaborative filtering: use item profiles to handle the new-item cold start, and user demographics to handle the new-user cold start; combine predictions via a linear model.

Evaluation

Split the utility matrix into train / test.

RMSE: $\displaystyle\sqrt{\frac{1}{N}\sum_{x,i}(r_{xi}-r^*_{xi})^2}$.
Precision @ 10: fraction of relevant items in top-10 recommendations.
Coverage / ROC: fraction of user--item pairs predictable; false-positive vs. true-positive tradeoff.

Note: RMSE penalises all rating levels equally, but in practice only high ratings matter.

Latent Factor Models (Matrix Factorization)

Core Idea

Map both users and items to a shared $k$-dimensional latent space. Factorize the ratings matrix $R$ (items $\times$ users) as

$$ R \approx Q,P^{\top}, $$

where $q_i\in\mathbb{R}^k$ is the latent vector for item $i$ and $p_x\in\mathbb{R}^k$ is the latent vector for user $x$. Predicted rating:

$$ \hat{r}_{xi} = q_i \cdot p_x. $$

Why Not Standard SVD?

SVD is undefined for sparse matrices (missing entries are not zeros). Instead, minimize SSE only over known entries:

$$ \min_{P,Q} \sum_{(i,x)\in R}(r_{xi} - q_i \cdot p_x)^2. $$

$P$ and $Q$ need not be orthonormal.

Regularized Objective (prevent overfitting)

Large $k$ overfits. Add L2 penalty:

$$ \min_{P,Q} \sum_{(i,x)\in R}(r_{xi} - q_i p_x)^2

\lambda_1\sum_x |p_x|^2
\lambda_2\sum_i |q_i|^2. $$

Regularization shrinks factor vectors toward zero where data is scarce.

Full Model with Biases

$$ \hat{r}_{xi} = \mu + b_x + b_i + q_i\cdot p_x, $$

where $\mu$ = global mean, $b_x$ = user bias, $b_i$ = item bias. Full regularized objective:

$$ \min_{Q,P,b}\sum_{(x,i)\in R}(r_{xi}-\mu-b_x-b_i-q_i p_x)^2

\lambda_1!\sum_i|q_i|^2
\lambda_2!\sum_x|p_x|^2
\lambda_3!\sum_x b_x^2
\lambda_4!\sum_i b_i^2. $$

BellKor Multi-Scale Model

The winning Netflix Prize approach decomposes predictions into three levels:

Global (biases): $b_{xi} = \mu + b_x + b_i$. Example: global mean $\mu=3.7$, movie is $+0.5$, user rates $-0.2$ below average $\Rightarrow$ baseline $= 4.0$.
Regional (latent factors): matrix factorization $q_i\cdot p_x$ capturing latent user--item affinity.
Local (CF/NN): nearest-neighbour adjustment using ratings of similar items/users to fine-tune the prediction.

Learned Interpolation Weights (CF++)

Replace fixed similarity $s_{ij}$ with learned weights $w_{ij}$ (not constrained to sum to 1; can be negative; captures item interdependencies):

$$ \hat{r}{xi} = b{xi} + \sum_{j\in N(i;x)} w_{ij}(r_{xj}-b_{xj}). $$

Objective (minimize SSE over training ratings):

$$ J(w) = \sum_{(x,i)\in R} \Bigl(r_{xi} - b_{xi} - \sum_{j\in N(i;x)} w_{ij}(r_{xj}-b_{xj})\Bigr)^2. $$

Gradient w.r.t. $w_{ij}$ (for $j\in N(i;x)$; zero otherwise):

$$ \frac{\partial J}{\partial w_{ij}} = -2\sum_{(x,i)\in R} \Bigl(r_{xi} - b_{xi} - \sum_{k\in N(i;x)} w_{ik}(r_{xk}-b_{xk})\Bigr) (r_{xj}-b_{xj}). $$

Gradient Descent update (fix item $i$; iterate over all its ratings):

$$ w_{\text{new}} \leftarrow w_{\text{old}} - \eta,\nabla_w J, \quad\text{repeat until } |w_{\text{new}} - w_{\text{old}}| < \varepsilon. $$

Optimization: Stochastic Gradient Descent (SGD)

Full Gradient Descent (slow)

$$ Q \leftarrow Q - \eta,\nabla_Q J, \qquad P \leftarrow P - \eta,\nabla_P J. $$

Requires scanning all training ratings per step --- too slow at scale.

SGD (preferred)

The full gradient for one entry $q_{if}$ is a sum over all ratings of item $i$:

$$ \frac{\partial J}{\partial q_{if}} = \sum_{x:,(x,i)\in R} \Bigl[-2(r_{xi}-q_i\cdot p_x),p_{xf} + 2\lambda_2,q_{if}\Bigr]. $$

SGD approximates this with a single rating $(x,i)$ at a time. Update after each rating:

Compute error: $\varepsilon_{xi} = 2(r_{xi} - q_i\cdot p_x)$.
Update item vector: $q_i \leftarrow q_i + \mu_1(\varepsilon_{xi},p_x - 2\lambda_2,q_i)$.
Update user vector: $p_x \leftarrow p_x + \mu_2(\varepsilon_{xi},q_i - 2\lambda_1,p_x)$.

Repeat sweeps over all training ratings until convergence. Each step is $O(k)$ vs. $O(k|R|)$ for full GD. Initialize $P$, $Q$ with standard SVD (treat missing entries as 0).

Temporal Dynamics

User tastes and movie popularities drift over time. Extend biases to be time-dependent:

$$ \hat{r}{xi} = \mu + b_x(t) + b_i(t) + q_i\cdot p_x, \qquad b_i(t) = b_i + b{i,,\mathrm{Bin}(t)}, $$

where each bin covers $\approx 10$ consecutive weeks. User preference vectors $p_x(t)$ can similarly be made time-dependent.

Netflix Prize Performance Summary

Method	RMSE
Global average	1.130
User average	1.065
Movie average	1.053
Netflix baseline	0.951
Basic CF	0.940
CF + biases + learned weights	0.910
Latent factors	0.900
Latent factors + biases	0.890
Latent factors + biases + temporal	0.876
Grand Prize (BellKor ensemble)	0.856

High Dimensional Data

Locally sensitive hashing

Problem: given $q$, find data points $x_j$, s.t. $d(q,x_j)\leq s$. Naive solution $O(N)$ time, LSH gives $O(1)$.
Problem: given $s>0$, find all $(x_i,x_j)$, such that $d(x_i,x_j)<s$. Naive solution $O(N^2)$, LSH gives $O(N)$.

Finding similar documents

Shingling - min hashing - LSH

After shingling, the goal becomes: find similar columns in large sparse matrices with high Jaccard similarity

Shingling

Shingling: $D$ turns into $S(D):=$ the set of sequnces of $k$ tokens in $D$.
Jaccarcd similarity $\textrm{sim}(D_1,D_2) = \frac{|S(D_1)\cap S(D_2)|}{{|S(D_1)\cup S(D_2)|}}$, and Jaccard distance $d(C_1,C_2):= 1- \frac{|C_1\cap C_2|}{|C_1\cup C_2|}$, where $C_i := S(D_i)$.
Sets into boolean Matrix: $c_ {ij} = \mathbf{1}[\textrm{shingle }i\textrm{ in Set(}\text{Document}_j\text{)}]$

Min-Hashing: Convert large sets to short signatures, while preserving similarity

minhash function: Permute the rows of the Boolean matrix using some permutation $\pi_ {k}$, $h_\pi(C_j):= \min\{\pi(i):c_ {ij}=1\}$. Choose $K$ permutations $\pi_k$ to create a signature for each column (a vector $(h_ {\pi_k}(C))_ {1\leq k\leq K}$) and signature matrix $ M = (h_ {\pi_k}(C_j))_ {1\leq k\leq K,j}$ with fewer rows.

$\mathrm{Pr}[h_ \pi(C_ 1) = h_ \pi(C_ 2)] = \textrm{sim}(C_ 1, C_ 2)$. The similarity of two signatures is the fraction of the hash functions in which they agree. Therefore, $\mathbb{E}(\textrm{sim}(\text{sig}(C_1),\text{sig}(C_2))) = \textrm{sim}(C_1,C_2).$
Implementation: instead of actually permuting rows of data, use row hashing to simulate, i.e. a permutation $\pi_i$ will be replaced by a hash function $h_i = ((ax+b) \mod p) \mod N$, where $a,b$ are random integers, prime $p>N$.
1. Initialize: Set all slots in the Signature Matrix to infinity: $M(i, c) = \infty$ (for hash function $i$, column/document $c$).
2. Scan: Read the original matrix row by row.
3. Update: If row $j$ has a 1 in column $c$:
  - Compute $h_i(j)$ for all $K$ hash functions.
  - Apply update rule: $M(i, c) \leftarrow \min(M(i, c), h_i(j))$

LSH

Divide signature matrix $M$ into $b$ bands with each band having $r$ rows, and candidate column pairs are those that hash to the same bucket for $\geq 1$ band.

Need to tune $b$, $r$, $K=b*r$ to balance false positive and false negative.

Need a second pass to verify LSH into same bucket actually giving similar signature; and then maybe another pass to verify similar signature giving similar document.

Assume $C_1$ and $C_2$ have similarity $t$, then $\mathrm{Pr}(\textrm{no bend identicle}) = (1-t^r)^b$, and at least 1 band identicle is $1-(1-t^r)^b.$

General metric space case

1. Locality-Sensitive (LS) Families A hash family is $(d_1, d_2, p_1, p_2)$-sensitive if for points $x, y$:

If $d(x,y) \le d_1 \rightarrow Pr[h(x)=h(y)] \ge p_1$
If $d(x,y) \ge d_2 \rightarrow Pr[h(x)=h(y)] \le p_2$

2. Amplification (The S-Curve) Goal: Create a step-function probability around a similarity threshold $t$.

AND ($r$ functions/rows): Match requires all $r$ to match. $Pr \rightarrow p^r$. Lowers both false positives and true positives.
OR ($b$ functions/bands): Match requires $\ge 1$ of $b$ to match. $Pr \rightarrow 1 - (1-p)^b$. Raises both true positives and false positives.
AND-OR Cascade: $Pr = 1 - (1-s^r)^b$. Standard LSH technique.
OR-AND Cascade: $Pr = (1 - (1-s)^b)^r$.

3. Specific Distance Metrics

Jaccard: Min-Hashing. $Pr[match] = 1 - d(x,y)$. Forms a $(d_1, d_2, 1-d_1, 1-d_2)$-sensitive family.
Cosine: Random Hyperplanes. $h_v(x) = +1$ if $v \cdot x \ge 0$, else $-1$. $Pr[match] = 1 - d(x,y)/\pi$.
Euclidean: Random Projections. Project points onto random lines partitioned into buckets of width $a$. Yields a $(a/2, 2a, 1/2, 1/3)$-sensitive family.

Clustering

Problem: Given a set of points in a metric space, group them into clusters. Some points may be outliers.

Cluster Representatives

Euclidean space: use centroid --- the average of all points in the cluster. This may be an artificial point not in the data.
Non-Euclidean space: use clustroid --- an existing data point that minimises some aggregate distance to all other cluster members:

$$ c^* = \argmin_{c\in C} \sum_{x\in C} d(x,c)^2. $$

Curse of Dimensionality

In $d$ dimensions, capturing a fraction $f$ of the data requires searching within radius $f^{1/d}$. As $d$ grows, almost all pairs of points become nearly equidistant, making meaningful clustering difficult.

Hierarchical (Agglomerative) Clustering

Algorithm

Initialise: each point is its own cluster.
Repeatedly find the two nearest clusters and merge them.
Stop when a criterion is met (number of clusters, diameter threshold, etc.).

The result can be visualised as a dendrogram with the merge distance on the $y$-axis.

Cluster Distance Definitions

Centroid / clustroid distance: distance between cluster representatives. Best for convex, well-separated clusters.
Single-link: $\min$ distance between any pair $(x\in C_1,,y\in C_2)$. Can chain clusters together; handles concentric shapes.
Average-link: average distance over all pairs $(x\in C_1,,y\in C_2)$.
Cohesion-based: merge the pair whose union is most cohesive (smallest diameter, smallest average distance, or highest density in the merged set).

$k$-Means Clustering

Objective

Given Euclidean space and a fixed $k$, minimise the sum of squared distances from each point to its nearest centroid:

$$ \min_{c_1,\ldots,c_k} \sum_i \min_j |x_i - c_j|^2. $$

Exact optimisation is NP-hard; Lloyd's algorithm finds an approximate solution.

Lloyd's Algorithm

Initialise: pick $k$ centroids (e.g. $k$ random data points).
Assignment: assign each point to its nearest centroid.
Update: recompute each centroid as the mean of its assigned points.
Repeat steps 2–3 until no point changes cluster.

Converges to a local optimum; quality depends heavily on initialisation.

$k$-Means++ Initialisation

To spread initial centres across the data:

Pick the first centre uniformly at random.
For each subsequent centre, select point $p$ with probability proportional to $D(p)^2$, where $D(p)$ is the distance from $p$ to the nearest already-chosen centre.
Repeat until $k$ centres are chosen.

This biases selection toward points far from existing centres, reducing worst-case behaviour.

Choosing $k$: Elbow Method

Plot average distance to centroid vs. $k$. The optimal $k$ is at the "elbow" --- the point of diminishing returns where adding more clusters yields little improvement.

BFR Algorithm (Large-Scale $k$-Means)

Assumption: clusters are normally distributed (axis-aligned Gaussian ellipses) around centroids. Goal: process disk-resident data in $O(\text{clusters})$ memory.

Three Sets of Points

DS (Discard Set): points close enough to a centroid; summarised and discarded.
CS (Compressed Set): groups of points that are close together but not near any centroid; summarised but not yet assigned.
RS (Retained Set): isolated points stored as-is; waiting to join a CS or DS.

Cluster Summary Statistics

Each cluster (and CS group) is stored as $2d+1$ values, where $d$ = number of dimensions:

$N$: number of points.
$\mathbf{SUM}$: $d$-vector; $i$-th component $= \sum x_i$ over all points in the cluster.
$\mathbf{SUMSQ}$: $d$-vector; $i$-th component $= \sum x_i^2$.

Derived quantities:

$$ \text{centroid}_i = \frac{\mathbf{SUM}_i}{N}, \qquad \sigma_i = \sqrt{\frac{\mathbf{SUMSQ}_i}{N} - \left(\frac{\mathbf{SUM}_i}{N}\right)^2}. $$

Algorithm Steps (per batch)

Load a new batch of points from disk.
For each new point, if it is close enough to an existing centroid (by Mahalanobis distance), add it to that cluster's DS and update $N$, $\mathbf{SUM}$, $\mathbf{SUMSQ}$.
Cluster remaining points (plus old RS) using in-memory clustering. Subclusters $\to$ CS; isolated points $\to$ RS.
Optionally merge CS groups that are close enough to each other or to a DS centroid.

Final step: merge all CS groups and RS points into their nearest cluster.

Mahalanobis Distance

Standard Euclidean distance is inappropriate for axis-aligned elliptical clusters. The Mahalanobis distance from point $x$ to centroid $c$ (with per-dimension std. dev. $\sigma_i$) is:

$$ d_M(x,,c) = \sqrt{\sum_{i=1}^d \left(\frac{x_i - c_i}{\sigma_i}\right)^2}. $$

For $d$-dimensional Gaussian clusters, approximately 68% of points satisfy $d_M < \sqrt{d}$.

Merging Two CS Groups

Compute the combined $N$, $\mathbf{SUM}$, $\mathbf{SUMSQ}$ from the two groups' statistics (no need to revisit raw points) and merge if the resulting variance is below a threshold.

Dimensionality Reduction: SVD & CUR

Singular Value Decomposition (SVD)

Definition

Any real $m\times n$ matrix $A$ can be decomposed as

$$ A ;=; U,\Sigma,V^{\top}, $$

where:

$U$ ($m\times r$): left singular vectors (column-orthonormal, $U^{\top}U=I$).
$\Sigma$ ($r\times r$): diagonal matrix of singular values $\sigma_1\geq\sigma_2\geq\cdots\geq 0$.
$V$ ($n\times r$): right singular vectors (column-orthonormal, $V^{\top}V=I$).
$r = \mathrm{rank}(A)$.

Equivalently, $A = \sum_i \sigma_i, \mathbf{u}_i \mathbf{v}_i^{\top}$ (sum of rank-1 outer products).

Best Low-Rank Approximation

Let $B$ be the SVD of $A$ with all but the top-$k$ singular values set to zero. Then

$$ B = \argmin_{\mathrm{rank}(X)=k} |A - X|_F, $$

where $|M|F = \sqrt{\sum{ij} M_{ij}^2}$ is the Frobenius norm. Setting $\sigma_i=0$ for $i>k$ zeroes out the corresponding rank-1 components.

Choosing the Rank $k$

Define the energy of a set of singular values as the sum of their squares. Keep the smallest $k$ such that the retained energy is $\geq 90%$ of the total:

$$ \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^r \sigma_i^2} ;\geq; 0.90. $$

Interpretation (Concept Space)

In a users-by-movies matrix: $U$ = user-to-concept loadings, $V$ = movie-to-concept loadings, $\Sigma$ = concept strengths. $U\Sigma$ gives coordinates of users projected onto the concept axes.

To map a query $q$ (a row vector in original space) into concept space:

$$ q_{\mathrm{concept}} = q,V. $$

Two users with zero ratings in common can still have high similarity in concept space, capturing latent shared preferences.

SVD Drawbacks

Interpretability: singular vectors are dense linear combinations of all columns/rows.
Sparsity loss: even if $A$ is sparse, $U$ and $V$ are dense.

Computing SVD via Power Iteration

Power Iteration (largest eigenpair of symmetric $M$)

Start with any guess $x_0$.
Iterate: $x_{k+1} = M x_k ;/; |M x_k|_F$ until convergence.
Eigenvalue: $\lambda = x^{\top} M x$.

Deflation (subsequent eigenpairs)

After finding eigenpair $(\lambda, x)$, eliminate its contribution and recurse:

$$ M^* := M - \lambda, x, x^{\top}, $$

then apply power iteration to $M^*$.

Connection to SVD

$$ A^{\top}A = V\Sigma^2 V^{\top}, \qquad AA^{\top} = U\Sigma^2 U^{\top}. $$

So $V$ (and $\Sigma$) come from eigenpairs of $A^{\top}A$, and $U$ from $AA^{\top}$. Full SVD complexity: $O(nm^2)$ or $O(n^2 m)$ (whichever is smaller).

CUR Decomposition

Motivation

CUR preserves sparsity: it uses actual rows and columns of $A$ as basis vectors, so if $A$ is sparse, $C$ and $R$ are also sparse.

Structure

$$ A ;\approx; C,U,R, $$

where:

$C$ ($m\times c$): $c$ randomly sampled columns of $A$.
$R$ ($r\times n$): $r$ randomly sampled rows of $A$.
$U$ ($c\times r$): pseudoinverse of the intersection $W$ of $C$ and $R$.

Computing $U$ (Pseudoinverse)

Let $W$ ($c\times r$) be the submatrix of $A$ at the intersecting rows and columns. Compute SVD $W = X Z Y^{\top}$; then

$$ U = W^+ = Y,Z^+,X^{\top}, $$

where $Z^+$ is diagonal with $Z^+{ii} = 1/Z{ii}$ (reciprocals of non-zero singular values; zero if $Z_{ii}=0$).

Sampling Algorithm

Sample columns (and rows symmetrically) with probability proportional to their squared Frobenius norm (importance):

$$ P(\text{column }j) = \frac{\sum_i A_{ij}^2}{\sum_{i,j} A_{ij}^2}. $$

Normalize each sampled column $j$ by $\sqrt{c\cdot P(j)}$ to remove bias:

$$ C_{:,,i} = \frac{A_{:,,j}}{\sqrt{c\cdot P(j)}}. $$

The same column can be sampled more than once.

Quality Guarantee

Sample $c = r = O(k\log k,/,\varepsilon^2)$ columns and rows. Then with probability $\geq 0.98$:

$$ |A - CUR|_F ;\leq; (2+\varepsilon),|A - A_k|_F, $$

where $A_k$ is the best rank-$k$ SVD approximation. In practice, $4k$ columns/rows suffice.

SVD vs. CUR Summary

Property	SVD	CUR
Factor matrices	$U,V$: dense	$C,R$: sparse (actual rows/cols)
Middle matrix	$\Sigma$: sparse, small	$U$: dense, small
Optimality	Exact best rank-$k$	$(2+\varepsilon)$ factor
Interpretability	Low (dense combos)	High (actual data rows/cols)
Sparsity preserved	No	Yes

Graph Data

Link Analysis: PageRank

Core Idea

Model the web as a directed graph (nodes = pages, edges = hyperlinks). Not all pages are equally important; rank them by link structure.

Two intuitions:

Random surfer: PageRank of page $j$ = limiting probability that a random surfer (following out-links uniformly at random) is at $j$.
Recursive importance: A page is important if important pages link to it.

Flow Formulation

Define the rank $r_j$ of page $j$:

$$ r_j = \sum_{i\to j} \frac{r_i}{d_i}, $$

where $d_i$ is the out-degree of page $i$. Normalisation: $\sum_j r_j = 1$.

Matrix Formulation

Define the column-stochastic transition matrix $M$:

$$ M_{ji} = \frac{1}{d_i} \text{ if } i\to j, \qquad M_{ji} = 0 \text{ otherwise.} $$

Each column of $M$ sums to 1. The flow equations become:

$$ r = M,r. $$

Thus $r$ is the principal eigenvector of $M$ (eigenvalue 1).

Power Iteration

Algorithm:

Initialise: $r^{(0)} = [1/N,\ldots,1/N]^\top$.
Iterate: $r^{(t+1)} = M,r^{(t)}$, i.e. $r_j^{(t+1)} = \sum_{i\to j} r_i^{(t)} / d_i$.
Stop when $|r^{(t+1)} - r^{(t)}|_1 < \varepsilon$.

Typically $\approx 50$ iterations suffice. $r^{(t)}$ is the probability distribution over pages of a random surfer at time $t$; convergence gives the stationary distribution of the random walk.

Why it works: Write $r^{(0)} = \sum_k c_k x_k$ in the eigenbasis of $M$. Then $M^t r^{(0)} = \lambda_1^t\left[c_1 x_1 + \sum_{k\geq 2} c_k(\lambda_k/\lambda_1)^t x_k\right]$. Since $\lambda_1 > |\lambda_k|$ for $k\geq 2$, the ratio $(\lambda_k/\lambda_1)^t\to 0$, leaving only $x_1$ (the principal eigenvector).

Problems with Naive PageRank

Dead ends (pages with no out-links): the column of $M$ sums to 0; rank "leaks out" --- all scores converge to 0.
Spider traps (closed group of pages with no external out-links): the random walk gets stuck; all rank is absorbed by the trap.

Google PageRank with Teleportation

At each step the random surfer either:

With probability $\beta$ ($\approx 0.8$--$0.9$), follows a random out-link.
With probability $1-\beta$, teleports to a uniformly random page.

From a dead-end, always teleport (probability 1).

PageRank equation (Brin--Page, 1998):

$$ r_j = \sum_{i\to j}\beta\frac{r_i}{d_i} + \frac{1-\beta}{N}. $$

Google matrix:

$$ A = \beta M + (1-\beta)\frac{\mathbf{1}\mathbf{1}^\top}{N}, \qquad r = A,r. $$

Efficient sparse form (avoids storing the dense $N\times N$ matrix):

$$ r = \beta M,r + \frac{1-\beta}{N},\mathbf{1}. $$

In each iteration, compute $r' = \beta M r$ (sparse), then add $(1-S)/N$ to each entry, where $S = \sum_j r'_j$ accounts for rank lost to dead ends (equals exactly $1-\beta$ when there are no dead ends).

Complete PageRank Algorithm

Initialise r_j = 1/N  for all j
Repeat until  sum_j |r_j_new - r_j_old| < eps:
    r'_j = sum_{i->j}  beta * r_i_old / d_i   for all j
    S    = sum_j r'_j
    r_j  = r'_j + (1 - S) / N               // redistribute leaked rank

Undirected Graphs

For an undirected graph with $N$ nodes and $m$ edges, the exact PageRank solution is:

$$ r_v = \frac{d_v}{2m}, $$

i.e. rank is proportional to degree.

Scalable Computation

For $N=10^9$ pages, the rank vector needs $\approx 4$ GB; the sparse matrix $M$ ($\approx10$ links/node) needs $\approx 40$ GB on disk.

Basic update ($r^{\text{new}}$ fits in RAM): read $M$ and $r^{\text{old}}$ once per iteration. Cost: $2|r| + |M|$.
Block-based update ($k$ blocks of $r^{\text{new}}$ fit in RAM): scan $M$ once per block. Cost: $k|M| + (k+1)|r|$.
Block-stripe update (improved): partition $M$ into $k$ stripes, each stripe containing only the entries mapping into one block of $r^{\text{new}}$. Each block of $r^{\text{new}}$ requires only one stripe of $M$. Cost: $|M|(1+\varepsilon) + (k+1)|r|$ --- $M$ is read only once overall.

Extensions & Limitations

Topic-Specific (Personalized) PageRank: replace the uniform teleport distribution with a topic-specific distribution; gives topic-sensitive rankings.
TrustRank: teleport only to manually verified "seed" pages to combat link spam.
HITS (Hubs & Authorities): separate hub score ($h$) and authority score ($a$); $a_j = \sum_{i\to j} h_i$, $h_i = \sum_{j:i\to j} a_j$.
Limitation: standard PageRank measures generic popularity and is susceptible to link spam / artificial link topologies.

Topic-Specific PageRank, Web Spam & TrustRank

Topic-Specific (Personalized) PageRank

Motivation

Standard PageRank measures generic importance. Topic-Specific PageRank biases the random walk toward a topic by restricting teleportation to a small set $S$ of topic-relevant pages (the teleport set). Each choice of $S$ produces a different ranking vector $r_S$.

Matrix Formulation

$$ A_{ij} = \beta M_{ij} + \frac{1-\beta}{|S|},\mathbf{1}[i\in S]. $$

Pages outside $S$ receive no teleport probability. Computation is identical to standard PageRank (sparse $M$ plus a bias vector); Power Iteration still converges.

Choosing $S$

Use a topic taxonomy (e.g. the 16 DMOZ categories). Determine the relevant topic from: (1) explicit user selection; (2) query classification; (3) browsing history / bookmarks.

Random Walk with Restarts

Set $S = \{q\}$ for a single query node $q$. The stationary distribution then gives the proximity of every node to $q$: nearby nodes accumulate high visit probability.

A good proximity measure should capture: multiple connections, multiple paths, degree of intermediate nodes, and penalise long paths. Shortest-path and network-flow measures fail on at least one of these criteria; random walk with restarts satisfies all of them.

Web Spam

Term Spam (1st generation)

Early search engines ranked by keyword frequency. Spammers stuffed invisible text (same colour as background) or copied top-ranked pages verbatim. Defeat: use anchor text (what others say about the page) and PageRank (pages with no inbound links cannot rank highly).

Link Spam / Spam Farms (2nd generation)

Spammers build artificial link topologies to inflate PageRank of a target page $t$.

Link farm structure:

$M$ owned pages each link to $t$.
$t$ links back to all $M$ owned pages (recirculating rank).
Spammer also posts links to $t$ on accessible pages (blogs, forums).

Mathematical analysis. Let $N$ = total pages, $x$ = PageRank acquired from accessible pages, $y$ = PageRank of $t$. Each owned page has rank $\beta y/M + (1-\beta)/N$. Substituting into the flow equation for $t$ and discarding small terms:

$$ y = \frac{x}{1-\beta^2} + \frac{\beta}{1+\beta}\cdot\frac{M}{N}. $$

For $\beta=0.85$: $1/(1-\beta^2)\approx 3.6$. Both terms grow unboundedly with $M$: the spammer can make $y$ arbitrarily large.

TrustRank

Definition: Topic-Specific PageRank with teleport set = a small set of human-verified trusted pages.

Key principle: Good pages rarely link to spam; trust propagates through links but attenuates with distance (split equally across out-links, scaled by $\beta$).

Algorithm

Select seed set: pick top-$k$ pages by PageRank, or use controlled domains (e.g. .edu, .gov, .mil).
Human labelling: an oracle labels each seed page as trusted or spam.
Run Topic-Specific PageRank with teleport set = trusted seed pages.
Classify: pages whose TrustRank falls below a threshold are marked spam.

Trust propagation (formal)

Set trust of each trusted seed to 1. Page $p$ with trust $t_p$ and out-degree $|o_p|$ confers trust $\beta,t_p/|o_p|$ to each out-neighbour. Trust is additive; within a scaling factor this equals PageRank with the trusted pages as the teleport set.

Spam Mass

Complementary view: estimate what fraction of a page's PageRank originates from spam.

Let $r_p$ = standard PageRank of page $p$, $r_p^+$ = TrustRank of $p$ (PageRank with trusted teleport set). Define:

$$ \mathrm{SpamMass}(p) = \frac{r_p - r_p^+}{r_p}. $$

Pages with high spam mass ($\approx 1$) receive nearly all their rank from spam sources and are classified as spam.

Community Detection

Graph Partitioning & Conductance

Goal

Divide an undirected graph $G=(V,E)$ into clusters (communities) that are internally dense and externally sparse.

Cut and Conductance

For a subset $A\subseteq V$:

Cut: $\mathrm{cut}(A) = |\{(i,j)\in E : i\in A,, j\notin A\}|$.
Volume: $\mathrm{vol}(A) = \sum_{i\in A} d_i$ (sum of degrees).
Conductance: $$ \phi(A) = \frac{\mathrm{cut}(A)}{\min(\mathrm{vol}(A),, 2m - \mathrm{vol}(A))}, $$ where $m=|E|$. Lower conductance $\Rightarrow$ better cluster. Conductance produces more balanced partitions than minimum cut.

Local Clustering via Approximate PPR

Algorithm (PageRank-Nibble)

Pick a seed node $s$.
Compute approximate Personalized PageRank (PPR) with teleport set $\{s\}$.
Sort nodes by decreasing PPR score: $r_1 > r_2 > \cdots > r_n$.
Sweep: for each prefix $A_i = \{u_1,\ldots,u_i\}$, compute $\phi(A_i)$. Local minima of $\phi(A_i)$ correspond to good clusters.

Approximate PPR (Push-based)

Maintain estimate $r$ and residual $q = p - r$ (where $p$ is the true PPR vector). Initialise $r=\mathbf{0}$, $q=\mathbf{a}$ (teleport vector). Repeatedly Push from any node $u$ with $q_u/d_u \geq \varepsilon$:

$$\begin{aligned} r'_u &= r_u + (1-\beta),q_u, \\ q'_u &= \tfrac{1}{2}\beta, q_u, \\ q'_v &= q_v + \tfrac{1}{2}\beta, q_u / d_u \quad\text{for each } u\to v. \end{aligned}$$

Runtime: $O(1/(\varepsilon(1-\beta)))$, independent of graph size. Guarantee: if a cut of conductance $\phi$ and volume $k$ exists, the method finds a cut of conductance $O(\sqrt{\phi / \log k})$.

$k$-NN Graph Construction: NN-Descent

Given $n$ data points and a similarity oracle $\sigma$, build an approximate $K$-nearest-neighbour graph.

Algorithm

Initialise: for each node $v$, set $B[v]$ to $K$ random nodes.
Iterate:
1. Compute reverse neighbours $R[v] = \{u : v\in B[u]\}$.
2. General neighbours: $B^*[v] = B[v]\cup R[v]$.
3. For each $v$, for each $u_1\in B^(v)$, for each $u_2\in B^(u_1)$: compute $\sigma(v, u_2)$ and update $B[v]$ if $u_2$ is closer.
Stop when no updates occur.

Empirical cost: $O(n^{1.14})$, much better than brute-force $O(n^2)$.

Modularity

Definition

Given a partitioning $S$ of graph $G$ with $n$ nodes and $m$ edges:

$$ Q(G,S) = \frac{1}{2m}\sum_{s\in S}\sum_{i\in s}\sum_{j\in s} \left(A_{ij} - \frac{k_i k_j}{2m}\right), $$

where $A_{ij}$ is the adjacency matrix entry and $k_i$ is the degree of node $i$. The null model is a random graph with the same degree distribution: expected edges between $i$ and $j$ is $k_i k_j/(2m)$. $Q\in[-1,1]$; values $> 0.3$--$0.7$ indicate significant community structure.

Louvain Algorithm

Greedy modularity maximisation with $O(n\log n)$ runtime.

Two-Phase Iteration

Phase 1 (local moves): start with each node in its own community. For each node $i$, compute the modularity gain $\Delta Q$ of moving $i$ into each neighbour's community; move $i$ to the community with the largest positive $\Delta Q$. Repeat until no move improves $Q$.
Phase 2 (aggregation): contract each community into a super-node. Edge weights between super-nodes $=$ sum of edge weights between the corresponding communities. Self-loops $=$ sum of internal edges.
Return to Phase 1 on the contracted graph; repeat until $Q$ no longer increases.

The hierarchy of contractions yields a dendrogram of communities at multiple resolutions.

Graph Embeddings

Problem Setup

Goal

Learn a mapping $f: V \to \mathbb{R}^d$ such that $\mathrm{similarity}(u,v) \approx \mathbf{z}_u^\top \mathbf{z}_v$, where $\mathbf{z}_v = f(v)$ is the $d$-dimensional embedding of node $v$.

Encoder--Decoder Framework

Encoder: maps each node to a low-dimensional vector. Simplest: embedding lookup $\mathbf{z}_v = Z\mathbf{e}_v$, where $Z\in\mathbb{R}^{d\times|V|}$ is the embedding matrix.
Decoder: maps embeddings back to a similarity score (e.g., dot product $\mathbf{z}_u^\top \mathbf{z}_v$).
Optimise: encoder parameters so that decoded similarity approximates the original graph similarity.

Random-Walk Embeddings (DeepWalk / node2vec)

Core Idea

Define node similarity via co-occurrence on short random walks. Let $N_R(u)$ be the multiset of nodes visited on random walks starting from $u$ under strategy $R$.

Objective

$$ \max_{\mathbf{z}} \sum_{u\in V} \log P\left(N_R(u) \mid \mathbf{z}u\right) = \max{\mathbf{z}} \sum_{u\in V} \sum_{v\in N_R(u)} \log P(\mathbf{z}_v \mid \mathbf{z}_u). $$

Softmax parametrisation: $P(\mathbf{z}_v \mid \mathbf{z}_u) = \exp(\mathbf{z}_v\cdot\mathbf{z}u) / \sum{n\in V}\exp(\mathbf{z}_n\cdot\mathbf{z}_u)$.

Negative Sampling

Computing the full softmax is $O(|V|)$ per pair. Approximate using $k$ negative samples:

$$ \log\sigma(\mathbf{z}_v\cdot\mathbf{z}_u)

\sum_{i=1}^{k} \mathbb{E}{n_i\sim P_V} \left[\log\sigma(-\mathbf{z}{n_i}\cdot\mathbf{z}_u)\right], $$

where $\sigma$ is the sigmoid and $P_V$ is a distribution over nodes (typically proportional to degree). In practice $k=5$--$20$.

node2vec: Biased Random Walks

Two parameters control the walk:

Return parameter $p$: controls likelihood of returning to the previous node.
In-out parameter $q$: controls exploration outward (DFS-like, small $q$) vs. staying local (BFS-like, small $p$).

After traversing edge $(s_1, w)$, unnormalised transition probabilities to next node $t$:

$$ \alpha(t) = \begin{cases} 1/p & \text{if } d(t, s_1) = 0 \text{ (return)}, \\ 1 & \text{if } d(t, s_1) = 1, \\ 1/q & \text{if } d(t, s_1) = 2 \text{ (move away)}. \end{cases} $$

BFS-like walks ($p$ small) capture local/structural roles; DFS-like walks ($q$ small) capture global community structure.

Algorithm

Compute random walk transition probabilities.
Simulate $r$ walks of length $l$ from each node.
Optimise embeddings via SGD with negative sampling.

All three steps are parallelisable; linear-time complexity.

Downstream Tasks

Node classification: predict label $f(\mathbf{z}_i)$.
Link prediction: predict edge from $f(\mathbf{z}_i, \mathbf{z}_j)$ (concatenation, Hadamard product, sum, or distance).
Clustering: cluster nodes by their embeddings.

Graph Neural Networks (GNNs)

Motivation

Shallow embedding methods (DeepWalk, node2vec) have $O(|V|d)$ parameters, are transductive (cannot embed unseen nodes), and ignore node features. GNNs address all three limitations by using a shared neural network encoder that aggregates neighbourhood information.

Message-Passing Framework

A GNN layer consists of two steps applied to each node $v$:

Message: each neighbour $u\in N(v)$ computes a message $\mathbf{m}_u^{(l)} = \mathrm{MSG}^{(l)}(\mathbf{h}_u^{(l-1)})$.
Aggregation: messages are combined and used to update $v$'s embedding: $\mathbf{h}_v^{(l)} = \mathrm{AGG}^{(l)}\left(\{\mathbf{m}_u^{(l)} : u\in N(v)\},, \mathbf{m}_v^{(l)}\right)$.

Initialisation: $\mathbf{h}_v^{(0)} = \mathbf{x}_v$ (node features). Final embedding: $\mathbf{z}_v = \mathbf{h}_v^{(L)}$ after $L$ layers. A node at layer $L$ aggregates information from its $L$-hop neighbourhood.

GNN Variants

GCN (Graph Convolutional Network)

$$ \mathbf{h}v^{(l)} = \sigma!\left( \sum{u\in N(v)} \frac{\mathbf{W}^{(l)}\mathbf{h}_u^{(l-1)}}{|N(v)|} \right). $$

Message: $\mathbf{m}_u^{(l)} = \mathbf{W}^{(l)}\mathbf{h}_u^{(l-1)} / |N(v)|$. Aggregation: sum, then apply activation $\sigma$.

GraphSAGE

Two-stage aggregation:

$$ \mathbf{h}_{N(v)}^{(l)} = \mathrm{AGG}\left(\{\mathbf{h}_u^{(l-1)} : u\in N(v)\}\right), \qquad \mathbf{h}_v^{(l)} = \sigma!\left(\mathbf{W}^{(l)}\cdot \mathrm{CONCAT}\left(\mathbf{h}v^{(l-1)},, \mathbf{h}{N(v)}^{(l)}\right)\right). $$

AGG can be mean, max, or a learned pooling function.

GAT (Graph Attention Network)

$$ \mathbf{h}v^{(l)} = \sigma!\left(\sum{u\in N(v)} \alpha_{vu}, \mathbf{W}^{(l)}\mathbf{h}_u^{(l-1)}\right), $$

where attention weights $\alpha_{vu}$ are learned:

Compute attention coefficient: $e_{vu} = a\left(\mathbf{W}^{(l)}\mathbf{h}_v^{(l-1)},, \mathbf{W}^{(l)}\mathbf{h}_u^{(l-1)}\right)$, where $a$ is a small neural network (e.g., a linear layer on the concatenation).
Normalise via softmax: $\alpha_{vu} = \exp(e_{vu}) / \sum_{k\in N(v)}\exp(e_{vk})$.

Multi-head attention: run $H$ independent attention heads and concatenate (or average) outputs for stability.

Training

Supervised

Node classification with cross-entropy loss:

$$ \mathcal{L} = \sum_{v\in V}\left[y_v\log\sigma(\mathbf{z}_v^\top\boldsymbol{\theta}) + (1-y_v)\log(1-\sigma(\mathbf{z}_v^\top\boldsymbol{\theta}))\right]. $$

Train weight matrices $\mathbf{W}^{(l)}, \mathbf{B}^{(l)}$ via SGD.

Unsupervised

Use random-walk co-occurrence (as in node2vec) as the similarity signal for the GNN encoder.

Graph Augmentation

Feature augmentation

When the graph has no node features, use: (a) constant features (all nodes get the same value --- inductive, low cost), or (b) one-hot node IDs (high expressive power, transductive, $O(|V|)$ cost). Other useful augmentations: node degree, PageRank, clustering coefficient.

Structure augmentation

Virtual edges: add 2-hop connections ($A + A^2$) to improve message passing in sparse graphs (e.g., bipartite $\to$ collaboration graphs).
Virtual nodes: a single node connected to all others reduces diameter to 2.
Neighbourhood sampling: randomly sample a fixed number of neighbours per node to reduce computation; in expectation recovers the full-neighbourhood result.

Relational Deep Learning (RDL)

Motivation

Most real-world data lives in relational databases (multiple tables linked by primary/foreign keys), not single flat tables. Traditional ML requires extensive feature engineering (ETL, SQL aggregations); RDL applies GNNs directly on the relational structure, eliminating manual feature work.

Relational Entity Graph

Each row in each table becomes a node in the graph.
Edges connect rows whose primary and foreign keys match (equivalent to SQL inner joins on pkey = fkey).
Node features are the non-key columns of each row.
The schema graph captures the high-level table relationships (one node per table, edges for foreign-key links).

Temporal Prediction Tasks

A training table specifies (Entity ID, Timestamp, Label). Tasks are inherently temporal: entity labels change over time, and the database evolves. At prediction time $t$, only information available up to $t$ may be used.

GNN computation graphs become time-dependent: message passing and neighbour sampling respect temporal constraints to avoid data leakage.

Why GNNs Work on Relational Data

GNN aggregation is a learnable version of hand-crafted SQL aggregation features (e.g., SUM, AVG over time windows).
GNNs can discover cross-table patterns via multi-hop message passing (e.g., user $\to$ transaction $\to$ product relationships).
Information exchange between training examples through shared graph structure enriches entity representations.

Machine Learning

Decision Trees

Setup

Input: $n$ examples $(\mathbf{x}_i, y_i)$ with $d$ features $x^{(1)},\ldots,x^{(d)}$. Features can be numerical or categorical; output $y$ can be categorical (classification) or numerical (regression).
Structure: a tree where each internal node tests a feature (e.g., $x^{(j)} < v$), each branch corresponds to an outcome, and each leaf stores a prediction.
Prediction: drop input $\mathbf{x}$ down the tree until it reaches a leaf; output the leaf's stored value.

Tree Construction (BuildSubtree)

Three decisions at each node with data $D$:

(1) How to split

Regression (variance reduction): Choose split $(x^{(j)}, v)$ that maximises

$$ |D|\cdot\mathrm{Var}(D) - \bigl[|D_L|\cdot\mathrm{Var}(D_L) + |D_R|\cdot\mathrm{Var}(D_R)\bigr], $$

where $\mathrm{Var}(D) = \frac{1}{|D|}\sum_{i\in D}(y_i - \bar{y})^2$.

Classification (information gain):

Entropy: $H(Y) = -\sum_j p(Y_j)\log_2 p(Y_j)$.
Conditional entropy: $H(Y\mid X) = \sum_j P(X=v_j), H(Y\mid X=v_j)$.
Information gain: $IG(Y\mid X) = H(Y) - H(Y\mid X)$.
Choose the split with highest $IG$.

(2) When to stop

Leaf is "pure": $\mathrm{Var}(y) < \epsilon$ or all labels are the same.
Too few examples: $|D| <$ threshold (e.g., 100).

(3) How to predict at a leaf

Regression: average $y_i$ in the leaf (or fit a local linear model).
Classification: majority class in the leaf.

Ensemble Methods

Bagging & Random Forests

Bagging (Bootstrap Aggregation)

Bootstrap: create $T$ datasets $D'_t$ by sampling $n$ points from $D$ with replacement ($\approx 63%$ unique points per sample).
Train: build a decision tree independently on each $D'_t$.
Aggregate: average (regression) or majority vote (classification) over all $T$ trees.

Random Forests

Bagged decision trees with feature bagging: at each split, consider only a random subset of $\sqrt{d}$ features (out of $d$). This breaks correlation between trees, improving ensemble diversity.

Boosting

AdaBoost

Combines decision stumps (1-level trees) sequentially.

Initialise equal weights $w_i = 1/n$ for all examples.
At each round $t$:
1. Train stump $G_t$ on weighted data.
2. Compute weighted error $\epsilon_t = \sum_{i: G_t(x_i)\neq y_i} w_i$.
3. Compute tree weight $\alpha_t = \frac{1}{2}\ln!\left(\frac{1-\epsilon_t}{\epsilon_t}\right)$.
4. Reweight: $w_i \leftarrow w_i \cdot \exp(-\alpha_t, y_i, G_t(x_i))$; normalise.
Final prediction: $G(x) = \mathrm{sign}!\left(\sum_t \alpha_t, G_t(x)\right)$.

Harder-to-classify examples get higher weight; more accurate stumps get higher $\alpha_t$.

Gradient Boosted Decision Trees (GBDT)

Additive model: $\hat{y}^{(t)} = \hat{y}^{(t-1)} + \epsilon, f_t(\mathbf{x})$, where $\epsilon$ is the learning rate (shrinkage, typically $\sim 0.1$).
At round $t$, find tree $f_t$ minimising the second-order Taylor approximation of the loss:

$$ \widetilde{\mathcal{L}}^{(t)} = \sum_i \bigl[g_i, f_t(\mathbf{x}_i) + \tfrac{1}{2}, h_i, f_t(\mathbf{x}_i)^2\bigr] + \Omega(f_t), $$

where $g_i = \partial_{\hat{y}} \ell(y_i, \hat{y}^{(t-1)})$, $h_i = \partial^2_{\hat{y}} \ell(y_i, \hat{y}^{(t-1)})$.
Regularisation: $\Omega(f) = \gamma T + \frac{\lambda}{2}\sum_{j=1}^{T} w_j^2$, where $T$ = number of leaves, $w_j$ = leaf weight.
For a given tree structure, optimal leaf weight: $w_j^* = -\sum_{i\in I_j} g_i / (\sum_{i\in I_j} h_i + \lambda)$.
Split gain:

$$ \mathrm{Gain} = \frac{1}{2}\left[ \frac{(\sum_{i\in I_L} g_i)^2}{\sum_{i\in I_L} h_i + \lambda}
- \frac{(\sum_{i\in I_R} g_i)^2}{\sum_{i\in I_R} h_i + \lambda}
- \frac{(\sum_{i\in I} g_i)^2}{\sum_{i\in I} h_i + \lambda} \right] - \gamma. $$
Grow tree greedily; post-prune splits with negative gain.

XGBoost

A scalable implementation of GBDT with L1/L2 regularisation, column-block structure for parallel tree construction, distributed computing, and out-of-core computation for large datasets.

Infinite Data

Mining Data Streams

Data arrives one element at a time at rapid rate; cannot store entire stream; must process immediately or lose forever.

Sampling from a Stream

Fixed-proportion sampling

To sample fraction $a/b$ of the stream: hash each tuple's key uniformly into $b$ buckets; keep the tuple if its hash value $< a$. Hashing on keys (not individual tuples) preserves group structure (e.g., all queries by the same user are either all in or all out).

Fixed-size sampling (Reservoir Sampling)

Maintain a sample $S$ of exactly $s$ elements.

Store the first $s$ elements.
When the $n$-th element arrives ($n>s$): with probability $s/n$, add it to $S$ (replacing a uniformly random element); otherwise discard it.

Invariant: after seeing $n$ elements, each element is in $S$ with probability $s/n$. Proof by induction: $P(\text{in } S \text{ at } n{+}1) = \frac{s}{n}\cdot\frac{n}{n+1} = \frac{s}{n+1}$.

Filtering: Bloom Filters

Given a set $S$ of $m$ keys, determine which stream elements have keys in $S$, using limited memory.

Basic Bloom Filter

Allocate a bit array $B$ of $n$ bits (all 0).
Hash each $s\in S$: set $B[h(s)]=1$.
For stream element $a$: output $a$ if $B[h(a)]=1$.

No false negatives; false positive probability $\approx 1 - e^{-m/n}$.

Bloom Filter with $k$ hash functions

Use $k$ independent hash functions $h_1,\ldots,h_k$; declare $x\in S$ only if $B[h_i(x)]=1$ for all $i$. False positive probability: $(1 - e^{-km/n})^k$. Optimal $k = \frac{n}{m}\ln 2$; for $m=10^9$, $n=8\times10^9$, optimal $k\approx 6$ gives FP rate $\approx 2.2%$.

Counting Distinct Elements: Flajolet--Martin

Algorithm

Choose hash $h$ mapping elements to $\geq\log_2 N$ bits.
For each element $a$, let $r(a)$ = number of trailing zeros in $h(a)$.
Maintain $R = \max_a r(a)$.
Estimate number of distinct elements $\approx 2^R$.

Analysis

Probability of not seeing a tail of length $r$ among $m$ distinct elements: $(1-2^{-r})^m$. If $m \gg 2^r$, this $\to 0$; if $m \ll 2^r$, this $\to 1$. So $2^R \approx m$.

$\mathbb{E}[2^R]$ is infinite (heavy tail), so use multiple hash functions: partition samples into groups, take median within groups, then average the medians.

Counting Frequent Items: Exponentially Decaying Windows

Maintain a smoothed count for each item $x$:

$$ w_x(T) = \sum_{t=1}^{T} \delta_t,(1-c)^{T-t}, \quad \delta_t = \mathbf{1}[a_t = x], $$

where $c$ is a small decay constant (e.g., $10^{-6}$).

Update rule: when new element $a_{T+1}$ arrives, multiply all weights by $(1-c)$; add 1 to $w_{a_{T+1}}$.

Total weight across all items $= 1/c$, so at most $2/c$ items can have weight $\geq 1/2$. Drop items with weight below threshold.

Extension to Itemsets

Start counting an itemset $S\subseteq B$ only if all proper subsets of $S$ are already being counted (mirroring A-priori).

Online Algorithms & Web Advertising

Online Bipartite Matching

Setting

One side of a bipartite graph ("boys") is known upfront. The other side ("girls") arrives one at a time; upon arrival, we must irrevocably match or leave unmatched.

Greedy

Match each arriving node to any available neighbour. Competitive ratio $= 1/2$ (worst case: greedy exhausts the wrong side first).

Competitive Ratio

$\mathrm{CR} = \min_{\text{all inputs } I} |M_{\mathrm{ALG}}| / |M_{\mathrm{OPT}}|$.

The Adwords Problem

Setting

A stream of search queries $q_1, q_2, \ldots$ arrives. Each advertiser has a bid on certain queries, a click-through rate (CTR), and a daily budget. For each query, select an advertiser to show; goal: maximise total revenue.

Simplified Model

All advertisers have budget $B$, all bids $=1$, same CTR.

BALANCE Algorithm

For each query, assign it to the advertiser with the largest remaining budget (break ties deterministically).

Analysis (2 advertisers): Competitive ratio $= 3/4$. Proof sketch: BALANCE must exhaust at least one budget; the number of missed queries $x$ satisfies $x \leq B/2$, giving revenue $\geq 3B/2$ vs. optimal $2B$.

General case ($N$ advertisers): Competitive ratio $= 1 - 1/e \approx 0.63$. Worst case: $N$ rounds of $B$ queries, where round $i$ queries are bid on by advertisers $A_i,\ldots,A_N$. After $k = N(1-1/e)$ rounds, all budgets are exhausted; revenue $= BN(1-1/e)$.

Generalised BALANCE

For arbitrary bids $x_i$ and budgets $b_i$, define

$$ \psi_i(q) = x_i,(1 - e^{-f_i}), \quad f_i = 1 - m_i/b_i, $$

where $m_i$ = amount spent so far. Allocate query $q$ to the advertiser with the largest $\psi_i(q)$. Same competitive ratio $1-1/e$.

No online algorithm can achieve a better competitive ratio.

Submodular Optimisation

Submodularity & Set Cover

Set Cover Problem

Given a universe $W = \{w_1,\ldots,w_n\}$ and sets $X_1,\ldots,X_m \subseteq W$, find $k$ sets whose union is as large as possible:

$$ \max_{|A|\leq k} F(A), \qquad F(A) = \left|\bigcup_{i\in A} X_i\right|. $$

This is NP-hard in general.

Submodular Functions

Definition

A set function $F: 2^V \to \mathbb{R}$ is submodular if for all $A\subseteq B\subseteq V$ and $d\notin B$:

$$ F(A\cup\{d\}) - F(A) \geq F(B\cup\{d\}) - F(B). $$

This is the diminishing returns property: adding an element to a smaller set helps at least as much as adding it to a larger set.

Equivalently, for all $A,B\subseteq V$: $F(A) + F(B) \geq F(A\cup B) + F(A\cap B)$.

Properties

The coverage function $F(A) = |\bigcup_{i\in A} X_i|$ is submodular and monotone.
Non-negative linear combinations of submodular functions are submodular.
Submodularity is the discrete analogue of concavity.

Greedy Algorithm

$A_0 = \emptyset$.
For $i=1,\ldots,k$: let $d_i = \arg\max_d F(A_{i-1}\cup\{d\}) - F(A_{i-1})$; set $A_i = A_{i-1}\cup\{d_i\}$.

Guarantee (Nemhauser--Fisher--Wolsey, 1978)

For any monotone submodular function $F$ with $F(\emptyset)=0$:

$$ F(A_{\mathrm{greedy}}) \geq \left(1 - \frac{1}{e}\right) F(A_{\mathrm{OPT}}) \approx 0.63\cdot\mathrm{OPT}. $$

Lazy Greedy (CELF)

Exploit submodularity: marginal gains $\Delta_i(d)$ can only decrease over rounds. Keep an ordered list of upper bounds from the previous round; re-evaluate only the current top element, re-sort, and pick the new top. Provides identical output to greedy but with significant speedup in practice.

Application: Diverse Document Selection

Probabilistic Coverage

Each document $d$ covers concept $c$ with probability $\mathrm{Cover}_d(c)$. Probability that at least one document in $A$ covers $c$:

$$ P_A(c) = 1 - \prod_{d\in A}(1 - \mathrm{Cover}_d(c)). $$

Objective with concept weights $w_c$:

$$ F(A) = \sum_c w_c, P_A(c). $$

This is monotone and submodular, so greedy gives a $(1-1/e)$-approximation.

Personalisation (Multiplicative Weights)

Learn per-user concept weights from feedback: after recommending document $d$ and receiving feedback $r\in\{+1,-1\}$, update $w_c \leftarrow \beta^r w_c$ for each concept $c\in X_d$, then renormalise so $\sum_c w_c = 1$.

`?`	Toggle this help
`/`	Search
`f`	Link hints (vim-like)
`t`	Toggle dark mode
`j` / `k`	Scroll down / up
`g` / `G`	Top / bottom
`o`	Jump back
`l`	Cycle language (en→zh→fr)
`H` / `L`	History back / forward
`r`	Reload
`F`	Fullscreen
`i`	Idle in the Matrix
`a`	ASCII Aquarium
`Esc`	Close / cancel