CS 246: Mining Massive Data Sets
Association Rule Discovery: Market-basket model
Definitions
- item, basket (a set of items). We are given the set $\mathbf{Items}$ of all items, and the set of all baskets $\mathbf{Buskets}=\{\mathbf{B}_i\}\subset\operatorname{Power}(\mathbf{Items})$.
- frequent itemsets:
- support for itemset $I$ is the number of baskets containing $I$, i.e. $$\textrm{supp}(I):=|\{\mathbf{B}\in \mathbf{Baskets}:\mathbf{B}\supseteq I\}|.$$ Support is monotonely decreasing, $I\subseteq J\Rightarrow \textrm{supp}(I)\geq \textrm{supp}(J).$
- given a support threshold $s$, frequent itemsets are $\{I\subseteq \mathbf{Items}:\textrm{supp}(I)\geq s\}.$
- association rules:
- If-then rule $I\rightarrow j$: if a basket contains $I$, then it is likely to contain $j$.
- support of this rule is defined to be the sum of the support of $I$ and the support of $j$.
- confidence: $\textrm{conf}(I\rightarrow j):=\frac{\textrm{supp}(I\cup\{j\})}{\textrm{supp}(I)}.$ This estimates $\textrm{Pr}(j|I)=\frac{\textrm{Pr}(I,j)}{\textrm{Pr}(I)}.$
- interest value: $\mathrm{Interest}(I\rightarrow j):= |\textrm{conf}(I\rightarrow j)-\textrm{Pr}(j)| = |\textrm{Pr}(j|I) - \textrm{Pr}(j)|,$ where $\mathrm{Pr(j)}:=\frac{\textrm{supp}(j)}{|\mathbf{Buskets}|}.$
- interesting rules are those with high interest values (usually above 0.5).
- absolute value to capture both positive and negative association rules.
- Association rule mining: find all association rules with support $\geq s$ and confidence $\geq c$.
Algorithms
Overlook
- Step1: find all frequent itemsets $I$.
- Step2: rule generation
- For every subset $A$ of $I$, genearte rule $A\rightarrow I\backslash A$.
- observation: if $A,B,C\rightarrow D$ is below confidence, then so is $A,B\rightarrow C,D.$
- Output the rules generated above the confidence threshold $s$. To compactify the output, we can output only maximal frequent itemsets, or closed itemsets (no immediate superset has the same support).
Step 1 is hard. And in general, the hardest scenario is finding frequent pairs ($|I| =2$.)
finding frequent paris
Naive approach
read file once, counting in main memory the occurences of each pair. This fails if $|\mathbf{Items}|^2$ exceeds main memory. - triangular matrix: updates $\textrm{count}[i][j]$ for $i<j$. - triples: updates $(i,j,\textrm{count})$. Better than triangular matrix if less than $1/3$ of possible pairs actually occur.
A-priori algorithm: two-pass approach
- Pass 1: read baskets and count occurances of each individual item. So after pass 1, we know $\textrm{supp}(i)$ for $i\in \mathbf{Items}.$
- Pass 2: read baskets again and count those pairs $(i,j)$ where both $i,j$ are frequent, using triangular matrix.
PCY algorithm for frequent pairs:
Uses the memory left in 1st pass of A-priori. Maintain a hash table with as many as buckets as fit in memory.
- Pass 1:
For (each basket B):
For (each item i in B):
count[i]+=1
For (each pair (i,j) in B):
hash (i,j) to a bucket b;
count_bucket[b]+=1
- bit-vector:
count_buckey[b] = 1if bucket $b$ is frequent, otherwise set it to be $0$. - pass 2: count only for candidate pair (i.e. $i$ and $j$ are frequent, the bucket $b$ to which $(i,j)$ hashes to is frequent.)
k passes for size $k$
A-priori and PCY generalize to $|I|=k$ case by $k$-passes: prune non-frequent $|I'|=k-1$ itemsets from $(k-1)$-th pass.
Frequent itemsets in $\leq2$ passes for all sizes
May miss some frequent itemsets.
Random Sampling
Take a random sample of the market baskets. Reduce support threshold proportionally: if your sample is $1/100$ of the baskets, use $s/100$ as your support threshold instead of $s$.
SON algorithm
- 1st pass: Repeatedly read small subsets of the baskets into main memory and run an in- memory algorithm to find all frequent itemsets, support threshold = $|\textrm{chunk}|/|\textrm{whole file}|*s.$
- 2nd pass: count all the candidate itemsets
Toivonen's algorithm
- Pass 1: Start with a random sample, but lower the threshold slightly for the sample. To the itemsets that are frequent in the sample, add the negative border (not frequent in the sample, but all its immediate subsets are).
- Pass 2: Count all candidate and also their negative border.
- If no itemset from the negative border turns out to be frequent, then we found all the frequent itemsets, otherwise, start over with another sample.
Recommendation Systems
Problem Setup
Given customers $X$, items $S$, and a utility function $u: X\times S\to\mathbb{R}$ (e.g. 1--5 star ratings), the utility matrix records known ratings (most entries blank). Three problems: (1) gather known ratings; (2) extrapolate unknown ratings; (3) evaluate extrapolation.
Key challenge: the matrix is sparse + cold-start (new items or users have no history).
Content-Based Filtering
Pipeline
-
Build an item profile (feature vector) for each item. For text: use the words with highest TF-IDF scores.
-
Build a user profile as a weighted average of the item profiles of items the user rated highly.
-
Recommend items closest (by cosine similarity) to the user profile.
TF-IDF
Let $f_{ij}$ = frequency of term $i$ in item $j$, $n_i$ = number of items containing term $i$, $N$ = total items.
$$ TF_{ij} = \frac{f_{ij}}{\max_k f_{kj}}, \qquad IDF_i = \log\frac{N}{n_i}, \qquad w_{ij} = TF_{ij}\times IDF_i. $$
High TF-IDF $\Rightarrow$ term is frequent in this item but rare globally.
Prediction
$$ u(x, i) = \cos(x, i) = \frac{x\cdot i}{|x|,|i|}. $$
Use LSH to find items closest to user profile $x$ efficiently.
Pros / Cons
-
Pros: no data from other users needed; can recommend new/unpopular items; no item cold-start; explanations available.
-
Cons: hard to extract features for non-text items; user cold-start; over-specialisation (never discovers items outside the user's profile).
Collaborative Filtering
Uses the utility matrix directly (no content features). Two flavours: user--user and item--item.
User Similarity Measures
Let $S_{xy}$ = items rated by both users $x$ and $y$, $\bar{r}_x$ = mean rating of user $x$.
-
Jaccard: $\mathrm{sim}(x,y)=|r_x\cap r_y|/|r_x\cup r_y|$. Ignores rating values.
-
Cosine: $\mathrm{sim}(x,y)=r_x\cdot r_y/(|r_x||r_y|)$. Treats missing entries as 0 (negative bias).
-
Pearson correlation (preferred):
$$ \mathrm{sim}(x,y) = \frac{\displaystyle\sum_{s\in S_{xy}}(r_{xs}-\bar{r}x)(r{ys}-\bar{r}y)} {\displaystyle\sqrt{\sum{s\in S_{xy}}(r_{xs}-\bar{r}x)^2}\cdot \sqrt{\sum{s\in S_{xy}}(r_{ys}-\bar{r}_y)^2}}. $$
Mean-centering removes user-level bias; sum only over co-rated items. (Cosine on mean-centered data equals Pearson.)
User--User CF: Rating Prediction
Let $N$ = $k$ users most similar to $x$ who rated item $i$.
$$ r_{xi} = \frac{\sum_{y\in N} s_{xy}\cdot r_{yi}}{\sum_{y\in N} s_{xy}}, \qquad s_{xy}=\mathrm{sim}(x,y). $$
Item--Item CF: Rating Prediction
Let $N(i;x)$ = items most similar to $i$ that were also rated by $x$.
$$ r_{xi} = \frac{\sum_{j\in N(i;x)} s_{ij}\cdot r_{xj}}{\sum_{j\in N(i;x)} s_{ij}}. $$
Empirical finding: item--item CF usually outperforms user--user CF because items are simpler entities and item neighbourhoods are more stable over time.
Baseline-Adjusted Prediction
$$ r_{xi} = b_{xi} + \frac{\sum_{j\in N(i;x)} s_{ij}\cdot(r_{xj}-b_{xj})}{\sum_{j\in N(i;x)} s_{ij}}, \qquad b_{xi} = \mu + b_x + b_i, $$
where $\mu$ = global mean rating, $b_x$ = user deviation, $b_i$ = item deviation.
Problems with Basic CF
-
Similarity $s_{ij}$ is arbitrary (Pearson, cosine, etc.) and not optimized.
-
Pairwise similarities neglect interdependencies among items in $N(i;x)$.
-
The normalized weighted average constrains the model unnecessarily ($\sum w = 1$, same sign as similarities).
Complexity & Scalability
Finding $k$ nearest users costs $O(|X|)$ per query --- too slow at runtime. Solutions: pre-compute similarities offline; LSH for approximate neighbours; cluster users before search; dimensionality reduction.
Pros / Cons of Collaborative Filtering
-
Pros: works for any item type; no feature engineering.
-
Cons: cold start (new users/items); sparsity; first-rater problem; popularity bias (over-recommends popular items).
Hybrid Methods
Combine content-based and collaborative filtering: use item profiles to handle the new-item cold start, and user demographics to handle the new-user cold start; combine predictions via a linear model.
Evaluation
Split the utility matrix into train / test.
-
RMSE: $\displaystyle\sqrt{\frac{1}{N}\sum_{x,i}(r_{xi}-r^*_{xi})^2}$.
-
Precision @ 10: fraction of relevant items in top-10 recommendations.
-
Coverage / ROC: fraction of user--item pairs predictable; false-positive vs. true-positive tradeoff.
Note: RMSE penalises all rating levels equally, but in practice only high ratings matter.
Latent Factor Models (Matrix Factorization)
Core Idea
Map both users and items to a shared $k$-dimensional latent space. Factorize the ratings matrix $R$ (items $\times$ users) as
$$ R \approx Q,P^{\top}, $$
where $q_i\in\mathbb{R}^k$ is the latent vector for item $i$ and $p_x\in\mathbb{R}^k$ is the latent vector for user $x$. Predicted rating:
$$ \hat{r}_{xi} = q_i \cdot p_x. $$
Why Not Standard SVD?
SVD is undefined for sparse matrices (missing entries are not zeros). Instead, minimize SSE only over known entries:
$$ \min_{P,Q} \sum_{(i,x)\in R}(r_{xi} - q_i \cdot p_x)^2. $$
$P$ and $Q$ need not be orthonormal.
Regularized Objective (prevent overfitting)
Large $k$ overfits. Add L2 penalty:
$$ \min_{P,Q} \sum_{(i,x)\in R}(r_{xi} - q_i p_x)^2
- \lambda_1\sum_x |p_x|^2
- \lambda_2\sum_i |q_i|^2. $$
Regularization shrinks factor vectors toward zero where data is scarce.
Full Model with Biases
$$ \hat{r}_{xi} = \mu + b_x + b_i + q_i\cdot p_x, $$
where $\mu$ = global mean, $b_x$ = user bias, $b_i$ = item bias. Full regularized objective:
$$ \min_{Q,P,b}\sum_{(x,i)\in R}(r_{xi}-\mu-b_x-b_i-q_i p_x)^2
- \lambda_1!\sum_i|q_i|^2
- \lambda_2!\sum_x|p_x|^2
- \lambda_3!\sum_x b_x^2
- \lambda_4!\sum_i b_i^2. $$
BellKor Multi-Scale Model
The winning Netflix Prize approach decomposes predictions into three levels:
-
Global (biases): $b_{xi} = \mu + b_x + b_i$. Example: global mean $\mu=3.7$, movie is $+0.5$, user rates $-0.2$ below average $\Rightarrow$ baseline $= 4.0$.
-
Regional (latent factors): matrix factorization $q_i\cdot p_x$ capturing latent user--item affinity.
-
Local (CF/NN): nearest-neighbour adjustment using ratings of similar items/users to fine-tune the prediction.
Learned Interpolation Weights (CF++)
Replace fixed similarity $s_{ij}$ with learned weights $w_{ij}$ (not constrained to sum to 1; can be negative; captures item interdependencies):
$$ \hat{r}{xi} = b{xi} + \sum_{j\in N(i;x)} w_{ij}(r_{xj}-b_{xj}). $$
Objective (minimize SSE over training ratings):
$$ J(w) = \sum_{(x,i)\in R} \Bigl(r_{xi} - b_{xi} - \sum_{j\in N(i;x)} w_{ij}(r_{xj}-b_{xj})\Bigr)^2. $$
Gradient w.r.t. $w_{ij}$ (for $j\in N(i;x)$; zero otherwise):
$$ \frac{\partial J}{\partial w_{ij}} = -2\sum_{(x,i)\in R} \Bigl(r_{xi} - b_{xi} - \sum_{k\in N(i;x)} w_{ik}(r_{xk}-b_{xk})\Bigr) (r_{xj}-b_{xj}). $$
Gradient Descent update (fix item $i$; iterate over all its ratings):
$$ w_{\text{new}} \leftarrow w_{\text{old}} - \eta,\nabla_w J, \quad\text{repeat until } |w_{\text{new}} - w_{\text{old}}| < \varepsilon. $$
Optimization: Stochastic Gradient Descent (SGD)
Full Gradient Descent (slow)
$$ Q \leftarrow Q - \eta,\nabla_Q J, \qquad P \leftarrow P - \eta,\nabla_P J. $$
Requires scanning all training ratings per step --- too slow at scale.
SGD (preferred)
The full gradient for one entry $q_{if}$ is a sum over all ratings of item $i$:
$$ \frac{\partial J}{\partial q_{if}} = \sum_{x:,(x,i)\in R} \Bigl[-2(r_{xi}-q_i\cdot p_x),p_{xf} + 2\lambda_2,q_{if}\Bigr]. $$
SGD approximates this with a single rating $(x,i)$ at a time. Update after each rating:
-
Compute error: $\varepsilon_{xi} = 2(r_{xi} - q_i\cdot p_x)$.
-
Update item vector: $q_i \leftarrow q_i + \mu_1(\varepsilon_{xi},p_x - 2\lambda_2,q_i)$.
-
Update user vector: $p_x \leftarrow p_x + \mu_2(\varepsilon_{xi},q_i - 2\lambda_1,p_x)$.
Repeat sweeps over all training ratings until convergence. Each step is $O(k)$ vs. $O(k|R|)$ for full GD. Initialize $P$, $Q$ with standard SVD (treat missing entries as 0).
Temporal Dynamics
User tastes and movie popularities drift over time. Extend biases to be time-dependent:
$$ \hat{r}{xi} = \mu + b_x(t) + b_i(t) + q_i\cdot p_x, \qquad b_i(t) = b_i + b{i,,\mathrm{Bin}(t)}, $$
where each bin covers $\approx 10$ consecutive weeks. User preference vectors $p_x(t)$ can similarly be made time-dependent.
Netflix Prize Performance Summary
| Method | RMSE |
|---|---|
| Global average | 1.130 |
| User average | 1.065 |
| Movie average | 1.053 |
| Netflix baseline | 0.951 |
| Basic CF | 0.940 |
| CF + biases + learned weights | 0.910 |
| Latent factors | 0.900 |
| Latent factors + biases | 0.890 |
| Latent factors + biases + temporal | 0.876 |
| Grand Prize (BellKor ensemble) | 0.856 |
High Dimensional Data
Locally sensitive hashing
- Problem: given $q$, find data points $x_j$, s.t. $d(q,x_j)\leq s$. Naive solution $O(N)$ time, LSH gives $O(1)$.
- Problem: given $s>0$, find all $(x_i,x_j)$, such that $d(x_i,x_j)<s$. Naive solution $O(N^2)$, LSH gives $O(N)$.
Finding similar documents
Shingling - min hashing - LSH
After shingling, the goal becomes: find similar columns in large sparse matrices with high Jaccard similarity
Shingling
- Shingling: $D$ turns into $S(D):=$ the set of sequnces of $k$ tokens in $D$.
- Jaccarcd similarity $\textrm{sim}(D_1,D_2) = \frac{|S(D_1)\cap S(D_2)|}{{|S(D_1)\cup S(D_2)|}}$, and Jaccard distance $d(C_1,C_2):= 1- \frac{|C_1\cap C_2|}{|C_1\cup C_2|}$, where $C_i := S(D_i)$.
- Sets into boolean Matrix: $c_ {ij} = \mathbf{1}[\textrm{shingle }i\textrm{ in Set(}\text{Document}_j\text{)}]$
Min-Hashing: Convert large sets to short signatures, while preserving similarity
- minhash function: Permute the rows of the Boolean matrix using some permutation $\pi_ {k}$, $h_\pi(C_j):= \min\{\pi(i):c_ {ij}=1\}$. Choose $K$ permutations $\pi_k$ to create a signature for each column (a vector $(h_ {\pi_k}(C))_ {1\leq k\leq K}$) and signature matrix $ M = (h_ {\pi_k}(C_j))_ {1\leq k\leq K,j}$ with fewer rows.
- $\mathrm{Pr}[h_ \pi(C_ 1) = h_ \pi(C_ 2)] = \textrm{sim}(C_ 1, C_ 2)$. The similarity of two signatures is the fraction of the hash functions in which they agree. Therefore, $\mathbb{E}(\textrm{sim}(\text{sig}(C_1),\text{sig}(C_2))) = \textrm{sim}(C_1,C_2).$
- Implementation: instead of actually permuting rows of data, use row hashing to simulate, i.e. a permutation $\pi_i$ will be replaced by a hash function $h_i = ((ax+b) \mod p) \mod N$, where $a,b$ are random integers, prime $p>N$.
- Initialize: Set all slots in the Signature Matrix to infinity: $M(i, c) = \infty$ (for hash function $i$, column/document $c$).
- Scan: Read the original matrix row by row.
- Update: If row $j$ has a
1in column $c$:- Compute $h_i(j)$ for all $K$ hash functions.
- Apply update rule: $M(i, c) \leftarrow \min(M(i, c), h_i(j))$
LSH
Divide signature matrix $M$ into $b$ bands with each band having $r$ rows, and candidate column pairs are those that hash to the same bucket for $\geq 1$ band.
Need to tune $b$, $r$, $K=b*r$ to balance false positive and false negative.
Need a second pass to verify LSH into same bucket actually giving similar signature; and then maybe another pass to verify similar signature giving similar document.
Assume $C_1$ and $C_2$ have similarity $t$, then $\mathrm{Pr}(\textrm{no bend identicle}) = (1-t^r)^b$, and at least 1 band identicle is $1-(1-t^r)^b.$
General metric space case
1. Locality-Sensitive (LS) Families A hash family is $(d_1, d_2, p_1, p_2)$-sensitive if for points $x, y$:
- If $d(x,y) \le d_1 \rightarrow Pr[h(x)=h(y)] \ge p_1$
- If $d(x,y) \ge d_2 \rightarrow Pr[h(x)=h(y)] \le p_2$
2. Amplification (The S-Curve) Goal: Create a step-function probability around a similarity threshold $t$.
- AND ($r$ functions/rows): Match requires all $r$ to match. $Pr \rightarrow p^r$. Lowers both false positives and true positives.
- OR ($b$ functions/bands): Match requires $\ge 1$ of $b$ to match. $Pr \rightarrow 1 - (1-p)^b$. Raises both true positives and false positives.
- AND-OR Cascade: $Pr = 1 - (1-s^r)^b$. Standard LSH technique.
- OR-AND Cascade: $Pr = (1 - (1-s)^b)^r$.
3. Specific Distance Metrics
- Jaccard: Min-Hashing. $Pr[match] = 1 - d(x,y)$. Forms a $(d_1, d_2, 1-d_1, 1-d_2)$-sensitive family.
- Cosine: Random Hyperplanes. $h_v(x) = +1$ if $v \cdot x \ge 0$, else $-1$. $Pr[match] = 1 - d(x,y)/\pi$.
- Euclidean: Random Projections. Project points onto random lines partitioned into buckets of width $a$. Yields a $(a/2, 2a, 1/2, 1/3)$-sensitive family.
Clustering
Problem: Given a set of points in a metric space, group them into clusters. Some points may be outliers.
Cluster Representatives
- Euclidean space: use centroid --- the average of all points in the cluster. This may be an artificial point not in the data.
- Non-Euclidean space: use clustroid --- an existing data point that minimises some aggregate distance to all other cluster members:
$$ c^* = \argmin_{c\in C} \sum_{x\in C} d(x,c)^2. $$
Curse of Dimensionality
In $d$ dimensions, capturing a fraction $f$ of the data requires searching within radius $f^{1/d}$. As $d$ grows, almost all pairs of points become nearly equidistant, making meaningful clustering difficult.
Hierarchical (Agglomerative) Clustering
Algorithm
- Initialise: each point is its own cluster.
- Repeatedly find the two nearest clusters and merge them.
- Stop when a criterion is met (number of clusters, diameter threshold, etc.).
The result can be visualised as a dendrogram with the merge distance on the $y$-axis.
Cluster Distance Definitions
-
Centroid / clustroid distance: distance between cluster representatives. Best for convex, well-separated clusters.
-
Single-link: $\min$ distance between any pair $(x\in C_1,,y\in C_2)$. Can chain clusters together; handles concentric shapes.
-
Average-link: average distance over all pairs $(x\in C_1,,y\in C_2)$.
-
Cohesion-based: merge the pair whose union is most cohesive (smallest diameter, smallest average distance, or highest density in the merged set).
$k$-Means Clustering
Objective
Given Euclidean space and a fixed $k$, minimise the sum of squared distances from each point to its nearest centroid:
$$ \min_{c_1,\ldots,c_k} \sum_i \min_j |x_i - c_j|^2. $$
Exact optimisation is NP-hard; Lloyd's algorithm finds an approximate solution.
Lloyd's Algorithm
- Initialise: pick $k$ centroids (e.g. $k$ random data points).
- Assignment: assign each point to its nearest centroid.
- Update: recompute each centroid as the mean of its assigned points.
- Repeat steps 2–3 until no point changes cluster.
Converges to a local optimum; quality depends heavily on initialisation.
$k$-Means++ Initialisation
To spread initial centres across the data:
- Pick the first centre uniformly at random.
- For each subsequent centre, select point $p$ with probability proportional to $D(p)^2$, where $D(p)$ is the distance from $p$ to the nearest already-chosen centre.
- Repeat until $k$ centres are chosen.
This biases selection toward points far from existing centres, reducing worst-case behaviour.
Choosing $k$: Elbow Method
Plot average distance to centroid vs. $k$. The optimal $k$ is at the "elbow" --- the point of diminishing returns where adding more clusters yields little improvement.
BFR Algorithm (Large-Scale $k$-Means)
Assumption: clusters are normally distributed (axis-aligned Gaussian ellipses) around centroids. Goal: process disk-resident data in $O(\text{clusters})$ memory.
Three Sets of Points
- DS (Discard Set): points close enough to a centroid; summarised and discarded.
- CS (Compressed Set): groups of points that are close together but not near any centroid; summarised but not yet assigned.
- RS (Retained Set): isolated points stored as-is; waiting to join a CS or DS.
Cluster Summary Statistics
Each cluster (and CS group) is stored as $2d+1$ values, where $d$ = number of dimensions:
- $N$: number of points.
- $\mathbf{SUM}$: $d$-vector; $i$-th component $= \sum x_i$ over all points in the cluster.
- $\mathbf{SUMSQ}$: $d$-vector; $i$-th component $= \sum x_i^2$.
Derived quantities:
$$ \text{centroid}_i = \frac{\mathbf{SUM}_i}{N}, \qquad \sigma_i = \sqrt{\frac{\mathbf{SUMSQ}_i}{N} - \left(\frac{\mathbf{SUM}_i}{N}\right)^2}. $$
Algorithm Steps (per batch)
- Load a new batch of points from disk.
- For each new point, if it is close enough to an existing centroid (by Mahalanobis distance), add it to that cluster's DS and update $N$, $\mathbf{SUM}$, $\mathbf{SUMSQ}$.
- Cluster remaining points (plus old RS) using in-memory clustering. Subclusters $\to$ CS; isolated points $\to$ RS.
- Optionally merge CS groups that are close enough to each other or to a DS centroid.
Final step: merge all CS groups and RS points into their nearest cluster.
Mahalanobis Distance
Standard Euclidean distance is inappropriate for axis-aligned elliptical clusters. The Mahalanobis distance from point $x$ to centroid $c$ (with per-dimension std. dev. $\sigma_i$) is:
$$ d_M(x,,c) = \sqrt{\sum_{i=1}^d \left(\frac{x_i - c_i}{\sigma_i}\right)^2}. $$
For $d$-dimensional Gaussian clusters, approximately 68% of points satisfy $d_M < \sqrt{d}$.
Merging Two CS Groups
Compute the combined $N$, $\mathbf{SUM}$, $\mathbf{SUMSQ}$ from the two groups' statistics (no need to revisit raw points) and merge if the resulting variance is below a threshold.
Dimensionality Reduction: SVD & CUR
Singular Value Decomposition (SVD)
Definition
Any real $m\times n$ matrix $A$ can be decomposed as
$$ A ;=; U,\Sigma,V^{\top}, $$
where:
- $U$ ($m\times r$): left singular vectors (column-orthonormal, $U^{\top}U=I$).
- $\Sigma$ ($r\times r$): diagonal matrix of singular values $\sigma_1\geq\sigma_2\geq\cdots\geq 0$.
- $V$ ($n\times r$): right singular vectors (column-orthonormal, $V^{\top}V=I$).
- $r = \mathrm{rank}(A)$.
Equivalently, $A = \sum_i \sigma_i, \mathbf{u}_i \mathbf{v}_i^{\top}$ (sum of rank-1 outer products).
Best Low-Rank Approximation
Let $B$ be the SVD of $A$ with all but the top-$k$ singular values set to zero. Then
$$ B = \argmin_{\mathrm{rank}(X)=k} |A - X|_F, $$
where $|M|F = \sqrt{\sum{ij} M_{ij}^2}$ is the Frobenius norm. Setting $\sigma_i=0$ for $i>k$ zeroes out the corresponding rank-1 components.
Choosing the Rank $k$
Define the energy of a set of singular values as the sum of their squares. Keep the smallest $k$ such that the retained energy is $\geq 90%$ of the total:
$$ \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^r \sigma_i^2} ;\geq; 0.90. $$
Interpretation (Concept Space)
In a users-by-movies matrix: $U$ = user-to-concept loadings, $V$ = movie-to-concept loadings, $\Sigma$ = concept strengths. $U\Sigma$ gives coordinates of users projected onto the concept axes.
To map a query $q$ (a row vector in original space) into concept space:
$$ q_{\mathrm{concept}} = q,V. $$
Two users with zero ratings in common can still have high similarity in concept space, capturing latent shared preferences.
SVD Drawbacks
- Interpretability: singular vectors are dense linear combinations of all columns/rows.
- Sparsity loss: even if $A$ is sparse, $U$ and $V$ are dense.
Computing SVD via Power Iteration
Power Iteration (largest eigenpair of symmetric $M$)
- Start with any guess $x_0$.
- Iterate: $x_{k+1} = M x_k ;/; |M x_k|_F$ until convergence.
- Eigenvalue: $\lambda = x^{\top} M x$.
Deflation (subsequent eigenpairs)
After finding eigenpair $(\lambda, x)$, eliminate its contribution and recurse:
$$ M^* := M - \lambda, x, x^{\top}, $$
then apply power iteration to $M^*$.
Connection to SVD
$$ A^{\top}A = V\Sigma^2 V^{\top}, \qquad AA^{\top} = U\Sigma^2 U^{\top}. $$
So $V$ (and $\Sigma$) come from eigenpairs of $A^{\top}A$, and $U$ from $AA^{\top}$. Full SVD complexity: $O(nm^2)$ or $O(n^2 m)$ (whichever is smaller).
CUR Decomposition
Motivation
CUR preserves sparsity: it uses actual rows and columns of $A$ as basis vectors, so if $A$ is sparse, $C$ and $R$ are also sparse.
Structure
$$ A ;\approx; C,U,R, $$
where:
- $C$ ($m\times c$): $c$ randomly sampled columns of $A$.
- $R$ ($r\times n$): $r$ randomly sampled rows of $A$.
- $U$ ($c\times r$): pseudoinverse of the intersection $W$ of $C$ and $R$.
Computing $U$ (Pseudoinverse)
Let $W$ ($c\times r$) be the submatrix of $A$ at the intersecting rows and columns. Compute SVD $W = X Z Y^{\top}$; then
$$ U = W^+ = Y,Z^+,X^{\top}, $$
where $Z^+$ is diagonal with $Z^+{ii} = 1/Z{ii}$ (reciprocals of non-zero singular values; zero if $Z_{ii}=0$).
Sampling Algorithm
Sample columns (and rows symmetrically) with probability proportional to their squared Frobenius norm (importance):
$$ P(\text{column }j) = \frac{\sum_i A_{ij}^2}{\sum_{i,j} A_{ij}^2}. $$
Normalize each sampled column $j$ by $\sqrt{c\cdot P(j)}$ to remove bias:
$$ C_{:,,i} = \frac{A_{:,,j}}{\sqrt{c\cdot P(j)}}. $$
The same column can be sampled more than once.
Quality Guarantee
Sample $c = r = O(k\log k,/,\varepsilon^2)$ columns and rows. Then with probability $\geq 0.98$:
$$ |A - CUR|_F ;\leq; (2+\varepsilon),|A - A_k|_F, $$
where $A_k$ is the best rank-$k$ SVD approximation. In practice, $4k$ columns/rows suffice.
SVD vs. CUR Summary
| Property | SVD | CUR |
|---|---|---|
| Factor matrices | $U,V$: dense | $C,R$: sparse (actual rows/cols) |
| Middle matrix | $\Sigma$: sparse, small | $U$: dense, small |
| Optimality | Exact best rank-$k$ | $(2+\varepsilon)$ factor |
| Interpretability | Low (dense combos) | High (actual data rows/cols) |
| Sparsity preserved | No | Yes |
Graph Data
Link Analysis: PageRank
Core Idea
Model the web as a directed graph (nodes = pages, edges = hyperlinks). Not all pages are equally important; rank them by link structure.
Two intuitions:
- Random surfer: PageRank of page $j$ = limiting probability that a random surfer (following out-links uniformly at random) is at $j$.
- Recursive importance: A page is important if important pages link to it.
Flow Formulation
Define the rank $r_j$ of page $j$:
$$ r_j = \sum_{i\to j} \frac{r_i}{d_i}, $$
where $d_i$ is the out-degree of page $i$. Normalisation: $\sum_j r_j = 1$.
Matrix Formulation
Define the column-stochastic transition matrix $M$:
$$ M_{ji} = \frac{1}{d_i} \text{ if } i\to j, \qquad M_{ji} = 0 \text{ otherwise.} $$
Each column of $M$ sums to 1. The flow equations become:
$$ r = M,r. $$
Thus $r$ is the principal eigenvector of $M$ (eigenvalue 1).
Power Iteration
Algorithm:
- Initialise: $r^{(0)} = [1/N,\ldots,1/N]^\top$.
- Iterate: $r^{(t+1)} = M,r^{(t)}$, i.e. $r_j^{(t+1)} = \sum_{i\to j} r_i^{(t)} / d_i$.
- Stop when $|r^{(t+1)} - r^{(t)}|_1 < \varepsilon$.
Typically $\approx 50$ iterations suffice. $r^{(t)}$ is the probability distribution over pages of a random surfer at time $t$; convergence gives the stationary distribution of the random walk.
Why it works: Write $r^{(0)} = \sum_k c_k x_k$ in the eigenbasis of $M$. Then $M^t r^{(0)} = \lambda_1^t\left[c_1 x_1 + \sum_{k\geq 2} c_k(\lambda_k/\lambda_1)^t x_k\right]$. Since $\lambda_1 > |\lambda_k|$ for $k\geq 2$, the ratio $(\lambda_k/\lambda_1)^t\to 0$, leaving only $x_1$ (the principal eigenvector).
Problems with Naive PageRank
- Dead ends (pages with no out-links): the column of $M$ sums to 0; rank "leaks out" --- all scores converge to 0.
- Spider traps (closed group of pages with no external out-links): the random walk gets stuck; all rank is absorbed by the trap.
Google PageRank with Teleportation
At each step the random surfer either:
- With probability $\beta$ ($\approx 0.8$--$0.9$), follows a random out-link.
- With probability $1-\beta$, teleports to a uniformly random page.
From a dead-end, always teleport (probability 1).
PageRank equation (Brin--Page, 1998):
$$ r_j = \sum_{i\to j}\beta\frac{r_i}{d_i} + \frac{1-\beta}{N}. $$
Google matrix:
$$ A = \beta M + (1-\beta)\frac{\mathbf{1}\mathbf{1}^\top}{N}, \qquad r = A,r. $$
Efficient sparse form (avoids storing the dense $N\times N$ matrix):
$$ r = \beta M,r + \frac{1-\beta}{N},\mathbf{1}. $$
In each iteration, compute $r' = \beta M r$ (sparse), then add $(1-S)/N$ to each entry, where $S = \sum_j r'_j$ accounts for rank lost to dead ends (equals exactly $1-\beta$ when there are no dead ends).
Complete PageRank Algorithm
Initialise r_j = 1/N for all j
Repeat until sum_j |r_j_new - r_j_old| < eps:
r'_j = sum_{i->j} beta * r_i_old / d_i for all j
S = sum_j r'_j
r_j = r'_j + (1 - S) / N // redistribute leaked rankUndirected Graphs
For an undirected graph with $N$ nodes and $m$ edges, the exact PageRank solution is:
$$ r_v = \frac{d_v}{2m}, $$
i.e. rank is proportional to degree.
Scalable Computation
For $N=10^9$ pages, the rank vector needs $\approx 4$ GB; the sparse matrix $M$ ($\approx10$ links/node) needs $\approx 40$ GB on disk.
-
Basic update ($r^{\text{new}}$ fits in RAM): read $M$ and $r^{\text{old}}$ once per iteration. Cost: $2|r| + |M|$.
-
Block-based update ($k$ blocks of $r^{\text{new}}$ fit in RAM): scan $M$ once per block. Cost: $k|M| + (k+1)|r|$.
-
Block-stripe update (improved): partition $M$ into $k$ stripes, each stripe containing only the entries mapping into one block of $r^{\text{new}}$. Each block of $r^{\text{new}}$ requires only one stripe of $M$. Cost: $|M|(1+\varepsilon) + (k+1)|r|$ --- $M$ is read only once overall.
Extensions & Limitations
- Topic-Specific (Personalized) PageRank: replace the uniform teleport distribution with a topic-specific distribution; gives topic-sensitive rankings.
- TrustRank: teleport only to manually verified "seed" pages to combat link spam.
- HITS (Hubs & Authorities): separate hub score ($h$) and authority score ($a$); $a_j = \sum_{i\to j} h_i$, $h_i = \sum_{j:i\to j} a_j$.
- Limitation: standard PageRank measures generic popularity and is susceptible to link spam / artificial link topologies.
Topic-Specific PageRank, Web Spam & TrustRank
Topic-Specific (Personalized) PageRank
Motivation
Standard PageRank measures generic importance. Topic-Specific PageRank biases the random walk toward a topic by restricting teleportation to a small set $S$ of topic-relevant pages (the teleport set). Each choice of $S$ produces a different ranking vector $r_S$.
Matrix Formulation
$$ A_{ij} = \beta M_{ij} + \frac{1-\beta}{|S|},\mathbf{1}[i\in S]. $$
Pages outside $S$ receive no teleport probability. Computation is identical to standard PageRank (sparse $M$ plus a bias vector); Power Iteration still converges.
Choosing $S$
Use a topic taxonomy (e.g. the 16 DMOZ categories). Determine the relevant topic from: (1) explicit user selection; (2) query classification; (3) browsing history / bookmarks.
Random Walk with Restarts
Set $S = \{q\}$ for a single query node $q$. The stationary distribution then gives the proximity of every node to $q$: nearby nodes accumulate high visit probability.
A good proximity measure should capture: multiple connections, multiple paths, degree of intermediate nodes, and penalise long paths. Shortest-path and network-flow measures fail on at least one of these criteria; random walk with restarts satisfies all of them.
Web Spam
Term Spam (1st generation)
Early search engines ranked by keyword frequency. Spammers stuffed invisible text (same colour as background) or copied top-ranked pages verbatim. Defeat: use anchor text (what others say about the page) and PageRank (pages with no inbound links cannot rank highly).
Link Spam / Spam Farms (2nd generation)
Spammers build artificial link topologies to inflate PageRank of a target page $t$.
Link farm structure:
- $M$ owned pages each link to $t$.
- $t$ links back to all $M$ owned pages (recirculating rank).
- Spammer also posts links to $t$ on accessible pages (blogs, forums).
Mathematical analysis. Let $N$ = total pages, $x$ = PageRank acquired from accessible pages, $y$ = PageRank of $t$. Each owned page has rank $\beta y/M + (1-\beta)/N$. Substituting into the flow equation for $t$ and discarding small terms:
$$ y = \frac{x}{1-\beta^2} + \frac{\beta}{1+\beta}\cdot\frac{M}{N}. $$
For $\beta=0.85$: $1/(1-\beta^2)\approx 3.6$. Both terms grow unboundedly with $M$: the spammer can make $y$ arbitrarily large.
TrustRank
Definition: Topic-Specific PageRank with teleport set = a small set of human-verified trusted pages.
Key principle: Good pages rarely link to spam; trust propagates through links but attenuates with distance (split equally across out-links, scaled by $\beta$).
Algorithm
- Select seed set: pick top-$k$ pages by PageRank, or use controlled
domains (e.g.
.edu,.gov,.mil). - Human labelling: an oracle labels each seed page as trusted or spam.
- Run Topic-Specific PageRank with teleport set = trusted seed pages.
- Classify: pages whose TrustRank falls below a threshold are marked spam.
Trust propagation (formal)
Set trust of each trusted seed to 1. Page $p$ with trust $t_p$ and out-degree $|o_p|$ confers trust $\beta,t_p/|o_p|$ to each out-neighbour. Trust is additive; within a scaling factor this equals PageRank with the trusted pages as the teleport set.
Spam Mass
Complementary view: estimate what fraction of a page's PageRank originates from spam.
Let $r_p$ = standard PageRank of page $p$, $r_p^+$ = TrustRank of $p$ (PageRank with trusted teleport set). Define:
$$ \mathrm{SpamMass}(p) = \frac{r_p - r_p^+}{r_p}. $$
Pages with high spam mass ($\approx 1$) receive nearly all their rank from spam sources and are classified as spam.
Community Detection
Graph Partitioning & Conductance
Goal
Divide an undirected graph $G=(V,E)$ into clusters (communities) that are internally dense and externally sparse.
Cut and Conductance
For a subset $A\subseteq V$:
- Cut: $\mathrm{cut}(A) = |\{(i,j)\in E : i\in A,, j\notin A\}|$.
- Volume: $\mathrm{vol}(A) = \sum_{i\in A} d_i$ (sum of degrees).
- Conductance: $$ \phi(A) = \frac{\mathrm{cut}(A)}{\min(\mathrm{vol}(A),, 2m - \mathrm{vol}(A))}, $$ where $m=|E|$. Lower conductance $\Rightarrow$ better cluster. Conductance produces more balanced partitions than minimum cut.
Local Clustering via Approximate PPR
Algorithm (PageRank-Nibble)
- Pick a seed node $s$.
- Compute approximate Personalized PageRank (PPR) with teleport set $\{s\}$.
- Sort nodes by decreasing PPR score: $r_1 > r_2 > \cdots > r_n$.
- Sweep: for each prefix $A_i = \{u_1,\ldots,u_i\}$, compute $\phi(A_i)$. Local minima of $\phi(A_i)$ correspond to good clusters.
Approximate PPR (Push-based)
Maintain estimate $r$ and residual $q = p - r$ (where $p$ is the true PPR vector). Initialise $r=\mathbf{0}$, $q=\mathbf{a}$ (teleport vector). Repeatedly Push from any node $u$ with $q_u/d_u \geq \varepsilon$:
$$\begin{aligned} r'_u &= r_u + (1-\beta),q_u, \\ q'_u &= \tfrac{1}{2}\beta, q_u, \\ q'_v &= q_v + \tfrac{1}{2}\beta, q_u / d_u \quad\text{for each } u\to v. \end{aligned}$$
Runtime: $O(1/(\varepsilon(1-\beta)))$, independent of graph size. Guarantee: if a cut of conductance $\phi$ and volume $k$ exists, the method finds a cut of conductance $O(\sqrt{\phi / \log k})$.
$k$-NN Graph Construction: NN-Descent
Given $n$ data points and a similarity oracle $\sigma$, build an approximate $K$-nearest-neighbour graph.
Algorithm
- Initialise: for each node $v$, set $B[v]$ to $K$ random nodes.
- Iterate:
- Compute reverse neighbours $R[v] = \{u : v\in B[u]\}$.
- General neighbours: $B^*[v] = B[v]\cup R[v]$.
- For each $v$, for each $u_1\in B^(v)$, for each $u_2\in B^(u_1)$: compute $\sigma(v, u_2)$ and update $B[v]$ if $u_2$ is closer.
- Stop when no updates occur.
Empirical cost: $O(n^{1.14})$, much better than brute-force $O(n^2)$.
Modularity
Definition
Given a partitioning $S$ of graph $G$ with $n$ nodes and $m$ edges:
$$ Q(G,S) = \frac{1}{2m}\sum_{s\in S}\sum_{i\in s}\sum_{j\in s} \left(A_{ij} - \frac{k_i k_j}{2m}\right), $$
where $A_{ij}$ is the adjacency matrix entry and $k_i$ is the degree of node $i$. The null model is a random graph with the same degree distribution: expected edges between $i$ and $j$ is $k_i k_j/(2m)$. $Q\in[-1,1]$; values $> 0.3$--$0.7$ indicate significant community structure.
Louvain Algorithm
Greedy modularity maximisation with $O(n\log n)$ runtime.
Two-Phase Iteration
-
Phase 1 (local moves): start with each node in its own community. For each node $i$, compute the modularity gain $\Delta Q$ of moving $i$ into each neighbour's community; move $i$ to the community with the largest positive $\Delta Q$. Repeat until no move improves $Q$.
-
Phase 2 (aggregation): contract each community into a super-node. Edge weights between super-nodes $=$ sum of edge weights between the corresponding communities. Self-loops $=$ sum of internal edges.
-
Return to Phase 1 on the contracted graph; repeat until $Q$ no longer increases.
The hierarchy of contractions yields a dendrogram of communities at multiple resolutions.
Graph Embeddings
Problem Setup
Goal
Learn a mapping $f: V \to \mathbb{R}^d$ such that $\mathrm{similarity}(u,v) \approx \mathbf{z}_u^\top \mathbf{z}_v$, where $\mathbf{z}_v = f(v)$ is the $d$-dimensional embedding of node $v$.
Encoder--Decoder Framework
- Encoder: maps each node to a low-dimensional vector. Simplest: embedding lookup $\mathbf{z}_v = Z\mathbf{e}_v$, where $Z\in\mathbb{R}^{d\times|V|}$ is the embedding matrix.
- Decoder: maps embeddings back to a similarity score (e.g., dot product $\mathbf{z}_u^\top \mathbf{z}_v$).
- Optimise: encoder parameters so that decoded similarity approximates the original graph similarity.
Random-Walk Embeddings (DeepWalk / node2vec)
Core Idea
Define node similarity via co-occurrence on short random walks. Let $N_R(u)$ be the multiset of nodes visited on random walks starting from $u$ under strategy $R$.
Objective
$$ \max_{\mathbf{z}} \sum_{u\in V} \log P\left(N_R(u) \mid \mathbf{z}u\right) = \max{\mathbf{z}} \sum_{u\in V} \sum_{v\in N_R(u)} \log P(\mathbf{z}_v \mid \mathbf{z}_u). $$
Softmax parametrisation: $P(\mathbf{z}_v \mid \mathbf{z}_u) = \exp(\mathbf{z}_v\cdot\mathbf{z}u) / \sum{n\in V}\exp(\mathbf{z}_n\cdot\mathbf{z}_u)$.
Negative Sampling
Computing the full softmax is $O(|V|)$ per pair. Approximate using $k$ negative samples:
$$ \log\sigma(\mathbf{z}_v\cdot\mathbf{z}_u)
- \sum_{i=1}^{k} \mathbb{E}{n_i\sim P_V} \left[\log\sigma(-\mathbf{z}{n_i}\cdot\mathbf{z}_u)\right], $$
where $\sigma$ is the sigmoid and $P_V$ is a distribution over nodes (typically proportional to degree). In practice $k=5$--$20$.
node2vec: Biased Random Walks
Two parameters control the walk:
- Return parameter $p$: controls likelihood of returning to the previous node.
- In-out parameter $q$: controls exploration outward (DFS-like, small $q$) vs. staying local (BFS-like, small $p$).
After traversing edge $(s_1, w)$, unnormalised transition probabilities to next node $t$:
$$ \alpha(t) = \begin{cases} 1/p & \text{if } d(t, s_1) = 0 \text{ (return)}, \\ 1 & \text{if } d(t, s_1) = 1, \\ 1/q & \text{if } d(t, s_1) = 2 \text{ (move away)}. \end{cases} $$
BFS-like walks ($p$ small) capture local/structural roles; DFS-like walks ($q$ small) capture global community structure.
Algorithm
- Compute random walk transition probabilities.
- Simulate $r$ walks of length $l$ from each node.
- Optimise embeddings via SGD with negative sampling.
All three steps are parallelisable; linear-time complexity.
Downstream Tasks
- Node classification: predict label $f(\mathbf{z}_i)$.
- Link prediction: predict edge from $f(\mathbf{z}_i, \mathbf{z}_j)$ (concatenation, Hadamard product, sum, or distance).
- Clustering: cluster nodes by their embeddings.
Graph Neural Networks (GNNs)
Motivation
Shallow embedding methods (DeepWalk, node2vec) have $O(|V|d)$ parameters, are transductive (cannot embed unseen nodes), and ignore node features. GNNs address all three limitations by using a shared neural network encoder that aggregates neighbourhood information.
Message-Passing Framework
A GNN layer consists of two steps applied to each node $v$:
- Message: each neighbour $u\in N(v)$ computes a message $\mathbf{m}_u^{(l)} = \mathrm{MSG}^{(l)}(\mathbf{h}_u^{(l-1)})$.
- Aggregation: messages are combined and used to update $v$'s embedding: $\mathbf{h}_v^{(l)} = \mathrm{AGG}^{(l)}\left(\{\mathbf{m}_u^{(l)} : u\in N(v)\},, \mathbf{m}_v^{(l)}\right)$.
Initialisation: $\mathbf{h}_v^{(0)} = \mathbf{x}_v$ (node features). Final embedding: $\mathbf{z}_v = \mathbf{h}_v^{(L)}$ after $L$ layers. A node at layer $L$ aggregates information from its $L$-hop neighbourhood.
GNN Variants
GCN (Graph Convolutional Network)
$$ \mathbf{h}v^{(l)} = \sigma!\left( \sum{u\in N(v)} \frac{\mathbf{W}^{(l)}\mathbf{h}_u^{(l-1)}}{|N(v)|} \right). $$
Message: $\mathbf{m}_u^{(l)} = \mathbf{W}^{(l)}\mathbf{h}_u^{(l-1)} / |N(v)|$. Aggregation: sum, then apply activation $\sigma$.
GraphSAGE
Two-stage aggregation:
$$ \mathbf{h}_{N(v)}^{(l)} = \mathrm{AGG}\left(\{\mathbf{h}_u^{(l-1)} : u\in N(v)\}\right), \qquad \mathbf{h}_v^{(l)} = \sigma!\left(\mathbf{W}^{(l)}\cdot \mathrm{CONCAT}\left(\mathbf{h}v^{(l-1)},, \mathbf{h}{N(v)}^{(l)}\right)\right). $$
AGG can be mean, max, or a learned pooling function.
GAT (Graph Attention Network)
$$ \mathbf{h}v^{(l)} = \sigma!\left(\sum{u\in N(v)} \alpha_{vu}, \mathbf{W}^{(l)}\mathbf{h}_u^{(l-1)}\right), $$
where attention weights $\alpha_{vu}$ are learned:
- Compute attention coefficient: $e_{vu} = a\left(\mathbf{W}^{(l)}\mathbf{h}_v^{(l-1)},, \mathbf{W}^{(l)}\mathbf{h}_u^{(l-1)}\right)$, where $a$ is a small neural network (e.g., a linear layer on the concatenation).
- Normalise via softmax: $\alpha_{vu} = \exp(e_{vu}) / \sum_{k\in N(v)}\exp(e_{vk})$.
Multi-head attention: run $H$ independent attention heads and concatenate (or average) outputs for stability.
Training
Supervised
Node classification with cross-entropy loss:
$$ \mathcal{L} = \sum_{v\in V}\left[y_v\log\sigma(\mathbf{z}_v^\top\boldsymbol{\theta}) + (1-y_v)\log(1-\sigma(\mathbf{z}_v^\top\boldsymbol{\theta}))\right]. $$
Train weight matrices $\mathbf{W}^{(l)}, \mathbf{B}^{(l)}$ via SGD.
Unsupervised
Use random-walk co-occurrence (as in node2vec) as the similarity signal for the GNN encoder.
Graph Augmentation
Feature augmentation
When the graph has no node features, use: (a) constant features (all nodes get the same value --- inductive, low cost), or (b) one-hot node IDs (high expressive power, transductive, $O(|V|)$ cost). Other useful augmentations: node degree, PageRank, clustering coefficient.
Structure augmentation
- Virtual edges: add 2-hop connections ($A + A^2$) to improve message passing in sparse graphs (e.g., bipartite $\to$ collaboration graphs).
- Virtual nodes: a single node connected to all others reduces diameter to 2.
- Neighbourhood sampling: randomly sample a fixed number of neighbours per node to reduce computation; in expectation recovers the full-neighbourhood result.
Relational Deep Learning (RDL)
Motivation
Most real-world data lives in relational databases (multiple tables linked by primary/foreign keys), not single flat tables. Traditional ML requires extensive feature engineering (ETL, SQL aggregations); RDL applies GNNs directly on the relational structure, eliminating manual feature work.
Relational Entity Graph
- Each row in each table becomes a node in the graph.
- Edges connect rows whose primary and foreign keys match (equivalent to SQL inner joins on pkey = fkey).
- Node features are the non-key columns of each row.
- The schema graph captures the high-level table relationships (one node per table, edges for foreign-key links).
Temporal Prediction Tasks
A training table specifies (Entity ID, Timestamp, Label). Tasks are inherently temporal: entity labels change over time, and the database evolves. At prediction time $t$, only information available up to $t$ may be used.
GNN computation graphs become time-dependent: message passing and neighbour sampling respect temporal constraints to avoid data leakage.
Why GNNs Work on Relational Data
- GNN aggregation is a learnable version of hand-crafted SQL aggregation
features (e.g.,
SUM,AVGover time windows). - GNNs can discover cross-table patterns via multi-hop message passing (e.g., user $\to$ transaction $\to$ product relationships).
- Information exchange between training examples through shared graph structure enriches entity representations.
Machine Learning
Decision Trees
Setup
- Input: $n$ examples $(\mathbf{x}_i, y_i)$ with $d$ features $x^{(1)},\ldots,x^{(d)}$. Features can be numerical or categorical; output $y$ can be categorical (classification) or numerical (regression).
- Structure: a tree where each internal node tests a feature (e.g., $x^{(j)} < v$), each branch corresponds to an outcome, and each leaf stores a prediction.
- Prediction: drop input $\mathbf{x}$ down the tree until it reaches a leaf; output the leaf's stored value.
Tree Construction (BuildSubtree)
Three decisions at each node with data $D$:
(1) How to split
Regression (variance reduction): Choose split $(x^{(j)}, v)$ that maximises
$$ |D|\cdot\mathrm{Var}(D) - \bigl[|D_L|\cdot\mathrm{Var}(D_L) + |D_R|\cdot\mathrm{Var}(D_R)\bigr], $$
where $\mathrm{Var}(D) = \frac{1}{|D|}\sum_{i\in D}(y_i - \bar{y})^2$.
Classification (information gain):
- Entropy: $H(Y) = -\sum_j p(Y_j)\log_2 p(Y_j)$.
- Conditional entropy: $H(Y\mid X) = \sum_j P(X=v_j), H(Y\mid X=v_j)$.
- Information gain: $IG(Y\mid X) = H(Y) - H(Y\mid X)$.
- Choose the split with highest $IG$.
(2) When to stop
- Leaf is "pure": $\mathrm{Var}(y) < \epsilon$ or all labels are the same.
- Too few examples: $|D| <$ threshold (e.g., 100).
(3) How to predict at a leaf
- Regression: average $y_i$ in the leaf (or fit a local linear model).
- Classification: majority class in the leaf.
Ensemble Methods
Bagging & Random Forests
Bagging (Bootstrap Aggregation)
- Bootstrap: create $T$ datasets $D'_t$ by sampling $n$ points from $D$ with replacement ($\approx 63%$ unique points per sample).
- Train: build a decision tree independently on each $D'_t$.
- Aggregate: average (regression) or majority vote (classification) over all $T$ trees.
Random Forests
Bagged decision trees with feature bagging: at each split, consider only a random subset of $\sqrt{d}$ features (out of $d$). This breaks correlation between trees, improving ensemble diversity.
Boosting
AdaBoost
Combines decision stumps (1-level trees) sequentially.
- Initialise equal weights $w_i = 1/n$ for all examples.
- At each round $t$:
- Train stump $G_t$ on weighted data.
- Compute weighted error $\epsilon_t = \sum_{i: G_t(x_i)\neq y_i} w_i$.
- Compute tree weight $\alpha_t = \frac{1}{2}\ln!\left(\frac{1-\epsilon_t}{\epsilon_t}\right)$.
- Reweight: $w_i \leftarrow w_i \cdot \exp(-\alpha_t, y_i, G_t(x_i))$; normalise.
- Final prediction: $G(x) = \mathrm{sign}!\left(\sum_t \alpha_t, G_t(x)\right)$.
Harder-to-classify examples get higher weight; more accurate stumps get higher $\alpha_t$.
Gradient Boosted Decision Trees (GBDT)
-
Additive model: $\hat{y}^{(t)} = \hat{y}^{(t-1)} + \epsilon, f_t(\mathbf{x})$, where $\epsilon$ is the learning rate (shrinkage, typically $\sim 0.1$).
-
At round $t$, find tree $f_t$ minimising the second-order Taylor approximation of the loss:
$$ \widetilde{\mathcal{L}}^{(t)} = \sum_i \bigl[g_i, f_t(\mathbf{x}_i) + \tfrac{1}{2}, h_i, f_t(\mathbf{x}_i)^2\bigr] + \Omega(f_t), $$
where $g_i = \partial_{\hat{y}} \ell(y_i, \hat{y}^{(t-1)})$, $h_i = \partial^2_{\hat{y}} \ell(y_i, \hat{y}^{(t-1)})$.
-
Regularisation: $\Omega(f) = \gamma T + \frac{\lambda}{2}\sum_{j=1}^{T} w_j^2$, where $T$ = number of leaves, $w_j$ = leaf weight.
-
For a given tree structure, optimal leaf weight: $w_j^* = -\sum_{i\in I_j} g_i / (\sum_{i\in I_j} h_i + \lambda)$.
-
Split gain:
$$ \mathrm{Gain} = \frac{1}{2}\left[ \frac{(\sum_{i\in I_L} g_i)^2}{\sum_{i\in I_L} h_i + \lambda}
- \frac{(\sum_{i\in I_R} g_i)^2}{\sum_{i\in I_R} h_i + \lambda}
- \frac{(\sum_{i\in I} g_i)^2}{\sum_{i\in I} h_i + \lambda} \right] - \gamma. $$
-
Grow tree greedily; post-prune splits with negative gain.
XGBoost
A scalable implementation of GBDT with L1/L2 regularisation, column-block structure for parallel tree construction, distributed computing, and out-of-core computation for large datasets.
Infinite Data
Mining Data Streams
Data arrives one element at a time at rapid rate; cannot store entire stream; must process immediately or lose forever.
Sampling from a Stream
Fixed-proportion sampling
To sample fraction $a/b$ of the stream: hash each tuple's key uniformly into $b$ buckets; keep the tuple if its hash value $< a$. Hashing on keys (not individual tuples) preserves group structure (e.g., all queries by the same user are either all in or all out).
Fixed-size sampling (Reservoir Sampling)
Maintain a sample $S$ of exactly $s$ elements.
- Store the first $s$ elements.
- When the $n$-th element arrives ($n>s$): with probability $s/n$, add it to $S$ (replacing a uniformly random element); otherwise discard it.
Invariant: after seeing $n$ elements, each element is in $S$ with probability $s/n$. Proof by induction: $P(\text{in } S \text{ at } n{+}1) = \frac{s}{n}\cdot\frac{n}{n+1} = \frac{s}{n+1}$.
Filtering: Bloom Filters
Given a set $S$ of $m$ keys, determine which stream elements have keys in $S$, using limited memory.
Basic Bloom Filter
- Allocate a bit array $B$ of $n$ bits (all 0).
- Hash each $s\in S$: set $B[h(s)]=1$.
- For stream element $a$: output $a$ if $B[h(a)]=1$.
No false negatives; false positive probability $\approx 1 - e^{-m/n}$.
Bloom Filter with $k$ hash functions
Use $k$ independent hash functions $h_1,\ldots,h_k$; declare $x\in S$ only if $B[h_i(x)]=1$ for all $i$. False positive probability: $(1 - e^{-km/n})^k$. Optimal $k = \frac{n}{m}\ln 2$; for $m=10^9$, $n=8\times10^9$, optimal $k\approx 6$ gives FP rate $\approx 2.2%$.
Counting Distinct Elements: Flajolet--Martin
Algorithm
- Choose hash $h$ mapping elements to $\geq\log_2 N$ bits.
- For each element $a$, let $r(a)$ = number of trailing zeros in $h(a)$.
- Maintain $R = \max_a r(a)$.
- Estimate number of distinct elements $\approx 2^R$.
Analysis
Probability of not seeing a tail of length $r$ among $m$ distinct elements: $(1-2^{-r})^m$. If $m \gg 2^r$, this $\to 0$; if $m \ll 2^r$, this $\to 1$. So $2^R \approx m$.
$\mathbb{E}[2^R]$ is infinite (heavy tail), so use multiple hash functions: partition samples into groups, take median within groups, then average the medians.
Counting Frequent Items: Exponentially Decaying Windows
Maintain a smoothed count for each item $x$:
$$ w_x(T) = \sum_{t=1}^{T} \delta_t,(1-c)^{T-t}, \quad \delta_t = \mathbf{1}[a_t = x], $$
where $c$ is a small decay constant (e.g., $10^{-6}$).
Update rule: when new element $a_{T+1}$ arrives, multiply all weights by $(1-c)$; add 1 to $w_{a_{T+1}}$.
Total weight across all items $= 1/c$, so at most $2/c$ items can have weight $\geq 1/2$. Drop items with weight below threshold.
Extension to Itemsets
Start counting an itemset $S\subseteq B$ only if all proper subsets of $S$ are already being counted (mirroring A-priori).
Online Algorithms & Web Advertising
Online Bipartite Matching
Setting
One side of a bipartite graph ("boys") is known upfront. The other side ("girls") arrives one at a time; upon arrival, we must irrevocably match or leave unmatched.
Greedy
Match each arriving node to any available neighbour. Competitive ratio $= 1/2$ (worst case: greedy exhausts the wrong side first).
Competitive Ratio
$\mathrm{CR} = \min_{\text{all inputs } I} |M_{\mathrm{ALG}}| / |M_{\mathrm{OPT}}|$.
The Adwords Problem
Setting
A stream of search queries $q_1, q_2, \ldots$ arrives. Each advertiser has a bid on certain queries, a click-through rate (CTR), and a daily budget. For each query, select an advertiser to show; goal: maximise total revenue.
Simplified Model
All advertisers have budget $B$, all bids $=1$, same CTR.
BALANCE Algorithm
For each query, assign it to the advertiser with the largest remaining budget (break ties deterministically).
Analysis (2 advertisers): Competitive ratio $= 3/4$. Proof sketch: BALANCE must exhaust at least one budget; the number of missed queries $x$ satisfies $x \leq B/2$, giving revenue $\geq 3B/2$ vs. optimal $2B$.
General case ($N$ advertisers): Competitive ratio $= 1 - 1/e \approx 0.63$. Worst case: $N$ rounds of $B$ queries, where round $i$ queries are bid on by advertisers $A_i,\ldots,A_N$. After $k = N(1-1/e)$ rounds, all budgets are exhausted; revenue $= BN(1-1/e)$.
Generalised BALANCE
For arbitrary bids $x_i$ and budgets $b_i$, define
$$ \psi_i(q) = x_i,(1 - e^{-f_i}), \quad f_i = 1 - m_i/b_i, $$
where $m_i$ = amount spent so far. Allocate query $q$ to the advertiser with the largest $\psi_i(q)$. Same competitive ratio $1-1/e$.
No online algorithm can achieve a better competitive ratio.
Submodular Optimisation
Submodularity & Set Cover
Set Cover Problem
Given a universe $W = \{w_1,\ldots,w_n\}$ and sets $X_1,\ldots,X_m \subseteq W$, find $k$ sets whose union is as large as possible:
$$ \max_{|A|\leq k} F(A), \qquad F(A) = \left|\bigcup_{i\in A} X_i\right|. $$
This is NP-hard in general.
Submodular Functions
Definition
A set function $F: 2^V \to \mathbb{R}$ is submodular if for all $A\subseteq B\subseteq V$ and $d\notin B$:
$$ F(A\cup\{d\}) - F(A) \geq F(B\cup\{d\}) - F(B). $$
This is the diminishing returns property: adding an element to a smaller set helps at least as much as adding it to a larger set.
Equivalently, for all $A,B\subseteq V$: $F(A) + F(B) \geq F(A\cup B) + F(A\cap B)$.
Properties
- The coverage function $F(A) = |\bigcup_{i\in A} X_i|$ is submodular and monotone.
- Non-negative linear combinations of submodular functions are submodular.
- Submodularity is the discrete analogue of concavity.
Greedy Algorithm
- $A_0 = \emptyset$.
- For $i=1,\ldots,k$: let $d_i = \arg\max_d F(A_{i-1}\cup\{d\}) - F(A_{i-1})$; set $A_i = A_{i-1}\cup\{d_i\}$.
Guarantee (Nemhauser--Fisher--Wolsey, 1978)
For any monotone submodular function $F$ with $F(\emptyset)=0$:
$$ F(A_{\mathrm{greedy}}) \geq \left(1 - \frac{1}{e}\right) F(A_{\mathrm{OPT}}) \approx 0.63\cdot\mathrm{OPT}. $$
Lazy Greedy (CELF)
Exploit submodularity: marginal gains $\Delta_i(d)$ can only decrease over rounds. Keep an ordered list of upper bounds from the previous round; re-evaluate only the current top element, re-sort, and pick the new top. Provides identical output to greedy but with significant speedup in practice.
Application: Diverse Document Selection
Probabilistic Coverage
Each document $d$ covers concept $c$ with probability $\mathrm{Cover}_d(c)$. Probability that at least one document in $A$ covers $c$:
$$ P_A(c) = 1 - \prod_{d\in A}(1 - \mathrm{Cover}_d(c)). $$
Objective with concept weights $w_c$:
$$ F(A) = \sum_c w_c, P_A(c). $$
This is monotone and submodular, so greedy gives a $(1-1/e)$-approximation.
Personalisation (Multiplicative Weights)
Learn per-user concept weights from feedback: after recommending document $d$ and receiving feedback $r\in\{+1,-1\}$, update $w_c \leftarrow \beta^r w_c$ for each concept $c\in X_d$, then renormalise so $\sum_c w_c = 1$.