Alternative two sample

Alternative two sample#

Download#

Kernelizing Hotelling $T^2$#

library(pdfCluster)
data(oliveoil)
# let's make it a two-sample dataset (of 3 regions)
X = oliveoil[oliveoil$macro.area %in% c('South', 'Centre.North'),]
#| echo: true
South = X[X$macro.area=='South',3:10]
North = X[X$macro.area=='Centre.North',3:10]
X_1 = as.matrix(South)
X_2 = as.matrix(North)

pdfCluster 1.0-4

Olive oil data `South` vs `Centre.North`#

pairs(X, col=X$macro.area)

../_images/90e33b80bb570f38c158c8bec6595533dfe7ab0801f8bf190e3ac2f553993aa9.png

\[H_0:F_1=F_2\]

Gram matrices#

We used the term Gram matrix last time for a reason.
For any covariance kernel $K:\mathbb{R}^p \times \mathbb{R}^p \rightarrow \mathbb{R}$ we could use

\[ G_{ij} = K(X_i, X_j) \]

Gram matrices can be computed for data that is not even Euclidean…

\[\begin{split} \begin{aligned} \text{Tr}(GP^{\Delta}) &= \frac{1}{\frac{1}{n_1}+\frac{1}{n_2}} \biggl(\frac{1}{n_1^2} \sum_{i,j=1}^{n_1} K(X_1[i,],X_1[j,] \\ & \qquad + \frac{1}{n_2^2} \sum_{i,j=1}^{n_2} K(X_2[i,],X_2[j,]) \\ & \qquad - \frac{2}{n_1 n_2} \sum_{i=1}^{n_1} \sum_{j=1}^{n_2} K(X_1[i,],X_2[j,])\biggr) \end{aligned} \end{split}\]

Rearranging a little

\[ \begin{aligned} \text{Tr}(GP^{\Delta}) =W_1(G) +W_2(G) \end{aligned} \]

with

\[\begin{split} \begin{aligned} W_1(G) &= W_1(G;X_1,X_2) \\ &= \frac{1}{\frac{1}{n_1}+\frac{1}{n_2}} \biggl(\frac{1}{n_1^2} \sum_{i,j=1:i\neq j}^{n_1} K(X_1[i,],X_1[j,]) \\ & \qquad \qquad + \frac{1}{n_2^2} \sum_{i,j=1: i \neq j}^{n_2} K(X_2[i,],X_2[j,]) \\ & \qquad \qquad - \frac{2}{n_1 n_2} \sum_{i=1}^{n_1} \sum_{j=1}^{n_2} K(X_1[i,],X_2[j,])\biggr) \end{aligned} \end{split}\]

The other term is

\[\begin{split} \begin{aligned} W_2(G) &= W_2(G;X_1,X_2) \\ &= \frac{n_1 n_2}{n_1+n_2} \biggl( \frac{1}{n_1^2} \sum_{i=1}^{n_1} K(X_1[i,],X_1[i,]) \\ & \qquad \qquad + \frac{1}{n_2^2} \sum_{i=1}^{n_2} K(X_2[i,], X_2[i,]) \biggr) \\ &\overset{H_0: (n_1, n_2) \to \infty }{=} C(n_1, n_2) E_{F_1}(K(X,X)) + O(n^{-1/2}) \end{aligned} \end{split}\]

Where does $W_2$ come from?#

Let $K(x,y)=x^Ty$ be the linear kernel. Then,

\[ E((\bar{X}_1-\bar{X}_2)^T(\bar{X}_1-\bar{X}_2)) = \|E_{F_1}(X)-E_{F_2}(X)\|^2_2 + \text{Tr}(\text{Var}(\bar{X}_1 - \bar{X}_2)) \]

This last term should be subtracted to get an unbiased estimate of $\|E_{F_1}(X)-E_{F_2}(X)\|^2_2$.
We see then, that for any feature map $h:\mathbb{R}^p \rightarrow \mathbb{R}^k$ leading to kernel $K(x,y)=h(x)^Th(y)$ subtracting this $W_2$ term yields unbiased estimate of

\[ \|E_{F_1}(h(X))-E_{F_2}(h(X))\|^2_2 \]

Maximum Mean Discrepancy Statistic (MMDS)#

When $K$ is the Gram matrix of a covariance kernel, Gretton et al. (almost) propose $W_1(G)$ as a two-sample test.
Specifically they propose (up to scaling)

\[\begin{split} \begin{aligned} T_{MMD} &= \widehat{MMD}({\cal H}; X_1, X_2) \\ &= \frac{1}{\frac{1}{n_1}+\frac{1}{n_2}} \biggl(\frac{1}{n_1 (n_1-1)} \sum_{i,j=1:i\neq j}^{n_1} K(X_1[i,],X_1[j,]) + \\ & \qquad \qquad \frac{1}{n_2 (n_2-1)} \sum_{i,j=1: i \neq j}^{n_2} K(X_2[i,],X_2[j,]) \\ & \qquad \qquad - \frac{2}{n_1 n_2} \sum_{i=1}^{n_1} \sum_{j=1}^{n_2} K(X_1[i,],X_2[j,]) \biggr)\\ \end{aligned} \end{split}\]

Comments#

This is almost $W_1(G;X_1,X_2)$: if $K$ is bounded by $C'$, difference is bounded by $C'/n$.
Easy to analyze their form as it is exactly a two-sample $U$ statistic (degenerate under $H_0$).

MMDS in general#

For a class of functions ${\cal F}$ define $$ MMD({\cal F}; F_1, F_2) = \sup_{h \in {\cal F}} E_{F_1}(h(X)) - E_{F_2}(h(X)) $$
When ${\cal H}$ is the unit ball in an RKHS with kernel $K$ they show $$ MMD({\cal H}; F_1, F_2) = E_{F_1 \times F_1}(K(X,X')) + E_{F_2 \times F_2}(K(X,X')) - 2 E_{F_1 \times F_2}(K(X,X')) $$
Notation: $E_{F \times G}[h(X,X')]$ denotes expectation under $(X,X') \sim F \times G$.
Statistic of Gretton et al. is an estimator of $MMD({\cal H}; F_1, F_2)$.

Using kernel property, it is not hard to see that the analog of (rescaled) $\text{Tr}(XX^TP^{\Delta})$ is given by

\[\begin{split} \begin{aligned} \text{Tr}(GP^{\Delta}) &= \frac{1}{\frac{1}{n_1}+\frac{1}{n_2}} \left\| \frac{1}{n_1} \sum_{i=1}^{n_1} K_{X_1[i,]} - \frac{1}{n_2} \sum_{j=1}^{n_2} K_{X_2[j,]} \right\|^2_{\cal H} \\ &= \frac{1}{\frac{1}{n_1}+\frac{1}{n_2}} \left( \sup_{h: \|h\|_{\cal F} \leq 1} \left \langle \frac{1}{n_1} \sum_{i=1}^{n_1} K_{X_1[i,]} - \frac{1}{n_2} \sum_{j=1}^{n_2} K_{X_2[j,]}, h \right\rangle_{\cal H}\right)^2 \end{aligned} \end{split}\]

Like $\widehat{MMD}({\cal H}; X_1, X_2)$ but includes this $W_2$ term…

`kernlab`#

#| echo: true
library(kernlab)
kmmd(X_1, X_2)

Using automatic sigma estimation (sigest) for RBF or laplace kernel

Kernel Maximum Mean Discrepancy object of class "kmmd" 
 
Gaussian Radial Basis kernel function. 
 Hyperparameter : sigma =  2.8390782080874e-06 


 H0 Hypothesis rejected :  TRUE
 Rademacher bound :  0.444461696552373

 1st and 3rd order MMD Statistics :  0.901532029840108 0.643725854362546

#| echo: true
kmmd(X_1, X_2, kernel='vanilladot', kpar=list())

Kernel Maximum Mean Discrepancy object of class "kmmd" 
 
Linear (vanilla) kernel function. 

 H0 Hypothesis rejected :  FALSE
 Rademacher bound :  3760.42987748469

 1st and 3rd order MMD Statistics :  798.276202069945 399126.145253863

Energy statistics#

We saw that Hotelling’s $T^2$ test is essentially determined by an $n \times n$ (case) similarity matrix $G$:

\[ \text{Tr}(P^{\Delta}GP^{\Delta}) \]

A simple way to construct a similarity matrix is from a distance matrix $D_{ij} = d(X_i, X_j)$ and setting

\[ S = \kappa I - D \]

Noting that $\text{Tr}(P^{\Delta}) = 1$ suggests simply using

\[\begin{split} \begin{aligned} -\text{Tr}(DP^{\Delta}) &= \frac{1}{\frac{1}{n_1}+\frac{1}{n_2}}\text{Tr}((M_1-M_2)D(M_1-M_2)^T) \\ &= \frac{n_2n_1}{n_1+n_2} \biggl( 2 \frac{1}{n_1n_2} \sum_{i=1}^{n_1} \sum_{j=1}^{n_2} d(X_1[i,],X_2[j,]) \\ & \qquad \quad - \frac{1}{n_1^2} \sum_{i,j=1}^{n_1} d(X_1[i,],X_1[j,]) \\ & \qquad \quad - \frac{1}{n_2^2} \sum_{i,j=1}^{n_2} d(X_2[i,],X_2[j,])\biggr) \end{aligned} \end{split}\]

These are the energy test statistics of Szekely et al..
Permutations can be used to get valid tests (in package energy in R)

Energy test with Euclidean distance#

#| echo: true
library(energy)
X = rbind(X_1, X_2)
n_1 = nrow(X_1)
n_2 = nrow(X_2)
D = dist(X)
T = eqdist.etest(D, c(n_1, n_2), distance=TRUE, R=500)
T

	2-sample E-test of equal distributions

data:  sample sizes 323 151, replicates 500
E-statistic = 94382, p-value = 0.001996

Sanity check#

#| echo: true
M_1 = c(rep(1, n_1), rep(0, n_2)) / n_1
M_2 = c(rep(0, n_1), rep(1, n_2)) / n_2
P_delta = (M_1 - M_2) %*% t(M_1 - M_2) * n_1 * n_2 / (n_1 + n_2)
Dm = as.matrix(D)
c(package.value=T$statistic, byhand=-sum(diag(P_delta %*% Dm)))

package.value.E-statistic: 94382.2812754321
byhand: 94382.2812754317

Classification two-sample tests (C2ST)#

LDA classifier#

A natural classifier related to our two sample problem is to map a new test point $x$ to the class $i$ that (under equal sample sizes) minimizes

\[ (x - \bar{X}_i)^TS_p^{-1}(x-\bar{X}_i). \]

For test data $H_0:F_1=F_2$, such a classifier should be roughly a coin flip.
This is the basis of the C2ST

General procedure#

Create label $$ Y = \begin{pmatrix} 1_{n_1} \\ 2_{n_2} \end{pmatrix} $$ and form D=cbind(Y, X)
Split D randomly up in to D_test and D_train in such a way that for test data the number of cases from X_1 matches that of X_2. Let n_test denote the total number of cases of each held out.
Fit a classifier $\hat{C}$ on cbind(X_train, Y_train), and evaluate its classification accuracy on cbind(X_test, Y_test).
Under $H_0:F_1=F_2$, conditional on D_train the accuracy has (approximate) distribution $$ \frac{1}{n_{test}} \text{Binomial}\left(n_{test}, \frac{1}{2} \right). $$ (This seems easiest to justify when $X_i$ are simple random samples from large populations -- in this case given case $x$ is selected a Bernoulli(1/2) determines which sample it finds itself in.)

Alternative two sample

Contents

Alternative two sample#

Download#

Kernelizing Hotelling \(T^2\)#

Olive oil data `South` vs `Centre.North`#

Gram matrices#

Where does \(W_2\) come from?#

Maximum Mean Discrepancy Statistic (MMDS)#

Comments#

MMDS in general#

`kernlab`#

Energy statistics#

Energy statistics#

Energy test with Euclidean distance#

Sanity check#

Classification two-sample tests (C2ST)#

LDA classifier#

General procedure#

References#

Alternative two sample

Contents

Alternative two sample#

Download#

Kernelizing Hotelling \(T^2\)#

Olive oil data South vs Centre.North#

Gram matrices#

Where does \(W_2\) come from?#

Maximum Mean Discrepancy Statistic (MMDS)#

Comments#

MMDS in general#

kernlab#

Energy statistics#

Energy statistics#

Energy test with Euclidean distance#

Sanity check#

Classification two-sample tests (C2ST)#

LDA classifier#

General procedure#

References#

Olive oil data `South` vs `Centre.North`#

`kernlab`#