Some notes on RKHS

STATS 305C

Reproducing kernel Hilbert spaces (RKHS)

Here, we briefly cover some properties of RKHS that are well-known and potentially useful.
Kernels are used to produce new (hopefully interesting) features on some existing feature space \(T\).
For concreteness, let’s think \(T=\mathbb{R}^p\).

Covariance kernel

The key ingredient is a covariance kernel \(K:T \times T \mathbb{R}\)
1. \(K(t,s) = K(s,t)\)
2. Non-negative definiteness: for any \((a_1,\dots,a_k) \in \mathbb{R}, (t_1, \dots, t_k) \in T\):

\[ \sum_{i,j=1}^k a_j a_j K(t_i, t_j) \geq 0 \]

RKHS

Feature map

For every \(t \in T\), define \(K_t:T \to \mathbb{R}\) by

\[ K_t(s) = K(t,s) \]

The RKHS \({\cal H}\) is formed by linear combinations of the form

\[ \sum_{i=1}^k a_i K_{t_i} \]

Inner product

Define (this is the reproducing aspect)

\[ \langle K_t, K_s \rangle_{\cal H} = K(t,s) \]

We see that

\[ \sum_{i,j=1}^k a_j a_j K(t_i, t_j) = \langle \|\sum_{i=1}^k a_i K_{t_i} \rangle\|^2_{\cal H} \]

Completion

Call all finite linear combinations \({\cal H}_0\).
The RKHS \({\cal H}\) is the completion of \({\cal H}_0\) (i.e. equivalence classes of Cauchy sequences in \({\cal H}_0\)).

Evaluation

For any \(h \in {\cal H}\) and any \(t \in T\)

\[ \langle h, K_t \rangle = h(t) \]

Proof: verify on \({\cal H}_0\), extend to \({\cal H}\).

Example: linear kernel

Take \(T=\mathbb{R}^p\) and any \(\Sigma \geq 0 \in \mathbb{R}^{p \times p}\)
For \(v \in \text{row}(\Sigma)\) define

\[ K_v(x) = v'x \]

We see that \({\cal H}\) is \(\text{row}(\Sigma)\) equipped with the inner product

\[ u'\Sigma v, \qquad u, v \in \text{row}(\Sigma). \]

Example: piecewise linear kernel

Take \(T=[0,1]\) and

\[ {\cal H} = \left\{f:[0,1] \to \mathbb{R}: f(0)=0, \dot{f} \in L^2([0,1]) \right\} \] - I.e.

\[ \|f\|^2_{\cal H} = \|\dot{f}\|^2_{L^2([0,1])} \] - This is the completion of

\[ {\cal H}_0 = \left\{ \sum_i a_i K_{t_i} \right\} \]

with

\[ K_{t}(s) = \min(t, s), \qquad s,t \in [0,1] \]

The feature maps are piecewise linear functions…

Example: piecewise cubic kernel

Take \(T=[0,1]\) and

\[ {\cal H} = \left\{f:[0,1] \to \mathbb{R}: f(0)=0, \dot{f}(0)=0, \ddot{f} \in L^2([0,1]) \right\} \]

We recognize the smoothing spline penalty

\[ \int_{[0,1]} \ddot{f}(s)^2 \; ds \]

The feature maps are piecewise cubic functions…
What is the corresponding kernel?

Gaussian process

To see what the corresponding kernel is, it is helpful to consider another object associated to \(K\)
For any covariance kernel \(K\), we can associate a Gaussian process \(Z:T \times \Omega \rightarrow \mathbb{R}\):

\[ \text{Cov}(Z_t, Z_s) = K(t,s) \]

Piecewise linear kernel the process is Brownian motion on \([0,1]\), let’s call it \(B\).
Piecewise cubic kernel

\[ Z_t = \int_{[0,t]} B_s \; ds \]

Straightforward to check that its covariance is piecewise cubic…

Kernel trick (aka Representer theorem)

Let \({\cal T}=(t_1, \dots, t_n)\) denote any finite set in \(T\)
Associated to \({\cal T}\) are:

\[ \begin{aligned} L_{\cal T} &= \text{span}(K_{t_i}, t_i \in {\cal T}) \\ P_{\cal T}(h) &= \text{argmin}_{k \in L_{\cal T}} \|h - k\|^2_{\cal H} \\ \end{aligned} \]

Orthogonality / interpolation

For any \(h \in {\cal H}\) and any \(t_i \in {\cal T}\)

\[ 0 = \langle h - P_{\cal T} h, K_{t_i} \rangle_{\cal H} = h(t_i) - (P_{\cal T} h)(t_i) \]

For any function \(\ell\) of the fitted values for \(h \in {\cal H}\)

\[ \ell(h(t_1), \dots, h(t_n)) + \lambda \|h\|^2_{\cal H} \geq \ell((P_{\cal T}h)(t_1), \dots, (P_{\cal T}(h)(t_n)) + \lambda \|P_{\cal T}h\|^2_{\cal H} \]

Implications

\(\implies\)

\[ \text{argmin}_{h \in {\cal H}} {\ell}(h(t_1), \dots, h(t_n)) + \lambda \|h\|^2_{\cal H} \subset L_{\cal T} \]

\(\implies\) suffices to solve the finite dimensional problem

\[ \text{argmin}_{h \in L_{\cal T}} {\ell}(h(t_1), \dots, h(t_n)) + \lambda \|h\|^2_{\cal H} \]

Functions in \(L_{\cal T}\) can be parameterized as

\[ h = \sum_i \alpha_i K_{t_i} \]

Bases for an RKHS

Mercer’s theorem / Karhunen Loeve expansion

Take \(T\) to be a bounded subset of \(\mathbb{R}^p\) with Lebesgue measure \(\Lambda_T\) and consider

\[ {\cal H} \ni h(\cdot) \overset{\bar{K}}{\to} \int_T K(t, \cdot) h(t) \; \Lambda_T(dt) \]

We can view \(\bar{K}:L^2(\Lambda_T) \to L^2(\Lambda_T)\) which admits an eigen-decomposition

\[ \bar{K}e_i = \lambda_i e_i \]

Formally (sums over index \(i\) may be finite or infinite)

\[ K(t,s) = \sum_i \lambda_i e_i(t) e_i(s) \]

Multiple expansions

Replace Lebesgue measure with a probability measure \(F\) with density \(f\) whose support is all of \(T\):

\[ {\cal H} \ni h(\cdot) \overset{\bar{K}_F}{\to} \int_T K(t, \cdot) f(t) h(t) \; dt \]

We now have \(\bar{K}_f:L^2(F) \to L^2(F)\) with a similar eigen-decomposition

\[ K(t,s) = \sum_i \lambda_{i,F} e_{i,F}(t) e_{i,F}(s) \]

Which is preferable?

Computationally, one may want to use a low-rank approximation to \(K\)…
We’d expect Lebesgue expansion \(L^2(\Lambda_T \times \Lambda_T)\) while the above would be preferable for approximation in \(L^2(F \times F)\)

Realizing the Gaussian process

Let \(\xi_i(\omega) \overset{IID}{\sim} N(0,1)\)
Given a basis \(e_{i,F}\) from such an expansion (or any other ONB) set

\[ Z_t(\omega) = \sum_i \lambda_{i,F}^{1/2} \xi_i(\omega) e_{i,F}(t) \]

As \(e_{i,F}(t) = \langle e_{i,F}, K_t \rangle_{\cal H}\) we can define, for any \(h \in {\cal H}\)

\[ Z(h,\omega) = \sum_i \lambda_{i,F}^{1/2} \xi_i(\omega) \langle e_{i,F}, h \rangle_{\cal H} \]

Orthonormal basis for \({\cal H}\)

Given any orthonormal basis \((\varphi_i)_{1 \leq i \leq dim({\cal H})}\) of \({\cal H}\) we see

\[ \begin{aligned} K(s,t) &= \langle K_s, K_t \rangle_{\cal H} \\ &= \sum_{i} \langle K_s, \varphi_i \rangle_{\cal H} \cdot \langle \varphi_i, K_t \rangle_{\cal H} \\ &= \sum_{i} \varphi_i(t) \varphi_i(s). \end{aligned} \] we can realize the law as

\[ Z_t(\omega) = \sum_{i} \xi_i(\omega) \varphi_i(t) \]

Conversely, for countable index set \({\cal I}\), suppose

\[ K(s,t) = \sum_{i \in {\cal I}} \varphi_i(t) \varphi_i(s) \] Then, \((\varphi_i)_{i \in {\cal I} }\) is an ONB for \({\cal H}\) determined by \(K\). The corresponding process is

\[ Z_t(\omega) = \sum_{i \in {\cal I}} \xi_i(\omega) \varphi_i(t) \]

\(\implies\) both \((\lambda_i^{1/2} e_{i})\) and \((\lambda_{i,F}^{1/2} e_{i,F})\) are ONB for \({\cal H}\).

A statistical problem

Suppose we want to approximate the function \(Z\) by its projection onto a single function \(g \in L^2(\Lambda_T)\) (\(\|g\|^2_{L^2(\Lambda_T)}=1\)):

\[ \hat{Z}(\cdot) = \int_T Z_t \cdot g(t) \; \Lambda_T(dt) \cdot g(\cdot) \]

We’ll measure accuracy by

\[ \mathbb{E}\left[\int_T (Z_t - \hat{Z}_t)^2 \Lambda_T(dt)\right] \]

Minimizing over \(g\) (unsurprisingly?) is seen to be the same as

\[ \text{maximize}_{\{g:\|g\|^2_{L^2(\Lambda_T)}=1\}} \int_T \int_T K(s,t) g(s) g(t) \; \Lambda_T(ds) \; \Lambda_T(ds) \]

Solution is to take top Karhunen-Loeve eigenfunction \(e_{1,\Lambda_T}\)
Generalizes to projection onto a subspace of dimension \(k\).
Swapping \(\Lambda_T\) to \(F\) yields \(e_{1,F}\)

Connections to PCA #1: Functional PCA

Image credit: Towards data science
For functional data analysis, the observations are functions

Connections to PCA #2: Kernel PCA

Given a sample \(X_i \overset{IID}{\sim} F\) and kernel \(K\), consider

\[ \text{maximize}_{h: \|h\|_{\cal H} \leq 1} \hat{\text{Var}}(h(X)) \]

Resulting eigenproblem (see 14.5.4 in ESL)

\[ \left(I - n^{-1}11'\right) G \left(I - n^{-1}11'\right) \]

with \(G_{ij} = K(X_i, X_j)\).

This is (the population version of) a low-rank perturbation of the original kernel… corresponding process is

\[ Z_t - \int_T Z_s f(s) \; ds \]