RKHS#

Download#

Reproducing kernel Hilbert spaces (RKHS)#

  • Here, we briefly cover some properties of RKHS that are well-known and potentially useful.

  • Kernels are used to produce new (hopefully interesting) features on some existing feature space \(T\).

  • For concreteness, let’s think \(T=\mathbb{R}^p\).

Covariance kernel#

  • The key ingredient is a covariance kernel \(K:T \times T \mathbb{R}\)

    1. \(K(t,s) = K(s,t)\)

    2. Non-negative definiteness: for any \((a_1,\dots,a_k) \in \mathbb{R}, (t_1, \dots, t_k) \in T\):

\[ \sum_{i,j=1}^k a_j a_j K(t_i, t_j) \geq 0 \]

RKHS#

Feature map#

  • For every \(t \in T\), define \(K_t:T \to \mathbb{R}\) by

\[ K_t(s) = K(t,s) \]
  • The RKHS \({\cal H}\) is formed by linear combinations of the form

\[ \sum_{i=1}^k a_i K_{t_i} \]

Inner product#

  • Define (this is the reproducing aspect)

\[ \langle K_t, K_s \rangle_{\cal H} = K(t,s) \]
  • We see that

\[ \sum_{i,j=1}^k a_j a_j K(t_i, t_j) = \langle \|\sum_{i=1}^k a_i K_{t_i} \rangle\|^2_{\cal H} \]

Completion#

  • Call all finite linear combinations \({\cal H}_0\).

  • The RKHS \({\cal H}\) is the completion of \({\cal H}_0\) (i.e. equivalence classes of Cauchy sequences in \({\cal H}_0\)).

Evaluation#

  • For any \(h \in {\cal H}\) and any \(t \in T\)

\[ \langle h, K_t \rangle = h(t) \]
  • Proof: verify on \({\cal H}_0\), extend to \({\cal H}\).

Example: linear kernel#

  • Take \(T=\mathbb{R}^p\) and any \(\Sigma \geq 0 \in \mathbb{R}^{p \times p}\)

  • For \(v \in \text{row}(\Sigma)\) define

\[ K_v(x) = v'x \]
  • We see that \({\cal H}\) is \(\text{row}(\Sigma)\) equipped with the inner product

\[ u'\Sigma v, \qquad u, v \in \text{row}(\Sigma). \]

Example: piecewise linear kernel#

  • Take \(T=[0,1]\) and

\[ {\cal H} = \left\{f:[0,1] \to \mathbb{R}: f(0)=0, \dot{f} \in L^2([0,1]) \right\} \]
  • I.e.

\[ \|f\|^2_{\cal H} = \|\dot{f}\|^2_{L^2([0,1])} \]
  • This is the completion of

\[ {\cal H}_0 = \left\{ \sum_i a_i K_{t_i} \right\} \]

with

\[ K_{t}(s) = \min(t, s), \qquad s,t \in [0,1] \]
  • The feature maps are piecewise linear functions…

Example: piecewise cubic kernel#

  • Take \(T=[0,1]\) and

\[ {\cal H} = \left\{f:[0,1] \to \mathbb{R}: f(0)=0, \dot{f}(0)=0, \ddot{f} \in L^2([0,1]) \right\} \]
  • We recognize the smoothing spline penalty

\[ \int_{[0,1]} \ddot{f}(s)^2 \; ds \]
  • The feature maps are piecewise cubic functions…

  • What is the corresponding kernel?

Gaussian process#

  • To see what the corresponding kernel is, it is helpful to consider another object associated to \(K\)

  • For any covariance kernel \(K\), we can associate a Gaussian process \(Z:T \times \Omega \rightarrow \mathbb{R}\):

\[ \text{Cov}(Z_t, Z_s) = K(t,s) \]
  • Piecewise linear kernel the process is Brownian motion on \([0,1]\), let’s call it \(B\).

  • Piecewise cubic kernel

\[ Z_t = \int_{[0,t]} B_s \; ds \]
  • Straightforward to check that its covariance is piecewise cubic…

Kernel trick (aka Representer theorem)#

  • Let \({\cal T}=(t_1, \dots, t_n)\) denote any finite set in \(T\)

  • Associated to \({\cal T}\) are:

\[\begin{split} \begin{aligned} L_{\cal T} &= \text{span}(K_{t_i}, t_i \in {\cal T}) \\ P_{\cal T}(h) &= \text{argmin}_{k \in L_{\cal T}} \|h - k\|^2_{\cal H} \\ \end{aligned} \end{split}\]

Orthogonality / interpolation#

  • For any \(h \in {\cal H}\) and any \(t_i \in {\cal T}\)

\[ 0 = \langle h - P_{\cal T} h, K_{t_i} \rangle_{\cal H} = h(t_i) - (P_{\cal T} h)(t_i) \]
  • For any function \(\ell\) of the fitted values for \(h \in {\cal H}\)

\[ \ell(h(t_1), \dots, h(t_n)) + \lambda \|h\|^2_{\cal H} \geq \ell((P_{\cal T}h)(t_1), \dots, (P_{\cal T}(h)(t_n)) + \lambda \|P_{\cal T}h\|^2_{\cal H} \]

Implications#

  • \(\implies\)

\[ \text{argmin}_{h \in {\cal H}} {\ell}(h(t_1), \dots, h(t_n)) + \lambda \|h\|^2_{\cal H} \subset L_{\cal T} \]
  • \(\implies\) suffices to solve the finite dimensional problem

\[ \text{argmin}_{h \in L_{\cal T}} {\ell}(h(t_1), \dots, h(t_n)) + \lambda \|h\|^2_{\cal H} \]
  • Functions in \(L_{\cal T}\) can be parameterized as

\[ h = \sum_i \alpha_i K_{t_i} \]

Bases for an RKHS#

Mercer’s theorem / Karhunen Loeve expansion#

  • Take \(T\) to be a bounded subset of \(\mathbb{R}^p\) with Lebesgue measure \(\Lambda_T\) and consider

\[ {\cal H} \ni h(\cdot) \overset{\bar{K}}{\to} \int_T K(t, \cdot) h(t) \; \Lambda_T(dt) \]
  • We can view \(\bar{K}:L^2(\Lambda_T) \to L^2(\Lambda_T)\) which admits an eigen-decomposition

\[ \bar{K}e_i = \lambda_i e_i \]
  • Formally (sums over index \(i\) may be finite or infinite)

\[ K(t,s) = \sum_i \lambda_i e_i(t) e_i(s) \]

Multiple expansions#

  • Replace Lebesgue measure with a probability measure \(F\) with density \(f\) whose support is all of \(T\):

\[ {\cal H} \ni h(\cdot) \overset{\bar{K}_F}{\to} \int_T K(t, \cdot) f(t) h(t) \; dt \]
  • We now have \(\bar{K}_f:L^2(F) \to L^2(F)\) with a similar eigen-decomposition

\[ K(t,s) = \sum_i \lambda_{i,F} e_{i,F}(t) e_{i,F}(s) \]

Which is preferable?#

  • Computationally, one may want to use a low-rank approximation to \(K\)

  • We’d expect Lebesgue expansion \(L^2(\Lambda_T \times \Lambda_T)\) while the above would be preferable for approximation in \(L^2(F \times F)\)

Realizing the Gaussian process#

  • Let \(\xi_i(\omega) \overset{IID}{\sim} N(0,1)\)

  • Given a basis \(e_{i,F}\) from such an expansion (or any other ONB) set

\[ Z_t(\omega) = \sum_i \lambda_{i,F}^{1/2} \xi_i(\omega) e_{i,F}(t) \]
  • As \(e_{i,F}(t) = \langle e_{i,F}, K_t \rangle_{\cal H}\) we can define, for any \(h \in {\cal H}\)

\[ Z(h,\omega) = \sum_i \lambda_{i,F}^{1/2} \xi_i(\omega) \langle e_{i,F}, h \rangle_{\cal H} \]

Orthonormal basis for \({\cal H}\)#

  • Given any orthonormal basis \((\varphi_i)_{1 \leq i \leq dim({\cal H})}\) of \({\cal H}\) we see

\[\begin{split} \begin{aligned} K(s,t) &= \langle K_s, K_t \rangle_{\cal H} \\ &= \sum_{i} \langle K_s, \varphi_i \rangle_{\cal H} \cdot \langle \varphi_i, K_t \rangle_{\cal H} \\ &= \sum_{i} \varphi_i(t) \varphi_i(s). \end{aligned} \end{split}\]

we can realize the law as

\[ Z_t(\omega) = \sum_{i} \xi_i(\omega) \varphi_i(t) \]
  • Conversely, for countable index set \({\cal I}\), suppose

\[ K(s,t) = \sum_{i \in {\cal I}} \varphi_i(t) \varphi_i(s) \]

Then, \((\varphi_i)_{i \in {\cal I} }\) is an ONB for \({\cal H}\) determined by \(K\). The corresponding process is

\[ Z_t(\omega) = \sum_{i \in {\cal I}} \xi_i(\omega) \varphi_i(t) \]
  • \(\implies\) both \((\lambda_i^{1/2} e_{i})\) and \((\lambda_{i,F}^{1/2} e_{i,F})\) are ONB for \({\cal H}\).

A statistical problem#

  • Suppose we want to approximate the function \(Z\) by its projection onto a single function \(g \in L^2(\Lambda_T)\) (\(\|g\|^2_{L^2(\Lambda_T)}=1\)):

\[ \hat{Z}(\cdot) = \int_T Z_t \cdot g(t) \; \Lambda_T(dt) \cdot g(\cdot) \]
  • We’ll measure accuracy by

\[ \mathbb{E}\left[\int_T (Z_t - \hat{Z}_t)^2 \Lambda_T(dt)\right] \]

  • Minimizing over \(g\) (unsurprisingly?) is seen to be the same as

\[ \text{maximize}_{\{g:\|g\|^2_{L^2(\Lambda_T)}=1\}} \int_T \int_T K(s,t) g(s) g(t) \; \Lambda_T(ds) \; \Lambda_T(ds) \]
  • Solution is to take top Karhunen-Loeve eigenfunction \(e_{1,\Lambda_T}\)

  • Generalizes to projection onto a subspace of dimension \(k\).

  • Swapping \(\Lambda_T\) to \(F\) yields \(e_{1,F}\)

Connections to PCA #1: Functional PCA#

Connections to PCA #2: Kernel PCA#

  • Given a sample \(X_i \overset{IID}{\sim} F\) and kernel \(K\), consider

\[ \text{maximize}_{h: \|h\|_{\cal H} \leq 1} \hat{\text{Var}}(h(X)) \]
  • Resulting eigenproblem (see 14.5.4 in ESL)

\[ \left(I - n^{-1}11'\right) G \left(I - n^{-1}11'\right) \]

with \(G_{ij} = K(X_i, X_j)\).

  • This is (the population version of) a low-rank perturbation of the original kernel… corresponding process is

\[ Z_t - \int_T Z_s f(s) \; ds \]