Linear models

  • We’ve seen several linear models in the course, most likelihood based…

  • Generic setup: independent \(Y_i\)’s parameterized by \(\eta_i\).

  • Linear model imposes \(\eta_i=X_i^T\beta\).

  • Deviance (-2 log likelihood) provides objective function to estimate \(\beta\).

  • These models have nice structure when \(Y_i\)’s are from some exponential family: often convex (negative log-likelihoods) which make it easy to regularize, go to high dimensions, etc.


Likelihoods we’ve looked at

  1. Binomial

  2. Poisson

  3. Multinomial

  4. Cox partial likelihood


Other models: multivariate Gaussian

  • Setup: \(Y_{n \times q}\), \(X_{n \times p}\)

  • Model: \(Y|X \sim N(XB, I_{n \times n} \otimes \Sigma_{q \times q})\) for \(B_{p \times q}\) (covariance is multivariate Kronecker notation for IID draws from \(N(0,\Sigma)\).

  • A likelihood based model: MLE (exactly analogous to univariate reponse):

\[\begin{split} \begin{aligned} \widehat{B} &= (X^TX)^{-1}X^TY \\ \widehat{\Sigma} &= \frac{1}{n}Y^T(I - X(X^TX)^{-1}X^T)Y \end{aligned} \end{split}\]
  • Can regularize in groups for group sparsity

\[ {\cal P}(B) = \sum_j \|B[j,]\|_2 \]

Other models: robust regression rlm

  • Setup: \(Y_{n \times 1}\), \(X_{n \times p}\)

  • Model: typically not thought of as a likelihood…

  • Estimator:

\[ (\widehat{\beta}, \widehat{\sigma}) = \text{argmin}_{\beta} \sum_{i=1}^n \rho((Y_i-X_i^T\beta)/\sigma) + n \log(\sigma) \]

Other models: robust regression rlm

  • Examples of \(\rho\): \(\ell_1\), Huber, Tukey biweight or other redescending estimator.

  • Regularization? Sure… (Huber and \(\ell_1\) are at least convex…)


Other models: quantile regression

  • Consider \(\ell_1\):

\[ \text{minimize}_{\beta} \sum_{i=1}^n |Y_i-X_i'\beta| \]
  • Instead of modeling \(E[Y|X]\) this models \(\text{median}(Y|X)\).

Pinball loss: \(\tau \in (0,1)\)

\[\begin{split} \ell_{\tau}(Y,\eta) = \begin{cases} \tau (Y-\eta) & Y > \eta \\ -(1-\tau)(Y-\eta) & Y\leq \eta \end{cases} \end{split}\]
  • Models the \(\tau\)-quantile of \(Y|X\).


Other models: robust regression rlm

  • Score equations

\[\begin{split} \begin{aligned} X^T\psi((Y - X\widehat{\beta})/\widehat{\sigma}) &= 0 \\ \sum_{i=1}^n\psi((Y_i - X_i^T\widehat{\beta})/\widehat{\sigma}) (Y_i - X_i^T\widehat{\beta})/\widehat{\sigma} &= n \end{aligned} \end{split}\]
  • Inference? Sandwich or bootstrap.


Other models: support vector machine (SVM)

  • Not always phrased this way, but equivalent to using hinge loss

  • For binary outcome \(Y\) coded as \(\pm 1\) and linear predictor \(\eta\)

\[ \ell_{\text{Hinge}}(Y,\eta) = \max(0, 1 - Y \cdot \eta) \]
  • Huberized SVM smooths out this loss as Huber does the \(\ell_1\) (Moreau smoothing)


Other models: support vector machine (SVM)

  • Regularization: convex in \(\eta\) so, sure…

  • Inference? Bootstrap?


Mixed effects models

  • Setup: \(Y_{n \times 1}\), \(X_{n \times p}\)

  • Model: \(Y|X \overset{D}{=} X\beta + Z\gamma + \epsilon\)

  • Measurement error \(\epsilon \sim N(0, \sigma^2 I)\)

  • Random effect (usually independent of \(\epsilon\)) \(\gamma \sim N(0, \Sigma(\theta))\).


Mixed effects models

  • Log-likelihood

\[\begin{split} \begin{aligned} \ell(\beta,\sigma,\theta | Y,X) &= -\frac{1}{2} (Y-X\beta)^T(\sigma^2 I + Z\Sigma(\theta)Z^T)^{-1}(Y-X\beta) \\ & \qquad - \frac{1}{2} \log\left(\det(\sigma^2 I + Z\Sigma(\theta)Z^T)\right) \end{aligned} \end{split}\]
  • Estimation: not convex but smooth…

  • Inference: likelihood based (if you believe the model)


Mixed effects models: estimation

  • EM algorithm (if we observed \(\gamma\)) likelihood would be nice and simple.

  • Gradient descent…

  • Messy, but feasible to run Newton-Raphson…

  • Since we have a likelihood, Bayesian methods are applicable here.


Mixed effects models: REML

  • REML: notes that law of residuals does not depend on \(\beta\)

\[ R = (I - X(X^TX)^{-1}X^T)Y \]
  • Decomposes problem into noise parameter estimation and fixed effects estimation.

  • This likelihood can be used to estimate \((\theta,\sigma^2)\).

  • Given this estimate \((\widehat{\sigma}^2, \widehat{\theta})\) the optimal \(\beta\) is

\[ (X^T(\widehat{\sigma}^2I + \Sigma(\widehat{\theta}))^{-1}X)^{-1} X^T(\widehat{\sigma}^2I + \Sigma(\widehat{\theta}))^{-1}Y \]
  • REML is often a better estimate than MLE for noise: in usual Gaussian model \(\widehat{\sigma}^2_{REML} = \|Y-X\widehat{\beta}\|^2_2/(n-p)\)


Generalized linear mixed models

  • Gaussian mixed effects model essentially models linear predictor

\[\begin{split} \begin{aligned} \eta &= X\beta + Z\gamma \\ &= \text{fixed effect} + \text{noise} \end{aligned} \end{split}\]
  • Lends itself naturally to glm likelihoods as well.

  • E.g. for Binomial data with logistic link

\[ Y|X,Z,\gamma \sim \text{Binomial}\left(\frac{e^{\eta}}{1+e^{\eta}}\right) \]

Generalized linear mixed models

  • Likelihood marginalizes over \(\gamma\) – no closed form (unlike Gaussian model).

  • Many computational issues to deal with here.

  • Well suited to Bayesian modeling here (though if optimization is difficult, likely that posteriors are quite multi-modal for non-informative priors…)


Beyond linear models: basis expansions

  • We’ve always used linear predictor

\[ \eta = X\beta = \sum_{j=1}^p \beta_j X_j \]
  • Could use a basis expansion with \(N_j\) functions for feature \(j\)

\[ \eta = \sum_{j=1}^p \sum_{l=1}^{N_j} \beta_{jl} h_{jl}(X_j) \]
  • Any linear method that takes linear predictor can profit from basis expansion.

  • Generally, these form additive models

\[ \eta = \sum_{j=1}^p f_j(X_j) \]

Implicit basis expansions

  • As basis expansion grows in complexity we should regularize in some fashion…

  • For explicit basis, we could choose sparsity or some ridge

\[ {\cal P}_j\left(\sum_{l=1}^{N_j} \beta_{jl} h_{jl}\right) = \sum_{l=1}^{N_j} \beta_{jl}^2 \]

Kernel trick

  • A natural class of penalties

  • A covariance function \(K:\mathbb{R} \times \mathbb{R} \rightarrow \mathbb{R}\) is a function that satisfies:

    1. \(K(s,t)=K(t,s)\) (symmetry)

    2. For any \(\{t_1, \dots, t_k\} \subset\mathbb{R}\) and coefficients \(\{a_1, \dots, a_k\}\) we have \(\sum_{i,j=1}^k a_i a_j K(t_i, t_j)>0\). (non-negative definite)

  • Example: radial basis function \(K(t,s) = e^{-(t-s)^2/(2\gamma)}\).

  • Example: Brownian motion \(K(t,s) = \min(t,s)\).


Reproducing kernel penalty

  • Consider functions of the form:

\[ h(s) = \sum_{j=1}^k a_j K(t_j,s) \]
  • The penalty is

\[ \|h\|^2_{K} = \sum_{l,j=1}^k a_i a_j K(t_i,t_j) \]

Penalized regression

\[ \text{minimize}_{f_1, \dots, f_p} -\log L(f_1,\dots,f_p|X,Y) + \sum_{j=1}^p \lambda_j \|f_j\|^2_{K_j}. \]
  • Kernels can take multivariate arguments \(K:\mathbb{R}^k \times \mathbb{R}^k \rightarrow \mathbb{R}\).

  • Greatly expands flexibility of regression modelling…. (at cost of having more choices to make!)


Combining ideas

  • Kernel trick tells us how to make lots of cool feature mappings…

  • Can be plugged into other regression problems, e.g. “non-parametric” quantile regression

\[ \text{minimize}_f \sum_{i=1}^n \ell_{\tau}(Y_i,f(X_i)) + \lambda \|f\|^2_{K}. \]
  • “Non-parametric” survival analysis

\[ \text{minimize}_f \sum_{i:\delta_i=1} \left[ \log \left(\sum_{j:O_j \geq O_i} \exp(f(X_j)) \right) - f(X_i)\right] + \lambda \|f\|^2_K. \]