Linear models¶

We’ve seen several linear models in the course, most likelihood based…
Generic setup: independent \(Y_i\)’s parameterized by \(\eta_i\).
Linear model imposes \(\eta_i=X_i^T\beta\).
Deviance (-2 log likelihood) provides objective function to estimate \(\beta\).
These models have nice structure when \(Y_i\)’s are from some exponential family: often convex (negative log-likelihoods) which make it easy to regularize, go to high dimensions, etc.

Likelihoods we’ve looked at¶

Binomial
Poisson
Multinomial
Cox partial likelihood

Other models: multivariate Gaussian¶

Setup: \(Y_{n \times q}\), \(X_{n \times p}\)
Model: \(Y|X \sim N(XB, I_{n \times n} \otimes \Sigma_{q \times q})\) for \(B_{p \times q}\) (covariance is multivariate Kronecker notation for IID draws from \(N(0,\Sigma)\).
A likelihood based model: MLE (exactly analogous to univariate reponse):

\[\begin{split} \begin{aligned} \widehat{B} &= (X^TX)^{-1}X^TY \\ \widehat{\Sigma} &= \frac{1}{n}Y^T(I - X(X^TX)^{-1}X^T)Y \end{aligned} \end{split}\]

Can regularize in groups for group sparsity

\[ {\cal P}(B) = \sum_j \|B[j,]\|_2 \]

Other models: robust regression `rlm`¶

Setup: \(Y_{n \times 1}\), \(X_{n \times p}\)
Model: typically not thought of as a likelihood…
Estimator:

\[ (\widehat{\beta}, \widehat{\sigma}) = \text{argmin}_{\beta} \sum_{i=1}^n \rho((Y_i-X_i^T\beta)/\sigma) + n \log(\sigma) \]

Other models: robust regression `rlm`¶

Examples of \(\rho\): \(\ell_1\), Huber, Tukey biweight or other redescending estimator.
Regularization? Sure… (Huber and \(\ell_1\) are at least convex…)

Other models: quantile regression¶

Consider \(\ell_1\):

\[ \text{minimize}_{\beta} \sum_{i=1}^n |Y_i-X_i'\beta| \]

Instead of modeling \(E[Y|X]\) this models \(\text{median}(Y|X)\).

Pinball loss: \(\tau \in (0,1)\)¶

\[\begin{split} \ell_{\tau}(Y,\eta) = \begin{cases} \tau (Y-\eta) & Y > \eta \\ -(1-\tau)(Y-\eta) & Y\leq \eta \end{cases} \end{split}\]

Models the \(\tau\)-quantile of \(Y|X\).

Other models: robust regression `rlm`¶

Score equations

\[\begin{split} \begin{aligned} X^T\psi((Y - X\widehat{\beta})/\widehat{\sigma}) &= 0 \\ \sum_{i=1}^n\psi((Y_i - X_i^T\widehat{\beta})/\widehat{\sigma}) (Y_i - X_i^T\widehat{\beta})/\widehat{\sigma} &= n \end{aligned} \end{split}\]

Inference? Sandwich or bootstrap.

Other models: support vector machine (SVM)¶

Not always phrased this way, but equivalent to using hinge loss
For binary outcome \(Y\) coded as \(\pm 1\) and linear predictor \(\eta\)

\[ \ell_{\text{Hinge}}(Y,\eta) = \max(0, 1 - Y \cdot \eta) \]

Huberized SVM smooths out this loss as Huber does the \(\ell_1\) (Moreau smoothing)

Other models: support vector machine (SVM)¶

Regularization: convex in \(\eta\) so, sure…
Inference? Bootstrap?

Mixed effects models¶

Setup: \(Y_{n \times 1}\), \(X_{n \times p}\)
Model: \(Y|X \overset{D}{=} X\beta + Z\gamma + \epsilon\)
Measurement error \(\epsilon \sim N(0, \sigma^2 I)\)
Random effect (usually independent of \(\epsilon\)) \(\gamma \sim N(0, \Sigma(\theta))\).

Mixed effects models¶

Log-likelihood

\[\begin{split} \begin{aligned} \ell(\beta,\sigma,\theta | Y,X) &= -\frac{1}{2} (Y-X\beta)^T(\sigma^2 I + Z\Sigma(\theta)Z^T)^{-1}(Y-X\beta) \\ & \qquad - \frac{1}{2} \log\left(\det(\sigma^2 I + Z\Sigma(\theta)Z^T)\right) \end{aligned} \end{split}\]

Estimation: not convex but smooth…
Inference: likelihood based (if you believe the model)

Mixed effects models: estimation¶

EM algorithm (if we observed \(\gamma\)) likelihood would be nice and simple.
Gradient descent…
Messy, but feasible to run Newton-Raphson…
Since we have a likelihood, Bayesian methods are applicable here.

Mixed effects models: REML¶

REML: notes that law of residuals does not depend on \(\beta\)

\[ R = (I - X(X^TX)^{-1}X^T)Y \]

Decomposes problem into noise parameter estimation and fixed effects estimation.
This likelihood can be used to estimate \((\theta,\sigma^2)\).
Given this estimate \((\widehat{\sigma}^2, \widehat{\theta})\) the optimal \(\beta\) is

\[ (X^T(\widehat{\sigma}^2I + \Sigma(\widehat{\theta}))^{-1}X)^{-1} X^T(\widehat{\sigma}^2I + \Sigma(\widehat{\theta}))^{-1}Y \]

REML is often a better estimate than MLE for noise: in usual Gaussian model \(\widehat{\sigma}^2_{REML} = \|Y-X\widehat{\beta}\|^2_2/(n-p)\)…

Generalized linear mixed models¶

Gaussian mixed effects model essentially models linear predictor

\[\begin{split} \begin{aligned} \eta &= X\beta + Z\gamma \\ &= \text{fixed effect} + \text{noise} \end{aligned} \end{split}\]

Lends itself naturally to glm likelihoods as well.
E.g. for Binomial data with logistic link

\[ Y|X,Z,\gamma \sim \text{Binomial}\left(\frac{e^{\eta}}{1+e^{\eta}}\right) \]

Generalized linear mixed models¶

Likelihood marginalizes over \(\gamma\) – no closed form (unlike Gaussian model).
Many computational issues to deal with here.
Well suited to Bayesian modeling here (though if optimization is difficult, likely that posteriors are quite multi-modal for non-informative priors…)

Beyond linear models: basis expansions¶

We’ve always used linear predictor

\[ \eta = X\beta = \sum_{j=1}^p \beta_j X_j \]

Could use a basis expansion with \(N_j\) functions for feature \(j\)

\[ \eta = \sum_{j=1}^p \sum_{l=1}^{N_j} \beta_{jl} h_{jl}(X_j) \]

Any linear method that takes linear predictor can profit from basis expansion.
Generally, these form additive models

\[ \eta = \sum_{j=1}^p f_j(X_j) \]

Implicit basis expansions¶

As basis expansion grows in complexity we should regularize in some fashion…
For explicit basis, we could choose sparsity or some ridge

\[ {\cal P}_j\left(\sum_{l=1}^{N_j} \beta_{jl} h_{jl}\right) = \sum_{l=1}^{N_j} \beta_{jl}^2 \]

Kernel trick¶

A natural class of penalties
A covariance function \(K:\mathbb{R} \times \mathbb{R} \rightarrow \mathbb{R}\) is a function that satisfies:
1. \(K(s,t)=K(t,s)\) (symmetry)
2. For any \(\{t_1, \dots, t_k\} \subset\mathbb{R}\) and coefficients \(\{a_1, \dots, a_k\}\) we have \(\sum_{i,j=1}^k a_i a_j K(t_i, t_j)>0\). (non-negative definite)
Example: radial basis function \(K(t,s) = e^{-(t-s)^2/(2\gamma)}\).
Example: Brownian motion \(K(t,s) = \min(t,s)\).

Reproducing kernel penalty¶

Consider functions of the form:

\[ h(s) = \sum_{j=1}^k a_j K(t_j,s) \]

The penalty is

\[ \|h\|^2_{K} = \sum_{l,j=1}^k a_i a_j K(t_i,t_j) \]

Penalized regression¶

\[ \text{minimize}_{f_1, \dots, f_p} -\log L(f_1,\dots,f_p|X,Y) + \sum_{j=1}^p \lambda_j \|f_j\|^2_{K_j}. \]

Kernels can take multivariate arguments \(K:\mathbb{R}^k \times \mathbb{R}^k \rightarrow \mathbb{R}\).
Greatly expands flexibility of regression modelling…. (at cost of having more choices to make!)

Combining ideas¶

Kernel trick tells us how to make lots of cool feature mappings…
Can be plugged into other regression problems, e.g. “non-parametric” quantile regression

\[ \text{minimize}_f \sum_{i=1}^n \ell_{\tau}(Y_i,f(X_i)) + \lambda \|f\|^2_{K}. \]

“Non-parametric” survival analysis

\[ \text{minimize}_f \sum_{i:\delta_i=1} \left[ \log \left(\sum_{j:O_j \geq O_i} \exp(f(X_j)) \right) - f(X_i)\right] + \lambda \|f\|^2_K. \]

STATS 305B

Linear models¶

Likelihoods we’ve looked at¶

Other models: multivariate Gaussian¶

Other models: robust regression rlm¶

Other models: robust regression rlm¶

Other models: quantile regression¶

Pinball loss: \(\tau \in (0,1)\)¶

Other models: robust regression rlm¶

Other models: support vector machine (SVM)¶

Other models: support vector machine (SVM)¶

Mixed effects models¶

Mixed effects models¶

Mixed effects models: estimation¶

Mixed effects models: REML¶

Generalized linear mixed models¶

Generalized linear mixed models¶

Beyond linear models: basis expansions¶

Implicit basis expansions¶

Kernel trick¶

Reproducing kernel penalty¶

Penalized regression¶

Combining ideas¶

Other models: robust regression `rlm`¶

Other models: robust regression `rlm`¶

Other models: robust regression `rlm`¶