Non-linear regression#

Problem: How do we model a non-linear relationship?

  • Left: Regression of wage onto age.

  • Right: Logistic regression for classes wage>250 and wage<250

Basis functions#

Strategy:

  • Define a model: $$Y = \beta_0 + \beta_1 f_1(X) + \beta_2 f_2(X) + \dots + \beta_d f_d(X) + \epsilon.$$
  • Fit this model through least-squares regression: $f_j$'s are nonlinear, model is linear!
  • Some options for $f_1,\dots,f_d$:
    1. Polynomials, $f_i(x) = x^i$.
    2. Indicator functions, $f_i(x) = \mathbf{1}(c_i \leq x < c_{i+1})$.

Piecewise constant#

Piecewise polynomials#

Cubic splines#

  • Define a set of knots $\xi_1< \xi_2 < \dots<\xi_K$.
  • We want the function $f$ in $Y= f(X) + \epsilon$ to:
    1. Be a cubic polynomial between every pair of knots $\xi_i,\xi_{i+1}$.
    2. Be continuous at each knot.
    3. Have continuous first and second derivatives at each knot.
  • It turns out, we can write $f$ in terms of $K+3$ basis functions: $$f(X) = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \beta_4 h(X,\xi_1) + \dots + \beta_{K+3} h(X,\xi_K)$$
  • Above, $$ h(x,\xi) = \begin{cases} (x-\xi)^3 \quad \text{if }x>\xi \\ 0 \quad \text{otherwise} \end{cases} $$

Natural cubic splines#

  • Spline which is linear instead of cubic for \(X<\xi_1\), \(X>\xi_K\).

  • The predictions are more stable for extreme values of \(X\).

Choosing the number and locations of knots#

  • The locations of the knots are typically quantiles of \(X\).

  • The number of knots, \(K\), is chosen by cross validation:

Natural cubic splines vs. polynomial regression#

  • Splines can fit complex functions with few parameters.

  • Polynomials require high degree terms to be flexible.

  • High-degree polynomials can be unstable at the edges.

Smoothing splines#

Find the function \(f\) which minimizes

\[\color{Blue}{\sum_{i=1}^n (y_i - f(x_i))^2} + \color{Red}{ \lambda \int f''(x)^2 dx}\]
  • The RSS when using \(f\) to predict.

  • A penalty for the roughness of the function.

Facts#

  • The minimizer \(\hat f\) is a natural cubic spline, with knots at each sample point \(x_1,\dots,x_n\).

  • Obtaining \(\hat f\) is similar to a Ridge regression.

Advanced: deriving a smoothing spline#

  1. Show that if you fix the values $f(x_1),\dots,f(x_2)$, the roughness $$\int f''(x)^2 dx$$ is minimized by a natural cubic spline.
  2. Deduce that the solution to the smoothing spline problem is a natural cubic spline, which can be written in terms of its basis functions. $$f(x) = \beta_0 + \beta_1 f_1(x) + \dots \beta_{n+3} f_{n+3}(x)$$
  3. Letting $\mathbf N$ be a matrix with $\mathbf N(i,j) = f_j(x_i)$, we can write the objective function: $$ (y - \mathbf N\beta)^T(y - \mathbf N\beta) + \lambda \beta^T \Omega_{\mathbf N}\beta, $$ where $\Omega_{\mathbf N}(i,j) = \int N_i''(t) N_j''(t) dt$.

Advanced: deriving a smoothing spline#

  1. By simple calculus, the coefficients $\hat \beta$ which minimize $$ (y - \mathbf N\beta)^T(y - \mathbf N\beta) + \lambda \beta^T \Omega_{\mathbf N}\beta, $$ are $\hat \beta = (\mathbf N^T \mathbf N + \lambda \Omega_{\mathbf N})^{-1} \mathbf N^T y$.
  2. Note that the predicted values are a linear function of the observed values: $$\hat y = \underbrace{\mathbf N (\mathbf N^T \mathbf N + \lambda \Omega_{\mathbf N})^{-1} \mathbf N^T}_{\mathbf S_\lambda} y$$
  3. The **degrees of freedom** for a smoothing spline are: $$\text{Trace}(\mathbf S_\lambda)= \mathbf S_\lambda(1,1) + \mathbf S_\lambda(2,2) + \cdots + \mathbf S_\lambda(n,n) $$

Natural cubic splines vs. Smoothing splines#

**Natural cubic splines** **Smoothing splines**
  • Fix the locations of $K$ knots at quantiles of $X$.
  • Number of knots $KFind the natural cubic spline $\hat f$ which minimizes the RSS: $$\sum_{i=1}^n (y_i - f(x_i) )^2$$
  • Choose $K$ by cross validation.
  • Put $n$ knots at $x_1,\dots,x_n$.
  • We could find a cubic spline which makes the RSS $=0$ $\longrightarrow$ {\color{Red}Overfitting!}
  • Instead, we obtain the fitted values $\hat f(x_1),\dots,\hat f(x_n)$ through an algorithm similar to Ridge regression.
  • The function $\hat f$ is the only natural cubic spline that has these fitted values.

Choosing the regularization parameter \(\lambda\)#

  • We typically choose \(\lambda\) through cross validation.

  • Fortunately, we can solve the problem for any \(\lambda\) with the same complexity of diagonalizing an \(n\times n\) matrix.

  • There is a shortcut for LOOCV: $\( \begin{aligned} RSS_\text{loocv}(\lambda) &= \sum_{i=1}^n (y_i - \hat f_\lambda^{(-i)}(x_i))^2 \\ &= \sum_{i=1}^n \left[\frac{y_i-\hat f_\lambda(x_i)}{1-\mathbf S_\lambda(i,i)}\right]^2 \end{aligned} \)$

Choosing the regularization parameter \(\lambda\)#

Local linear regression#

  • Sample points nearer \(x\) are weighted higher in corresponding regression.

Algorithm#

To predict the regression function \(f\) at an input \(x\):

  1. Assign a weight $K_i(x)$ to the training point $x_i$, such that:
    • $K_i(x)=0$ unless $x_i$ is one of the $k$ nearest neighbors of $x$ (not strictly necessary).
    • $K_i(x)$ decreases when the distance $d(x,x_i)$ increases.
  2. Perform a **weighted least squares** regression; i.e. find $(\beta_0,\beta_1)$ which minimize $$\hat{\beta}(x) = \text{argmin}_{(\beta_0, \beta_1)} \sum_{i=i}^n K_i(x) (y_i - \beta_0 -\beta_1 x_i)^2.$$
  3. Predict $\hat f(x) = \hat \beta_0(x) + \hat \beta_1(x) x$.

Generalized nearest neighbors#

  1. Set $K_i(x)=1$ if $x_i$ is one of $x$'s $k$ nearest neighbors.
  2. Perform a ``regression'' with only an intercept; i.e. find $\beta_0$ which minimizes $$\hat{\beta}_0(x) = \text{argmin}_{\beta_0} \sum_{i=i}^n K_i(x)(y_i - \beta_0)^2.$$
  3. Predict $\hat f(x) = \hat \beta_0(x)$.
  4. Common choice that is smoother than nearest neighbors $$K_i(x) = \exp(-\|x-x_i\|^2/2\lambda)$$

Local linear regression#

  • The span \(k/n\), is chosen by cross-validation.

Generalized Additive Models (GAMs)#

  • Extension of non-linear models to multiple predictors:

\[\;\;\mathtt{wage} = \beta_0 + \beta_1\times \mathtt{year} + \beta_2\times \mathtt{age} + \beta_3 \times\mathtt{education} +\epsilon\]
\[\longrightarrow \;\;\mathtt{wage} = \beta_0 + f_1(\mathtt{year}) + f_2(\mathtt{age}) + f_3(\mathtt{education}) +\epsilon\]
  • The functions \(f_1,\dots,f_p\) can be polynomials, natural splines, smoothing splines, local regressions…

Fitting a GAM#

\[\mathtt{wage} = \beta_0 + f_1(\mathtt{year}) + f_2(\mathtt{age}) + f_3(\mathtt{education}) +\epsilon\]
  • If the functions $f_1$ have a basis representation, we can simply use least squares:
    • Natural cubic splines
    • Polynomials
    • Step functions

Fitting a GAM#

  • Otherwise, we can use **backfitting**:
    1. Keep $\beta_0,f_2,\dots,f_p$ fixed, and fit $f_1$ using the partial residuals: $$y_i - \beta_0 - f_2( x_{i2}) -\dots - f_p( x_{ip}),$$ as the response.
    2. Keep $\beta_0,f_1,f_3,\dots,f_p$ fixed, and fit $f_2$ using the partial residuals: $$y_i - \beta_0 - f_1( x_{i1}) - f_3( x_{i3}) -\dots - f_p( x_{ip}),$$ as the response.
    3. ...
    4. Iterate
  • This works for smoothing splines and local regression.
  • **For smoothing splines this is a descent method, descending on convex loss ...**

Backfitting: coordinate descent#

  • Also works for linear regression...
    1. Initialize $\hat{\beta}^{(0)} = 0$ and, (say).
    2. Given $\hat{\beta}^{(T-1)}$, choose a coordinate $0 \leq k(T) \leq p$ and find $$ \begin{aligned} \hat{\alpha}(T) &= \text{argmin}_{\alpha} \sum_{i=1}^n\left(Y_i - \hat{\beta}^{(T-1)}_0 - \sum_{j:j \neq k(T)} X_{ij} \hat{\beta}^{(T-1)}_j - \alpha X_{ik(T)}\right)^2 \\ &= \frac{\sum_{i=1}^n X_{ik(T)}\left(Y_i - \hat{\beta}^{(T-1)}_0 - \sum_{j: j \neq k(T)} X_{ij} \hat{\beta}^{(T-1)}_j\right)} {\sum_{i=1}^n X_{ik(T)}^2} \end{aligned} $$
    3. Set $\hat{\beta}^{(T)}=\hat{\beta}^{(T-1)}$ except $k(T)$ entry which we set to $\hat{\alpha}(T)$.
    4. Iterate

Backfitting: coordinate descent and LASSO#

  • Also works for LASSO...
    1. Initialize $\hat{\beta}^{(0)} = 0$ and, (say).
    2. Given $\hat{\beta}^{(T-1)}$, choose a coordinate $0 \leq k(T) \leq p$ and find $$ \begin{aligned} \hat{\alpha}_{\lambda}(T) &= \text{argmin}_{\alpha} \sum_{i=1}^n\left(r^{(T-1)}_{ik(T)} - \alpha X_{ik(T)}\right)^2 \\ & \qquad + \lambda \sum_{j: j \neq k(T)} |\hat{\beta}^{(T-1)}_j| + \lambda |\alpha| \end{aligned} $$ with $r^{(T-1)}_j$ the $j$-th partial residual at iteration $T$ $$ r^{(T-1)}_j = Y - \hat{\beta}^{(T-1)}_0 - \sum_{l:l \neq j} X_l \hat{\beta}^{(T-1)}_l. $$ Solution is a simple soft-thresholded version of previous $\hat{\alpha}(T)$ -- **Very fast! Used in `glmnet`**
    3. Set $\hat{\beta}^{(T)}=\hat{\beta}^{(T-1)}$ except $k(T)$ entry which we set to $\hat{\alpha}_{\lambda}(T)$.
    4. Iterate...

Backfitting: GAM#

  • Let's look at basis functions
    1. Initialize $\hat{\beta}^{(0)} = 0$ and, (say).
    2. Given $\hat{\beta}^{(T-1)}$, choose a coordinate $0 \leq k(T) \leq p$ and find $$ \begin{aligned} \hat{\alpha}(T) &= \text{argmin}_{\alpha \in \mathbb{R}^{n_{k(T)}}} \\ & \qquad \sum_{i=1}^n\biggl(Y_i - \hat{\beta}^{(T-1)}_0 - \sum_{j:j \neq k(T)} \sum_{l=1}^{n_j} f_{lj}(X_{ij}) \hat{\beta}^{(T-1)}_{lj} \\ & \qquad \quad - \sum_{l=1}^{n_{k(T)}} \alpha_{l} f_{lk(T)}(X_{ik(T)})\biggr)^2 \end{aligned} $$
    3. Set $\hat{\beta}^{(T)}=\hat{\beta}^{(T-1)}$ except $k(T)$ entries which we set to $\hat{\alpha}(T)$. **Blockwise coordinate descent!**
    4. Iterate...

Properties#

  • GAMs are a step from linear regression toward a fully nonparametric method.
  • The only constraint is additivity. This can be partially addressed by adding key interaction variables $X_iX_j$ (or tensor product of basis functions -- e.g. polynomials of two variables).
  • We can report degrees of freedom for many non-linear functions.
  • As in linear regression, we can examine the significance of each of the variables.

Example: Regression for Wage#

  • year: natural spline with df=4.

  • age: natural spline with df=5.

  • education: factor.

Example: Regression for Wage#

  • year: smoothing spline with df=4.

  • age: smoothing spline with df=5.

  • education: step function

Classification#

We can model the log-odds in a classification problem using a GAM:

\[\log \frac{P(Y=1\mid X)}{P(Y=0\mid X)} = \beta_0 + f_1(X_1) + \dots + f_p(X_p).\]

Again fit by backfitting …

Backfitting: GAM with logistic loss#

  • Also works for logistic...
    1. Initialize $\hat{\beta}^{(0)} = 0$ and, (say).
    2. Given $\hat{\beta}^{(T-1)}$, choose a coordinate $0 \leq k(T) \leq p$ with $\ell$ logistic loss, find $$ \begin{aligned} \hat{\alpha}(T) &= \text{argmin}_{\alpha \in \mathbb{R}^{n_{k(T)}}} \\ & \qquad \sum_{i=1}^n \ell\biggl(Y_i, \hat{\beta}^{(T-1)}_0 + \sum_{j:j \neq k(T)} \sum_{l=1}^{n_j} f_{lj}(X_{ij}) \hat{\beta}^{(T-1)}_{lj} \\ & \qquad \quad + \sum_{l=1}^{n_{k(T)}} \alpha_{l} f_{lk(T)}(X_{ik(T)}) \biggr) \end{aligned} $$
    3. Set $\hat{\beta}^{(T)}=\hat{\beta}^{(T-1)}$ except $k(T)$ entries which we set to $\hat{\alpha}(T)$.
    4. Works for losses that have a *linear predictor*.
    5. For GAMs, the linear predictor is $$ \beta_0 + f_1(X_1) + \dots + f_p(X_p) $$

Example: Classification for Wage>250#

  • year: linear

  • age: smoothing spline with df=5

  • education: step function

Example: Classification for Wage>250#

  • Same model excluding cases education == "<HS"