Non-linear regression#

Problem: How do we model a non-linear relationship?

Left: Regression of wage onto age.
Right: Logistic regression for classes wage>250 and wage<250

Basis functions#

Strategy:

Define a model: $$Y = \beta_0 + \beta_1 f_1(X) + \beta_2 f_2(X) + \dots + \beta_d f_d(X) + \epsilon.$$
Fit this model through least-squares regression: $f_j$'s are nonlinear, model is linear!
Some options for $f_1,\dots,f_d$:
1. Polynomials, $f_i(x) = x^i$.
2. Indicator functions, $f_i(x) = \mathbf{1}(c_i \leq x < c_{i+1})$.

Piecewise constant#

Piecewise polynomials#

Cubic splines#

Define a set of knots $\xi_1< \xi_2 < \dots<\xi_K$.
We want the function $f$ in $Y= f(X) + \epsilon$ to:
1. Be a cubic polynomial between every pair of knots $\xi_i,\xi_{i+1}$.
2. Be continuous at each knot.
3. Have continuous first and second derivatives at each knot.
It turns out, we can write $f$ in terms of $K+3$ basis functions: $$f(X) = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \beta_4 h(X,\xi_1) + \dots + \beta_{K+3} h(X,\xi_K)$$
Above, $$ h(x,\xi) = \begin{cases} (x-\xi)^3 \quad \text{if }x>\xi \\ 0 \quad \text{otherwise} \end{cases} $$

Natural cubic splines#

Spline which is linear instead of cubic for $X<\xi_1$, $X>\xi_K$.
The predictions are more stable for extreme values of $X$.

Choosing the number and locations of knots#

The locations of the knots are typically quantiles of $X$.
The number of knots, $K$, is chosen by cross validation:

Natural cubic splines vs. polynomial regression#

Splines can fit complex functions with few parameters.
Polynomials require high degree terms to be flexible.
High-degree polynomials can be unstable at the edges.

Smoothing splines#

Find the function $f$ which minimizes

\[\color{Blue}{\sum_{i=1}^n (y_i - f(x_i))^2} + \color{Red}{ \lambda \int f''(x)^2 dx}\]

The RSS when using $f$ to predict.
A penalty for the roughness of the function.

Facts#

The minimizer $\hat f$ is a natural cubic spline, with knots at each sample point $x_1,\dots,x_n$.
Obtaining $\hat f$ is similar to a Ridge regression.

Advanced: deriving a smoothing spline#

Show that if you fix the values $f(x_1),\dots,f(x_2)$, the roughness $$\int f''(x)^2 dx$$ is minimized by a natural cubic spline.
Deduce that the solution to the smoothing spline problem is a natural cubic spline, which can be written in terms of its basis functions. $$f(x) = \beta_0 + \beta_1 f_1(x) + \dots \beta_{n+3} f_{n+3}(x)$$
Letting $\mathbf N$ be a matrix with $\mathbf N(i,j) = f_j(x_i)$, we can write the objective function: $$ (y - \mathbf N\beta)^T(y - \mathbf N\beta) + \lambda \beta^T \Omega_{\mathbf N}\beta, $$ where $\Omega_{\mathbf N}(i,j) = \int N_i''(t) N_j''(t) dt$.

Advanced: deriving a smoothing spline#

By simple calculus, the coefficients $\hat \beta$ which minimize $$ (y - \mathbf N\beta)^T(y - \mathbf N\beta) + \lambda \beta^T \Omega_{\mathbf N}\beta, $$ are $\hat \beta = (\mathbf N^T \mathbf N + \lambda \Omega_{\mathbf N})^{-1} \mathbf N^T y$.
Note that the predicted values are a linear function of the observed values: $$\hat y = \underbrace{\mathbf N (\mathbf N^T \mathbf N + \lambda \Omega_{\mathbf N})^{-1} \mathbf N^T}_{\mathbf S_\lambda} y$$
The **degrees of freedom** for a smoothing spline are: $$\text{Trace}(\mathbf S_\lambda)= \mathbf S_\lambda(1,1) + \mathbf S_\lambda(2,2) + \cdots + \mathbf S_\lambda(n,n) $$

Natural cubic splines vs. Smoothing splines#

Natural cubic splines	Smoothing splines
Fix the locations of $K$ knots at quantiles of $X$. Number of knots $KFind the natural cubic spline $\hat f$ which minimizes the RSS: $$\sum_{i=1}^n (y_i - f(x_i) )^2$$ Choose $K$ by cross validation.	Put $n$ knots at $x_1,\dots,x_n$. We could find a cubic spline which makes the RSS $=0$ $\longrightarrow$ {\color{Red}Overfitting!} Instead, we obtain the fitted values $\hat f(x_1),\dots,\hat f(x_n)$ through an algorithm similar to Ridge regression. The function $\hat f$ is the only natural cubic spline that has these fitted values.

Choosing the regularization parameter $\lambda$#

We typically choose $\lambda$ through cross validation.
Fortunately, we can solve the problem for any $\lambda$ with the same complexity of diagonalizing an $n\times n$ matrix.
There is a shortcut for LOOCV: $$ \begin{aligned} RSS_\text{loocv}(\lambda) &= \sum_{i=1}^n (y_i - \hat f_\lambda^{(-i)}(x_i))^2 \\ &= \sum_{i=1}^n \left[\frac{y_i-\hat f_\lambda(x_i)}{1-\mathbf S_\lambda(i,i)}\right]^2 \end{aligned} $$

Choosing the regularization parameter $\lambda$#

Local linear regression#

Sample points nearer $x$ are weighted higher in corresponding regression.

Algorithm#

To predict the regression function $f$ at an input $x$:

Assign a weight $K_i(x)$ to the training point $x_i$, such that:
- $K_i(x)=0$ unless $x_i$ is one of the $k$ nearest neighbors of $x$ (not strictly necessary).
- $K_i(x)$ decreases when the distance $d(x,x_i)$ increases.
Perform a **weighted least squares** regression; i.e. find $(\beta_0,\beta_1)$ which minimize $$\hat{\beta}(x) = \text{argmin}_{(\beta_0, \beta_1)} \sum_{i=i}^n K_i(x) (y_i - \beta_0 -\beta_1 x_i)^2.$$
Predict $\hat f(x) = \hat \beta_0(x) + \hat \beta_1(x) x$.

Generalized nearest neighbors#

Set $K_i(x)=1$ if $x_i$ is one of $x$'s $k$ nearest neighbors.
Perform a ``regression'' with only an intercept; i.e. find $\beta_0$ which minimizes $$\hat{\beta}_0(x) = \text{argmin}_{\beta_0} \sum_{i=i}^n K_i(x)(y_i - \beta_0)^2.$$
Predict $\hat f(x) = \hat \beta_0(x)$.
Common choice that is smoother than nearest neighbors $$K_i(x) = \exp(-\|x-x_i\|^2/2\lambda)$$

Local linear regression#

The span $k/n$, is chosen by cross-validation.

Generalized Additive Models (GAMs)#

Extension of non-linear models to multiple predictors:

\[\;\;\mathtt{wage} = \beta_0 + \beta_1\times \mathtt{year} + \beta_2\times \mathtt{age} + \beta_3 \times\mathtt{education} +\epsilon\]

\[\longrightarrow \;\;\mathtt{wage} = \beta_0 + f_1(\mathtt{year}) + f_2(\mathtt{age}) + f_3(\mathtt{education}) +\epsilon\]

The functions $f_1,\dots,f_p$ can be polynomials, natural splines, smoothing splines, local regressions…

Fitting a GAM#

\[\mathtt{wage} = \beta_0 + f_1(\mathtt{year}) + f_2(\mathtt{age}) + f_3(\mathtt{education}) +\epsilon\]

If the functions $f_1$ have a basis representation, we can simply use least squares:
- Natural cubic splines
- Polynomials
- Step functions

Fitting a GAM#

Otherwise, we can use **backfitting**:
1. Keep $\beta_0,f_2,\dots,f_p$ fixed, and fit $f_1$ using the partial residuals: $$y_i - \beta_0 - f_2( x_{i2}) -\dots - f_p( x_{ip}),$$ as the response.
2. Keep $\beta_0,f_1,f_3,\dots,f_p$ fixed, and fit $f_2$ using the partial residuals: $$y_i - \beta_0 - f_1( x_{i1}) - f_3( x_{i3}) -\dots - f_p( x_{ip}),$$ as the response.
3. ...
4. Iterate
This works for smoothing splines and local regression.
**For smoothing splines this is a descent method, descending on convex loss ...**

Backfitting: coordinate descent#

Also works for linear regression...
1. Initialize $\hat{\beta}^{(0)} = 0$ and, (say).
2. Given $\hat{\beta}^{(T-1)}$, choose a coordinate $0 \leq k(T) \leq p$ and find $$ \begin{aligned} \hat{\alpha}(T) &= \text{argmin}_{\alpha} \sum_{i=1}^n\left(Y_i - \hat{\beta}^{(T-1)}_0 - \sum_{j:j \neq k(T)} X_{ij} \hat{\beta}^{(T-1)}_j - \alpha X_{ik(T)}\right)^2 \\ &= \frac{\sum_{i=1}^n X_{ik(T)}\left(Y_i - \hat{\beta}^{(T-1)}_0 - \sum_{j: j \neq k(T)} X_{ij} \hat{\beta}^{(T-1)}_j\right)} {\sum_{i=1}^n X_{ik(T)}^2} \end{aligned} $$
3. Set $\hat{\beta}^{(T)}=\hat{\beta}^{(T-1)}$ except $k(T)$ entry which we set to $\hat{\alpha}(T)$.
4. Iterate

Backfitting: coordinate descent and LASSO#

Also works for LASSO...
1. Initialize $\hat{\beta}^{(0)} = 0$ and, (say).
2. Given $\hat{\beta}^{(T-1)}$, choose a coordinate $0 \leq k(T) \leq p$ and find $$ \begin{aligned} \hat{\alpha}_{\lambda}(T) &= \text{argmin}_{\alpha} \sum_{i=1}^n\left(r^{(T-1)}_{ik(T)} - \alpha X_{ik(T)}\right)^2 \\ & \qquad + \lambda \sum_{j: j \neq k(T)} |\hat{\beta}^{(T-1)}_j| + \lambda |\alpha| \end{aligned} $$ with $r^{(T-1)}_j$ the $j$-th partial residual at iteration $T$ $$ r^{(T-1)}_j = Y - \hat{\beta}^{(T-1)}_0 - \sum_{l:l \neq j} X_l \hat{\beta}^{(T-1)}_l. $$ Solution is a simple soft-thresholded version of previous $\hat{\alpha}(T)$ -- **Very fast! Used in `glmnet`**
3. Set $\hat{\beta}^{(T)}=\hat{\beta}^{(T-1)}$ except $k(T)$ entry which we set to $\hat{\alpha}_{\lambda}(T)$.
4. Iterate...

Backfitting: GAM#

Let's look at basis functions
1. Initialize $\hat{\beta}^{(0)} = 0$ and, (say).
2. Given $\hat{\beta}^{(T-1)}$, choose a coordinate $0 \leq k(T) \leq p$ and find $$ \begin{aligned} \hat{\alpha}(T) &= \text{argmin}_{\alpha \in \mathbb{R}^{n_{k(T)}}} \\ & \qquad \sum_{i=1}^n\biggl(Y_i - \hat{\beta}^{(T-1)}_0 - \sum_{j:j \neq k(T)} \sum_{l=1}^{n_j} f_{lj}(X_{ij}) \hat{\beta}^{(T-1)}_{lj} \\ & \qquad \quad - \sum_{l=1}^{n_{k(T)}} \alpha_{l} f_{lk(T)}(X_{ik(T)})\biggr)^2 \end{aligned} $$
3. Set $\hat{\beta}^{(T)}=\hat{\beta}^{(T-1)}$ except $k(T)$ entries which we set to $\hat{\alpha}(T)$. **Blockwise coordinate descent!**
4. Iterate...

Properties#

GAMs are a step from linear regression toward a fully nonparametric method.
The only constraint is additivity. This can be partially addressed by adding key interaction variables $X_iX_j$ (or tensor product of basis functions -- e.g. polynomials of two variables).
We can report degrees of freedom for many non-linear functions.
As in linear regression, we can examine the significance of each of the variables.

Example: Regression for `Wage`#

year: natural spline with df=4.
age: natural spline with df=5.
education: factor.

Example: Regression for `Wage`#

year: smoothing spline with df=4.
age: smoothing spline with df=5.
education: step function

Classification#

We can model the log-odds in a classification problem using a GAM:

\[\log \frac{P(Y=1\mid X)}{P(Y=0\mid X)} = \beta_0 + f_1(X_1) + \dots + f_p(X_p).\]

Again fit by backfitting …

Backfitting: GAM with logistic loss#

Also works for logistic...
1. Initialize $\hat{\beta}^{(0)} = 0$ and, (say).
2. Given $\hat{\beta}^{(T-1)}$, choose a coordinate $0 \leq k(T) \leq p$ with $\ell$ logistic loss, find $$ \begin{aligned} \hat{\alpha}(T) &= \text{argmin}_{\alpha \in \mathbb{R}^{n_{k(T)}}} \\ & \qquad \sum_{i=1}^n \ell\biggl(Y_i, \hat{\beta}^{(T-1)}_0 + \sum_{j:j \neq k(T)} \sum_{l=1}^{n_j} f_{lj}(X_{ij}) \hat{\beta}^{(T-1)}_{lj} \\ & \qquad \quad + \sum_{l=1}^{n_{k(T)}} \alpha_{l} f_{lk(T)}(X_{ik(T)}) \biggr) \end{aligned} $$
3. Set $\hat{\beta}^{(T)}=\hat{\beta}^{(T-1)}$ except $k(T)$ entries which we set to $\hat{\alpha}(T)$.
4. Works for losses that have a *linear predictor*.
5. For GAMs, the linear predictor is $$ \beta_0 + f_1(X_1) + \dots + f_p(X_p) $$

Example: Classification for `Wage>250`#

year: linear
age: smoothing spline with df=5
education: step function

Example: Classification for `Wage>250`#

Same model excluding cases education == "<HS"

STATS 202

Non-linear regression

Contents

Non-linear regression#

Basis functions#

Piecewise constant#

Piecewise polynomials#

Cubic splines#

Natural cubic splines#

Choosing the number and locations of knots#

Natural cubic splines vs. polynomial regression#

Smoothing splines#

Facts#

Advanced: deriving a smoothing spline#

Advanced: deriving a smoothing spline#

Natural cubic splines vs. Smoothing splines#

Choosing the regularization parameter \(\lambda\)#

Choosing the regularization parameter \(\lambda\)#

Local linear regression#

Algorithm#

Generalized nearest neighbors#

Local linear regression#

Generalized Additive Models (GAMs)#

Fitting a GAM#

Fitting a GAM#

Backfitting: coordinate descent#

Backfitting: coordinate descent and LASSO#

Backfitting: GAM#

Properties#

Example: Regression for `Wage`#

Example: Regression for `Wage`#

Classification#

Backfitting: GAM with logistic loss#

Example: Classification for `Wage>250`#

Example: Classification for `Wage>250`#

STATS 202

Non-linear regression

Contents

Non-linear regression#

Basis functions#

Piecewise constant#

Piecewise polynomials#

Cubic splines#

Natural cubic splines#

Choosing the number and locations of knots#

Natural cubic splines vs. polynomial regression#

Smoothing splines#

Facts#

Advanced: deriving a smoothing spline#

Advanced: deriving a smoothing spline#

Natural cubic splines vs. Smoothing splines#

Choosing the regularization parameter \(\lambda\)#

Choosing the regularization parameter \(\lambda\)#

Local linear regression#

Algorithm#

Generalized nearest neighbors#

Local linear regression#

Generalized Additive Models (GAMs)#

Fitting a GAM#

Fitting a GAM#

Backfitting: coordinate descent#

Backfitting: coordinate descent and LASSO#

Backfitting: GAM#

Properties#

Example: Regression for Wage#

Example: Regression for Wage#

Classification#

Backfitting: GAM with logistic loss#

Example: Classification for Wage>250#

Example: Classification for Wage>250#

Example: Regression for `Wage`#

Example: Regression for `Wage`#

Example: Classification for `Wage>250`#

Example: Classification for `Wage>250`#