Generalized linear models (Ch. 4)¶

We saw that the Bernoulli$(\pi)$ family is an exponential family with natural parameter $$ \log \left(\frac{\pi}{1-\pi}\right). $$
Given $(Y_i,X_i)$ with $Y_i$ binary, the logistic model is $$ \log\left(\frac{\pi_i}{1-\pi_i}\right) = X_i^T\beta $$ with $$ \pi_i = E[Y_i|X_i]. $$
We are making the assumption that $Y_i|X$ are independent (given $X$).
Note that for binary data $$ \text{Var}[Y_i|X_i] = \pi_i (1 - \pi_i). $$

One parameter exponential families¶

Suppose $Y_i|X_i$ has density of the form (w.r.t. some measure $m$) $$ f(y|\theta) = a(\theta) b(y) \exp(y \cdot Q(\theta)) $$ so that $$ a(\theta)^{-1} = \int_{\cal Y} \exp(y \cdot Q(\theta)) b(y) \; m(dy). $$
This is an exponential family with sufficient statistic $y$, natural parameter $\alpha=Q(\theta)$ and reference measure with density $b$ w.r.t. $m$ and $a(\theta)=a(Q(\theta))$.

We could write $$ a(\theta)^{-1} = \exp(\Lambda(\alpha)) = \int_{\cal Y} \exp(y \cdot \alpha) b(y) \; m(dy) $$ so that $$ \Lambda(\alpha) = \log\left(MGF(Y_0,\eta) \right) $$ where $Y_0$ is a random variable with density $b$ w.r.t. $m$.
$\Lambda$ is a cumulant generating function.
Simple, standard calculations show

\[\begin{split} \begin{aligned} \nabla \Lambda(\alpha) &= E_{\alpha}[Y] \\ \nabla^2 \Lambda(\alpha) &= \text{Var}_{\alpha}[Y] \end{aligned} \end{split}\]

where $E_{\alpha}$ is distribution with density

\[y \mapsto \exp(\alpha \cdot y) b(y)\]

The function $$ \alpha \mapsto E_{\alpha}[Y] $$ is increasing (and invertible on its range).
Therefore $\text{Var}_{\alpha}[Y]$ can be expressed as a function of $E_{\alpha}[Y]$.
Set $\mu=E_{\alpha}[Y]$, $V = \text{Var}_{\alpha}[Y]$ then we have concluded that $$ V = V(\mu). $$
If $Y$ is Bernoulli, then $$ V(\mu)=\mu(1-\mu). $$

Poisson case¶

For Poisson random variables, $$ V(\mu)=\mu. $$
What takes role of logit for Poisson?
Mass function, if $Y \sim \text{Poisson}(\mu)$ then $$ P_{\mu}(Y=y) = \frac{\mu^y}{y!} e^{-\mu} = \exp(y \cdot \log \mu) \cdot \frac{e^{-\mu}}{y!}. $$
Natural parameter is $\log(\mu)$.
The “natural” regression model is: $$ \log(\mu_i) = X_i^T\beta. $$

“Unnatural models”¶

For binary data, we might suppose that $$ \pi_i = F(X_i^T\beta) $$ for our favorite distribution function.
Or, $$ F^{-1}(\pi_i) = X_i^T\beta. $$
A GLM for binary data corresponds to some choice of $F$.

Generalized linear models¶

Given a dataset $(Y_i, X_{i1}, \dots, X_{ip}), 1 \leq i \leq n$ we consider a model for the distribution of $Y|X_1, \dots, X_p$.

If $\eta_i = g({\mathbb{E}}(Y_i|X_i)) = g(\mu_i) =X_i^T\beta$ then $g$ is called the link function for the model.
If ${\text{Var}}(Y_i|X_i) = \phi \cdot V({\mathbb{E}}(Y_i)) = \phi \cdot V(\mu_i)$ for dispersion parameter $\phi > 0$ and some function $V$, then $V$ is the called variance function for the model.
GLM is specified by this pair $(g, V)$. If data is a one-parameter exponential family then $V$ is already determined…
Parameter $\phi$ is called the dispersion parameter.
Canonical reference Generalized linear models.

Binary regression as GLM¶

For a logistic model, $g(\mu)={\text{logit}}(\mu), \qquad V(\mu)=\mu(1-\mu).$
For a probit model, $g(\mu)=\Phi^{-1}(\mu), \qquad V(\mu)=\mu(1-\mu).$
For a cloglog model, $g(\mu)=-\log(-\log(\mu)), \qquad V(\mu)=\mu(1-\mu).$

Canonical link¶

The $\text{logit}$ transform is the canonical link. This is our “natural” binary regression model because the natural parameters are modelled as $X\beta$.

library(glmbb) # for `crabs` data
data(crabs)
width_logit = glm(y ~ width, family=binomial(), data=crabs)
width_probit = glm(y ~ width, family=binomial(link='probit'), data=crabs)
c(AIC(width_logit), AIC(width_probit))

198.452663928765
198.035734025475

summary(width_probit)

Call:
glm(formula = y ~ width, family = binomial(link = "probit"), 
    data = crabs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0519  -1.0494   0.5374   0.9126   1.6897  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.50196    1.50712  -4.978 6.44e-07 ***
width        0.30202    0.05804   5.204 1.95e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 225.76  on 172  degrees of freedom
Residual deviance: 194.04  on 171  degrees of freedom
AIC: 198.04

Number of Fisher Scoring iterations: 5

Deviance for binary glm¶

\[\begin{split} \begin{aligned} \text{DEV}(\beta|Y) &= -2 \log L(\pi(\beta)|Y) + 2 \log L(Y|Y) \\ &= 2 \sum_{i=1}^n -Y_i \log \left(\frac{F(X_i^T\beta)}{1 - F(X_i^T\beta)}\right) - \log(1 - F(X_i^T\beta)) \end{aligned} \end{split}\]

Gradient for binary glm¶

\[\begin{split} \begin{aligned} \nabla \text{DEV}(\beta|Y) &= 2 \sum_{i=1}^n X_i \cdot f(X_i^T\beta) \biggl[\frac{-Y_i} {F(X_i^T\beta)(1 - F(X_i^T\beta))} + \\ & \qquad \qquad \qquad \frac{1}{1 - F(X_i^T\beta)} \biggr] \\ &= 2 \sum_{i=1}^n X_i \cdot f(X_i^T\beta) \left[\frac{F(X_i^T\beta)-Y_i} {F(X_i^T\beta)(1 - F(X_i^T\beta))} \right] \\ &= 2 \sum_{i=1}^n X_i \cdot \frac{f(X_i^T\beta)^2}{F(X_i^T\beta)(1 - F(X_i^T\beta))} \left[\frac{F(X_i^T\beta)-Y_i}{f(X_i^T\beta)} \right] \end{aligned} \end{split}\]

Hessian for binary glm¶

Twice the observed information – a little ugly … but the expected information is

\[\begin{split} \begin{aligned} E_{\beta} [ \nabla^2 \text{DEV}(\beta|Y)] &= 2 \cdot \text{Var}_{\beta}[\nabla \text{DEV}(\beta|Y)] \\ &= 2 \sum_{i=1}^n X_iX_i^T \frac{f(X_i^T\beta)^2}{F(X_i^T\beta)(1 - F(X_i^T\beta))} \\ &= 2 \cdot X^TW_{\beta}X \end{aligned} \end{split}\]

where

\[ W_{\beta} = \text{diag}\left(\frac{f(X_i^T\beta)^2}{F(X_i^T\beta)(1-F(X_i^T\beta))}\right) \]

Note $$ \begin{aligned} \nabla \text{DEV}(\beta|Y) &= - 2 X^TW_{\beta}\left( \frac{Y-F(X\beta)}{f(X\beta)} \right). \end{aligned} $$

Fisher scoring¶

Instead of using the observed information (i.e. actual Hessian of $\text{DEV}$) Fisher proposed using the expected information in a Newton-Raphson like algorithm.
After initialization,

\[\begin{split} \begin{aligned} \hat{\beta}^{(k+1)} &= \hat{\beta}^{(k)} - E_{\hat{\beta}^{(k)}}[\nabla^2 \text{DEV}(\hat{\beta}^{(k)}|Y)]^{-1} \nabla \text{DEV}(\hat{\beta}^{(k)}|Y) \\ &= (X^TW^{(k)}X)^{-1}X^TW^{(k)}\left(X \hat{\beta}^{(k)} + \frac{Y - F(X\hat{\beta}^{(k)})}{f(X\hat{\beta}^{(k)})} \right) \end{aligned} \end{split}\]

For logistic regression, Fisher scoring is Newton-Raphson.

Fisher scoring and IRLS¶

We saw in logistic regression that the MLE are roughly (weighted) OLS estimators.
In fact, each iteration of Fisher scoring solves a (weighted) OLS problem with design matrix $X$ and reponse $Z^{(k+1)}$ and weights $W^{(k+1)}$:

\[\begin{split} \begin{aligned} Z^{(k+1)} &= X\hat{\beta}^{(k)} + \frac{Y - F(X\hat\beta^{(k)})}{f(X\hat\beta^{(k)})} \\ W^{(k+1)} &= W_{\hat{\beta}^{(k)}} \\ &= \frac{f(X\hat{\beta}^{(k)})^2}{F(X\hat{\beta}^{(k)})(1 - F(X\hat{\beta}^{(k)}))}. \end{aligned} \end{split}\]

Note that weights are the inverse of entries of $$ \text{Var}_{\beta}\left(X\beta + \frac{Y - F(X\beta)}{f(X\beta)} \biggl \vert X\right) $$

Fisher scoring and IRLS¶

The variables $Z^{(k+1)}$ can be written as

\[ g \left(E_{\hat{\beta}^{(k)}}(Y) \right) + g'\left(E_{\hat{\beta}^{(k)}}(Y)\right) \cdot \left(Y - E_{\hat{\beta}^{(k)}}(Y) \right) \]

or,

\[ g(\hat{\mu}^{(k)}) + g'(\hat{\mu}^{(k)})(Y - \hat{\mu}^{(k)}) \]

with

\[ \hat{\mu}^{(k)} = E_{\hat{\beta}^{(k)}}(Y) \]

Note

\[ W^{(k)} = \text{Var}_{\hat{\beta}^{(k)}}(g'(\mu)(Y-\mu)). \]

Stationary points of Fisher scoring are zeros of gradient…

Asymptotic distribution¶

Similar Taylor series argument shows that, when model is correct

\[ \hat{\beta} \approx N\left(\beta^*, (X^TW_{\hat{\beta}}X)^{-1} \right) \]

where covariance matrix is a consistent estimator and we apply Slutsky.

If model is not correct, population analog for $(X_i,Y_i) \overset{IID}{\sim} G$ is

\[ 0 = E_G\left[X \frac{f(X^T\beta^*(G))}{F(X^T\beta^*(G))(1-F(X^T\beta^*(G)))} \left(Y - F(X^T\beta^*(G))\right) \right] \]

Asymptotic covariance: bootstrap or sandwich…

Inference¶

As these are likelihood models, differences of $\text{DEV}$ yield likelihood ratio tests.
In least squares regression, we use $$\frac{1}{\sigma^2}\left( SSE({\cal M}_R) - SSE({\cal M}_F) \right) \sim \chi^2_{df_R-df_F}$$
This is replaced with (when model ${\cal M}_F \supset {\cal M}_R$ is correctly specified) $$DEV({\cal M}_R) - DEV({\cal M}_F) \overset{n \rightarrow \infty}{\sim} \chi^2_{df_R-df_F}$$
Resulting tests do not agree numerically with those coming from IRLS (Wald tests). Both are often used.

anova(width_probit, glm(y ~ width + color, 
          family=binomial(link="probit"), 
          data=crabs))
1 - pchisq(6.72, 3)

A anova: 2 × 4
	Resid. Df	Resid. Dev	Df	Deviance
	<dbl>	<dbl>	<dbl>	<dbl>
1	171	194.0357	NA	NA
2	168	187.3085	3	6.727217

0.0813785459421077

summary(glm(y  ~ width + color, 
            family=binomial(link="probit"), 
            data=crabs))

Call:
glm(formula = y ~ width + color, family = binomial(link = "probit"), 
    data = crabs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1348  -1.0009   0.5143   0.8662   2.1544  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.94034    1.54818  -4.483 7.36e-06 ***
width        0.28031    0.06011   4.664 3.11e-06 ***
colordarker -0.65847    0.35357  -1.862   0.0626 .  
colorlight   0.11098    0.45860   0.242   0.8088    
colormedium  0.16335    0.25087   0.651   0.5150    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 225.76  on 172  degrees of freedom
Residual deviance: 187.31  on 168  degrees of freedom
AIC: 197.31

Number of Fisher Scoring iterations: 5

Grouped goodness of fit tests (5.2.5)

Poisson data (4.3)

STATS 305B

Generalized linear models (Ch. 4)¶

One parameter exponential families¶

Poisson case¶

“Unnatural models”¶

Generalized linear models¶

Binary regression as GLM¶

Canonical link¶

Deviance for binary glm¶

Gradient for binary glm¶

Hessian for binary glm¶

Fisher scoring¶

Fisher scoring and IRLS¶

Fisher scoring and IRLS¶

Asymptotic distribution¶

Inference¶