Generalized linear models (Ch. 4)

  • We saw that the Bernoulli\((\pi)\) family is an exponential family with natural parameter $\( \log \left(\frac{\pi}{1-\pi}\right). \)$

  • Given \((Y_i,X_i)\) with \(Y_i\) binary, the logistic model is $\( \log\left(\frac{\pi_i}{1-\pi_i}\right) = X_i^T\beta \)\( with \)\( \pi_i = E[Y_i|X_i]. \)$

  • We are making the assumption that \(Y_i|X\) are independent (given \(X\)).

  • Note that for binary data $\( \text{Var}[Y_i|X_i] = \pi_i (1 - \pi_i). \)$


One parameter exponential families

  • Suppose \(Y_i|X_i\) has density of the form (w.r.t. some measure \(m\)) $\( f(y|\theta) = a(\theta) b(y) \exp(y \cdot Q(\theta)) \)\( so that \)\( a(\theta)^{-1} = \int_{\cal Y} \exp(y \cdot Q(\theta)) b(y) \; m(dy). \)$

  • This is an exponential family with sufficient statistic \(y\), natural parameter \(\alpha=Q(\theta)\) and reference measure with density \(b\) w.r.t. \(m\) and \(a(\theta)=a(Q(\theta))\).


  • We could write $\( a(\theta)^{-1} = \exp(\Lambda(\alpha)) = \int_{\cal Y} \exp(y \cdot \alpha) b(y) \; m(dy) \)\( so that \)\( \Lambda(\alpha) = \log\left(MGF(Y_0,\eta) \right) \)\( where \)Y_0\( is a random variable with density \)b\( w.r.t. \)m$.

  • \(\Lambda\) is a cumulant generating function.

  • Simple, standard calculations show

\[\begin{split} \begin{aligned} \nabla \Lambda(\alpha) &= E_{\alpha}[Y] \\ \nabla^2 \Lambda(\alpha) &= \text{Var}_{\alpha}[Y] \end{aligned} \end{split}\]

where \(E_{\alpha}\) is distribution with density

\[y \mapsto \exp(\alpha \cdot y) b(y)\]

  • The function $\( \alpha \mapsto E_{\alpha}[Y] \)$ is increasing (and invertible on its range).

  • Therefore \(\text{Var}_{\alpha}[Y]\) can be expressed as a function of \(E_{\alpha}[Y]\).

  • Set \(\mu=E_{\alpha}[Y]\), \(V = \text{Var}_{\alpha}[Y]\) then we have concluded that $\( V = V(\mu). \)$

  • If \(Y\) is Bernoulli, then $\( V(\mu)=\mu(1-\mu). \)$


Poisson case

  • For Poisson random variables, $\( V(\mu)=\mu. \)$

  • What takes role of logit for Poisson?

  • Mass function, if \(Y \sim \text{Poisson}(\mu)\) then $\( P_{\mu}(Y=y) = \frac{\mu^y}{y!} e^{-\mu} = \exp(y \cdot \log \mu) \cdot \frac{e^{-\mu}}{y!}. \)$

  • Natural parameter is \(\log(\mu)\).

  • The “natural” regression model is: $\( \log(\mu_i) = X_i^T\beta. \)$


“Unnatural models”

  • For binary data, we might suppose that $\( \pi_i = F(X_i^T\beta) \)$ for our favorite distribution function.

  • Or, $\( F^{-1}(\pi_i) = X_i^T\beta. \)$

  • A GLM for binary data corresponds to some choice of \(F\).


Generalized linear models

Given a dataset \((Y_i, X_{i1}, \dots, X_{ip}), 1 \leq i \leq n\) we consider a model for the distribution of \(Y|X_1, \dots, X_p\).

  • If \(\eta_i = g({\mathbb{E}}(Y_i|X_i)) = g(\mu_i) =X_i^T\beta\) then \(g\) is called the link function for the model.

  • If \({\text{Var}}(Y_i|X_i) = \phi \cdot V({\mathbb{E}}(Y_i)) = \phi \cdot V(\mu_i)\) for dispersion parameter \(\phi > 0\) and some function \(V\), then \(V\) is the called variance function for the model.

  • GLM is specified by this pair \((g, V)\). If data is a one-parameter exponential family then \(V\) is already determined…

  • Parameter \(\phi\) is called the dispersion parameter.

  • Canonical reference Generalized linear models.


Binary regression as GLM

  • For a logistic model, \(g(\mu)={\text{logit}}(\mu), \qquad V(\mu)=\mu(1-\mu).\)

  • For a probit model, \(g(\mu)=\Phi^{-1}(\mu), \qquad V(\mu)=\mu(1-\mu).\)

  • For a cloglog model, \(g(\mu)=-\log(-\log(\mu)), \qquad V(\mu)=\mu(1-\mu).\)

Deviance for binary glm

\[\begin{split} \begin{aligned} \text{DEV}(\beta|Y) &= -2 \log L(\pi(\beta)|Y) + 2 \log L(Y|Y) \\ &= 2 \sum_{i=1}^n -Y_i \log \left(\frac{F(X_i^T\beta)}{1 - F(X_i^T\beta)}\right) - \log(1 - F(X_i^T\beta)) \end{aligned} \end{split}\]

Gradient for binary glm

\[\begin{split} \begin{aligned} \nabla \text{DEV}(\beta|Y) &= 2 \sum_{i=1}^n X_i \cdot f(X_i^T\beta) \biggl[\frac{-Y_i} {F(X_i^T\beta)(1 - F(X_i^T\beta))} + \\ & \qquad \qquad \qquad \frac{1}{1 - F(X_i^T\beta)} \biggr] \\ &= 2 \sum_{i=1}^n X_i \cdot f(X_i^T\beta) \left[\frac{F(X_i^T\beta)-Y_i} {F(X_i^T\beta)(1 - F(X_i^T\beta))} \right] \\ &= 2 \sum_{i=1}^n X_i \cdot \frac{f(X_i^T\beta)^2}{F(X_i^T\beta)(1 - F(X_i^T\beta))} \left[\frac{F(X_i^T\beta)-Y_i}{f(X_i^T\beta)} \right] \end{aligned} \end{split}\]

Hessian for binary glm

  • Twice the observed information – a little ugly … but the expected information is

\[\begin{split} \begin{aligned} E_{\beta} [ \nabla^2 \text{DEV}(\beta|Y)] &= 2 \cdot \text{Var}_{\beta}[\nabla \text{DEV}(\beta|Y)] \\ &= 2 \sum_{i=1}^n X_iX_i^T \frac{f(X_i^T\beta)^2}{F(X_i^T\beta)(1 - F(X_i^T\beta))} \\ &= 2 \cdot X^TW_{\beta}X \end{aligned} \end{split}\]

where

\[ W_{\beta} = \text{diag}\left(\frac{f(X_i^T\beta)^2}{F(X_i^T\beta)(1-F(X_i^T\beta))}\right) \]
  • Note $\( \begin{aligned} \nabla \text{DEV}(\beta|Y) &= - 2 X^TW_{\beta}\left( \frac{Y-F(X\beta)}{f(X\beta)} \right). \end{aligned} \)$


Fisher scoring

  • Instead of using the observed information (i.e. actual Hessian of \(\text{DEV}\)) Fisher proposed using the expected information in a Newton-Raphson like algorithm.

  • After initialization,

\[\begin{split} \begin{aligned} \hat{\beta}^{(k+1)} &= \hat{\beta}^{(k)} - E_{\hat{\beta}^{(k)}}[\nabla^2 \text{DEV}(\hat{\beta}^{(k)}|Y)]^{-1} \nabla \text{DEV}(\hat{\beta}^{(k)}|Y) \\ &= (X^TW^{(k)}X)^{-1}X^TW^{(k)}\left(X \hat{\beta}^{(k)} + \frac{Y - F(X\hat{\beta}^{(k)})}{f(X\hat{\beta}^{(k)})} \right) \end{aligned} \end{split}\]
  • For logistic regression, Fisher scoring is Newton-Raphson.


Fisher scoring and IRLS

  • We saw in logistic regression that the MLE are roughly (weighted) OLS estimators.

  • In fact, each iteration of Fisher scoring solves a (weighted) OLS problem with design matrix \(X\) and reponse \(Z^{(k+1)}\) and weights \(W^{(k+1)}\):

\[\begin{split} \begin{aligned} Z^{(k+1)} &= X\hat{\beta}^{(k)} + \frac{Y - F(X\hat\beta^{(k)})}{f(X\hat\beta^{(k)})} \\ W^{(k+1)} &= W_{\hat{\beta}^{(k)}} \\ &= \frac{f(X\hat{\beta}^{(k)})^2}{F(X\hat{\beta}^{(k)})(1 - F(X\hat{\beta}^{(k)}))}. \end{aligned} \end{split}\]
  • Note that weights are the inverse of entries of $\( \text{Var}_{\beta}\left(X\beta + \frac{Y - F(X\beta)}{f(X\beta)} \biggl \vert X\right) \)$


Fisher scoring and IRLS

  • The variables \(Z^{(k+1)}\) can be written as

\[ g \left(E_{\hat{\beta}^{(k)}}(Y) \right) + g'\left(E_{\hat{\beta}^{(k)}}(Y)\right) \cdot \left(Y - E_{\hat{\beta}^{(k)}}(Y) \right) \]

or,

\[ g(\hat{\mu}^{(k)}) + g'(\hat{\mu}^{(k)})(Y - \hat{\mu}^{(k)}) \]

with

\[ \hat{\mu}^{(k)} = E_{\hat{\beta}^{(k)}}(Y) \]
  • Note

\[ W^{(k)} = \text{Var}_{\hat{\beta}^{(k)}}(g'(\mu)(Y-\mu)). \]
  • Stationary points of Fisher scoring are zeros of gradient…


Asymptotic distribution

  • Similar Taylor series argument shows that, when model is correct

\[ \hat{\beta} \approx N\left(\beta^*, (X^TW_{\hat{\beta}}X)^{-1} \right) \]

where covariance matrix is a consistent estimator and we apply Slutsky.

  • If model is not correct, population analog for \((X_i,Y_i) \overset{IID}{\sim} G\) is

\[ 0 = E_G\left[X \frac{f(X^T\beta^*(G))}{F(X^T\beta^*(G))(1-F(X^T\beta^*(G)))} \left(Y - F(X^T\beta^*(G))\right) \right] \]
  • Asymptotic covariance: bootstrap or sandwich…


Inference

  • As these are likelihood models, differences of \(\text{DEV}\) yield likelihood ratio tests.

  • In least squares regression, we use $\(\frac{1}{\sigma^2}\left( SSE({\cal M}_R) - SSE({\cal M}_F) \right) \sim \chi^2_{df_R-df_F}\)$

  • This is replaced with (when model \({\cal M}_F \supset {\cal M}_R\) is correctly specified) $\(DEV({\cal M}_R) - DEV({\cal M}_F) \overset{n \rightarrow \infty}{\sim} \chi^2_{df_R-df_F}\)$

  • Resulting tests do not agree numerically with those coming from IRLS (Wald tests). Both are often used.


anova(width_probit, glm(y ~ width + color, 
          family=binomial(link="probit"), 
          data=crabs))
1 - pchisq(6.72, 3)
A anova: 2 × 4
Resid. DfResid. DevDfDeviance
<dbl><dbl><dbl><dbl>
1171194.0357NA NA
2168187.3085 36.727217
0.0813785459421077
summary(glm(y  ~ width + color, 
            family=binomial(link="probit"), 
            data=crabs))
Call:
glm(formula = y ~ width + color, family = binomial(link = "probit"), 
    data = crabs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1348  -1.0009   0.5143   0.8662   2.1544  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.94034    1.54818  -4.483 7.36e-06 ***
width        0.28031    0.06011   4.664 3.11e-06 ***
colordarker -0.65847    0.35357  -1.862   0.0626 .  
colorlight   0.11098    0.45860   0.242   0.8088    
colormedium  0.16335    0.25087   0.651   0.5150    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 225.76  on 172  degrees of freedom
Residual deviance: 187.31  on 168  degrees of freedom
AIC: 197.31

Number of Fisher Scoring iterations: 5