Generalized linear models (Ch. 4)¶
We saw that the Bernoulli\((\pi)\) family is an exponential family with natural parameter $\( \log \left(\frac{\pi}{1-\pi}\right). \)$
Given \((Y_i,X_i)\) with \(Y_i\) binary, the logistic model is $\( \log\left(\frac{\pi_i}{1-\pi_i}\right) = X_i^T\beta \)\( with \)\( \pi_i = E[Y_i|X_i]. \)$
We are making the assumption that \(Y_i|X\) are independent (given \(X\)).
Note that for binary data $\( \text{Var}[Y_i|X_i] = \pi_i (1 - \pi_i). \)$
One parameter exponential families¶
Suppose \(Y_i|X_i\) has density of the form (w.r.t. some measure \(m\)) $\( f(y|\theta) = a(\theta) b(y) \exp(y \cdot Q(\theta)) \)\( so that \)\( a(\theta)^{-1} = \int_{\cal Y} \exp(y \cdot Q(\theta)) b(y) \; m(dy). \)$
This is an exponential family with sufficient statistic \(y\), natural parameter \(\alpha=Q(\theta)\) and reference measure with density \(b\) w.r.t. \(m\) and \(a(\theta)=a(Q(\theta))\).
We could write $\( a(\theta)^{-1} = \exp(\Lambda(\alpha)) = \int_{\cal Y} \exp(y \cdot \alpha) b(y) \; m(dy) \)\( so that \)\( \Lambda(\alpha) = \log\left(MGF(Y_0,\eta) \right) \)\( where \)Y_0\( is a random variable with density \)b\( w.r.t. \)m$.
\(\Lambda\) is a cumulant generating function.
Simple, standard calculations show
where \(E_{\alpha}\) is distribution with density
The function $\( \alpha \mapsto E_{\alpha}[Y] \)$ is increasing (and invertible on its range).
Therefore \(\text{Var}_{\alpha}[Y]\) can be expressed as a function of \(E_{\alpha}[Y]\).
Set \(\mu=E_{\alpha}[Y]\), \(V = \text{Var}_{\alpha}[Y]\) then we have concluded that $\( V = V(\mu). \)$
If \(Y\) is Bernoulli, then $\( V(\mu)=\mu(1-\mu). \)$
Poisson case¶
For Poisson random variables, $\( V(\mu)=\mu. \)$
What takes role of logit for Poisson?
Mass function, if \(Y \sim \text{Poisson}(\mu)\) then $\( P_{\mu}(Y=y) = \frac{\mu^y}{y!} e^{-\mu} = \exp(y \cdot \log \mu) \cdot \frac{e^{-\mu}}{y!}. \)$
Natural parameter is \(\log(\mu)\).
The “natural” regression model is: $\( \log(\mu_i) = X_i^T\beta. \)$
“Unnatural models”¶
For binary data, we might suppose that $\( \pi_i = F(X_i^T\beta) \)$ for our favorite distribution function.
Or, $\( F^{-1}(\pi_i) = X_i^T\beta. \)$
A GLM for binary data corresponds to some choice of \(F\).
Generalized linear models¶
Given a dataset \((Y_i, X_{i1}, \dots, X_{ip}), 1 \leq i \leq n\) we consider a model for the distribution of \(Y|X_1, \dots, X_p\).
If \(\eta_i = g({\mathbb{E}}(Y_i|X_i)) = g(\mu_i) =X_i^T\beta\) then \(g\) is called the link function for the model.
If \({\text{Var}}(Y_i|X_i) = \phi \cdot V({\mathbb{E}}(Y_i)) = \phi \cdot V(\mu_i)\) for dispersion parameter \(\phi > 0\) and some function \(V\), then \(V\) is the called variance function for the model.
GLM is specified by this pair \((g, V)\). If data is a one-parameter exponential family then \(V\) is already determined…
Parameter \(\phi\) is called the dispersion parameter.
Canonical reference Generalized linear models.
Binary regression as GLM¶
For a logistic model, \(g(\mu)={\text{logit}}(\mu), \qquad V(\mu)=\mu(1-\mu).\)
For a probit model, \(g(\mu)=\Phi^{-1}(\mu), \qquad V(\mu)=\mu(1-\mu).\)
For a cloglog model, \(g(\mu)=-\log(-\log(\mu)), \qquad V(\mu)=\mu(1-\mu).\)
Canonical link¶
The \(\text{logit}\) transform is the canonical link. This is our “natural” binary regression model because the natural parameters are modelled as \(X\beta\).
library(glmbb) # for `crabs` data
data(crabs)
width_logit = glm(y ~ width, family=binomial(), data=crabs)
width_probit = glm(y ~ width, family=binomial(link='probit'), data=crabs)
c(AIC(width_logit), AIC(width_probit))
- 198.452663928765
- 198.035734025475
summary(width_probit)
Call:
glm(formula = y ~ width, family = binomial(link = "probit"),
data = crabs)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0519 -1.0494 0.5374 0.9126 1.6897
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.50196 1.50712 -4.978 6.44e-07 ***
width 0.30202 0.05804 5.204 1.95e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 225.76 on 172 degrees of freedom
Residual deviance: 194.04 on 171 degrees of freedom
AIC: 198.04
Number of Fisher Scoring iterations: 5
Deviance for binary glm¶
Gradient for binary glm¶
Hessian for binary glm¶
Twice the observed information – a little ugly … but the expected information is
where
Note $\( \begin{aligned} \nabla \text{DEV}(\beta|Y) &= - 2 X^TW_{\beta}\left( \frac{Y-F(X\beta)}{f(X\beta)} \right). \end{aligned} \)$
Fisher scoring¶
Instead of using the observed information (i.e. actual Hessian of \(\text{DEV}\)) Fisher proposed using the expected information in a Newton-Raphson like algorithm.
After initialization,
For logistic regression, Fisher scoring is Newton-Raphson.
Fisher scoring and IRLS¶
We saw in logistic regression that the MLE are roughly (weighted) OLS estimators.
In fact, each iteration of Fisher scoring solves a (weighted) OLS problem with design matrix \(X\) and reponse \(Z^{(k+1)}\) and weights \(W^{(k+1)}\):
Note that weights are the inverse of entries of $\( \text{Var}_{\beta}\left(X\beta + \frac{Y - F(X\beta)}{f(X\beta)} \biggl \vert X\right) \)$
Fisher scoring and IRLS¶
The variables \(Z^{(k+1)}\) can be written as
or,
with
Note
Stationary points of Fisher scoring are zeros of gradient…
Asymptotic distribution¶
Similar Taylor series argument shows that, when model is correct
where covariance matrix is a consistent estimator and we apply Slutsky.
If model is not correct, population analog for \((X_i,Y_i) \overset{IID}{\sim} G\) is
Asymptotic covariance: bootstrap or sandwich…
Inference¶
As these are likelihood models, differences of \(\text{DEV}\) yield likelihood ratio tests.
In least squares regression, we use $\(\frac{1}{\sigma^2}\left( SSE({\cal M}_R) - SSE({\cal M}_F) \right) \sim \chi^2_{df_R-df_F}\)$
This is replaced with (when model \({\cal M}_F \supset {\cal M}_R\) is correctly specified) $\(DEV({\cal M}_R) - DEV({\cal M}_F) \overset{n \rightarrow \infty}{\sim} \chi^2_{df_R-df_F}\)$
Resulting tests do not agree numerically with those coming from IRLS (Wald tests). Both are often used.
anova(width_probit, glm(y ~ width + color,
family=binomial(link="probit"),
data=crabs))
1 - pchisq(6.72, 3)
Resid. Df | Resid. Dev | Df | Deviance | |
---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | |
1 | 171 | 194.0357 | NA | NA |
2 | 168 | 187.3085 | 3 | 6.727217 |
summary(glm(y ~ width + color,
family=binomial(link="probit"),
data=crabs))
Call:
glm(formula = y ~ width + color, family = binomial(link = "probit"),
data = crabs)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1348 -1.0009 0.5143 0.8662 2.1544
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.94034 1.54818 -4.483 7.36e-06 ***
width 0.28031 0.06011 4.664 3.11e-06 ***
colordarker -0.65847 0.35357 -1.862 0.0626 .
colorlight 0.11098 0.45860 0.242 0.8088
colormedium 0.16335 0.25087 0.651 0.5150
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 225.76 on 172 degrees of freedom
Residual deviance: 187.31 on 168 degrees of freedom
AIC: 197.31
Number of Fisher Scoring iterations: 5