Models for binary outcomes

  • Data \((X_i, Y_i), 1 \leq i \leq n\).

  • Examples:

    1. medical: presence or absence of cancer

    2. financial: bankrupt or solvent

    3. industrial: passes a quality control test or not

  • For \(0-1\) responses we need to model $\(\pi(x_1, \dots, x_p) = P(Y=1|X_1=x_1,\dots, X_p=x_p)\)$

options(repr.plot.width=5, repr.plot.height=4)

Binary outcomes

  • Likelihood (\(X\) considered fixed or likelihood considered conditional) $\( -\log L(\pi|Y) = \sum_{i=1}^n - Y_i \log\left(\frac{\pi_i}{1-\pi_i}\right) - \log(1 -\pi_i) \)$

  • The function $\( \pi \mapsto \log\left(\frac{\pi}{1-\pi}\right) \overset{\text{def}}{=} \text{logit}(\pi) \)$

Logit transform

  • Note $\( \text{logit}\left(\frac{e^{\eta}}{1+e^{\eta}}\right) = \eta. \)\( where \)\( F_{\text{logistic}}(\eta) = \frac{e^{\eta}}{1+e^{\eta}} \)$ is the logistic distribution function.

  • So, $\( \text{logit} = F_{\text{logistic}}^{-1} \)$


g = function(p) {
  return(log(p / (1 - p)))
}
p = seq(0.01,0.99,length=200)
plot(p, g(p), lwd=2, type='l', col='red')
../_images/Logistic_I_4_0.png

g.inv <- function(x) {
  return(exp(x) / (1 + exp(x)))
}

x = seq(g(0.01), g(0.99), length=200)
plot(x, g.inv(x), lwd=2, type='l', col='red')
../_images/Logistic_I_6_0.png

Logistic regression model

  • Write likelihood in natural parameters

\[ \eta_i = \text{logit}(\pi_i) \]
  • Then,

\[ - \log L(\eta|Y) = \sum_{i=1}^n - Y_i \eta_i + \log(1 + e^{\eta_i}) \]
  • Logistic regression model: \(\eta_i=X_i^T\beta\) and

\[ - \log L(\beta|y) = -(X\beta)^TY + \sum_{i=1}^n \log(1 + \exp(X_i^T\beta)) \]

Logistic regression

  • In other words,$\( \text{logit}(\pi(x_1,\dots,x_p)) = x^T\beta. \)$

  • Negative log-likelihood is convex in \(\beta\) \(\implies\) easy to add regularization penalties (and solve such problems…)

  • Default in most software for GLMs because(?) \(\eta_i=X_i^T\beta\) (i.e. canonical link)


Logistic regression

  • How do we know it’s convex?

  • Negative log-likelihood of saturated model with parameters \(\eta=(\eta_1, \dots, \eta_n)\)

\[ -\log L(\eta|Y) = \sum_{i=1}^n \Lambda_B(\eta_i) -\eta_i Y_i \]

with \(\Lambda_B(\eta)\) the cumulant generating function (CGF) of Bernoulli

  • For exponential family with sufficient statistic \(W\) and natural parameters \(\eta\)

\[ \exp(\Lambda(\eta)) = \mathbb{E}_0\left(\exp(\eta^TW) \right) \]

where \(\mathbb{P}_0\) is the law under \(\eta=0\). In our case

\[ \Lambda(\eta) = \sum_{i=1}^n \Lambda_B(\eta_i). \]
  • Negative log-likelihood is $\( \eta \mapsto -\eta^TY + \Lambda(\eta) \)$

  • Logistic regression imposes linear constraint on \(\eta\): \(\eta \in \text{col}(X)\).


Logistic regression

  • Logistic is “natural” because it is a linear model for natural parameters of Binomial data.

  • It can appear in other places (c.f. 5.1.5 of Agresti)

  • Suppose we observe observations from $\( \begin{aligned} Y &\sim \text{Bernoulli}(\pi) \\ X | Y \sim & \begin{cases} N(\mu_0, \Sigma) & Y = 0\\ N(\mu_1, \Sigma) & Y=1 \end{cases} \end{aligned} \)$

  • Then, logistic model for \(Y|X\) holds.


  • Another reason this model is popular relates to the odds ratio (c.f. 5.1.4 of Agresti)

  • If the logistic model is correct along with an assumption about sampling, it is possible to estimate parameters of \(Y|X\) distribution in case-control studies where the actual randomness is \(X|Y\).

  • This is similar to (2.2.6) of Agresti (Lung Cancer and Smoking) which we discussed earlier.


Interpretation of coefficients: Odds Ratios

  • Logistic model:

\[OR_{X_j} = \frac{ODDS(Y=1|\dots, X_j=x_j+1, \dots)}{ODDS(Y=1|\dots, X_j=x_j, \dots)} = e^{\beta_j}\]
  • If \(X_j \in \{0, 1\}\) is dichotomous, then odds for group with \(X_j = 1\) are \(e^{\beta_j}\) higher, other parameters being equal.


Rare disease hypothesis

  • When incidence is rare, \(P(Y=0)\approxeq 1\) no matter what the covariates \(X_j\)’s are.

  • In this case, odds ratios are almost ratios of probabilities: $\(OR_{X_j} \approxeq \frac{{\mathbb{P}}(Y=1|\dots, X_j=x_j+1, \dots)}{{\mathbb{P}}(Y=1|\dots, X_j=x_j, \dots)}\)$

  • Hypothetical example: in a lung cancer study, if \(X_j\) is an indicator of smoking or not, a \(\beta_j\) of 5 means for smoking vs. non-smoking means smokers are \(e^5 \approx 150\) times more likely to develop lung cancer


Other (linear) binary models

  • Suppose there is some latent variable \(T_i | X_i \sim F\) such that

\[\begin{split} Y_i = \begin{cases} 1 & T_i \leq X_i^T\beta \\ 0 & \text{otherwise.} \end{cases} \end{split}\]
  • Then,

\[ P(Y_i=1 | X_i) = P(T_i \leq X_i^T\beta|X_i) = F(X_i^T\beta). \]

  • Example: probit \(F=\Phi\) (dbn of \(N(0,1)\))

  • Setting $\( \eta_i = \text{logit}(F(X_i^T\beta)) \)\( the likelihood is then \)\( -\log L(\beta|Y) = \sum_{i=1}^n -Y_i \cdot \text{logit}(F(X_i^T\beta)) - \log(1 - F(X_i^T\beta)). \)$

  • Parameters no longer interpretable in odds ratios but problem can still be convex (e.g. probit)


Modelling probabilities

  • Note: \(\text{Var}(Y) = \pi ( 1 - \pi) = E(Y) \cdot ( 1- E(Y))\)

  • The binary nature forces a relation between mean and variance of \(Y\).

  • This makes logistic regression a Generalized Linear Model, use glm in R.


Horseshoe crabs

library(glmbb) # for `crabs` data
data(crabs)
head(crabs)
A data.frame: 6 × 6
colorspinewidthsatellweighty
<fct><fct><dbl><int><int><int>
1mediumbad 28.3830501
2dark bad 22.5015500
3light good26.0923001
4dark bad 24.8021000
5dark bad 26.0426001
6mediumbad 23.8021000

width_glm = glm(y ~ width, family=binomial(), data=crabs)
summary(width_glm)
anova(glm(y ~ 1, family=binomial(), data=crabs), width_glm)
c(225.76-194.45, 4.887^2)
Call:
glm(formula = y ~ width, family = binomial(), data = crabs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0281  -1.0458   0.5480   0.9066   1.6942  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -12.3508     2.6287  -4.698 2.62e-06 ***
width         0.4972     0.1017   4.887 1.02e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 225.76  on 172  degrees of freedom
Residual deviance: 194.45  on 171  degrees of freedom
AIC: 198.45

Number of Fisher Scoring iterations: 4
A anova: 2 × 4
Resid. DfResid. DevDfDeviance
<dbl><dbl><dbl><dbl>
1172225.7585NA NA
2171194.4527 131.30586
  1. 31.31
  2. 23.882769

summary(glm(y ~ width, data=crabs)) # same fit as `lm`
Call:
glm(formula = y ~ width, data = crabs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8614  -0.4495   0.1569   0.3674   0.7061  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.76553    0.42136  -4.190 4.46e-05 ***
width        0.09153    0.01597   5.731 4.42e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.1951497)

    Null deviance: 39.780  on 172  degrees of freedom
Residual deviance: 33.371  on 171  degrees of freedom
AIC: 212.26

Number of Fisher Scoring iterations: 2
summary(lm(y ~ width, data=crabs))  # dispersion is \hat{\sigma}^2
0.4418^2
Call:
lm(formula = y ~ width, data = crabs)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8614 -0.4495  0.1569  0.3674  0.7061 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.76553    0.42136  -4.190 4.46e-05 ***
width        0.09153    0.01597   5.731 4.42e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4418 on 171 degrees of freedom
Multiple R-squared:  0.1611,	Adjusted R-squared:  0.1562 
F-statistic: 32.85 on 1 and 171 DF,  p-value: 4.415e-08
0.19518724

Logistic regression and \(I \times 2\) tables

  • For \(I \times 2\) tables \(N=(N_{ij})\), set

\[ Y_i = N_{i1}, \qquad 1 \leq i \leq I \]
  • Under independent rows sampling,

\[ Y_i \sim \text{Binomial}(N_{i+}, \pi_i) \]
  • Under Poisson sampling, same result holds if we condition on \(N_{i+}\).

  • Model

\[ \text{logit}(\pi_i) = \alpha + \beta_i \]
  • Like a one-way ANOVA regression model

  • Need some identifiability constraint (R often sets one \(\beta\) to 0).

  • Not really using a logistic assumption here – just parameterizing \(\pi_i\) a certain way: model is saturated.

# Table 3.8
absent = c(17066, 14464, 788, 126, 37)
present = c(48, 38, 5, 1, 1)
total = absent + present
scores = c(0, 0.5, 1.5, 4.0, 7.0)
proportion = present / total
alcohol = rbind(absent, present)
chisq.test(alcohol)
Warning message in chisq.test(alcohol):
“Chi-squared approximation may be incorrect”
	Pearson's Chi-squared test

data:  alcohol
X-squared = 12.082, df = 4, p-value = 0.01675

lr_stat = function(data_table) {
    chisq_test = chisq.test(data_table)
    return(2 * sum(data_table * log(data_table / chisq_test$expected)))
}
lr_stat(alcohol)
Warning message in chisq.test(data_table):
“Chi-squared approximation may be incorrect”
6.20199790716322

alcohol_scores = glm(proportion ~ factor(scores), family=binomial(), 
                     weights=total)
summary(alcohol_scores)
Call:
glm(formula = proportion ~ factor(scores), family = binomial(), 
    weights = total)

Deviance Residuals: 
[1]  0  0  0  0  0

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -5.87364    0.14454 -40.637   <2e-16 ***
factor(scores)0.5 -0.06819    0.21743  -0.314   0.7538    
factor(scores)1.5  0.81358    0.47134   1.726   0.0843 .  
factor(scores)4    1.03736    1.01431   1.023   0.3064    
factor(scores)7    2.26272    1.02368   2.210   0.0271 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6.2020e+00  on 4  degrees of freedom
Residual deviance: 1.8163e-13  on 0  degrees of freedom
AIC: 28.627

Number of Fisher Scoring iterations: 4
anova(glm(proportion ~ 1, family=binomial(), weights=total), 
      glm(proportion ~ factor(scores), family=binomial(), weights=total))
A anova: 2 × 4
Resid. DfResid. DevDfDeviance
<dbl><dbl><dbl><dbl>
146.201998e+00NA NA
201.816325e-13 46.201998

A 1 df test

  • In this case, scores is ordered, so we might fit model $\( \text{logit}(\pi_i) = \alpha + \beta \cdot \text{scores}_i. \)$

  • Perhaps more power because direction of deviation from null is specified.


summary(glm(proportion ~ scores, family=binomial(), weights=total))
Call:
glm(formula = proportion ~ scores, family = binomial(), weights = total)

Deviance Residuals: 
      1        2        3        4        5  
 0.5921  -0.8801   0.8865  -0.1449   0.1291  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -5.9605     0.1154 -51.637   <2e-16 ***
scores        0.3166     0.1254   2.523   0.0116 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6.2020  on 4  degrees of freedom
Residual deviance: 1.9487  on 3  degrees of freedom
AIC: 24.576

Number of Fisher Scoring iterations: 4
anova(glm(proportion ~ 1, family=binomial(), weights=total), 
      glm(proportion ~ scores, family=binomial(), weights=total)) # LR test
data.frame(df.1=pchisq(4.25, 1, lower.tail=FALSE),
           df.4=pchisq(6.2, 4, lower.tail=FALSE)) 
A anova: 2 × 4
Resid. DfResid. DevDfDeviance
<dbl><dbl><dbl><dbl>
146.201998NA NA
231.948721 14.253277
A data.frame: 1 × 2
df.1df.4
<dbl><dbl>
0.039250330.1847017

Deviance

Instead of sum of squares, logistic regression uses deviance:

  • Defined as

\[DEV(\pi| Y) = -2 \log L(\pi| Y) + 2 \log L(\hat{\pi}_S| Y)\]

where \(\pi\) is a location estimator for \(Y\) with \(\hat{\pi}_S\) the saturated fit (i.e. no contraints on \(\pi\)’s).

  • For Binomial data \(Y_i \sim \text{Binom}({\tt total}_i,\pi_i)\)

\[ \hat{\pi}_{S,i} = \frac{Y_i}{{\tt total}_i} \]
  • If \(Y\) is a binary vector, with mean vector \(\pi\) then

\[DEV(\pi| Y) = -2 \sum_{i=1}^n \left( Y_i \log(\pi_i) + (1-Y_i) \log(1-\pi_i) \right)\]

  • If \(Y_i \sim \text{Binom}({\tt total}_i,\pi_i)\) then

\[\begin{split} \begin{aligned} DEV(\pi| Y) &= -2 \sum_{i=1}^n \biggl( Y_i \log(\pi_i / \hat{\pi}_{S,i}) + \\ & \qquad ({\tt total}_i -Y_i) \log((1-\pi_i)/(1-\hat{\pi}_{S,i}) \biggr) \\ &= -2 \sum_{i=1}^n {\tt total}_i \biggl( \hat{\pi}_i \log(\pi_i/\hat{\pi}_{S,i}) + \\ & \qquad (1-\hat{\pi}_{S,i}) \log((1-\pi_i)/(1-\hat{\pi}_{S,i}) \biggr) \end{aligned} \end{split}\]

with \(\hat{\pi}_{S,i} = Y_i / {\tt total}_i\) is the MLE under the saturated model.

  • Fit using weights=total in glm.


Deviance for logistic regression

  • For any binary regression model, the deviance is:

\[\begin{aligned} DEV(\beta| Y) &= -2 \sum_{i=1}^n \left( Y_i \cdot {\text{logit}}(\pi_i(\beta)) + \log(1-\pi_i(\beta)) \right) \end{aligned}\]
  • For the logistic model, the RHS is:

\[ \begin{aligned} -2 \left[ (X\beta)^Ty - \sum_{i=1}^n\log \left(1 + \exp \left(\sum_{j=1}^p X_{ij} \beta_j\right) \right)\right] \end{aligned}\]
  • The logistic model is special in that \(\text{logit}(\pi_i(\beta))=X_i^T\beta\).


Fitting the logistic model

Derivatives of log-likelihood

  • Let \(Y \in \{0,1\}^n, X \in \mathbb{R}^{n \times p}\).

  • Gradient / Score

\[\begin{split} \begin{aligned} \nabla(-\log L(\beta|Y)) &= - \left[X^T\left(y - \frac{\exp(X\beta)}{1 + \exp(X\beta)} \right) \right] \\ &= - \left[X^T\left(y - E_{\beta}[Y|X]\right) \right] \end{aligned} \end{split}\]
  • Hessian

\[\begin{split} \begin{aligned} \nabla^2 (-\log L(\beta|Y)) &= X^T\frac{\exp(X\beta)}{(1 + \exp(X\beta))^2}X \\ &= X^T \text{Var}_{\beta}(Y|X) X. \end{aligned} \end{split}\]

  • Above,

\[\begin{split} \begin{aligned} E_{\beta}[Y|X] &= \frac{\exp(X\beta)}{1 + \exp(X\beta)} = \pi_{\beta}(X) \in \mathbb{R}^n \\ \text{Var}_{\beta}[Y|X] &= \text{diag}\left(E_{\beta}[Y|X] (1 - E_{\beta}[Y|X]) \right) \\ &= \text{diag}(\pi_{\beta}(X)(1 - \pi_{\beta}(X)) \\ &= W_{\beta}(X). \end{aligned} \end{split}\]
  • Derivatives related to moments as this is an exponential family (conditional on \(X\)) with natural parameter \(\beta\), sufficient statistic \(X^TY\) and cumulant generating function

\[ \Lambda(\beta) = \sum_{i=1}^n\log \left(1 + \exp \left(\sum_{j=1}^p X_{ij} \beta_j\right) \right) \]

Newton-Raphson

  1. Given initial \(\hat{\beta}^{(0)}\)

  2. Until convergence $\( \hat{\beta}^{(i+1)} = \hat{\beta}^{(i)} - (X^T W_{\hat{\beta}^{(i)}}(X)X)^{-1}X^T(Y - \pi_{\hat{\beta}^{(i)}}(X)) \)$


Fitting the logistic model

Stationarity

  • MLE satisfies $\( X^T(Y - \pi_{\hat{\beta}}(X))) = 0 \)$


Inference

  • Suppose model is true, with true parameter \(\beta^*\). Then, $\( E_{\beta^*}[X^T(Y - \pi_{\beta^*}(X))] = 0 \)$

  • Taylor series $\( \pi_{\hat{\beta}}(X) \approx \pi_{\beta^*}(X) + W_{\beta^*}(X) X(\hat{\beta}-\beta^*) \)$

  • How big is remainder?


Inference

  • Stationarity conditions then read $\( X^T(Y - E_{\beta^*}[Y|X]) \approx X^TW_{\beta^*}(X) X(\hat{\beta}-\beta^*) \)$

  • Or, $\( \begin{aligned} \hat{\beta}-\beta^* &\approx \left(X^TW_{\beta^*}(X) X\right)^{-1}X^T(Y - E_{\beta^*}[Y|X]) \\ &= (X^TW_{\beta^*}(X)X)^{-1}X^T(Y-\pi_{\beta^*}(X)) \end{aligned} \)$

  • Implies $\( \begin{aligned} \text{Var}_{\beta^*}(\hat{\beta}-\beta^*) &\approx \left(X^TW_{\beta}^*(X) X\right)^{-1} \text{Var}_{\beta^*}\left(X^T(Y - \pi_{\beta^*}(X))\right) \left(X^TW_{\beta^*}(X) X\right)^{-1} \\ &\approx \left(X^T W_{\beta^*}(X) X\right)^{-1} \\ \end{aligned} \)$ (Last line assumes the model is true.)


  • Leads to

\[ \hat{\beta}-\beta^* \approx N\left(0, \left(X^TW_{\beta^*}(X) X\right)^{-1}\right) \]

(Plugin estimator of variance appeals to Slutsky, consistency of \(\hat{\beta}\).)


Where do we really use the model?

  • Under the logistic model$\( \begin{aligned} \text{Var}_{\beta^*}(X^T(Y-\pi_{\beta}^*(X)) & \approx X^TW_{\beta^*}(X)X. \end{aligned} \)$

  • Really just using the standard asymptotic distribution of the MLE:

\[ \hat{\beta} \approx N\left(\beta^*, \left(E_{\beta^*}[\nabla^2 (-\log L(\beta^*|Y))]\right)^{-1} \right) \]
  • For logistic regression, note that (conditional on \(X\)) $\( E_{\beta^*}[\nabla^2 (-\log L(\beta^*|Y))] = \nabla^2 (-\log L(\beta^*|Y)). \)$

center = coef(width_glm)['width']
SE = sqrt(vcov(width_glm)['width', 'width'])
U = center + SE * qnorm(0.975)
L = center + SE * qnorm(0.025)
data.frame(L, center, U)
A data.frame: 1 × 3
LcenterU
<dbl><dbl><dbl>
width0.29783260.49723060.6966286

These are slightly different from what R will give you if you ask it for confidence intervals:

confint(width_glm)
Waiting for profiling to be done...
A matrix: 2 × 2 of type dbl
2.5 %97.5 %
(Intercept)-17.8100090-7.4572470
width 0.3083806 0.7090167

What is logistic regression even estimating?

  • Before appealing to general properties of MLE, we had only really used the assumption the model was true once.

  • What can we say if the model is not true?

  • Suppose \((Y_i,X_i) \overset{IID}{\sim} F, 1 \leq i \leq n\).


  • Define the population parameter \(\beta^*=\beta^*(F)\) by $\( E_F[X(Y - \pi_{\beta^*(F)}(X))] = 0. \)$

  • This is minimizer of

\[ \beta \mapsto E_F\left[-(X^T\beta) Y + \log\left(1 + \exp(X^T\beta) \right) \right] \]
  • Taylor expansion yields

\[ \hat{\beta}-\beta^*(F) \approx \left(X^TW_{\beta^*(F)}(X) X\right)^{-1}X^T(Y - \pi_{\beta^*(F)}(X)) \]

  • Note that (under conditions…)

\[\begin{split} \begin{aligned} n^{-1/2} X^T(Y - \pi_{\beta^*(F)}(X)) & \overset{D}{\to} N(0, \Sigma(F)) \\ n^{-1}X^TW_{\beta^*(F)}(X) X &\overset{P}{\to} E_F[XX^T \cdot W_{\beta^*(F)}(X)] \\ & \overset{\text{def}}{=} Q(F) \end{aligned} \end{split}\]
  • Implies $\( \hat{\beta}-\beta^*(F) \overset{D}{\to} N(0, n^{-1} Q(F)^{-1}\Sigma(F)Q(F)^{-1}). \)$

  • Variance above has so-called sandwich form.


  • Applying Slutsky $\( \hat{\beta}-\beta^*(F) \approx N\left(0, (X^TW_{\hat{\beta}}(X) X)^{-1} (n\Sigma(F)) (X^TW_{\hat{\beta}}(X) X)^{-1}\right) \)$

  • We know everything in the approximation above (or can estimate it) except \(\Sigma(F)\)

  • Covariance \(\Sigma(F)\) can be estimated, say, by bootstrap or sandwich estimator. (Approachable description by D. Freedman here).

  • Alternatively, entire logistic regression could be bootstrapped.


Testing in logistic regression

What about comparing full and reduced model?

  • For a model \({\cal M}\), \(DEV({\cal M})\) replaces \(SSE({\cal M})\).

  • In least squares regression, we use

\[\frac{1}{\sigma^2}\left( SSE({\cal M}_R) - SSE({\cal M}_F) \right) \sim \chi^2_{df_R-df_F}\]
  • This is replaced with (when model true)

\[DEV({\cal M}_R) - DEV({\cal M}_F) \overset{n \rightarrow \infty}{\sim} \chi^2_{df_R-df_F}\]
  • These tests assume model \({\cal M}_F\) is correctly specified. Otherwise, Wald-type test with sandwich or other covariance estimate would have to be used.

  • Deviance tests do not agree numerically with Wald tests even when model is correct. Both are often used. Wald can fail!


bigger_glm = glm(y ~ width + weight, data=crabs, family=binomial())
anova(width_glm, bigger_glm)
A anova: 2 × 4
Resid. DfResid. DevDfDeviance
<dbl><dbl><dbl><dbl>
1171194.4527NA NA
2170192.8919 11.560777

We should compare this difference in deviance with a \(\chi^2_1\) random variable.

1 - pchisq(1.56,1)
0.211665220269218

Let’s compare this with the Wald test:

summary(bigger_glm)
Call:
glm(formula = y ~ width + weight, family = binomial(), data = crabs)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1127  -1.0344   0.5304   0.9006   1.7207  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)   
(Intercept) -9.3547261  3.5280465  -2.652  0.00801 **
width        0.3067892  0.1819473   1.686  0.09177 . 
weight       0.0008338  0.0006716   1.241  0.21445   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 225.76  on 172  degrees of freedom
Residual deviance: 192.89  on 170  degrees of freedom
AIC: 198.89

Number of Fisher Scoring iterations: 4