Models for binary outcomes

Data \((X_i, Y_i), 1 \leq i \leq n\).
Examples:
1. medical: presence or absence of cancer
2. financial: bankrupt or solvent
3. industrial: passes a quality control test or not
For \(0-1\) responses we need to model \[\pi(x_1, \dots, x_p) = P(Y=1|X_1=x_1,\dots, X_p=x_p)\]

options(repr.plot.width=5, repr.plot.height=4)

Binary outcomes

Likelihood (\(X\) considered fixed or likelihood considered conditional) \[ -\log L(\pi|Y) = \sum_{i=1}^n - Y_i \log\left(\frac{\pi_i}{1-\pi_i}\right) - \log(1 -\pi_i) \]
The function \[ \pi \mapsto \log\left(\frac{\pi}{1-\pi}\right) \overset{\text{def}}{=} \text{logit}(\pi) \]

Logit transform

Note \[ \text{logit}\left(\frac{e^{\eta}}{1+e^{\eta}}\right) = \eta. \] where \[ F_{\text{logistic}}(\eta) = \frac{e^{\eta}}{1+e^{\eta}} \] is the logistic distribution function.
So, \[ \text{logit} = F_{\text{logistic}}^{-1} \]

g = function(p) {
  return(log(p / (1 - p)))
}
p = seq(0.01,0.99,length=200)
plot(p, g(p), lwd=2, type='l', col='red')

g.inv <- function(x) {
  return(exp(x) / (1 + exp(x)))
}

x = seq(g(0.01), g(0.99), length=200)
plot(x, g.inv(x), lwd=2, type='l', col='red')

Logistic regression model

Write likelihood in natural parameters

\[ \eta_i = \text{logit}(\pi_i) \]

Then,

\[ - \log L(\eta|Y) = \sum_{i=1}^n - Y_i \eta_i + \log(1 + e^{\eta_i}) \]

Logistic regression model: \(\eta_i=X_i^T\beta\) and

\[ - \log L(\beta|y) = -(X\beta)^TY + \sum_{i=1}^n \log(1 + \exp(X_i^T\beta)) \]

Logistic regression

In other words,\[ \text{logit}(\pi(x_1,\dots,x_p)) = x^T\beta. \]
Negative log-likelihood is convex in \(\beta\) \(\implies\) easy to add regularization penalties (and solve such problems…)
Default in most software for GLMs because(?) \(\eta_i=X_i^T\beta\) (i.e. canonical link)

Logistic regression

How do we know it’s convex?
Negative log-likelihood of saturated model with parameters \(\eta=(\eta_1, \dots, \eta_n)\)

\[ -\log L(\eta|Y) = \sum_{i=1}^n \Lambda_B(\eta_i) -\eta_i Y_i \]

with \(\Lambda_B(\eta)\) the cumulant generating function (CGF) of Bernoulli

For exponential family with sufficient statistic \(W\) and natural parameters \(\eta\)

\[ \exp(\Lambda(\eta)) = \mathbb{E}_0\left(\exp(\eta^TW) \right) \]

where \(\mathbb{P}_0\) is the law under \(\eta=0\). In our case

\[ \Lambda(\eta) = \sum_{i=1}^n \Lambda_B(\eta_i). \]

Negative log-likelihood is \[ \eta \mapsto -\eta^TY + \Lambda(\eta) \]
Logistic regression imposes linear constraint on \(\eta\): \(\eta \in \text{col}(X)\).

Logistic regression

Logistic is “natural” because it is a linear model for natural parameters of Binomial data.
It can appear in other places (c.f. 5.1.5 of Agresti)
Suppose we observe observations from \[ \begin{aligned} Y &\sim \text{Bernoulli}(\pi) \\ X | Y \sim & \begin{cases} N(\mu_0, \Sigma) & Y = 0\\ N(\mu_1, \Sigma) & Y=1 \end{cases} \end{aligned} \]
Then, logistic model for \(Y|X\) holds.

Another reason this model is popular relates to the odds ratio (c.f. 5.1.4 of Agresti)
If the logistic model is correct along with an assumption about sampling, it is possible to estimate parameters of \(Y|X\) distribution in case-control studies where the actual randomness is \(X|Y\).
This is similar to (2.2.6) of Agresti (Lung Cancer and Smoking) which we discussed earlier.

Interpretation of coefficients: Odds Ratios

Logistic model:

\[OR_{X_j} = \frac{ODDS(Y=1|\dots, X_j=x_j+1, \dots)}{ODDS(Y=1|\dots, X_j=x_j, \dots)} = e^{\beta_j} \]

If \(X_j \in \{0, 1\}\) is dichotomous, then odds for group with \(X_j = 1\) are \(e^{\beta_j}\) higher, other parameters being equal.

Rare disease hypothesis

When incidence is rare, \(P(Y=0)\approxeq 1\) no matter what the covariates \(X_j\)’s are.
In this case, odds ratios are almost ratios of probabilities: \[OR_{X_j} \approxeq \frac{{\mathbb{P}}(Y=1|\dots, X_j=x_j+1, \dots)}{{\mathbb{P}}(Y=1|\dots, X_j=x_j, \dots)}\]
Hypothetical example: in a lung cancer study, if \(X_j\) is an indicator of smoking or not, a \(\beta_j\) of 5 means for smoking vs. non-smoking means smokers are \(e^5 \approx 150\) times more likely to develop lung cancer

Other (linear) binary models

Suppose there is some latent variable \(T_i | X_i \sim F\) such that

\[ Y_i = \begin{cases} 1 & T_i \leq X_i^T\beta \\ 0 & \text{otherwise.} \end{cases} \]

Then,

\[ P(Y_i=1 | X_i) = P(T_i \leq X_i^T\beta|X_i) = F(X_i^T\beta). \]

Example: probit \(F=\Phi\) (dbn of \(N(0,1)\))
Setting \[ \eta_i = \text{logit}(F(X_i^T\beta)) \] the likelihood is then \[ -\log L(\beta|Y) = \sum_{i=1}^n -Y_i \cdot \text{logit}(F(X_i^T\beta)) - \log(1 - F(X_i^T\beta)). \]
Parameters no longer interpretable in odds ratios but problem can still be convex (e.g. probit)

Modelling probabilities

Note: \(\text{Var}(Y) = \pi ( 1 - \pi) = E(Y) \cdot ( 1- E(Y))\)
The binary nature forces a relation between mean and variance of \(Y\).
This makes logistic regression a Generalized Linear Model, use glm in R.

Horseshoe crabs

library(glmbb) # for `crabs` data
data(crabs)
head(crabs)

##    color spine width satell weight y
## 1 medium   bad  28.3      8   3050 1
## 2   dark   bad  22.5      0   1550 0
## 3  light  good  26.0      9   2300 1
## 4   dark   bad  24.8      0   2100 0
## 5   dark   bad  26.0      4   2600 1
## 6 medium   bad  23.8      0   2100 0

width_glm = glm(y ~ width, family=binomial(), data=crabs)
summary(width_glm)

## 
## Call:
## glm(formula = y ~ width, family = binomial(), data = crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0281  -1.0458   0.5480   0.9066   1.6942  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.3508     2.6287  -4.698 2.62e-06 ***
## width         0.4972     0.1017   4.887 1.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 194.45  on 171  degrees of freedom
## AIC: 198.45
## 
## Number of Fisher Scoring iterations: 4

anova(glm(y ~ 1, family=binomial(), data=crabs), width_glm)

## Analysis of Deviance Table
## 
## Model 1: y ~ 1
## Model 2: y ~ width
##   Resid. Df Resid. Dev Df Deviance
## 1       172     225.76            
## 2       171     194.45  1   31.306

c(225.76-194.45, 4.887^2)

## [1] 31.31000 23.88277

summary(glm(y ~ width, data=crabs)) # same fit as `lm`

## 
## Call:
## glm(formula = y ~ width, data = crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8614  -0.4495   0.1569   0.3674   0.7061  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.76553    0.42136  -4.190 4.46e-05 ***
## width        0.09153    0.01597   5.731 4.42e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1951497)
## 
##     Null deviance: 39.780  on 172  degrees of freedom
## Residual deviance: 33.371  on 171  degrees of freedom
## AIC: 212.26
## 
## Number of Fisher Scoring iterations: 2

summary(lm(y ~ width, data=crabs))  # dispersion is \hat{\sigma}^2

## 
## Call:
## lm(formula = y ~ width, data = crabs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8614 -0.4495  0.1569  0.3674  0.7061 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.76553    0.42136  -4.190 4.46e-05 ***
## width        0.09153    0.01597   5.731 4.42e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4418 on 171 degrees of freedom
## Multiple R-squared:  0.1611, Adjusted R-squared:  0.1562 
## F-statistic: 32.85 on 1 and 171 DF,  p-value: 4.415e-08

0.4418^2

## [1] 0.1951872

Logistic regression and \(I \times 2\) tables

For \(I \times 2\) tables \(N=(N_{ij})\), set

\[ Y_i = N_{i1}, \qquad 1 \leq i \leq I \]

Under independent rows sampling,

\[ Y_i \sim \text{Binomial}(N_{i+}, \pi_i) \]

Under Poisson sampling, same result holds if we condition on \(N_{i+}\).
Model

\[ \text{logit}(\pi_i) = \alpha + \beta_i \]

Like a one-way ANOVA regression model
Need some identifiability constraint (R often sets one \(\beta\) to 0).
Not really using a logistic assumption here – just parameterizing \(\pi_i\) a certain way: model is saturated.

# Table 3.8
absent = c(17066, 14464, 788, 126, 37)
present = c(48, 38, 5, 1, 1)
total = absent + present
scores = c(0, 0.5, 1.5, 4.0, 7.0)
proportion = present / total
alcohol = rbind(absent, present)
chisq.test(alcohol)

## Warning in chisq.test(alcohol): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  alcohol
## X-squared = 12.082, df = 4, p-value = 0.01675

lr_stat = function(data_table) {
    chisq_test = chisq.test(data_table)
    return(2 * sum(data_table * log(data_table / chisq_test$expected)))
}
lr_stat(alcohol)

## Warning in chisq.test(data_table): Chi-squared approximation may be incorrect

## [1] 6.201998

alcohol_scores = glm(proportion ~ factor(scores), family=binomial(), 
                     weights=total)
summary(alcohol_scores)

## 
## Call:
## glm(formula = proportion ~ factor(scores), family = binomial(), 
##     weights = total)
## 
## Deviance Residuals: 
## [1]  0  0  0  0  0
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -5.87364    0.14454 -40.637   <2e-16 ***
## factor(scores)0.5 -0.06819    0.21743  -0.314   0.7538    
## factor(scores)1.5  0.81358    0.47134   1.726   0.0843 .  
## factor(scores)4    1.03736    1.01431   1.023   0.3064    
## factor(scores)7    2.26272    1.02368   2.210   0.0271 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6.2020e+00  on 4  degrees of freedom
## Residual deviance: 1.8163e-13  on 0  degrees of freedom
## AIC: 28.627
## 
## Number of Fisher Scoring iterations: 4

anova(glm(proportion ~ 1, family=binomial(), weights=total), 
      glm(proportion ~ factor(scores), family=binomial(), weights=total))

## Analysis of Deviance Table
## 
## Model 1: proportion ~ 1
## Model 2: proportion ~ factor(scores)
##   Resid. Df Resid. Dev Df Deviance
## 1         4      6.202            
## 2         0      0.000  4    6.202

A 1 df test

In this case, scores is ordered, so we might fit model \[ \text{logit}(\pi_i) = \alpha + \beta \cdot \text{scores}_i. \]
Perhaps more power because direction of deviation from null is specified.

summary(glm(proportion ~ scores, family=binomial(), weights=total))

## 
## Call:
## glm(formula = proportion ~ scores, family = binomial(), weights = total)
## 
## Deviance Residuals: 
##       1        2        3        4        5  
##  0.5921  -0.8801   0.8865  -0.1449   0.1291  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.9605     0.1154 -51.637   <2e-16 ***
## scores        0.3166     0.1254   2.523   0.0116 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6.2020  on 4  degrees of freedom
## Residual deviance: 1.9487  on 3  degrees of freedom
## AIC: 24.576
## 
## Number of Fisher Scoring iterations: 4

anova(glm(proportion ~ 1, family=binomial(), weights=total), 
      glm(proportion ~ scores, family=binomial(), weights=total)) # LR test

## Analysis of Deviance Table
## 
## Model 1: proportion ~ 1
## Model 2: proportion ~ scores
##   Resid. Df Resid. Dev Df Deviance
## 1         4     6.2020            
## 2         3     1.9487  1   4.2533

data.frame(df.1=pchisq(4.25, 1, lower.tail=FALSE),
           df.4=pchisq(6.2, 4, lower.tail=FALSE))

##         df.1      df.4
## 1 0.03925033 0.1847017

Deviance

Instead of sum of squares, logistic regression uses deviance:

Defined as

\[DEV(\pi| Y) = -2 \log L(\pi| Y) + 2 \log L(\hat{\pi}_S| Y) \]

where \(\pi\) is a location estimator for \(Y\) with \(\hat{\pi}_S\) the saturated fit (i.e. no contraints on \(\pi\)’s).

For Binomial data \(Y_i \sim \text{Binom}({\tt total}_i,\pi_i)\)

\[ \hat{\pi}_{S,i} = \frac{Y_i}{{\tt total}_i} \]

If \(Y\) is a binary vector, with mean vector \(\pi\) then

\[DEV(\pi| Y) = -2 \sum_{i=1}^n \left( Y_i \log(\pi_i) + (1-Y_i) \log(1-\pi_i) \right) \]

If \(Y_i \sim \text{Binom}({\tt total}_i,\pi_i)\) then

\[ \begin{aligned} DEV(\pi| Y) &= -2 \sum_{i=1}^n \biggl( Y_i \log(\pi_i / \hat{\pi}_{S,i}) + \\ & \qquad ({\tt total}_i -Y_i) \log((1-\pi_i)/(1-\hat{\pi}_{S,i}) \biggr) \\ &= -2 \sum_{i=1}^n {\tt total}_i \biggl( \hat{\pi}_i \log(\pi_i/\hat{\pi}_{S,i}) + \\ & \qquad (1-\hat{\pi}_{S,i}) \log((1-\pi_i)/(1-\hat{\pi}_{S,i}) \biggr) \end{aligned} \]

with \(\hat{\pi}_{S,i} = Y_i / {\tt total}_i\) is the MLE under the saturated model.

Fit using weights=total in glm.

Deviance for logistic regression

For any binary regression model, the deviance is:

\[\begin{aligned} DEV(\beta| Y) &= -2 \sum_{i=1}^n \left( Y_i \cdot {\text{logit}}(\pi_i(\beta)) + \log(1-\pi_i(\beta)) \right) \end{aligned} \]

For the logistic model, the RHS is:

\[ \begin{aligned} -2 \left[ (X\beta)^Ty - \sum_{i=1}^n\log \left(1 + \exp \left(\sum_{j=1}^p X_{ij} \beta_j\right) \right)\right] \end{aligned} \]

The logistic model is special in that \(\text{logit}(\pi_i(\beta))=X_i^T\beta\).

Fitting the logistic model

Derivatives of log-likelihood

Let \(Y \in \{0,1\}^n, X \in \mathbb{R}^{n \times p}\).
Gradient / Score

\[ \begin{aligned} \nabla(-\log L(\beta|Y)) &= - \left[X^T\left(y - \frac{\exp(X\beta)}{1 + \exp(X\beta)} \right) \right] \\ &= - \left[X^T\left(y - E_{\beta}[Y|X]\right) \right] \end{aligned} \]

Hessian

\[ \begin{aligned} \nabla^2 (-\log L(\beta|Y)) &= X^T\frac{\exp(X\beta)}{(1 + \exp(X\beta))^2}X \\ &= X^T \text{Var}_{\beta}(Y|X) X. \end{aligned} \]

Above,

\[ \begin{aligned} E_{\beta}[Y|X] &= \frac{\exp(X\beta)}{1 + \exp(X\beta)} = \pi_{\beta}(X) \in \mathbb{R}^n \\ \text{Var}_{\beta}[Y|X] &= \text{diag}\left(E_{\beta}[Y|X] (1 - E_{\beta}[Y|X]) \right) \\ &= \text{diag}(\pi_{\beta}(X)(1 - \pi_{\beta}(X)) \\ &= W_{\beta}(X). \end{aligned} \]

Derivatives related to moments as this is an exponential family (conditional on \(X\)) with natural parameter \(\beta\), sufficient statistic \(X^TY\) and cumulant generating function

\[ \Lambda(\beta) = \sum_{i=1}^n\log \left(1 + \exp \left(\sum_{j=1}^p X_{ij} \beta_j\right) \right) \]

Newton-Raphson

Given initial \(\hat{\beta}^{(0)}\)
Until convergence \[ \hat{\beta}^{(i+1)} = \hat{\beta}^{(i)} - (X^T W_{\hat{\beta}^{(i)}}(X)X)^{-1}X^T(Y - \pi_{\hat{\beta}^{(i)}}(X)) \]

Fitting the logistic model

Stationarity

MLE satisfies \[ X^T(Y - \pi_{\hat{\beta}}(X))) = 0 \]

Inference

Suppose model is true, with true parameter \(\beta^*\). Then, \[ E_{\beta^*}[X^T(Y - \pi_{\beta^*}(X))] = 0 \]
Taylor series \[ \pi_{\hat{\beta}}(X) \approx \pi_{\beta^*}(X) + W_{\beta^*}(X) X(\hat{\beta}-\beta^*) \]
How big is remainder?

Inference

Stationarity conditions then read \[ X^T(Y - E_{\beta^*}[Y|X]) \approx X^TW_{\beta^*}(X) X(\hat{\beta}-\beta^*) \]
Or, \[ \begin{aligned} \hat{\beta}-\beta^* &\approx \left(X^TW_{\beta^*}(X) X\right)^{-1}X^T(Y - E_{\beta^*}[Y|X]) \\ &= (X^TW_{\beta^*}(X)X)^{-1}X^T(Y-\pi_{\beta^*}(X)) \end{aligned} \]
Implies \[ \begin{aligned} \text{Var}_{\beta^*}(\hat{\beta}-\beta^*) &\approx \left(X^TW_{\beta}^*(X) X\right)^{-1} \text{Var}_{\beta^*}\left(X^T(Y - \pi_{\beta^*}(X))\right) \left(X^TW_{\beta^*}(X) X\right)^{-1} \\ &\approx \left(X^T W_{\beta^*}(X) X\right)^{-1} \\ \end{aligned} \] (Last line assumes the model is true.)

Leads to

\[ \hat{\beta}-\beta^* \approx N\left(0, \left(X^TW_{\beta^*}(X) X\right)^{-1}\right) \]

(Plugin estimator of variance appeals to Slutsky, consistency of \(\hat{\beta}\).)

Where do we really use the model?

Under the logistic model\[ \begin{aligned} \text{Var}_{\beta^*}(X^T(Y-\pi_{\beta}^*(X)) & \approx X^TW_{\beta^*}(X)X. \end{aligned} \]
Really just using the standard asymptotic distribution of the MLE:

\[ \hat{\beta} \approx N\left(\beta^*, \left(E_{\beta^*}[\nabla^2 (-\log L(\beta^*|Y))]\right)^{-1} \right) \]

For logistic regression, note that (conditional on \(X\)) \[ E_{\beta^*}[\nabla^2 (-\log L(\beta^*|Y))] = \nabla^2 (-\log L(\beta^*|Y)). \]

center = coef(width_glm)['width']
SE = sqrt(vcov(width_glm)['width', 'width'])
U = center + SE * qnorm(0.975)
L = center + SE * qnorm(0.025)
data.frame(L, center, U)

##               L    center         U
## width 0.2978326 0.4972306 0.6966286

These are slightly different from what R will give you if you ask it for confidence intervals:

confint(width_glm)

## Waiting for profiling to be done...

##                   2.5 %     97.5 %
## (Intercept) -17.8100090 -7.4572470
## width         0.3083806  0.7090167

What is logistic regression even estimating?

Before appealing to general properties of MLE, we had only really used the assumption the model was true once.
What can we say if the model is not true?
Suppose \((Y_i,X_i) \overset{IID}{\sim} F, 1 \leq i \leq n\).

Define the population parameter \(\beta^*=\beta^*(F)\) by \[ E_F[X(Y - \pi_{\beta^*(F)}(X))] = 0. \]
This is minimizer of

\[ \beta \mapsto E_F\left[-(X^T\beta) Y + \log\left(1 + \exp(X^T\beta) \right) \right] \]

Taylor expansion yields

\[ \hat{\beta}-\beta^*(F) \approx \left(X^TW_{\beta^*(F)}(X) X\right)^{-1}X^T(Y - \pi_{\beta^*(F)}(X)) \]

Note that (under conditions…)

\[ \begin{aligned} n^{-1/2} X^T(Y - \pi_{\beta^*(F)}(X)) & \overset{D}{\to} N(0, \Sigma(F)) \\ n^{-1}X^TW_{\beta^*(F)}(X) X &\overset{P}{\to} E_F[XX^T \cdot W_{\beta^*(F)}(X)] \\ & \overset{\text{def}}{=} Q(F) \end{aligned} \]

Implies \[ \hat{\beta}-\beta^*(F) \overset{D}{\to} N(0, n^{-1} Q(F)^{-1}\Sigma(F)Q(F)^{-1}). \]
Variance above has so-called sandwich form.

Applying Slutsky \[ \hat{\beta}-\beta^*(F) \approx N\left(0, (X^TW_{\hat{\beta}}(X) X)^{-1} (n\Sigma(F)) (X^TW_{\hat{\beta}}(X) X)^{-1}\right) \]
We know everything in the approximation above (or can estimate it) except \(\Sigma(F)\)
Covariance \(\Sigma(F)\) can be estimated, say, by bootstrap or sandwich estimator. (Approachable description by D. Freedman here).
Alternatively, entire logistic regression could be bootstrapped.

Testing in logistic regression

What about comparing full and reduced model?

For a model \({\cal M}\), \(DEV({\cal M})\) replaces \(SSE({\cal M})\).
In least squares regression, we use

\[\frac{1}{\sigma^2}\left( SSE({\cal M}_R) - SSE({\cal M}_F) \right) \sim \chi^2_{df_R-df_F} \]

This is replaced with (when model true)

\[DEV({\cal M}_R) - DEV({\cal M}_F) \overset{n \rightarrow \infty}{\sim} \chi^2_{df_R-df_F} \]

These tests assume model \({\cal M}_F\) is correctly specified. Otherwise, Wald-type test with sandwich or other covariance estimate would have to be used.
Deviance tests do not agree numerically with Wald tests even when model is correct. Both are often used. Wald can fail!

bigger_glm = glm(y ~ width + weight, data=crabs, family=binomial())
anova(width_glm, bigger_glm)

## Analysis of Deviance Table
## 
## Model 1: y ~ width
## Model 2: y ~ width + weight
##   Resid. Df Resid. Dev Df Deviance
## 1       171     194.45            
## 2       170     192.89  1   1.5608

We should compare this difference in deviance with a \(\chi^2_1\) random variable.

1 - pchisq(1.56,1)

## [1] 0.2116652

Let’s compare this with the Wald test:

summary(bigger_glm)

## 
## Call:
## glm(formula = y ~ width + weight, family = binomial(), data = crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1127  -1.0344   0.5304   0.9006   1.7207  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -9.3547261  3.5280465  -2.652  0.00801 **
## width        0.3067892  0.1819473   1.686  0.09177 . 
## weight       0.0008338  0.0006716   1.241  0.21445   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 225.76  on 172  degrees of freedom
## Residual deviance: 192.89  on 170  degrees of freedom
## AIC: 198.89
## 
## Number of Fisher Scoring iterations: 4

Grouped goodness of fit tests (5.2.5)

library(glmbb) # for `crabs` data
data(crabs)
width_glm = glm(y ~ width, family=binomial(), data=crabs)

Is the logistic model a good fit?

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

crabs$groups = cut(crabs$width, breaks=c(0, seq(23.25, 29.25, by=1), max(crabs$width)))
crabs$fits = fitted(width_glm)
results = crabs %>%
          group_by(groups) %>%
          summarize(N=n(), numY = sum(y), numN = N - numY, 
                    fittedY=sum(fits), fittedN=N-fittedY)

## `summarise()` ungrouping output (override with `.groups` argument)

results

## # A tibble: 8 x 6
##   groups          N  numY  numN fittedY fittedN
##   <fct>       <int> <int> <int>   <dbl>   <dbl>
## 1 (0,23.2]       14     5     9    3.64  10.4  
## 2 (23.2,24.2]    14     4    10    5.31   8.69 
## 3 (24.2,25.2]    28    17    11   13.8   14.2  
## 4 (25.2,26.2]    39    21    18   24.2   14.8  
## 5 (26.2,27.2]    22    15     7   15.9    6.06 
## 6 (27.2,28.2]    24    20     4   19.4    4.62 
## 7 (28.2,29.2]    18    15     3   15.7    2.35 
## 8 (29.2,33.5]    14    14     0   13.1    0.918

G2 = 2 * (sum(results$numY * log(results$numY / results$fittedY)) +
          sum(results$numN * log((results$numN + 1.e-10) / results$fittedN)))
c(G2, pchisq(G2, 8 - 2, lower.tail=FALSE)) # why df=6?

## [1] 6.1793623 0.4034009

X2 = (sum((results$numY - results$fittedY)^2 / results$fittedY) +
      sum((results$numN - results$fittedN)^2 / results$fittedN))
c(X2, pchisq(X2, 8 - 2, lower.tail=FALSE))

## [1] 5.320099 0.503461

Hosmer-Lemeshow

Alternative “partition”: partition cases based on \(\hat{\pi}\).
Rank \(\hat{\pi}\) into \(g\) groups.
Compute Pearson’s \(X^2\) as above.

library(ResourceSelection)

## ResourceSelection 0.3-5   2019-07-22

hoslem.test(crabs$y, crabs$fits, g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  crabs$y, crabs$fits
## X-squared = 4.3855, df = 8, p-value = 0.8208

Diagnostics in logistic regression (6.2)

If we have grouped data (i.e. a fixed number of possible covariate values) we can use a \(G^2\) or \(X^2\) test as a goodness of fit.
Without groups, we might make groups by partitioning the feature space appropriately.

Residuals

Pearson residual \[ e_i = \frac{Y_i - \hat{\pi}_i}{\sqrt{\hat{\pi}_i(1-\hat{\pi}_i)}} \]
Used to form Pearson’s \(X^2\) (we’ll see general form for GLMs later) \[ X^2 = \sum_{i=1}^n e_i^2. \]

Residuals

Deviance residual: write deviance as \[ DEV(\hat{\pi}|Y) = \sum_{i=1}^n DEV(\hat{\pi}_i|Y_i) \] then the deviance residuals are \[ \text{sign}(Y_i-\hat{\pi}_i) \cdot \sqrt{DEV(\hat{\pi}_i|Y_i)} \]

Hat matrix and leverage

As in OLS regression, residuals have smaller variance than the true errors. (What are the true errors?)
In OLS regression, this is remedied by standardizing \[ r_i = \frac{Y_i - \hat{Y}_i} {\sqrt{R_{ii}}} \] where \(R=I-H\) is the residual forming matrix and \[ H = X(X^TX)^{-1}X^T \] is the hat matrix.
What is analog of \(H\) in logistic regression?

Hat matrix and leverage

Recall \[ \begin{aligned} \hat{\beta}-\beta^* &\approx (X^TW_{\beta^*}X)^{-1}X^T(Y-\pi_{\beta^*}(X)) \\ &= (X^TW_{\beta^*}X)^{-1}X^TW^{1/2}_{\beta^*}R_{\beta^*}(X,Y) \\ \end{aligned} \]
Above, \[ R_{\beta^*}(X,Y) = W_{\beta^*}^{-1/2}(Y-\pi_{\beta^*}(X)) \] has independent entries, each with variance 1 (assuming model is correctly specified).
\(Z_{\beta^*}(X,Y)\) are essentially the (population) Pearson residuals…

Hat matrix and leverage

We see \(\hat{\beta}\) is essentially OLS estimator with design \(W_{\beta^*}^{1/2}(X)X\) and errors \(R_{\beta^*}(X,Y)\) (This is basically the IRLS viewpoint)
In this OLS model, the appropriate hat matrix is

\[ H_{\beta^*} = W_{\beta^*}^{1/2}X(X^TW_{\beta^*}X)^{-1}X^TW_{\beta^*} ^{1/2}, \]

approximated by

\[ H_{\hat{\beta}} = W_{\hat{\beta}}^{1/2}X(X^TW_{\hat{\beta}}X)^{-1}X^TW_{\hat{\beta}}^{1/2} \]

Standardized residuals

Given “correct” hat matrix, leads to standardized residual

\[ r_i = \frac{e_i}{ \sqrt{1 - H_{\hat{\beta},ii}}}. \]

Residuals and leverage can be used in similar ways as in OLS:
- Detect outliers. (Are they normal? Probably not. Grouped data has a chance.)
- Check regression specification.

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

outlierTest(width_glm)

## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##    rstudent unadjusted p-value Bonferroni p
## 94 -2.05045           0.040321           NA

pearson_resid = (crabs$y - crabs$fits) / sqrt(crabs$fits * (1 - crabs$fits))
plot(resid(width_glm, type='pearson'), pearson_resid, pch=23, col='red')

dev_resid = sign(crabs$y - crabs$fits) * 
     sqrt(- 2 * (crabs$y * log(crabs$fits) + (1 - crabs$y) * log(1 - crabs$fits)))
plot(resid(width_glm), dev_resid, pch=23, col='red')

par(mfrow=c(2,2))
plot(width_glm)

Analog of \(R^2\) (6.3)

In OLS regression, a common summary is \[ R^2 = 1 - \frac{SSE}{SST} = \frac{SST-SSE}{SST} \]
Analog for logistic regression is McFadden’s \(R^2\) \[ \frac{DEV(M_0) - DEV(M)}{DEV(M_0)} \]

anova(width_glm)

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: y
## 
## Terms added sequentially (first to last)
## 
## 
##       Df Deviance Resid. Df Resid. Dev
## NULL                    172     225.76
## width  1   31.306       171     194.45

c((225.76 - 194.45) / 225.76, 
  1 - width_glm$deviance / width_glm$null.deviance)

## [1] 0.1386871 0.1386697

Influence measures

library(car)
head(influence.measures(width_glm)$infmat)

##         dfb.1_   dfb.wdth       dffit    cov.r      cook.d         hat
## 1 -0.047610662 0.04996645  0.06040203 1.020937 0.001132128 0.012344511
## 2 -0.104054950 0.10079960 -0.11366303 1.032553 0.004233338 0.025717205
## 3 -0.004535225 0.00948021  0.07499977 1.009632 0.002017305 0.007084208
## 4 -0.061618758 0.05535952 -0.11116748 1.007620 0.005034932 0.010062537
## 5 -0.004535225 0.00948021  0.07499977 1.009632 0.002017305 0.007084208
## 6 -0.094872323 0.08993360 -0.11878283 1.018836 0.005119353 0.016600380

Confusion matrix

Suppose classification rule is \[ \hat{y}(x) = \begin{cases} 1 & \hat{\pi}(x) \geq 0.5 \\ 0 & \text{otherwise.} \end{cases} \]
Produces a 2x2 table based on actual labels and predicted labels.

crabs$predicted.label = fitted(width_glm) >= 0.5
table(crabs$predicted.label, crabs$y)

##        
##          0  1
##   FALSE 27 16
##   TRUE  35 95

Confusion matrix

Standard measures of quality of classifier based on the confusion matrix.
Example using our classifier above: \[ \begin{aligned} TPR &= 95 / (95 + 16) \\ FPR &= 35 / (35 + 27) \end{aligned} \]

ROC curve

Threshold of 0.5 is somewhat arbitrary.
All thresholds together can be summarized in ROC curve

library(ROCR)
crabs_perf_data = prediction(crabs$fits, crabs$y)
crabs_perf = performance(crabs_perf_data, 'tpr', 'fpr')

plot(crabs_perf)
performance(crabs_perf_data, 'auc')@y.values # AUC

## [[1]]
## [1] 0.7424441

abline(0, 1, lty=2)

Model selection (6.1)

Model is parametric, hence we can use

\[ \begin{aligned} AIC(M) &= -2 \log L(M) + 2 \cdot p(M)\\ &= DEV(M) + 2 \cdot p(M) + g(Y) \\ \end{aligned} \]

or \(BIC(M)\) (2 replaced by \(\log n\)).

Search
- Exhaustive: bestglm
- Stepwise: step

Exhaustive model search

library(bestglm)

## Loading required package: leaps

crabs_orig = within(crabs, rm(groups, fits, satell, predicted.label))
crabs_orig$color = as.integer(crabs_orig$color)
crabs_orig$spine = as.integer(crabs_orig$spine)
AIC_model = bestglm(crabs_orig, family=binomial(), IC="AIC")

## Morgan-Tatar search since family is non-gaussian.

# looks for variable named y
AIC_model

## AIC
## BICq equivalent for q in (2.09431094677637e-06, 0.847412646439335)
## Best Model:
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -12.3508177  2.6287179 -4.698419 2.621832e-06
## width         0.4972306  0.1017355  4.887482 1.021340e-06

crabs_factor = crabs_orig # makes a copy
crabs_factor$color = factor(crabs_factor$color)
crabs_factor$spine = factor(crabs_factor$spine)
factorAIC_model = bestglm(crabs_factor, family=binomial(), IC="AIC")

## Morgan-Tatar search since family is non-gaussian.

## Note: factors present with more than 2 levels.

factorAIC_model

## AIC
## Best Model:
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## color         3   3.24   1.079   5.685 0.000988 ***
## width         1   4.66   4.659  24.548 1.76e-06 ***
## Residuals   168  31.88   0.190                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC_model

## AIC
## BICq equivalent for q in (2.09431094677637e-06, 0.847412646439335)
## Best Model:
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -12.3508177  2.6287179 -4.698419 2.621832e-06
## width         0.4972306  0.1017355  4.887482 1.021340e-06

factorAIC_model

## AIC
## Best Model:
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## color         3   3.24   1.079   5.685 0.000988 ***
## width         1   4.66   4.659  24.548 1.76e-06 ***
## Residuals   168  31.88   0.190                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

tryCatch(bestglm(crabs_factor, family=binomial(), IC="CV"), error=function(e) { geterrmessage()})

## [1] "Cross-validation not available when there are categorical variables with more than 2 levels!"

CV_model = bestglm(crabs_orig, family=binomial(), IC="CV") # note that this is not usual cross-validation -- read the help

## Morgan-Tatar search since family is non-gaussian.

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

CV_model

## CVd(d = 132, REP = 1000)
## BICq equivalent for q in (2.09431094677637e-06, 0.847412646439335)
## Best Model:
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -12.3508177  2.6287179 -4.698419 2.621832e-06
## width         0.4972306  0.1017355  4.887482 1.021340e-06

Stepwise model search

full_glm = glm(y ~ ., family=binomial(), data=crabs_factor)
null_glm = glm(y ~ 1, family=binomial(), data=crabs_factor)
step(null_glm, list(upper=full_glm), direction='forward', k=log(nrow(crabs_factor))) # for BIC

## Start:  AIC=230.91
## y ~ 1
## 
##          Df Deviance    AIC
## + width   1   194.45 204.76
## + weight  1   195.74 206.04
## <none>        225.76 230.91
## + color   3   212.06 232.67
## + spine   2   223.23 238.69
## 
## Step:  AIC=204.76
## y ~ width
## 
##          Df Deviance    AIC
## <none>        194.45 204.76
## + weight  1   192.89 208.35
## + color   3   187.46 213.22
## + spine   2   194.43 215.04

## 
## Call:  glm(formula = y ~ width, family = binomial(), data = crabs_factor)
## 
## Coefficients:
## (Intercept)        width  
##    -12.3508       0.4972  
## 
## Degrees of Freedom: 172 Total (i.e. Null);  171 Residual
## Null Deviance:       225.8 
## Residual Deviance: 194.5     AIC: 198.5

step(null_glm, list(upper=full_glm), direction='forward', k=2, trace=FALSE)

## 
## Call:  glm(formula = y ~ width + color, family = binomial(), data = crabs_factor)
## 
## Coefficients:
## (Intercept)        width       color2       color3       color4  
##    -11.6090       0.4680      -1.1061       0.2238       0.2962  
## 
## Degrees of Freedom: 172 Total (i.e. Null);  168 Residual
## Null Deviance:       225.8 
## Residual Deviance: 187.5     AIC: 197.5

step(null_glm, list(upper=full_glm), direction='both', k=2, trace=FALSE) # for AIC

## 
## Call:  glm(formula = y ~ width + color, family = binomial(), data = crabs_factor)
## 
## Coefficients:
## (Intercept)        width       color2       color3       color4  
##    -11.6090       0.4680      -1.1061       0.2238       0.2962  
## 
## Degrees of Freedom: 172 Total (i.e. Null);  168 Residual
## Null Deviance:       225.8 
## Residual Deviance: 187.5     AIC: 197.5

step(full_glm, direction='both', k=2, trace=FALSE)

## 
## Call:  glm(formula = y ~ color + width, family = binomial(), data = crabs_factor)
## 
## Coefficients:
## (Intercept)       color2       color3       color4        width  
##    -11.6090      -1.1061       0.2238       0.2962       0.4680  
## 
## Degrees of Freedom: 172 Total (i.e. Null);  168 Residual
## Null Deviance:       225.8 
## Residual Deviance: 187.5     AIC: 197.5

Logistic regression

web.stanford.edu/class/stats305b

Models for binary outcomes

Binary outcomes

Logit transform

Logistic regression model

Logistic regression

Logistic regression

Logistic regression

Interpretation of coefficients: Odds Ratios

Rare disease hypothesis

Other (linear) binary models

Modelling probabilities

Horseshoe crabs

Logistic regression and \(I \times 2\) tables

A 1 df test

Deviance

Deviance for logistic regression

Fitting the logistic model

Derivatives of log-likelihood

Newton-Raphson

Fitting the logistic model

Stationarity

Inference

Inference

Where do we really use the model?

What is logistic regression even estimating?

Testing in logistic regression

Grouped goodness of fit tests (5.2.5)

Is the logistic model a good fit?

Hosmer-Lemeshow

Diagnostics in logistic regression (6.2)

Residuals

Residuals

Hat matrix and leverage

Hat matrix and leverage

Hat matrix and leverage

Standardized residuals

Analog of \(R^2\) (6.3)

Influence measures

Confusion matrix

Confusion matrix

ROC curve

Model selection (6.1)

Exhaustive model search

Stepwise model search