Data \((X_i, Y_i), 1 \leq i \leq n\).
Examples:
medical: presence or absence of cancer
financial: bankrupt or solvent
industrial: passes a quality control test or not
For \(0-1\) responses we need to model \[\pi(x_1, \dots, x_p) = P(Y=1|X_1=x_1,\dots, X_p=x_p)\]
Likelihood (\(X\) considered fixed or likelihood considered conditional) \[ -\log L(\pi|Y) = \sum_{i=1}^n - Y_i \log\left(\frac{\pi_i}{1-\pi_i}\right) - \log(1 -\pi_i) \]
The function \[ \pi \mapsto \log\left(\frac{\pi}{1-\pi}\right) \overset{\text{def}}{=} \text{logit}(\pi) \]
Note \[ \text{logit}\left(\frac{e^{\eta}}{1+e^{\eta}}\right) = \eta. \] where \[ F_{\text{logistic}}(\eta) = \frac{e^{\eta}}{1+e^{\eta}} \] is the logistic distribution function.
So, \[ \text{logit} = F_{\text{logistic}}^{-1} \]
g = function(p) {
return(log(p / (1 - p)))
}
p = seq(0.01,0.99,length=200)
plot(p, g(p), lwd=2, type='l', col='red')g.inv <- function(x) {
return(exp(x) / (1 + exp(x)))
}
x = seq(g(0.01), g(0.99), length=200)
plot(x, g.inv(x), lwd=2, type='l', col='red')\[ \eta_i = \text{logit}(\pi_i) \]
\[ - \log L(\eta|Y) = \sum_{i=1}^n - Y_i \eta_i + \log(1 + e^{\eta_i}) \]
\[ - \log L(\beta|y) = -(X\beta)^TY + \sum_{i=1}^n \log(1 + \exp(X_i^T\beta)) \]
In other words,\[ \text{logit}(\pi(x_1,\dots,x_p)) = x^T\beta. \]
Negative log-likelihood is convex in \(\beta\) \(\implies\) easy to add regularization penalties (and solve such problems…)
Default in most software for GLMs because(?) \(\eta_i=X_i^T\beta\) (i.e. canonical link)
How do we know it’s convex?
Negative log-likelihood of saturated model with parameters \(\eta=(\eta_1, \dots, \eta_n)\)
\[ -\log L(\eta|Y) = \sum_{i=1}^n \Lambda_B(\eta_i) -\eta_i Y_i \]
with \(\Lambda_B(\eta)\) the cumulant generating function (CGF) of Bernoulli
\[ \exp(\Lambda(\eta)) = \mathbb{E}_0\left(\exp(\eta^TW) \right) \]
where \(\mathbb{P}_0\) is the law under \(\eta=0\). In our case
\[ \Lambda(\eta) = \sum_{i=1}^n \Lambda_B(\eta_i). \]
Negative log-likelihood is \[ \eta \mapsto -\eta^TY + \Lambda(\eta) \]
Logistic regression imposes linear constraint on \(\eta\): \(\eta \in \text{col}(X)\).
Logistic is “natural” because it is a linear model for natural parameters of Binomial data.
It can appear in other places (c.f. 5.1.5 of Agresti)
Suppose we observe observations from \[ \begin{aligned} Y &\sim \text{Bernoulli}(\pi) \\ X | Y \sim & \begin{cases} N(\mu_0, \Sigma) & Y = 0\\ N(\mu_1, \Sigma) & Y=1 \end{cases} \end{aligned} \]
Then, logistic model for \(Y|X\) holds.
Another reason this model is popular relates to the odds ratio (c.f. 5.1.4 of Agresti)
If the logistic model is correct along with an assumption about sampling, it is possible to estimate parameters of \(Y|X\) distribution in case-control studies where the actual randomness is \(X|Y\).
This is similar to (2.2.6) of Agresti (Lung Cancer and Smoking) which we discussed earlier.
\[OR_{X_j} = \frac{ODDS(Y=1|\dots, X_j=x_j+1, \dots)}{ODDS(Y=1|\dots, X_j=x_j, \dots)} = e^{\beta_j} \]
When incidence is rare, \(P(Y=0)\approxeq 1\) no matter what the covariates \(X_j\)’s are.
In this case, odds ratios are almost ratios of probabilities: \[OR_{X_j} \approxeq \frac{{\mathbb{P}}(Y=1|\dots, X_j=x_j+1, \dots)}{{\mathbb{P}}(Y=1|\dots, X_j=x_j, \dots)}\]
Hypothetical example: in a lung cancer study, if \(X_j\) is an indicator of smoking or not, a \(\beta_j\) of 5 means for smoking vs. non-smoking means smokers are \(e^5 \approx 150\) times more likely to develop lung cancer
\[ Y_i = \begin{cases} 1 & T_i \leq X_i^T\beta \\ 0 & \text{otherwise.} \end{cases} \]
\[ P(Y_i=1 | X_i) = P(T_i \leq X_i^T\beta|X_i) = F(X_i^T\beta). \]
Example: probit \(F=\Phi\) (dbn of \(N(0,1)\))
Setting \[ \eta_i = \text{logit}(F(X_i^T\beta)) \] the likelihood is then \[ -\log L(\beta|Y) = \sum_{i=1}^n -Y_i \cdot \text{logit}(F(X_i^T\beta)) - \log(1 - F(X_i^T\beta)). \]
Parameters no longer interpretable in odds ratios but problem can still be convex (e.g. probit)
Note: \(\text{Var}(Y) = \pi ( 1 - \pi) = E(Y) \cdot ( 1- E(Y))\)
The binary nature forces a relation between mean and variance of \(Y\).
This makes logistic regression a Generalized Linear Model, use glm in R.
## color spine width satell weight y
## 1 medium bad 28.3 8 3050 1
## 2 dark bad 22.5 0 1550 0
## 3 light good 26.0 9 2300 1
## 4 dark bad 24.8 0 2100 0
## 5 dark bad 26.0 4 2600 1
## 6 medium bad 23.8 0 2100 0
##
## Call:
## glm(formula = y ~ width, family = binomial(), data = crabs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0281 -1.0458 0.5480 0.9066 1.6942
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.3508 2.6287 -4.698 2.62e-06 ***
## width 0.4972 0.1017 4.887 1.02e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 225.76 on 172 degrees of freedom
## Residual deviance: 194.45 on 171 degrees of freedom
## AIC: 198.45
##
## Number of Fisher Scoring iterations: 4
## Analysis of Deviance Table
##
## Model 1: y ~ 1
## Model 2: y ~ width
## Resid. Df Resid. Dev Df Deviance
## 1 172 225.76
## 2 171 194.45 1 31.306
## [1] 31.31000 23.88277
##
## Call:
## glm(formula = y ~ width, data = crabs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8614 -0.4495 0.1569 0.3674 0.7061
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.76553 0.42136 -4.190 4.46e-05 ***
## width 0.09153 0.01597 5.731 4.42e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1951497)
##
## Null deviance: 39.780 on 172 degrees of freedom
## Residual deviance: 33.371 on 171 degrees of freedom
## AIC: 212.26
##
## Number of Fisher Scoring iterations: 2
##
## Call:
## lm(formula = y ~ width, data = crabs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8614 -0.4495 0.1569 0.3674 0.7061
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.76553 0.42136 -4.190 4.46e-05 ***
## width 0.09153 0.01597 5.731 4.42e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4418 on 171 degrees of freedom
## Multiple R-squared: 0.1611, Adjusted R-squared: 0.1562
## F-statistic: 32.85 on 1 and 171 DF, p-value: 4.415e-08
## [1] 0.1951872
\[ Y_i = N_{i1}, \qquad 1 \leq i \leq I \]
\[ Y_i \sim \text{Binomial}(N_{i+}, \pi_i) \]
Under Poisson sampling, same result holds if we condition on \(N_{i+}\).
Model
\[ \text{logit}(\pi_i) = \alpha + \beta_i \]
Like a one-way ANOVA regression model
Need some identifiability constraint (R often sets one \(\beta\) to 0).
Not really using a logistic assumption here – just parameterizing \(\pi_i\) a certain way: model is saturated.
# Table 3.8
absent = c(17066, 14464, 788, 126, 37)
present = c(48, 38, 5, 1, 1)
total = absent + present
scores = c(0, 0.5, 1.5, 4.0, 7.0)
proportion = present / total
alcohol = rbind(absent, present)
chisq.test(alcohol)## Warning in chisq.test(alcohol): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: alcohol
## X-squared = 12.082, df = 4, p-value = 0.01675
lr_stat = function(data_table) {
chisq_test = chisq.test(data_table)
return(2 * sum(data_table * log(data_table / chisq_test$expected)))
}
lr_stat(alcohol)## Warning in chisq.test(data_table): Chi-squared approximation may be incorrect
## [1] 6.201998
alcohol_scores = glm(proportion ~ factor(scores), family=binomial(),
weights=total)
summary(alcohol_scores)##
## Call:
## glm(formula = proportion ~ factor(scores), family = binomial(),
## weights = total)
##
## Deviance Residuals:
## [1] 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.87364 0.14454 -40.637 <2e-16 ***
## factor(scores)0.5 -0.06819 0.21743 -0.314 0.7538
## factor(scores)1.5 0.81358 0.47134 1.726 0.0843 .
## factor(scores)4 1.03736 1.01431 1.023 0.3064
## factor(scores)7 2.26272 1.02368 2.210 0.0271 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6.2020e+00 on 4 degrees of freedom
## Residual deviance: 1.8163e-13 on 0 degrees of freedom
## AIC: 28.627
##
## Number of Fisher Scoring iterations: 4
anova(glm(proportion ~ 1, family=binomial(), weights=total),
glm(proportion ~ factor(scores), family=binomial(), weights=total))## Analysis of Deviance Table
##
## Model 1: proportion ~ 1
## Model 2: proportion ~ factor(scores)
## Resid. Df Resid. Dev Df Deviance
## 1 4 6.202
## 2 0 0.000 4 6.202
In this case, scores is ordered, so we might fit model \[
\text{logit}(\pi_i) = \alpha + \beta \cdot \text{scores}_i.
\]
Perhaps more power because direction of deviation from null is specified.
##
## Call:
## glm(formula = proportion ~ scores, family = binomial(), weights = total)
##
## Deviance Residuals:
## 1 2 3 4 5
## 0.5921 -0.8801 0.8865 -0.1449 0.1291
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.9605 0.1154 -51.637 <2e-16 ***
## scores 0.3166 0.1254 2.523 0.0116 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6.2020 on 4 degrees of freedom
## Residual deviance: 1.9487 on 3 degrees of freedom
## AIC: 24.576
##
## Number of Fisher Scoring iterations: 4
anova(glm(proportion ~ 1, family=binomial(), weights=total),
glm(proportion ~ scores, family=binomial(), weights=total)) # LR test## Analysis of Deviance Table
##
## Model 1: proportion ~ 1
## Model 2: proportion ~ scores
## Resid. Df Resid. Dev Df Deviance
## 1 4 6.2020
## 2 3 1.9487 1 4.2533
## df.1 df.4
## 1 0.03925033 0.1847017
Instead of sum of squares, logistic regression uses deviance:
\[DEV(\pi| Y) = -2 \log L(\pi| Y) + 2 \log L(\hat{\pi}_S| Y) \]
where \(\pi\) is a location estimator for \(Y\) with \(\hat{\pi}_S\) the saturated fit (i.e. no contraints on \(\pi\)’s).
\[ \hat{\pi}_{S,i} = \frac{Y_i}{{\tt total}_i} \]
\[DEV(\pi| Y) = -2 \sum_{i=1}^n \left( Y_i \log(\pi_i) + (1-Y_i) \log(1-\pi_i) \right) \]
\[ \begin{aligned} DEV(\pi| Y) &= -2 \sum_{i=1}^n \biggl( Y_i \log(\pi_i / \hat{\pi}_{S,i}) + \\ & \qquad ({\tt total}_i -Y_i) \log((1-\pi_i)/(1-\hat{\pi}_{S,i}) \biggr) \\ &= -2 \sum_{i=1}^n {\tt total}_i \biggl( \hat{\pi}_i \log(\pi_i/\hat{\pi}_{S,i}) + \\ & \qquad (1-\hat{\pi}_{S,i}) \log((1-\pi_i)/(1-\hat{\pi}_{S,i}) \biggr) \end{aligned} \]
with \(\hat{\pi}_{S,i} = Y_i / {\tt total}_i\) is the MLE under the saturated model.
weights=total in glm.\[\begin{aligned} DEV(\beta| Y) &= -2 \sum_{i=1}^n \left( Y_i \cdot {\text{logit}}(\pi_i(\beta)) + \log(1-\pi_i(\beta)) \right) \end{aligned} \]
\[ \begin{aligned} -2 \left[ (X\beta)^Ty - \sum_{i=1}^n\log \left(1 + \exp \left(\sum_{j=1}^p X_{ij} \beta_j\right) \right)\right] \end{aligned} \]
Let \(Y \in \{0,1\}^n, X \in \mathbb{R}^{n \times p}\).
Gradient / Score
\[ \begin{aligned} \nabla(-\log L(\beta|Y)) &= - \left[X^T\left(y - \frac{\exp(X\beta)}{1 + \exp(X\beta)} \right) \right] \\ &= - \left[X^T\left(y - E_{\beta}[Y|X]\right) \right] \end{aligned} \]
\[ \begin{aligned} \nabla^2 (-\log L(\beta|Y)) &= X^T\frac{\exp(X\beta)}{(1 + \exp(X\beta))^2}X \\ &= X^T \text{Var}_{\beta}(Y|X) X. \end{aligned} \]
\[ \begin{aligned} E_{\beta}[Y|X] &= \frac{\exp(X\beta)}{1 + \exp(X\beta)} = \pi_{\beta}(X) \in \mathbb{R}^n \\ \text{Var}_{\beta}[Y|X] &= \text{diag}\left(E_{\beta}[Y|X] (1 - E_{\beta}[Y|X]) \right) \\ &= \text{diag}(\pi_{\beta}(X)(1 - \pi_{\beta}(X)) \\ &= W_{\beta}(X). \end{aligned} \]
\[ \Lambda(\beta) = \sum_{i=1}^n\log \left(1 + \exp \left(\sum_{j=1}^p X_{ij} \beta_j\right) \right) \]
Given initial \(\hat{\beta}^{(0)}\)
Until convergence \[ \hat{\beta}^{(i+1)} = \hat{\beta}^{(i)} - (X^T W_{\hat{\beta}^{(i)}}(X)X)^{-1}X^T(Y - \pi_{\hat{\beta}^{(i)}}(X)) \]
Suppose model is true, with true parameter \(\beta^*\). Then, \[ E_{\beta^*}[X^T(Y - \pi_{\beta^*}(X))] = 0 \]
Taylor series \[ \pi_{\hat{\beta}}(X) \approx \pi_{\beta^*}(X) + W_{\beta^*}(X) X(\hat{\beta}-\beta^*) \]
How big is remainder?
Stationarity conditions then read \[ X^T(Y - E_{\beta^*}[Y|X]) \approx X^TW_{\beta^*}(X) X(\hat{\beta}-\beta^*) \]
Or, \[ \begin{aligned} \hat{\beta}-\beta^* &\approx \left(X^TW_{\beta^*}(X) X\right)^{-1}X^T(Y - E_{\beta^*}[Y|X]) \\ &= (X^TW_{\beta^*}(X)X)^{-1}X^T(Y-\pi_{\beta^*}(X)) \end{aligned} \]
Implies \[ \begin{aligned} \text{Var}_{\beta^*}(\hat{\beta}-\beta^*) &\approx \left(X^TW_{\beta}^*(X) X\right)^{-1} \text{Var}_{\beta^*}\left(X^T(Y - \pi_{\beta^*}(X))\right) \left(X^TW_{\beta^*}(X) X\right)^{-1} \\ &\approx \left(X^T W_{\beta^*}(X) X\right)^{-1} \\ \end{aligned} \] (Last line assumes the model is true.)
\[ \hat{\beta}-\beta^* \approx N\left(0, \left(X^TW_{\beta^*}(X) X\right)^{-1}\right) \]
(Plugin estimator of variance appeals to Slutsky, consistency of \(\hat{\beta}\).)
Under the logistic model\[ \begin{aligned} \text{Var}_{\beta^*}(X^T(Y-\pi_{\beta}^*(X)) & \approx X^TW_{\beta^*}(X)X. \end{aligned} \]
Really just using the standard asymptotic distribution of the MLE:
\[ \hat{\beta} \approx N\left(\beta^*, \left(E_{\beta^*}[\nabla^2 (-\log L(\beta^*|Y))]\right)^{-1} \right) \]
center = coef(width_glm)['width']
SE = sqrt(vcov(width_glm)['width', 'width'])
U = center + SE * qnorm(0.975)
L = center + SE * qnorm(0.025)
data.frame(L, center, U)## L center U
## width 0.2978326 0.4972306 0.6966286
These are slightly different from what R will give you if you ask it for confidence intervals:
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -17.8100090 -7.4572470
## width 0.3083806 0.7090167
Before appealing to general properties of MLE, we had only really used the assumption the model was true once.
What can we say if the model is not true?
Suppose \((Y_i,X_i) \overset{IID}{\sim} F, 1 \leq i \leq n\).
Define the population parameter \(\beta^*=\beta^*(F)\) by \[ E_F[X(Y - \pi_{\beta^*(F)}(X))] = 0. \]
This is minimizer of
\[ \beta \mapsto E_F\left[-(X^T\beta) Y + \log\left(1 + \exp(X^T\beta) \right) \right] \]
\[ \hat{\beta}-\beta^*(F) \approx \left(X^TW_{\beta^*(F)}(X) X\right)^{-1}X^T(Y - \pi_{\beta^*(F)}(X)) \]
\[ \begin{aligned} n^{-1/2} X^T(Y - \pi_{\beta^*(F)}(X)) & \overset{D}{\to} N(0, \Sigma(F)) \\ n^{-1}X^TW_{\beta^*(F)}(X) X &\overset{P}{\to} E_F[XX^T \cdot W_{\beta^*(F)}(X)] \\ & \overset{\text{def}}{=} Q(F) \end{aligned} \]
Implies \[ \hat{\beta}-\beta^*(F) \overset{D}{\to} N(0, n^{-1} Q(F)^{-1}\Sigma(F)Q(F)^{-1}). \]
Variance above has so-called sandwich form.
Applying Slutsky \[ \hat{\beta}-\beta^*(F) \approx N\left(0, (X^TW_{\hat{\beta}}(X) X)^{-1} (n\Sigma(F)) (X^TW_{\hat{\beta}}(X) X)^{-1}\right) \]
We know everything in the approximation above (or can estimate it) except \(\Sigma(F)\)
Covariance \(\Sigma(F)\) can be estimated, say, by bootstrap or sandwich estimator. (Approachable description by D. Freedman here).
Alternatively, entire logistic regression could be bootstrapped.
What about comparing full and reduced model?
For a model \({\cal M}\), \(DEV({\cal M})\) replaces \(SSE({\cal M})\).
In least squares regression, we use
\[\frac{1}{\sigma^2}\left( SSE({\cal M}_R) - SSE({\cal M}_F) \right) \sim \chi^2_{df_R-df_F} \]
\[DEV({\cal M}_R) - DEV({\cal M}_F) \overset{n \rightarrow \infty}{\sim} \chi^2_{df_R-df_F} \]
These tests assume model \({\cal M}_F\) is correctly specified. Otherwise, Wald-type test with sandwich or other covariance estimate would have to be used.
Deviance tests do not agree numerically with Wald tests even when model is correct. Both are often used. Wald can fail!
## Analysis of Deviance Table
##
## Model 1: y ~ width
## Model 2: y ~ width + weight
## Resid. Df Resid. Dev Df Deviance
## 1 171 194.45
## 2 170 192.89 1 1.5608
We should compare this difference in deviance with a \(\chi^2_1\) random variable.
## [1] 0.2116652
Let’s compare this with the Wald test:
##
## Call:
## glm(formula = y ~ width + weight, family = binomial(), data = crabs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1127 -1.0344 0.5304 0.9006 1.7207
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.3547261 3.5280465 -2.652 0.00801 **
## width 0.3067892 0.1819473 1.686 0.09177 .
## weight 0.0008338 0.0006716 1.241 0.21445
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 225.76 on 172 degrees of freedom
## Residual deviance: 192.89 on 170 degrees of freedom
## AIC: 198.89
##
## Number of Fisher Scoring iterations: 4
library(glmbb) # for `crabs` data
data(crabs)
width_glm = glm(y ~ width, family=binomial(), data=crabs)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
crabs$groups = cut(crabs$width, breaks=c(0, seq(23.25, 29.25, by=1), max(crabs$width)))
crabs$fits = fitted(width_glm)
results = crabs %>%
group_by(groups) %>%
summarize(N=n(), numY = sum(y), numN = N - numY,
fittedY=sum(fits), fittedN=N-fittedY)## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 8 x 6
## groups N numY numN fittedY fittedN
## <fct> <int> <int> <int> <dbl> <dbl>
## 1 (0,23.2] 14 5 9 3.64 10.4
## 2 (23.2,24.2] 14 4 10 5.31 8.69
## 3 (24.2,25.2] 28 17 11 13.8 14.2
## 4 (25.2,26.2] 39 21 18 24.2 14.8
## 5 (26.2,27.2] 22 15 7 15.9 6.06
## 6 (27.2,28.2] 24 20 4 19.4 4.62
## 7 (28.2,29.2] 18 15 3 15.7 2.35
## 8 (29.2,33.5] 14 14 0 13.1 0.918
G2 = 2 * (sum(results$numY * log(results$numY / results$fittedY)) +
sum(results$numN * log((results$numN + 1.e-10) / results$fittedN)))
c(G2, pchisq(G2, 8 - 2, lower.tail=FALSE)) # why df=6?## [1] 6.1793623 0.4034009
X2 = (sum((results$numY - results$fittedY)^2 / results$fittedY) +
sum((results$numN - results$fittedN)^2 / results$fittedN))
c(X2, pchisq(X2, 8 - 2, lower.tail=FALSE)) ## [1] 5.320099 0.503461
Alternative “partition”: partition cases based on \(\hat{\pi}\).
Rank \(\hat{\pi}\) into \(g\) groups.
Compute Pearson’s \(X^2\) as above.
## ResourceSelection 0.3-5 2019-07-22
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: crabs$y, crabs$fits
## X-squared = 4.3855, df = 8, p-value = 0.8208
If we have grouped data (i.e. a fixed number of possible covariate values) we can use a \(G^2\) or \(X^2\) test as a goodness of fit.
Without groups, we might make groups by partitioning the feature space appropriately.
Pearson residual \[ e_i = \frac{Y_i - \hat{\pi}_i}{\sqrt{\hat{\pi}_i(1-\hat{\pi}_i)}} \]
Used to form Pearson’s \(X^2\) (we’ll see general form for GLMs later) \[ X^2 = \sum_{i=1}^n e_i^2. \]
As in OLS regression, residuals have smaller variance than the true errors. (What are the true errors?)
In OLS regression, this is remedied by standardizing \[ r_i = \frac{Y_i - \hat{Y}_i} {\sqrt{R_{ii}}} \] where \(R=I-H\) is the residual forming matrix and \[ H = X(X^TX)^{-1}X^T \] is the hat matrix.
What is analog of \(H\) in logistic regression?
Recall \[ \begin{aligned} \hat{\beta}-\beta^* &\approx (X^TW_{\beta^*}X)^{-1}X^T(Y-\pi_{\beta^*}(X)) \\ &= (X^TW_{\beta^*}X)^{-1}X^TW^{1/2}_{\beta^*}R_{\beta^*}(X,Y) \\ \end{aligned} \]
Above, \[ R_{\beta^*}(X,Y) = W_{\beta^*}^{-1/2}(Y-\pi_{\beta^*}(X)) \] has independent entries, each with variance 1 (assuming model is correctly specified).
\(Z_{\beta^*}(X,Y)\) are essentially the (population) Pearson residuals…
We see \(\hat{\beta}\) is essentially OLS estimator with design \(W_{\beta^*}^{1/2}(X)X\) and errors \(R_{\beta^*}(X,Y)\) (This is basically the IRLS viewpoint)
In this OLS model, the appropriate hat matrix is
\[ H_{\beta^*} = W_{\beta^*}^{1/2}X(X^TW_{\beta^*}X)^{-1}X^TW_{\beta^*} ^{1/2}, \]
approximated by
\[ H_{\hat{\beta}} = W_{\hat{\beta}}^{1/2}X(X^TW_{\hat{\beta}}X)^{-1}X^TW_{\hat{\beta}}^{1/2} \]
\[ r_i = \frac{e_i}{ \sqrt{1 - H_{\hat{\beta},ii}}}. \]
Residuals and leverage can be used in similar ways as in OLS:
Detect outliers. (Are they normal? Probably not. Grouped data has a chance.)
Check regression specification.
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 94 -2.05045 0.040321 NA
pearson_resid = (crabs$y - crabs$fits) / sqrt(crabs$fits * (1 - crabs$fits))
plot(resid(width_glm, type='pearson'), pearson_resid, pch=23, col='red')dev_resid = sign(crabs$y - crabs$fits) *
sqrt(- 2 * (crabs$y * log(crabs$fits) + (1 - crabs$y) * log(1 - crabs$fits)))
plot(resid(width_glm), dev_resid, pch=23, col='red')In OLS regression, a common summary is \[ R^2 = 1 - \frac{SSE}{SST} = \frac{SST-SSE}{SST} \]
Analog for logistic regression is McFadden’s \(R^2\) \[ \frac{DEV(M_0) - DEV(M)}{DEV(M_0)} \]
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: y
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 172 225.76
## width 1 31.306 171 194.45
## [1] 0.1386871 0.1386697
## dfb.1_ dfb.wdth dffit cov.r cook.d hat
## 1 -0.047610662 0.04996645 0.06040203 1.020937 0.001132128 0.012344511
## 2 -0.104054950 0.10079960 -0.11366303 1.032553 0.004233338 0.025717205
## 3 -0.004535225 0.00948021 0.07499977 1.009632 0.002017305 0.007084208
## 4 -0.061618758 0.05535952 -0.11116748 1.007620 0.005034932 0.010062537
## 5 -0.004535225 0.00948021 0.07499977 1.009632 0.002017305 0.007084208
## 6 -0.094872323 0.08993360 -0.11878283 1.018836 0.005119353 0.016600380
Suppose classification rule is \[ \hat{y}(x) = \begin{cases} 1 & \hat{\pi}(x) \geq 0.5 \\ 0 & \text{otherwise.} \end{cases} \]
Produces a 2x2 table based on actual labels and predicted labels.
##
## 0 1
## FALSE 27 16
## TRUE 35 95
Standard measures of quality of classifier based on the confusion matrix.
Example using our classifier above: \[ \begin{aligned} TPR &= 95 / (95 + 16) \\ FPR &= 35 / (35 + 27) \end{aligned} \]
Threshold of 0.5 is somewhat arbitrary.
All thresholds together can be summarized in ROC curve
library(ROCR)
crabs_perf_data = prediction(crabs$fits, crabs$y)
crabs_perf = performance(crabs_perf_data, 'tpr', 'fpr')## [[1]]
## [1] 0.7424441
\[ \begin{aligned} AIC(M) &= -2 \log L(M) + 2 \cdot p(M)\\ &= DEV(M) + 2 \cdot p(M) + g(Y) \\ \end{aligned} \]
or \(BIC(M)\) (2 replaced by \(\log n\)).
Search
Exhaustive: bestglm
Stepwise: step
## Loading required package: leaps
crabs_orig = within(crabs, rm(groups, fits, satell, predicted.label))
crabs_orig$color = as.integer(crabs_orig$color)
crabs_orig$spine = as.integer(crabs_orig$spine)
AIC_model = bestglm(crabs_orig, family=binomial(), IC="AIC") ## Morgan-Tatar search since family is non-gaussian.
## AIC
## BICq equivalent for q in (2.09431094677637e-06, 0.847412646439335)
## Best Model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.3508177 2.6287179 -4.698419 2.621832e-06
## width 0.4972306 0.1017355 4.887482 1.021340e-06
crabs_factor = crabs_orig # makes a copy
crabs_factor$color = factor(crabs_factor$color)
crabs_factor$spine = factor(crabs_factor$spine)
factorAIC_model = bestglm(crabs_factor, family=binomial(), IC="AIC")## Morgan-Tatar search since family is non-gaussian.
## Note: factors present with more than 2 levels.
## AIC
## Best Model:
## Df Sum Sq Mean Sq F value Pr(>F)
## color 3 3.24 1.079 5.685 0.000988 ***
## width 1 4.66 4.659 24.548 1.76e-06 ***
## Residuals 168 31.88 0.190
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## AIC
## BICq equivalent for q in (2.09431094677637e-06, 0.847412646439335)
## Best Model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.3508177 2.6287179 -4.698419 2.621832e-06
## width 0.4972306 0.1017355 4.887482 1.021340e-06
## AIC
## Best Model:
## Df Sum Sq Mean Sq F value Pr(>F)
## color 3 3.24 1.079 5.685 0.000988 ***
## width 1 4.66 4.659 24.548 1.76e-06 ***
## Residuals 168 31.88 0.190
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] "Cross-validation not available when there are categorical variables with more than 2 levels!"
## Morgan-Tatar search since family is non-gaussian.
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## CVd(d = 132, REP = 1000)
## BICq equivalent for q in (2.09431094677637e-06, 0.847412646439335)
## Best Model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.3508177 2.6287179 -4.698419 2.621832e-06
## width 0.4972306 0.1017355 4.887482 1.021340e-06
full_glm = glm(y ~ ., family=binomial(), data=crabs_factor)
null_glm = glm(y ~ 1, family=binomial(), data=crabs_factor)
step(null_glm, list(upper=full_glm), direction='forward', k=log(nrow(crabs_factor))) # for BIC## Start: AIC=230.91
## y ~ 1
##
## Df Deviance AIC
## + width 1 194.45 204.76
## + weight 1 195.74 206.04
## <none> 225.76 230.91
## + color 3 212.06 232.67
## + spine 2 223.23 238.69
##
## Step: AIC=204.76
## y ~ width
##
## Df Deviance AIC
## <none> 194.45 204.76
## + weight 1 192.89 208.35
## + color 3 187.46 213.22
## + spine 2 194.43 215.04
##
## Call: glm(formula = y ~ width, family = binomial(), data = crabs_factor)
##
## Coefficients:
## (Intercept) width
## -12.3508 0.4972
##
## Degrees of Freedom: 172 Total (i.e. Null); 171 Residual
## Null Deviance: 225.8
## Residual Deviance: 194.5 AIC: 198.5
##
## Call: glm(formula = y ~ width + color, family = binomial(), data = crabs_factor)
##
## Coefficients:
## (Intercept) width color2 color3 color4
## -11.6090 0.4680 -1.1061 0.2238 0.2962
##
## Degrees of Freedom: 172 Total (i.e. Null); 168 Residual
## Null Deviance: 225.8
## Residual Deviance: 187.5 AIC: 197.5
##
## Call: glm(formula = y ~ width + color, family = binomial(), data = crabs_factor)
##
## Coefficients:
## (Intercept) width color2 color3 color4
## -11.6090 0.4680 -1.1061 0.2238 0.2962
##
## Degrees of Freedom: 172 Total (i.e. Null); 168 Residual
## Null Deviance: 225.8
## Residual Deviance: 187.5 AIC: 197.5
##
## Call: glm(formula = y ~ color + width, family = binomial(), data = crabs_factor)
##
## Coefficients:
## (Intercept) color2 color3 color4 width
## -11.6090 -1.1061 0.2238 0.2962 0.4680
##
## Degrees of Freedom: 172 Total (i.e. Null); 168 Residual
## Null Deviance: 225.8
## Residual Deviance: 187.5 AIC: 197.5