Grouped goodness of fit tests (5.2.5)

library(glmbb) # for `crabs` data
data(crabs)
width_glm = glm(y ~ width, family=binomial(), data=crabs)

Is the logistic model a good fit?

library(dplyr)
crabs$groups = cut(crabs$width, breaks=c(0, seq(23.25, 29.25, by=1), max(crabs$width)))
crabs$fits = fitted(width_glm)
results = crabs %>%
          group_by(groups) %>%
          summarize(N=n(), numY = sum(y), numN = N - numY, 
                    fittedY=sum(fits), fittedN=N-fittedY)
results
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:

    filter, lag
The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
`summarise()` ungrouping output (override with `.groups` argument)
A tibble: 8 × 6
groupsNnumYnumNfittedYfittedN
<fct><int><int><int><dbl><dbl>
(0,23.2] 14 5 9 3.63542810.3645724
(23.2,24.2]14 410 5.305988 8.6940124
(24.2,25.2]28171113.77762414.2223757
(25.2,26.2]39211824.22768014.7723199
(26.2,27.2]2215 715.937800 6.0621999
(27.2,28.2]2420 419.383348 4.6166516
(28.2,29.2]1815 315.650178 2.3498224
(29.2,33.5]1414 013.081954 0.9180457
G2 = 2 * (sum(results$numY * log(results$numY / results$fittedY)) +
          sum(results$numN * log((results$numN + 1.e-10) / results$fittedN)))
c(G2, pchisq(G2, 8 - 2, lower.tail=FALSE)) # why df=6?
  1. 6.17936230993838
  2. 0.403400865096638
X2 = (sum((results$numY - results$fittedY)^2 / results$fittedY) +
      sum((results$numN - results$fittedN)^2 / results$fittedN))
c(X2, pchisq(X2, 8 - 2, lower.tail=FALSE)) 
  1. 5.32009943957515
  2. 0.503461006750787

Hosmer-Lemeshow

  • Alternative “partition”: partition cases based on \(\hat{\pi}\).

  • Rank \(\hat{\pi}\) into \(g\) groups.

  • Compute Pearson’s \(X^2\) as above.

library(ResourceSelection)
hoslem.test(crabs$y, crabs$fits, g=10)
ResourceSelection 0.3-5 	 2019-07-22
	Hosmer and Lemeshow goodness of fit (GOF) test

data:  crabs$y, crabs$fits
X-squared = 4.3855, df = 8, p-value = 0.8208

Diagnostics in logistic regression (6.2)

  • If we have grouped data (i.e. a fixed number of possible covariate values) we can use a \(G^2\) or \(X^2\) test as a goodness of fit.

  • Without groups, we might make groups by partitioning the feature space appropriately.

Residuals

  • Pearson residual $\( e_i = \frac{Y_i - \hat{\pi}_i}{\sqrt{\hat{\pi}_i(1-\hat{\pi}_i)}} \)$

  • Used to form Pearson’s \(X^2\) (we’ll see general form for GLMs later) $\( X^2 = \sum_{i=1}^n e_i^2. \)$


Residuals

  • Deviance residual: write deviance as $\( DEV(\hat{\pi}|Y) = \sum_{i=1}^n DEV(\hat{\pi}_i|Y_i) \)\( then the deviance residuals are \)\( \text{sign}(Y_i-\hat{\pi}_i) \cdot \sqrt{DEV(\hat{\pi}_i|Y_i)} \)$


Hat matrix and leverage

  • As in OLS regression, residuals have smaller variance than the true errors. (What are the true errors?)

  • In OLS regression, this is remedied by standardizing $\( r_i = \frac{Y_i - \hat{Y}_i} {\sqrt{R_{ii}}} \)\( where \)R=I-H\( is the residual forming matrix and \)\( H = X(X^TX)^{-1}X^T \)$ is the hat matrix.

  • What is analog of \(H\) in logistic regression?


Hat matrix and leverage

  • Recall $\( \begin{aligned} \hat{\beta}-\beta^* &\approx (X^TW_{\beta^*}X)^{-1}X^T(Y-\pi_{\beta^*}(X)) \\ &= (X^TW_{\beta^*}X)^{-1}X^TW^{1/2}_{\beta^*}R_{\beta^*}(X,Y) \\ \end{aligned} \)$

  • Above, $\( R_{\beta^*}(X,Y) = W_{\beta^*}^{-1/2}(Y-\pi_{\beta^*}(X)) \)$ has independent entries, each with variance 1 (assuming model is correctly specified).

  • \(Z_{\beta^*}(X,Y)\) are essentially the (population) Pearson residuals…


Hat matrix and leverage

  • We see \(\hat{\beta}\) is essentially OLS estimator with design \(W_{\beta^*}^{1/2}(X)X\) and errors \(R_{\beta^*}(X,Y)\) (This is basically the IRLS viewpoint)

  • In this OLS model, the appropriate hat matrix is

\[ H_{\beta^*} = W_{\beta^*}^{1/2}X(X^TW_{\beta^*}X)^{-1}X^TW_{\beta^*} ^{1/2}, \]

approximated by

\[ H_{\hat{\beta}} = W_{\hat{\beta}}^{1/2}X(X^TW_{\hat{\beta}}X)^{-1}X^TW_{\hat{\beta}}^{1/2} \]

Standardized residuals

  • Given “correct” hat matrix, leads to standardized residual

\[ r_i = \frac{e_i}{ \sqrt{1 - H_{\hat{\beta},ii}}}. \]
  • Residuals and leverage can be used in similar ways as in OLS:

    • Detect outliers. (Are they normal? Probably not. Grouped data has a chance.)

    • Check regression specification.


library(car)
outlierTest(width_glm)
Loading required package: carData
Attaching package: ‘car’
The following object is masked from ‘package:dplyr’:

    recode
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
   rstudent unadjusted p-value Bonferroni p
94 -2.05045           0.040321           NA

pearson_resid = (crabs$y - crabs$fits) / sqrt(crabs$fits * (1 - crabs$fits))
plot(resid(width_glm, type='pearson'), pearson_resid, pch=23, col='red')
../_images/Logistic_II_18_0.png

dev_resid = sign(crabs$y - crabs$fits) * 
     sqrt(- 2 * (crabs$y * log(crabs$fits) + (1 - crabs$y) * log(1 - crabs$fits)))
plot(resid(width_glm), dev_resid, pch=23, col='red')
../_images/Logistic_II_20_0.png

par(mfrow=c(2,2))
plot(width_glm)
../_images/Logistic_II_22_0.png

Analog of \(R^2\) (6.3)

  • In OLS regression, a common summary is $\( R^2 = 1 - \frac{SSE}{SST} = \frac{SST-SSE}{SST} \)$

  • Analog for logistic regression is McFadden’s \(R^2\) $\( \frac{DEV(M_0) - DEV(M)}{DEV(M_0)} \)$


anova(width_glm)
c((225.76 - 194.45) / 225.76, 
  1 - width_glm$deviance / width_glm$null.deviance)
A anova: 2 × 4
DfDevianceResid. DfResid. Dev
<int><dbl><int><dbl>
NULLNA NA172225.7585
width 131.30586171194.4527
  1. 0.138687101346563
  2. 0.138669667388724

Influence measures

library(car)
head(influence.measures(width_glm)$infmat)
A matrix: 6 × 6 of type dbl
dfb.1_dfb.wdthdffitcov.rcook.dhat
1-0.0476106620.04996645 0.060402031.0209370.0011321280.012344511
2-0.1040549500.10079960-0.113663031.0325530.0042333380.025717205
3-0.0045352250.00948021 0.074999771.0096320.0020173050.007084208
4-0.0616187580.05535952-0.111167481.0076200.0050349320.010062537
5-0.0045352250.00948021 0.074999771.0096320.0020173050.007084208
6-0.0948723230.08993360-0.118782831.0188360.0051193530.016600380

Confusion matrix

  • Suppose classification rule is $\( \hat{y}(x) = \begin{cases} 1 & \hat{\pi}(x) \geq 0.5 \\ 0 & \text{otherwise.} \end{cases} \)$

  • Produces a 2x2 table based on actual labels and predicted labels.

crabs$predicted.label = fitted(width_glm) >= 0.5
table(crabs$predicted.label, crabs$y)
       
         0  1
  FALSE 27 16
  TRUE  35 95

Confusion matrix

  • Standard measures of quality of classifier based on the confusion matrix.

  • Example using our classifier above: $\( \begin{aligned} TPR &= 95 / (95 + 16) \\ FPR &= 35 / (35 + 27) \end{aligned} \)$


ROC curve

  • Threshold of 0.5 is somewhat arbitrary.

  • All thresholds together can be summarized in ROC curve

library(ROCR)
crabs_perf_data = prediction(crabs$fits, crabs$y)
crabs_perf = performance(crabs_perf_data, 'tpr', 'fpr')
plot(crabs_perf)
performance(crabs_perf_data, 'auc')@y.values # AUC
abline(0, 1, lty=2)
  1. 0.742444056960186
../_images/Logistic_II_34_1.png

Model selection (6.1)

  • Model is parametric, hence we can use

\[\begin{split} \begin{aligned} AIC(M) &= -2 \log L(M) + 2 \cdot p(M)\\ &= DEV(M) + 2 \cdot p(M) + g(Y) \\ \end{aligned} \end{split}\]

or \(BIC(M)\) (2 replaced by \(\log n\)).

  • Search

    • Exhaustive: bestglm

    • Stepwise: step