Logistic regression#

Download#

Outline#

  • Case studies:

    A. Survival in the Donner party

    B. Smoking and bird-keeping

  • Binary outcomes

  • Estimation and inference: likelihood

Case study A: Donner party#

url = 'https://raw.githubusercontent.com/StanfordStatistics/stats191-data/main/Sleuth3/donner_party.csv'
donner.df = read.csv(url, sep=',', header=TRUE)
head(donner.df)
A data.frame: 6 × 3
AgeSexStatus
<int><chr><chr>
123Male Died
240FemaleSurvived
340Male Survived
430Male Died
528Male Died
640Male Died

Binary outcomes#

  • Most models so far have had response \(Y\) as continuous.

  • Status is Died/Survived

  • Other examples:

    1. medical: presence or absence of cancer

    2. financial: bankrupt or solvent

    3. industrial: passes a quality control test or not

Modelling probabilities#

  • For \(0-1\) responses we need to model

\[\pi(x)=\pi(x_1, \dots, x_p) = P(Y=1|X_1=x_1,\dots, X_p=x_p)\]
  • That is, \(Y\) is Bernoulli with a probability that depends on covariates \(\{X_1, \dots, X_p\}.\)

  • Note: \(\text{Var}(Y|X) = \pi(X) ( 1 - \pi(X)) = E(Y|X) \cdot ( 1- E(Y|X))\)

  • Or, the binary nature forces a relation between mean and variance of \(Y\).

  • This makes logistic regression a Generalized Linear Model.

A possible model for Donner#

  • Set Y=Status=='Survived'

  • Simple model: lm(Y ~ Age + Sex)

  • Problems / issues:

    1. We must have \(0 \leq E(Y|X) = \pi(X_1,X_2) \leq 1\). OLS will not force this.

    2. Inference methods for will not work because of relation between mean and variance. Need to change how we compute variance!

Logistic regression#

  • Set X_1=Age, X_2=Sex

  • Logistic model

\[\pi(X_1,X_2) = \frac{\exp(\beta_0 + \beta_1 X_1 + \beta_2 X_2)}{1 + \exp(\beta_0 + \beta_1 X_1 + \beta_2 X_2)}\]
  • This automatically fixes \(0 \leq E(Y|X) = \pi(X_1,X_2) \leq 1\).

Logit transform#

  • Definition: \(\text{logit}(\pi(X_1, X_2)) = \log\left(\frac{\pi(X_1, X_2)}{1 - \pi(X_1,X_2)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2\)

Logistic distribution#

logit.inv = function(x) {
  return(exp(x) / (1 + exp(x)))
}
x = seq(-4, 4, length=200)
plot(x, logit.inv(x), lwd=2, type='l', col='red', cex.lab=1.2)
../../_images/9e3e290d3b6dfde3198138acc1ccef5b859e1f6e9ea8afa5fc3fa41450e3ebb7.png

Logistic transform: logit#

logit = function(p) {
  return(log(p / (1 - p)))
}
p = seq(0.01,0.99,length=200)
plot(p, logit(p), lwd=2, type='l', col='red', cex.lab=1.2)
../../_images/9745e8427e44bebf9d9523c9140558f175ab00578775b625e2ffc80c2c495d12.png

Binary regression models#

  • Models \(E(Y)\) as \(F(\beta_0 + \beta_1 X_1 + \beta_2 X_2)\) for some increasing function \(F\) (usually a distribution function).

  • The logistic model uses the function (we called logit.inv above)

\[F(x)=\frac{e^x}{1+e^x}.\]
  • Can be fit using Maximum Likelihood / Iteratively Reweighted Least Squares.

  • For logistic regression, coefficients have nice interpretation in terms of odds ratios (to be defined shortly).

  • What about inference?

Criterion used to fit model#

Instead of sum of squares, logistic regression uses deviance:

  • \(DEV(\mu| Y) = -2 \log L(\mu| Y) + 2 \log L(Y| Y)\) where \(\mu\) is a location estimator for \(Y\).

  • If \(Y\) is Gaussian with independent \(N(\mu_i,\sigma^2)\) entries then \(DEV(\mu| Y) = \frac{1}{\sigma^2}\sum_{i=1}^n(Y_i - \mu_i)^2\)

  • If \(Y\) is a binary vector, with mean vector \(\pi\) then \(DEV(\pi| Y) = -2 \sum_{i=1}^n \left( Y_i \log(\pi_i) + (1-Y_i) \log(1-\pi_i) \right)\)

  • Minimizing deviance \(\iff\) Maximum Likelihood

Deviance for logistic regression#

  • For any binary regression model, \(\pi=\pi(\beta)\).

  • The deviance is:

\[\begin{aligned} DEV(\beta| Y) &= -2 \sum_{i=1}^n \left( Y_i {\text{logit}}(\pi_i(\beta)) + \log(1-\pi_i(\beta)) \right) \end{aligned}\]
  • For the logistic model, the RHS is:

\[ \begin{aligned} -2 \left[ (X\beta)^Ty + \sum_{i=1}^n\log \left(1 + \exp \left(\beta_0 + \sum_{j=1}^p X_{ij} \beta_j\right) \right)\right] \end{aligned}\]

  • The logistic model is special in that \(\text{logit}(\pi(\beta))=X\beta\). If we used a different transformation, the first part would not be linear in \(X\beta\).

donner.df$Y = donner.df$Status == 'Survived'
donner.glm = glm(Y ~ Age + Sex, data=donner.df,
                 family=binomial())
summary(donner.glm)
Call:
glm(formula = Y ~ Age + Sex, family = binomial(), data = donner.df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  3.23041    1.38686   2.329   0.0198 *
Age         -0.07820    0.03728  -2.097   0.0359 *
SexMale     -1.59729    0.75547  -2.114   0.0345 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.827  on 44  degrees of freedom
Residual deviance: 51.256  on 42  degrees of freedom
AIC: 57.256

Number of Fisher Scoring iterations: 4

Odds Ratios#

  • One reason logistic models are popular is that the parameters have simple interpretations in terms of odds

    \[ODDS(A) = \frac{P(A)}{1-P(A)}.\]
  • Logistic model:

\[OR_{X_j} = \frac{ODDS(Y=1|\dots, X_j=x_j+h, \dots)}{ODDS(Y=1|\dots, X_j=x_j, \dots)} = e^{h \beta_j}\]
  • If \(X_j \in {0, 1}\) is dichotomous, then odds for group with \(X_j = 1\) are \(e^{\beta_j}\) higher, other parameters being equal.


  • Tempting to think of odds ratio as probability ratio, but not quite the same

  • In Donner, the odds ratio for a 45 year old male compared to a 35 year old female

\[e^{-0.7820-1.5973}=-2.3793\]
logodds = predict(donner.glm, list(Age=c(35,45),Sex=c('Female', 'Male')),
                 type='link')
logodds
logodds[2] - logodds[1]
exp(logodds[2] - logodds[1])
1
0.493271271049863
2
-1.88606295597673
2: -2.37933422702659
2: 0.0926122156997664

The estimated probabilities yield a ratio of \(0.1317/0.6209 \approx 0.21\).

prob = exp(logodds)/(1+exp(logodds))
prob
prob[2] / prob[1]
1
0.620876756856824
2
0.131694020754108
2: 0.212109761397426
  • When success is rare (rare disease hypothesis) then odds ratios are approximately probability ratios…

Inference#

  • Model is fit to minimize deviance

  • Fit using IRLS (Iteratively Reweighted Least Squares) algorithm

  • General formula for \(\text{Var}(\hat{\beta})\) from IRLS.

  • Intervals formed width vcov() are called Wald intervals.

center = coef(donner.glm)['Age']
SE = sqrt(vcov(donner.glm)['Age', 'Age'])
U = center + SE * qnorm(0.975)
L = center - SE * qnorm(0.975)
data.frame(L, center, U)
A data.frame: 1 × 3
LcenterU
<dbl><dbl><dbl>
Age-0.1512803-0.07820407-0.005127799

Covariance#

  • The estimated covariance uses the “weights” computed from the fitted model.

pi.hat = fitted(donner.glm)
W.hat = pi.hat * (1 - pi.hat)
X = model.matrix(donner.glm)
C = solve(t(X) %*% (W.hat * X))
c(SE, sqrt(C['Age', 'Age']))
  1. 0.0372844979677129
  2. 0.0372874889556373

Confidence intervals#

  • The intervals above are slightly different from what R will give you if you ask it for confidence intervals.

  • R uses so-called profile intervals.

  • For large samples the two methods should agree quite closely.

CI = confint(donner.glm)
CI
mean(CI[2,]) # profile intervals are not symmetric around the estimate...
data.frame(L, center, U)
Waiting for profiling to be done...
A matrix: 3 × 2 of type dbl
2.5 %97.5 %
(Intercept) 0.8514190 6.42669512
Age-0.1624377-0.01406576
SexMale-3.2286705-0.19510198
-0.0882517038296134
A data.frame: 1 × 3
LcenterU
<dbl><dbl><dbl>
Age-0.1512803-0.07820407-0.005127799

Example from book#

  • The book considers testing whether there is any interaction between Sex and Age in the Donner party data:

summary(glm(Y ~ Age + Sex + Age:Sex, 
            donner.df, 
            family=binomial()))$coef
A matrix: 4 × 4 of type dbl
EstimateStd. Errorz valuePr(>|z|)
(Intercept) 7.24638483.20516725 2.2608450.02376889
Age-0.19407410.08741539-2.2201370.02640948
SexMale-6.92804953.39887166-2.0383380.04151614
Age:SexMale 0.16159690.09426191 1.7143390.08646650

  • \(Z\)-test for testing \(H_0:\beta_{\texttt Age:SexMale}=0\):

\[ Z = \frac{0.1616-0}{0.09426} \approx 1.714 \]
  • When \(H_0\) is true, \(Z\) is approximately \(N(0,1)\).

pval = 2 * pnorm(1.714, lower=FALSE)
pval
0.0865287258673211

Comparing models in logistic regression#

  • For a model \({\cal M}\), \(DEV({\cal M})\) replaces \(SSE({\cal M})\).

  • In least squares regression (with \(\sigma^2\) known), we use

\[\frac{1}{\sigma^2}\left( SSE({\cal M}_R) - SSE({\cal M}_F) \right) \overset{H_0:{\cal M}_R}{\sim} \chi^2_{df_R-df_F}\]
  • This is closely related to \(F\) with large \(df_F\): approximately \(F_{df_R-df_F, df_F} \cdot (df_R-df_F)\).

  • For logistic regression this difference in \(SSE\) is replaced with

\[DEV({\cal M}_R) - DEV({\cal M}_F) \overset{n \rightarrow \infty, H_0:{\cal M}_R}{\sim} \chi^2_{df_R-df_F}\]
  • Resulting tests do not agree numerically with those coming from IRLS (Wald tests). Both are often used.

Comparing ~ Age + Sex to ~1#

anova(glm(Y ~ 1,
          data=donner.df, 
          family=binomial()), 
      donner.glm)
A anova: 2 × 5
Resid. DfResid. DevDfDeviancePr(>Chi)
<dbl><dbl><dbl><dbl><dbl>
14461.82654NA NA NA
24251.25628 210.570260.00506638

Corresponding \(p\)-value:

1 - pchisq(10.57, 2) # testing ~1 vs ~1 + Health.Aware + Age
0.00506703218816074

Comparing ~ Age + Sex to ~ Age#

anova(glm(Y ~ Age,
          data=donner.df, 
          family=binomial()), 
      donner.glm)
A anova: 2 × 5
Resid. DfResid. DevDfDeviancePr(>Chi)
<dbl><dbl><dbl><dbl><dbl>
14356.29072NA NA NA
24251.25628 15.0344370.02484816

Corresponding \(p\)-value:

1 - pchisq(5.03, 1) # testing ~ 1 + Health.Aware vs ~1 + Health.Aware + Age
0.024911898430907

Let’s compare this with the Wald test:

summary(donner.glm)
Call:
glm(formula = Y ~ Age + Sex, family = binomial(), data = donner.df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  3.23041    1.38686   2.329   0.0198 *
Age         -0.07820    0.03728  -2.097   0.0359 *
SexMale     -1.59729    0.75547  -2.114   0.0345 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.827  on 44  degrees of freedom
Residual deviance: 51.256  on 42  degrees of freedom
AIC: 57.256

Number of Fisher Scoring iterations: 4

Diagnostics#

  • Similar to least square regression, only residuals used are usually deviance residuals

    \[r_i = \text{sign}(Y_i-\widehat{\pi}_i) \sqrt{DEV(\widehat{\pi}_i|Y_i)}.\]
  • These agree with usual residual for least square regression.

par(mfrow=c(2,2))
plot(donner.glm)
../../_images/132198aaf47d1cfadfae2d0a61aafbb3a441d4b114b3b5ebc785390780b1589c.png

influence.measures(donner.glm)
Influence measures of
	 glm(formula = Y ~ Age + Sex, family = binomial(), data = donner.df) :

    dfb.1_  dfb.Age dfb.SxMl   dffit cov.r   cook.d    hat inf
1  -0.0828  0.09169  -0.0900 -0.2341 1.047 0.015372 0.0492    
2   0.0228  0.13667  -0.2285  0.3671 1.100 0.038310 0.1025    
3  -0.2909  0.32202   0.2497  0.4314 0.916 0.094717 0.0567    
4   0.0308 -0.03411  -0.0974 -0.1648 1.065 0.006840 0.0387    
5   0.0047 -0.00520  -0.0971 -0.1763 1.058 0.008042 0.0389    
6   0.0972 -0.10766  -0.0835 -0.1442 1.111 0.004763 0.0567    
7   0.0701 -0.23190   0.1880 -0.3990 1.146 0.043320 0.1310    
8   0.0599 -0.06636  -0.0309 -0.0697 1.146 0.001030 0.0670    
9   0.0518 -0.05738  -0.0259 -0.0598 1.143 0.000756 0.0628    
10  0.0701 -0.23190   0.1880 -0.3990 1.146 0.043320 0.1310    
11 -0.3939  0.23169   0.4066 -0.4886 0.955 0.109490 0.0780    
12 -0.0071  0.00786   0.1466  0.2663 0.986 0.024482 0.0389    
13  0.0047 -0.00520  -0.0971 -0.1763 1.058 0.008042 0.0389    
14 -0.0828  0.09169  -0.0900 -0.2341 1.047 0.015372 0.0492    
15  0.1534 -0.10271  -0.1414  0.1748 1.139 0.006983 0.0802    
16  0.1543 -0.09959  -0.1473  0.1801 1.136 0.007465 0.0794    
17 -0.0071  0.00786   0.1466  0.2663 0.986 0.024482 0.0389    
18  0.1342 -0.10552  -0.1022  0.1391 1.155 0.004259 0.0839    
19  0.1005 -0.25632   0.1656 -0.3986 1.172 0.041875 0.1438    
20  0.0744 -0.08240  -0.0408 -0.0880 1.150 0.001656 0.0721    
21  0.1500 -0.10660  -0.1298  0.1644 1.145 0.006103 0.0817    
22  0.1714 -0.18980   0.0614  0.2872 1.076 0.022793 0.0735    
23 -0.0439  0.04860  -0.0940 -0.2054 1.050 0.011427 0.0433    
24  0.0656 -0.07267  -0.0346 -0.0768 1.148 0.001254 0.0694    
25  0.0557 -0.06172   0.1193  0.2608 1.009 0.021757 0.0433    
26  0.1434 -0.15882   0.0768  0.2773 1.055 0.022058 0.0623    
27 -0.0992  0.10982   0.1833  0.2998 0.962 0.034818 0.0403    
28  0.1259 -0.02425  -0.1980  0.2436 1.109 0.014897 0.0786    
29  0.1546 -0.09559  -0.1532  0.1855 1.133 0.007983 0.0787    
30 -0.0522  0.05781   0.1650  0.2793 0.974 0.028470 0.0387    
31 -0.2884  0.31929  -0.0567 -0.4225 1.046 0.059528 0.0928    
32  0.1370 -0.28212   0.1339 -0.3937 1.211 0.039152 0.1626    
33  0.1520 -0.10503  -0.1355  0.1695 1.142 0.006530 0.0810    
34 -0.0439  0.04860  -0.0940 -0.2054 1.050 0.011427 0.0433    
35 -0.4172  0.46184   0.2851  0.5385 0.876 0.187509 0.0685    
36  0.1259 -0.02425  -0.1980  0.2436 1.109 0.014897 0.0786    
37  0.0308 -0.03411  -0.0974 -0.1648 1.065 0.006840 0.0387    
38 -0.0439  0.04860  -0.0940 -0.2054 1.050 0.011427 0.0433    
39 -0.0439  0.04860  -0.0940 -0.2054 1.050 0.011427 0.0433    
40 -0.0439  0.04860  -0.0940 -0.2054 1.050 0.011427 0.0433    
41  0.0308 -0.03411  -0.0974 -0.1648 1.065 0.006840 0.0387    
42  0.0761 -0.08422  -0.0931 -0.1518 1.087 0.005495 0.0453    
43  0.0937 -0.10368   0.1018  0.2648 1.026 0.021407 0.0492    
44 -0.0627  0.06942  -0.0922 -0.2188 1.049 0.013186 0.0459    
45  0.1541 -0.09064  -0.1591  0.1912 1.130 0.008544 0.0780    

Model selection#

  • Models all have AIC scores \(\implies\) stepwise selection can be used easily!

  • Package bestglm has something like leaps functionality (not covered here)

AIC(donner.glm)
null.glm = glm(Y ~ 1, data=donner.df, family=binomial())
upper.glm = glm(Y ~ (.-Status)^2, data=donner.df, family=binomial())
donner_selected = step(null.glm, scope=list(upper=upper.glm), direction='both')
summary(donner_selected)$coef
57.2562845365773
Start:  AIC=63.83
Y ~ 1

       Df Deviance    AIC
+ Age   1   56.291 60.291
+ Sex   1   57.286 61.286
<none>      61.827 63.827

Step:  AIC=60.29
Y ~ Age

       Df Deviance    AIC
+ Sex   1   51.256 57.256
<none>      56.291 60.291
- Age   1   61.827 63.827

Step:  AIC=57.26
Y ~ Age + Sex

          Df Deviance    AIC
+ Age:Sex  1   47.346 55.346
<none>         51.256 57.256
- Sex      1   56.291 60.291
- Age      1   57.286 61.286

Step:  AIC=55.35
Y ~ Age + Sex + Age:Sex

          Df Deviance    AIC
<none>         47.346 55.346
- Age:Sex  1   51.256 57.256
A matrix: 4 × 4 of type dbl
EstimateStd. Errorz valuePr(>|z|)
(Intercept) 7.24638483.20516725 2.2608450.02376889
Age-0.19407410.08741539-2.2201370.02640948
SexMale-6.92804953.39887166-2.0383380.04151614
Age:SexMale 0.16159690.09426191 1.7143390.08646650

Case Study B: birdkeeping and smoking#

  • Outcome: LC, lung cancer status

  • Features:

    1. FM: sex of patient

    2. AG: age (years)

    3. SS: socio-economic status

    4. YR: years of smoking prior to examination / diagnosis

    5. CD: cigarettes per day

    6. BK: whether or not subjects keep birds at home


url = 'https://raw.githubusercontent.com/StanfordStatistics/stats191-data/main/Sleuth3/smoking_birds.csv'
smoking.df = read.csv(url, sep=',', header=TRUE)
smoking.df$LC = smoking.df$LC == 'LungCancer'
head(smoking.df)
A data.frame: 6 × 7
LCFMSSBKAGYRCD
<lgl><chr><chr><chr><int><int><int>
1TRUEMaleLow Bird 371912
2TRUEMaleLow Bird 412215
3TRUEMaleHighNoBird431915
4TRUEMaleLow Bird 462415
5TRUEMaleLow Bird 493120
6TRUEMaleHighNoBird512415
base.glm = glm(LC ~ .,
               data=smoking.df,
               family=binomial())
summary(base.glm)$coef
A matrix: 7 × 4 of type dbl
EstimateStd. Errorz valuePr(>|z|)
(Intercept) 0.091956261.75464706 0.052407270.9582041827
FMMale-0.561272700.53116056-1.056691220.2906525319
SSLow-0.105447610.46884614-0.224908770.8220502474
BKNoBird-1.362594560.41127585-3.313091590.0009227076
AG-0.039755420.03548022-1.120495250.2625027758
YR 0.072868480.02648741 2.751061200.0059402544
CD 0.026016890.02552400 1.019311020.3080553359

Reduced model: ~ . - BK#

reduced.glm = glm(LC ~ . - BK, 
               data=smoking.df,
               family=binomial())
anova(reduced.glm, base.glm)
A anova: 2 × 5
Resid. DfResid. DevDfDeviancePr(>Chi)
<dbl><dbl><dbl><dbl><dbl>
1141165.8684NA NA NA
2140154.1984 111.669970.000635172

Further reduction: ~ FM + SS + AG + YR#

reduced2.glm = glm(LC ~ . - BK - CD, 
               data=smoking.df,
               family=binomial())
anova(reduced2.glm, reduced.glm)
A anova: 2 × 5
Resid. DfResid. DevDfDeviancePr(>Chi)
<dbl><dbl><dbl><dbl><dbl>
1142166.5320NA NA NA
2141165.8684 10.66364120.4152774

Effect of smoking#

  • As in the salary discrimination study, the book first selects model without BK

  • We’ll start with all possible two-way interactions

full.glm = glm(LC ~ (. - BK)^2, data=smoking.df,
               family=binomial())
summary(full.glm)$coef
A matrix: 16 × 4 of type dbl
EstimateStd. Errorz valuePr(>|z|)
(Intercept)-25.9051048668.969226e+02-0.028882210.9769585
FMMale 19.5368488498.968999e+02 0.021782640.9826213
SSLow 18.2174869688.969003e+02 0.020311610.9837948
AG 0.1497675541.434516e-01 1.044028690.2964721
YR 0.2615754272.351992e-01 1.112144020.2660762
CD 0.0858583262.404498e-01 0.357073870.7210365
FMMale:SSLow-15.8080490588.968883e+02-0.017625440.9859377
FMMale:AG -0.1314264049.613134e-02-1.367154620.1715768
FMMale:YR 0.1442639679.406420e-02 1.533675560.1251095
FMMale:CD -0.1028505619.575896e-02-1.074056830.2827972
SSLow:AG -0.0586691679.501203e-02-0.617492000.5369103
SSLow:YR 0.0687485877.340861e-02 0.936519450.3490058
SSLow:CD -0.0842551225.797420e-02-1.453320910.1461347
AG:YR -0.0049077333.730934e-03-1.315416620.1883699
AG:CD 0.0026269624.682346e-03 0.561035510.5747733
YR:CD -0.0025973743.145647e-03-0.825704280.4089719

select.glm = step(full.glm, 
                  direction='backward',
                  trace=FALSE)
 # these effects aren't to be fully trusted - data snooping!
summary(select.glm)$coef
A matrix: 4 × 4 of type dbl
EstimateStd. Errorz valuePr(>|z|)
(Intercept) 0.828861351.52661870 0.54293930.5871715684
FMMale-0.736378590.48914336-1.50544530.1322096220
AG-0.063627530.03359029-1.89422410.0581952757
YR 0.087764990.02451891 3.57948100.0003442773

  • Let’s see the effect of BK when added to our selected model

bird.glm = glm(LC ~ FM + AG + YR + BK, family=binomial(),
               data=smoking.df)
 # these effects aren't to be fully trusted - data snooping!
anova(select.glm, bird.glm)
A anova: 2 × 5
Resid. DfResid. DevDfDeviancePr(>Chi)
<dbl><dbl><dbl><dbl><dbl>
1143166.5488NA NA NA
2142155.3212 111.227540.0008059231

Probit model#

  • We used \(F(x)=e^x/(1+e^x)\). There are other choices…

  • Probit regression model:

\[\Phi^{-1}(E(Y|X))= \beta_0 + \sum_{j=1}^{p} \beta_j X_{j}\]
  • Above, \(\Phi\) is CDF of \(N(0,1)\), i.e. \(\Phi(t) = {\tt pnorm(t)}\), \(\Phi^{-1}(q) = {\tt qnorm}(q)\).


  • Regression function:

\[\begin{split} \begin{aligned} E(Y|X) &= E(Y|X_1,\dots,X_p) \\ &= P(Y=1|X_1, \dots, X_p) \\ & = {\tt pnorm}\left(\beta_0 + \sum_{j=1}^p \beta_j X_j \right)\\ \end{aligned} \end{split}\]
  • In logit, probit and cloglog \({\text{Var}}(Y|X)=\pi(X)(1-\pi(X))\) but the model for the mean is different.

  • Coefficients no longer have an odds ratio interpretation.


summary(glm(Y ~ Age + Sex,
            data=donner.df, 
            family=binomial(link='probit')))
Call:
glm(formula = Y ~ Age + Sex, family = binomial(link = "probit"), 
    data = donner.df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  1.91730    0.76438   2.508   0.0121 *
Age         -0.04571    0.02076  -2.202   0.0277 *
SexMale     -0.95828    0.43983  -2.179   0.0293 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 61.827  on 44  degrees of freedom
Residual deviance: 51.283  on 42  degrees of freedom
AIC: 57.283

Number of Fisher Scoring iterations: 5

Generalized linear models: glm#

Given a dataset \((Y_i, X_{i1}, \dots, X_{ip}), 1 \leq i \leq n\) we consider a model for the distribution of \(Y|X_1, \dots, X_p\).

-If \(\eta_i=g(E(Y_i|X_i)) = g(\mu_i) = \beta_0 + \sum_{j=1}^p \beta_j X_{ij}\) then \(g\) is called the link function for the model.

  • If \({\text{Var}}(Y_i) = \phi \cdot V({\mathbb{E}}(Y_i)) = \phi \cdot V(\mu_i)\) for \(\phi > 0\) and some function \(V\), then \(V\) is the called variance function for the model.

  • Canonical reference Generalized linear models.

Binary regression as GLM#

  • For a logistic model, \(g(\mu)={\text{logit}}(\mu), \qquad V(\mu)=\mu(1-\mu).\)

  • For a probit model, \(g(\mu)=\Phi^{-1}(\mu), \qquad V(\mu)=\mu(1-\mu).\)

  • For a cloglog model, \(g(\mu)=-\log(-\log(\mu)), \qquad V(\mu)=\mu(1-\mu).\)

  • All of these have dispersion \(\phi=1\).