Models for matched pairs (11.1)¶
vote_data = matrix(c(175, 16, 54, 188), 2, 2, byrow=TRUE) # Table 11.1
colnames(vote_data) = c('D_2008', 'R_2008')
rownames(vote_data) = c('D_2004', 'R_2004')
vote_data
D_2008 | R_2008 | |
---|---|---|
D_2004 | 175 | 16 |
R_2004 | 54 | 188 |
Q: has party affiliation shifted?¶
Null is marginal homogeneity
Difference is
Estimate
Standard error (under
)
Cochran-Mantel-Haenszel test (6.4)¶
Each subject produces a 2x2 table with rows being years, columns being political parties.
E.g. a Bush supporter in 2004 who changed votes to Obama would be
voter = matrix(c(0,1,1,0), 2, 2)
rownames(voter) = c(2004, 2008)
colnames(voter) = c('D', 'R')
voter
D | R | |
---|---|---|
2004 | 0 | 1 |
2008 | 1 | 0 |
We can thus form a
table representing each voter’s data.Marginal homogeneity
conditional independence in table.
Combining multiple tables¶
Let
be a table.Moments:
Cochran’s variation has slightly different variance…
Test statistic¶
McNemar’s test¶
Cochran-Mantel-Haenszel with one stratum per subject (433 subjects) on
table. : whether vote is Republican or Democrat : year of vote : subject labelApplying CMH: no variation when both votes are
R
orD
: only discordant pairs contribute.
pi_2008 = (175 + 54) / (175 + 16 + 54 + 188) # total is 433
pi_2004 = (175 + 16) / (175 + 16 + 54 + 188)
delta = pi_2004 - pi_2008
SE = sqrt(54 + 16) / (175 + 16 + 54 + 188)
Z = delta / SE
Z
We’ve observed 70 discordant pairs (people who changed votes).
Under
,Exact p-value can be computed using e.g.
pbinom
.
pbinom(16, 54 + 16, 0.5)
Z_mcnemar = (16 - 0.5 * (54 + 16)) / sqrt(0.5 * 0.5 * (54+16))
Z_mcnemar
c(1 - pchisq(Z_mcnemar^2, 1), 2 * pbinom(16, 54 + 16, 0.5))
- 5.5757761732167e-06
- 5.8539555293881e-06
Conditional logistic regression (11.2)¶
Data can be represented as
where are votes in 2004 and are votes in 2008 (with, say, if vote isD
).Might model these as
This is marginal model.
Subject specific model¶
Might model these as $
$This has 434 parameters!
Can’t expect to be easy to fit, or standard MLE theory to work.
Eliminate
’s by conditioning – this is yields a conditional likelihood similar to those we saw when discussing pseudolikelihood.
Full likelihood is proportional to
Sufficient statistic for
isConditioning on
eliminates from the likelihood.
No information when
is 0 or 2. Focuses again on discordant pairs…For discordant pairs, let
Conditional likelihood $
$This is logistic likelihood intercept only (one sample Binomial likelihood) $
$
Including covariats¶
Suppose the pairs have
-dimensional features .Similar conditioning and calculation yields product over discordant pairs $
$A standard logistic likelihood with features
.
Square tables (11.4)¶
An
table often has (essentially) same row names as column names.There are several models of
or, thinking of log-linear model
Marginal homogeneity
or
Marginal homogeneity generally not expressible as log-linear model.
Symmetry¶
Model:
with
Note: symmetry implies marginal homogeneity (converse not generally true).
Model fit
Goodness of fit test with
being symmetry
Quasi-independence¶
Model: $
$Exact independence is
.A lowish dimensional departure from independence.
Agreement between raters¶
agreement = matrix(c(22, 2, 2, 0,
5, 7,14, 0,
0, 2,36, 0,
0, 1,17,10),
4, 4,
byrow=TRUE)
agreement
22 | 2 | 2 | 0 |
5 | 7 | 14 | 0 |
0 | 2 | 36 | 0 |
0 | 1 | 17 | 10 |
Quasi-independence model might be reasonable here: raters agree on easy to classify objects (diagonal is higher than independence)
carcinoma = data.frame(N=as.numeric(agreement), # Table 11.8
A=factor(c(rep(1,4), rep(2,4), rep(3,4), rep(4,4))),
B=factor(rep(1:4,4)))
indep_fit = glm(N ~ A + B, family=poisson, data=carcinoma)
indep_resid = matrix(resid(indep_fit), 4, 4)
indep_resid # positive on diagonal
5.0439297 | -0.4140653 | -4.2770661 | -2.099233 |
-0.4002454 | 2.2177506 | -0.3128405 | -2.099233 |
-4.1701116 | -1.0460256 | 2.6798251 | -2.537849 |
-3.5796032 | -1.2657260 | 0.1540102 | 3.676325 |
departure = factor((carcinoma$A == carcinoma$B) * as.numeric(carcinoma$A))
quasi_indep_fit = glm(N ~ A + B + departure, family=poisson, data=carcinoma)
quasi_indep_resid = matrix(resid(quasi_indep_fit), 4, 4)
quasi_indep_resid # residuals are 0 on diagonal
5.161914e-08 | 1.1958607 | -0.7480157 | -6.169760e-05 |
1.486609e+00 | 0.0000000 | -0.6636398 | -1.395289e-04 |
-1.236669e+00 | 0.6307957 | 0.0000000 | -7.928944e-05 |
-1.932544e+00 | -1.3509531 | 1.0251702 | 0.000000e+00 |
anova(indep_fit, quasi_indep_fit)
Resid. Df | Resid. Dev | Df | Deviance | |
---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | |
1 | 9 | 117.95686 | NA | NA |
2 | 5 | 13.17806 | 4 | 104.7788 |
Cohen’s ¶
Cohen’s
See textbook for estimate of variance of sample version
row_margin = apply(agreement, 2, sum) / sum(agreement)
col_margin = apply(agreement, 1, sum) / sum(agreement)
kappa_hat = (sum(diag(agreement)) / sum(agreement) - sum(row_margin * col_margin)) / (1 - sum(row_margin * col_margin))
kappa_hat
Paired preferences (Bradley-Terry model, 11.6)¶
baseball = matrix(c(NA,12, 6,10,10,
6,NA, 9,11,13,
12, 9,NA,12, 9,
8, 7, 6,NA,12,
8, 5, 9, 6,NA),
5, 5, byrow=TRUE)
rownames(baseball) = c('Boston',
'New York',
'Tampa Bay',
'Toronto',
'Baltimore')
colnames(baseball) = rownames(baseball)
baseball
Boston | New York | Tampa Bay | Toronto | Baltimore | |
---|---|---|---|---|---|
Boston | NA | 12 | 6 | 10 | 10 |
New York | 6 | NA | 9 | 11 | 13 |
Tampa Bay | 12 | 9 | NA | 12 | 9 |
Toronto | 8 | 7 | 6 | NA | 12 |
Baltimore | 8 | 5 | 9 | 6 | NA |
Let
denote probability team beats team .Model
Bradley-Terry model is an example of quasi-symmetry.
Paired preferences¶
library(BradleyTerry2)
processed_baseball = BradleyTerry2::countsToBinomial(baseball)
processed_baseball
Attaching package: ‘BradleyTerry2’
The following object is masked _by_ ‘.GlobalEnv’:
baseball
player1 | player2 | win1 | win2 |
---|---|---|---|
<fct> | <fct> | <dbl> | <dbl> |
Boston | New York | 12 | 6 |
Boston | Tampa Bay | 6 | 12 |
Boston | Toronto | 10 | 8 |
Boston | Baltimore | 10 | 8 |
New York | Tampa Bay | 9 | 9 |
New York | Toronto | 11 | 7 |
New York | Baltimore | 13 | 5 |
Tampa Bay | Toronto | 12 | 6 |
Tampa Bay | Baltimore | 9 | 9 |
Toronto | Baltimore | 12 | 6 |
BradleyTerry2::BTm(cbind(win1, win2),
player1=player1,
player2=player2,
data=processed_baseball, refcat='Baltimore')
Bradley Terry model fit by glm.fit
Call: BradleyTerry2::BTm(outcome = cbind(win1, win2), player1 = player1,
player2 = player2, refcat = "Baltimore", data = processed_baseball)
Coefficients:
..Boston ..New York ..Tampa Bay ..Toronto
0.4539 0.4991 0.6354 0.2287
Degrees of Freedom: 10 Total (i.e. Null); 6 Residual
Null Deviance: 13.18
Residual Deviance: 7.703 AIC: 48.66
Can break down games into home and away
Let
denote probability home team beats away teamModel $
$