ANOVA#

Download#

Outline#

  • ANOVA models: variables with only categorical variables

  • Case studies:

    A. Seaweed regeneration time based on grazers

    B. Pygmalion effect

Case study A: Seaweed grazers#

url = 'https://raw.githubusercontent.com/StanfordStatistics/stats191-data/main/Sleuth3/seaweed_grazers.csv'
seaweed.df = read.csv(url, sep=',', header=TRUE)
head(seaweed.df)
A data.frame: 6 × 3
CoverBlockTreat
<int><chr><chr>
114B1C
223B1C
322B2C
435B2C
567B3C
682B3C
  • Treat: which animals were allowed into graze

  • Block: which of 8 blocks measured

  • Cover: what percentage of each block had seaweed covered at time of measurement

Case study B: Pygmalion effect#

url = 'https://raw.githubusercontent.com/StanfordStatistics/stats191-data/main/Sleuth3/pygmalion.csv'
pygmalion.df = read.csv(url, sep=',', header=TRUE)
head(pygmalion.df)
A data.frame: 6 × 3
CompanyTreatScore
<chr><chr><dbl>
1C1Pygmalion80.0
2C1Control 63.2
3C1Control 69.2
4C2Pygmalion83.9
5C2Control 63.1
6C2Control 81.5
  • Company: ID of company

  • Treat: whether Pygmalion or Control

  • Score: outcome of interest

Additive model#

pygm_add.lm = lm(Score ~ Treat + Company, data=pygmalion.df)
anova(pygm_add.lm)
A anova: 3 × 5
DfSum SqMean SqF valuePr(>F)
<int><dbl><dbl><dbl><dbl>
Treat 1327.3410327.341027.5685400.01314152
Company 9682.5171 75.835241.7534070.14844454
Residuals18778.5039 43.25022 NA NA

Saturated model#

pygm_sat.lm = lm(Score ~ Treat + Company + Treat:Company, data=pygmalion.df)
anova(pygm_sat.lm)
A anova: 4 × 5
DfSum SqMean SqF valuePr(>F)
<int><dbl><dbl><dbl><dbl>
Treat1327.3410327.341026.30795890.03322533
Company9682.5171 75.835241.46136760.29050766
Treat:Company9311.4639 34.607100.66688920.72211560
Residuals9467.0400 51.89333 NA NA

Graphical check of additivity#

with(pygmalion.df, 
   interaction.plot(Company, Treat, Score,
                    type='p',
                    col=c('red', 'blue'),
		    lwd=2))
../../_images/8c914b85fb08f3b700f6e1ffeb0dd24bebc2ad7dd9a650290e272a6588c44a6c.png

Two-way ANOVA model#

  • Generalization of \(t\)-test: more than one grouping variable.

  • Two-way ANOVA model:

    • \(r=8\) groups in first factor (Block)

    • \(m=6\) groups in second factor (Treat)

    • \(n_{ij}\) in each combination of factor variables.

  • Model: $\(Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha \beta)_{ij} + \varepsilon_{ijk} , \qquad \varepsilon_{ijk} \sim N(0, \sigma^2).\)$

  • In seaweed.df, \(r=8\), \(m=6\), \(n_{ij}=2\) for all \((i,j)\).

Parameterization#

  • Some constraints are needed, again for identifiability. Let’s not worry too much about the details…

  • Constraints:

    • \(\sum_{i=1}^r \alpha_i = 0\)

    • \(\sum_{j=1}^m \beta_j = 0\)

    • \(\sum_{j=1}^m (\alpha\beta)_{ij} = 0, 1 \leq i \leq r\)

    • \(\sum_{i=1}^r (\alpha\beta)_{ij} = 0, 1 \leq j \leq m.\)

ANOVA table#

Source

df

E(MS)

A

r-1

\(\sigma^2 + nm\frac{\sum_{i=1}^r \alpha_i^2}{r-1}\)

B

m-1

\(\sigma^2 + nr\frac{\sum_{j=1}^m \beta_j^2}{m-1}\)

A:B

(m-1)(r-1)

\(\sigma^2 + n\frac{\sum_{i=1}^r\sum_{j=1}^m (\alpha\beta)_{ij}^2}{(r-1)(m-1)}\)

Error

(n-1)mr

\(\sigma^2\)

Tests using the ANOVA table#

  • Rows of the ANOVA table can be used to test various of the hypotheses we started out with.

  • Under \(H_0:\) no interactions

\[F = \frac{MSAB}{MSE} = \frac{\frac{SSAB}{(m-1)(r-1)}}{\frac{SSE}{(n-1)mr}} \sim F_{(m-1)(r-1), (n-1)mr}\]

Test of additivity#

  • \(H_0\): pygm_add.lm is the true model

anova(pygm_add.lm, pygm_sat.lm)
A anova: 2 × 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
118778.5039NA NA NA NA
2 9467.0400 9311.46390.66688920.7221156
  • \(F\)-statistic \(\approx 0.67\) – very little evidence against \(H_0\).

  • Note: this \(F\)-statistic also appears in anova(pygm_sat.lm).

Case study A: seaweed regeneration#

seaweed.df$Block = factor(seaweed.df$Block, levels=c('B1','B2','B7','B5', 'B8', 'B6', 'B3', 'B4'))
  • Before calling interaction.plot we’ll order Block as book does for Fig 13.9: ordered by total response.

Interaction plot for seaweed#

with(seaweed.df,
   interaction.plot(Block, Treat, Cover,
                    type='b',
                    col=c('red', 'blue'),
		    lwd=2))
../../_images/904826eb000d8389f74b90ebd6642117421f890fc1c35cb28e52508750de3d6f.png

No transformation#

seaweed_sat.lm = lm(Cover ~ Block + Treat + Treat:Block, data=seaweed.df)
par(mfrow=c(2,2))
plot(seaweed_sat.lm)
../../_images/677fdfd26d1406595075b6e920c90b0247dca9b0a7329bec0a12052ac12f73e6.png

logit transformed#

seaweed.df$lCover = with(seaweed.df, log(Cover / (100 - Cover)))
seaweed_lsat.lm = lm(lCover ~ Block + Treat + Treat:Block, data=seaweed.df)
par(mfrow=c(2,2))
plot(seaweed_lsat.lm)
../../_images/4fdffb36f84b23a3f06e2821a614f278d9d723aac087ac1e0f9b355bac84510c.png

Interaction plot for logit transformed data#

with(seaweed.df,
   interaction.plot(Block, Treat, lCover,
                    type='b',
                    col=c('red', 'blue'),
		    lwd=2))
../../_images/cee55b9df9d3663b3bcb6ac499293b0d88dc98b543e0b2b5a5cac5265cfa069f.png

ANOVA table#

anova(seaweed_lsat.lm)
A anova: 4 × 5
DfSum SqMean SqF valuePr(>F)
<int><dbl><dbl><dbl><dbl>
Block 776.2386110.891230435.9633715.390313e-17
Treat 596.9932219.398644564.0552654.510728e-20
Block:Treat3515.23041 0.4351546 1.4369021.209129e-01
Residuals4814.53643 0.3028423 NA NA

Another test for Treat#

seaweed_ladd.lm = lm(lCover ~ Block + Treat, data=seaweed.df)
anova(seaweed_ladd.lm)
A anova: 3 × 5
DfSum SqMean SqF valuePr(>F)
<int><dbl><dbl><dbl><dbl>
Block 776.2386110.891230430.368432.133837e-20
Treat 596.9932219.398644554.089971.094513e-24
Residuals8329.76684 0.3586366 NA NA
  • Difference between the two \(F\)-statistics: estimate of \(\sigma^2\)

Inference about specific linear combinations#

Do large fish have an effect on the regeneration ratio? If so, how much? (see Display 13.13)#

linear_comb = 0 * coef(seaweed_ladd.lm)
linear_comb['TreatLfF'] = 0.5
linear_comb['TreatfF'] = 0.5
linear_comb['TreatLf'] = -0.5
linear_comb['Treatf'] = -0.5
estimate = sum(linear_comb * coef(seaweed_ladd.lm))

Standard error#

var_est = sum(linear_comb * (vcov(seaweed_ladd.lm) %*% linear_comb))
SE = sqrt(var_est)
T_stat = estimate / SE
pval = 2 * pt(abs(T_stat), seaweed_ladd.lm$df.resid, lower=FALSE)
c(estimate=estimate, T_stat=T_stat, SE=SE, pval=pval)
estimate
-0.614025727666025
T_stat
-4.10127815915699
SE
0.149715699310733
pval
9.53986852987613e-05

Some caveats about R formulae#

While we see that it is straightforward to form the interactions test using our usual anova function approach, we generally cannot test for main effects by this approach.

lm1 = lm(lCover ~ Treat + Block:Treat, data=seaweed.df)
anova(lm1, seaweed_lsat.lm)
A anova: 2 × 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
14814.53643NA NANANA
24814.53643 0-1.065814e-14NANA
sum((resid(lm1) - resid(seaweed_lsat.lm))^2)
6.59262517602417e-28