SOLUTIONS
Startup Review Problems Ed257
D Rogosa
January 1999
1. (Reference NWK Sec 16.8 pp. 6815)
The ANOVA table is as follows, with calculations below.
SOURCE SS df MS
Between 80 4 20
Within 400 40 10
Total 480 44
SSW = SST  SSB = 48080= 400
df(within) = total n  (# of groups) = (44 + 1)  (4 + 1) = 40
or df(total)  df(between)
MS = SS/df so that MSB = 80/4 = 20 and MSW = 400/40 = 10.
The omnibus null hypothesis is Ho: mu(1) = mu(2) = ... = mu(5)
i.e that all 5 population means are equal, versus an alternative hypothesis
that not all are equal. The test statistic is the ratio of the mean squares
= 20/10 = 2.
The critical value for Type I error rate .10 is F(0.90, 4,40) = 2.09,
(Table A.4 rough interpolation)
Or use Minitab:
MTB >invcdf .90;
SUBC>f 4 40.
0.9000 2.0909
Since 2<2.09, we do not reject the null hypothesis.
NOTE: since subscripts cannot be displayed in this text mode we will usually
employ parens to indicate subscripts etc  e.g. mu(1).

2.
a) The model for this problem is as follows:
Y(ij) = mu + alpha(i) + epsilon(ij)
where
i = 1,2,3 (3 groups)
j = 1,2, ... n(i) where n(1)=12, n(2)=14, n(3)=11
Y(ij) = jth employee's response in the ith group
mu = overall mean
alpha(i) = effect of the ith group
epsilon(ij) = random error (individual differences) associated with
the jth employee in the ith group
refer to NWK formula (16.62) on p.693
(An alternative model in terms of the cell means rather
than main effects could be written:
Y(ij) = mu(i) + epsilon(ij)
where
i = 1,2,3
j = 1,2,...,n(i)
Y(ij) = jth employee's response in the ith group
mu(i) = mean of the ith group
epsilon(ij) = random error associated with the jth employee in the
ith group
)
b) We are given
G(1) G(2) G(3)
n(i) 12 14 11
y(i)bar 25.2 32.6 28.1 (sample means)
s(i)^2 3.6 4.8 5.3 (sample variances)
n = 12+14+11= 37
First calculate the grand mean: ybar = 28.862
To calculate grand mean, weight each group mean by its sample size,
add, and divide by total n:
ybar = [25.2(12) + 32.6(14) + 28.1(11)]/(12+14+11) = 28.86
Degrees of freedom between is 2, and within is 34.
Form SSB by deviating
the group means from the grand mean (28.86), squaring the deviations,
multiplying by the group size, and summing over the three groups
(SSB=362.98). MSB is 181.47 (362.93/2).
Now, SSW = (n(1)1)s(1)^2 + (n(2)1)s(2)^2 + (n(3)1)s(3)^2
=11(3.6) + 13(4.8) + 10(5.3)
= 155
MSW is the weighted average (by sample size) of the
withingroup variances = 4.559 which is found by
divide SSW by dfw: 155/34 = 4.559.
Hence the ANOVA table is
SOURCE SS df MS
Between 362.98 2 181.49
Within 155 34 4.558
Total 517.98 36
Test statistic = MSB/MSW = 39.81
The 99th percentile point of F(2,34) is 5.30.
(by simple interpolation since F(0.99,2,30)=5.39 and F(0.99,2,40)=5.18)
Since 39.81 > 5.30 we reject the null hypothesis of equal means in all
groups.

3.
a) MTB > read '/usr/class/ed257/HW/knee.dat' c1 c2
24 ROWS READ
ROW C1 C2
1 29 1
2 42 1
3 38 1
4 40 1
. . .
MTB > describe c1;
SUBC> by c2.
C2 N MEAN MEDIAN TRMEAN STDEV SEMEAN
C1 1 8 38.00 40.00 38.00 5.48 1.94
2 10 32.00 31.00 31.62 3.46 1.10
3 6 24.00 22.50 24.00 4.43 1.81
C2 MIN MAX Q1 Q3
C1 1 29.00 43.00 32.00 42.00
2 28.00 39.00 29.00 35.00
3 20.00 32.00 20.75 27.50
The group means are 38, 32, and 24 for the below average, average, and
above average groups, respectively. Variances are 30.03, 11.97, and 19.62.
(Note: The group means and SDs are also displayed under the ANOVA table)
b)
MTB > dotplot c1;
SUBC> by c2.
C2
1 (below average)
. . . : : .
++++++C1
C2
2 (average)
. : . : . : .
++++++C1
C2
3 (above average)
. . . . . .
++++++C1
20.0 25.0 30.0 35.0 40.0 45.0
These plots illustrate the clustering of the observations in each
group about the group means. The small sample sizes make it difficult
to detect outliers or heteroskedasticity (unequal group variances),
although the observations in the below average group appear to be
somewhat more spread out than are those in the other groups.
c)
MTB > oneway c1 c2 resids in c3 fits in c4;
SUBC> tukey.
(Note: the above command tells Minitab to store the residuals in C3
and the fitted values (which are just the group means) in C4. The
words "resids in" and "fits in" are unnecessary; could just write
MTB >oneway c1 c2 c3 c4)
ANALYSIS OF VARIANCE ON C1
SOURCE DF SS MS F p
C2 2 672.0 336.0 16.96 0.000
ERROR 21 416.0 19.8
TOTAL 23 1088.0
INDIVIDUAL 95 PCT CI'S FOR MEAN
BASED ON POOLED STDEV
LEVEL N MEAN STDEV +++
1 8 38.000 5.477 (*)
2 10 32.000 3.464 (*)
3 6 24.000 4.427 (*)
+++
POOLED STDEV = 4.451 24.0 30.0 36.0
The omnibus null hypothesis is
Ho: mu(1)=mu(2)=mu(3)
We test this against the alternative
Ha: not all mu(i) are equal
Test statistic is MSB/MSW = 336/19.8 = 16.96.
Find critical value F(.95,2,21):
MTB > invcdf .95;
SUBC> f 2 21.
0.9500 3.4668
Since 16.96 > 3.4668, we reject the omnibus null hypothesis and
conclude that there are differences among the three groups.
d) Resids are stored in C3 & fits in C4, from oneway command above.
MTB > plot c3 c4
 *
 *
6.0+
 *
C3  2
 * 2 2
 *
0.0+ *
 * 2
 * *
 2 3

6.0+

 *
 *

++++++C4
25.0 27.5 30.0 32.5 35.0 37.5
We could also plot C3 against C2, or produce aligned dotplots of the
residuals for each group.
Here's how to obtain residuals the long way (remember residuals are
just the differences between each observation and the group mean):
MTB > unstack c1 c3c5;
SUBC> subscripts c2.
MTB > let c6=c3mean(c3)
MTB > let c7=c4mean(c4)
MTB > let c8=c5mean(c5)
MTB > stack c6c8 c9
MTB > plot c9 c2.
The plots suggest that the variability of the observations in the
below average group is greater than that for the other groups (the
dotplots and a quick look at the descriptive statistics support this).
Since the sample sizes are a bit unequal, if one wanted to be very careful,
the best analysis here would be to use something like BMDP7D
which we illustrated with the IBS data to use a oneway anova method
that did not require the equal variance assumption.

PROBLEM 4
Could You Get In?
a. Median and Quartiles of GPA
Using just the scatterplot from the problem, we can get a
roughy graphical answer:
Note that each tick mark represents an increment of 0.16.
Median = (10th + 11th)/2 = (2.24 +2.56)/2 = 2.4
Q1= (5th + 6th)/2 = (1.92 + 2.08)/2 = 2.0
Q3= (15th + 16th)/2 = (3.04 + 3.04)/2 = 3.04
we should mention that alternative formulas for quartiles will
produce slightly different results; e.g., the formula
Q1=x[(n+1)/4] (integer part) & Q3=x[3*(n+1)/4] (integer part)
produce 1.92 for q1.
If you use the actual data in the indicated file 95revp1.dat
you should obtain from
MTB> describe c1
values 2.400 for median, 1.925 for Q1, and 3.075 for
Q3.
b. fit for 5.0
GPA = 1.70+0.840*Test
= 1.70+0.840*5.0
= 2.5
fit for 6.0
GPA = 1.70+0.840*6.0=3.34
observed at 6.0 was 3.36
residual equals observed value minus fitted value:
3.363.34=0.02
c. The regression line passes through the sample means. So
Sample Mean GPA = 1.70 + 0.840*5
= 2.5
d. Correlation is the square root of the Rsquared value expressed in
decimal form. (In the two variable case.)
So corr(GPA, Test) = sqrt(0.654) = 0.81
______________________________________________________________________
Problem 5
An alternative to transformations is to fit a polynomial.
MTB > name c7 'dayssq'
MTB > let c7 = c1*c1
MTB > regress 'size' 2 c1 c7 c20 c21
The regression equation is
size = 8.97  1.37 days + 0.0588 dayssq
Predictor Coef Stdev tratio p
Constant 8.972 4.119 2.18 0.057
days 1.3750 0.3213 4.28 0.002
dayssq 0.058771 0.005858 10.03 0.000
s = 1.278 Rsq = 99.5% Rsq(adj) = 99.4%
Analysis of Variance
SOURCE DF SS MS F p
Regression 2 2846.7 1423.3 871.59 0.000
Error 9 14.7 1.6
Total 11 2861.4
SOURCE DF SEQ SS
days 1 2682.3
dayssq 1 164.4
Unusual Observations
Obs. days size Fit Stdev.Fit Residual St.Resid
10 35.0 30.300 32.842 0.518 2.542 2.18R
R denotes an obs. with a large st. resid.
Notice the t statistic for the coefficient of the quadratic term
in this model. It is highly significant, indicating a substantial quadratic
component to these data. (Also notice the success of this model in
general 
Examine the plot of residuals as a function of fitted values to
see if there is any trend or pattern in the data unaccounted for by the
present model.
MTB > plot c20 c21

 *
1.5+
 *
C20 
 *
 *
0.0+ ** * *
 *
 *
 *

1.5+

 *


++++++C21
0 10 20 30 40 50
Note that there is no simple relationship between fitted values
and residuals, suggesting that this model is adequate.
Is there a relationship between residuals adjacent in time?

6.
Take the 2x2 table and put the counts in two cols
MTB > chisquare c1 c2
Expected counts are printed below observed counts
VOTE NOVOTE
1 1481 132 1613 Some HS
1438.7 174.3
2 1036 173 1209 No HS
1078.3 130.7
Total 2517 305 2822
ChiSq = 1.25 + 10.28 +
1.66 + 13.71 = 26.90
df = 1
The critical chisq(0.95,1) = 3.84. Thus the null hypothesis
of no association is rejected.
The phi coefficient is one measure of association, given as
sqrt(chisq/n) = sqrt(26.90/2822) = 0.098.
If you really want to work the phi coeff can alternatively be computed
by
phi = [n(1,1)*n(2,2)  n(2,1)*n(1,2)]/{[n(1+)*n(2+)*n(+1)*n(+2)]**.5}
What's the odds ratio for voting for this table?