Education 257 FINAL PROBLEMS, Spring 2003
Solutions for these problems are to be submitted in hard-copy
form. Given that these problems are untimed, some care should be
taken in presentation, clarity, format. Especially important is
to give full and clear answers to questions, not just to submit
unannotated computer output, although relevant output should
be included.
You may use any inanimate resources--no collaboration. This
work is done under Stanford's Honor Code.
Please read the questions carefully and answer the question that
is asked.
Papers are due in Rogosa's Cubberley mailbox before 5PM
Wed June 11 2003
To obtain data sets through web services go to the following
location:
http://www.stanford.edu/class/ed257/HWdat/
and select from the file listing.
(I will also try to remember to mirror in the older
data sets location:
path is /afs/ir.stanford.edu/class/ed257/HW or
/usr/class/ed257/HW ] )
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Problem 1, Model Building, Variable Selection
Can anyone do math? Was it your Parents fault?
Relation of educational achievement of students to the home
environment. Data on average mathematics proficiency (MATHPROF) and
the home environment variables were obtained from the 1990 National
Assessment of Educational Progress for 37 states, the District of
Columbia, Guam, and the Virgin Islands.
In file mathnaep.dat the educational achievement of eighth-grade
students in mathematics and the fol1owing five explanatory variables
(all state-level variables):
C1 MATHPROF average mathematics proficiency
C2 PARENTS percentage of eighth-grade students with both parents living at home
C3 HOMELIB percentage of eighth-grade students with three or more types of
reading materials at home (books, encyclopedias, magazines, newspapers)
C4 READING percentage of eighth-grade students who read more than 10
pages a day
C5 TVWATCH percentage of eighth-grade students who watch TV for six
hours or more per day
C6 ABSENCES percentage of eighth-grade students absent three days or
more last month
a. Start with basic data analysis due diligence. Examine scatterplots
for anomalous observations and for curvature. Any transformations needed?
Obtain a correlation matrix of the predictor variables and outcome.
What is the single best predictor of mathprof?
b. Use best-subsets regression methods to identify useful prediction models?
What is your best candidate? Compare with the second best candidate.
For your best model, comment on the observations with the largest
standardized residuals.
c. Compare your results in part b with the use of Forward Stepwise regression
to determine a prediction model.
d. For the full set of predictor variables, are there any logical candidates
for data reduction (i.e. forming composites). Will any improvements in the
regression fits be obtained from using a composite?
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Problem 2 Categorical Predictor Variables in both Non-experimental
and Experimental Settings.
Part 1--Nonexperimental Data
For our class example of the of the NELS data (n=4804) [get data from link on
web-page] taken from National Educational Longitudinal Study of 1988 (NELS:88));
(see description in course files and examples ), let's examine the following.
Obtain a prediction equation for students' 10th-grade scores on the science
achievement test using the predictors: 8th grade science achievement and
socio-economic status. (indicator variables define 3 levels of SES; lowest
quartile, middle SES, highest quartile). Employ the indicator variables to
obtain a prediction equation for the outcome that allows for different
10th grade on 8th grade regression lines for the three SES groups.
Now consider a student with a score of 20 on the 8th grade test.
From the previous regression equation, estimate what the difference in his
predicted outcome would be if he were high SES versus if he were middle SES.
Conduct a statistical test of the null hypothesis that the 10th grade on 8th
grade regression slopes are identical for the three SES groups.
extra credit: are there significant gender differences in the prediction of
10th-grade science by eighth-grade science and SES?
---------------------------------------------------
Part 2. Experimental Data
Consider a slight variation of the lecture example taken from for the
Huitema text (Analysis of Covariance and Alternatives). The basic
experiment (described in the Huitema course example) has the description:
Three experimental groups, each of size 10, single outcome,
The investigator is concerned with the effects of three different
types of study objectives on student achievement in freshman
biology. The three types of objectives are:
1.General--students are told to know and understand
everything in the text.
2.Specific--students are provided with a clear
specification of the terms and concepts they are
expected to master and of the testing format.
3.Specific with study time allocations--the amount of
time that should be spent on each topic is provided in
addition to specific objectives that describe the type
of behavior expected on examinations.
The outcome variable is the biology achievement test.
A population of freshman students scheduled to enroll
in biology is defined, and 30 students are randomly
selected. The investigator obtains a single measure
from a science aptitude test for all students before the
investigator randomly assigns 10 students to each of the
three treatments. Treatments are administered, and scores
on the dependent variable are obtained for all students.
In the data file ancovaprob2.dat, the dependent variable is in c1,
aptitude test in c2, group membership variable (1,2,3) in c3.
a) carry out a comparison of the three objectives using analysis
of covariance for these data. What are your conclusions?
b) Does the ancova assumption of equal objective effects over
all levels of ability seem to be reasonable in these data?
What alternatives could you investigate?
c) by comparing the analysis in part a with a one-way anova (no covariate)
give a measure of the improvement in precision for comparing
the group outcomes obtained from use of the aptitude measure as a
covariate.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Problem 3 "Simple" Contingency Tables
Cheating Father Time. In the SF Chronicle May 20, 2001 the feature
"Cheating Father Time: Training, nutrition and medical advances
prolonging careers" provides the following data on the increasing
longevity of professional athletes.
1990
Number of Players Percent
35 and older players 35 and older
League
Major League Baseball 94 8.4%
National Football League 12 1%
National Basketball Association 14 3.6%
National Hockey League 14 2.4%
2000
Number of Players Percent
35 and older players 35 and older
League
Major League Baseball 162 11.7%
National Football League 44 2.7%
National Basketball Association 41 9.3%
National Hockey League 56 7.8%
a. For each of the four leagues construct a 2x2 table: player age
(35 and older, under 35) and year (1990, 2000). For each table
calculate the relative risk of playing (at or) past 35 in the
two decades.
b. Consider the year 2000 data. For the 2x4 table of player
age by sport, test the null hypothesis of independence. Explain
what that null hypothesis actually is saying. Construct a
display of actual counts, expected counts under independence,
and adjusted residuals from the independence model for each cell
in the 2x4 structure.
c. Calculate the following probability:
Given that a professional athlete in one of these four
leagues is still playing in the year 2000 at age 35 or over,
what's the probability he's a baseball player? Do you
have all the information you need to calculate this
probability?
d. Let's do a meta-analysis. Consider the four leagues as four
separate studies. Estimate the overall odds ratio for the 2x2 tables
in part a. Give a point estimate of the overall odds ratio and carry
out a test that the overall odds ratio is different from 1.0
(independence of year and playing past 35)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Problem 4 Modeling Multivariate Categorical Data
But would you want to matriculate?
We consider data on admissions for Fall 1973 graduate study at
U.C. Berkeley in the six largest departments. These data among others
were the subject of extensive litigation on gender discrimination
a few years back.
The data on each applicant consists of the applicants gender (G),
whether admitted (A) and major department (D).
Whether admitted, male Whether admitted, female
Dept Yes No Yes No
a 512 313 89 19
b 353 207 17 8
c 120 205 202 391
d 138 279 131 244
e 53 138 94 299
f 22 351 24 317
a) To start, construct the marginal AG table (a 2x2 table of gender by admit
status). Carry out a test for independence and obtain a point and
interval estimate the odds ratio for admittance for this marginal AG table
What might this result be taken to indicate about gender equity etc in the
admit process? Are you outraged yet?
b. Now use the breakdown by department. Obtain the odds ratio for admittance
within each of the 6 departments. Does Simpson's paradox appear to be present
in these data? Why or why not?
c. Use Cochran-Mantel-Haenszel procedures to:
test whether conditional independence holds for AG
estimate a common odds ratio for the six departments
use Breslow-Day statistic to test whether the AG odds-ratio
is the same for the 6 departments
d. For the possible A G D log-linear models, which model terms
would indicate gender discrimination?
e. Fit the set of A G D log-linear models using
SAS Proc Genmod, and identify what you regard as the most appropriate
model. Does this model confirm gender discrimination in admissions?
Examine the log-likelihood chi-square and table the fits and adjusted
residuals for this model. Are you satisfied with this model?
f. Set aside department a and rerun the log-linear model analysis.
Interpret your preferred model in terms of gender discrimination
in admissions. Also comment on the admissions preferences in dept a.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Problem 5 --Prediction of Binary Outcomes
(more aging male data)
These data are taken from an age discrimination lawsuit I was
involved in some years back (presumed to be long-settled).
Basically, the plaintiff claimed that when Company A bought
Company B, Company A proceeded to "fire all the old guys"
in the sales force of Company B. As there is a protected class
which I believe kicks in at age 40 (though these data use 45
as a division in dichotomizing age). Did Company A perpetrate
an act of age discrimination?
In file agediscrim.dat the rows are data for 65 Company B
sales personnel.
Columns are
age (in years)
rating (employee performance rating 1-5, 5 best)
layoff (1 if terminated after Company A takeover)
ratind (1 if rating is 4,5)
ageind (1 if age 45 or higher)
a. Where I started out with these data was looking at the obvious
2x2 tables. Can you reject independence between layoff and
ageind for these data? What is the relative risk of layoff
for the two age categories? What are the odds of layoff
if the salesperson is under 45 yrs? 45 or older?. Give
a point and interval estimate for this odds ratio.
Now also consider the dichotomous indicator for the employee
performance rating. Is the odds ratio for layoff and age the
same at both levels of the employee performance rating indicator?
b. The 2x2 tables have the advantage of easy presentation (esp to a judge)
but there may be some considerable advantage to predicting the
dichotomous outcome layoff using actual age, rather than a
dichotomization.
Use logistic regression to predict layoff using age and ratind
as predictors. Comment on the results, significance of the chosen
predictors.
Display probability and odds of layoff as a function of age and ratind.
Construct an index plot of the deviance residuals following NWK fig 14.7
Are you satisfied with this model?
c. From the logistic fit compare the odds of layoff of a salesperson
50 yrs old with that for a salesperson 35 years old.
Give a point estimate and a 95% confidence interval for the odds ratio.
d. For any given age, were the odds of layoff greater for low performance
rating that for high rating?
Give a point estimate and a 95% confidence interval.
e. The model in part b contains no interaction term between
age and ratind (i.e. the "effect" of ratind is the same at all
levels of age). Fit a more complex model including an ageXratind
interaction and conduct a statistical test for that term using a
drop-in-deviance test statistics. What is your preferred model?
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
END 257 !