Assignment 4#
You may discuss homework problems with other students, but you have to prepare the written assignments yourself.
Please combine all your answers, the computer code and the figures into one file, and submit a copy in your dropbox on Gradescope.
Due date: 11:59 PM, May 28, 2024.
Grading scheme: 10 points per question, total of 30.
Building PDF#
If you have not installed LaTeX on your computer. After running the below commands (once is enough), then using either Quarto or RMarkdown formats should hopefully be sufficient to build directly to PDF.
install.packages('tinytex', repos='http://cloud.r-project.org')
tinytex::install_tinytex()
Download#
Question 2 (Model selection for multiple linear regression)#
Use the Carseats
data we saw in Assignment 3 for this problem.
Parts#
Fit a full model
Sales ~ .
to the data.Extract the design matrix from your model in 1., discarding the column for the intercept.
Use
leaps
to select a model using adjusted \(R^2\) as a criterion, storing the 5 best models of each size.Use
leaps
to select a model using Mallow’s \(C_p\) as a criterion, storing the 5 best models of each size.Use
step
to select a model usingdirection='forward'
starting fromSales ~ 1
with the upper model being the model in 1. Do the models in 3., 4. and 5. agree?Make a plot of \(C_p\) versus model size. Is your selected model clearly better than all others or are there other models of similar quality?
Question 3#
Consider the Wage
data we saw in Assignment 1.
Parts#
Create an outcome variable
high
which isTRUE
ifwage > 250
andFALSE
otherwise.Fit a logistic regression model
high
with featuresmaritl
,age
,health
,health_ins
,education
andrace
.At a 5% level, test the null hypothesis that the effect of
maritl
is 0 in the model in part 2.Split the data into two sets of equal size using this snippet of code to form a half for model selection and a second half for inference. We’ll focus on students with at least
HS
:
set.seed(1)
atleast_HS = Wage$education != '1. < HS Grad'
selection = rep(TRUE, nrow(Wage))
selection[sample(1:nrow(Wage), 0.5*nrow(Wage), replace=FALSE)] = FALSE
inference = !selection
selection = selection & atleast_HS
inference = inference & atleast_HS
Using
subset=selection
, select a model using forward selection usingstep
with the largest model being the model in 2. Call this modelselected.glm
Fit the same model selected in 5. on
subset=inference
. Call this modelsplit.glm
. Compare thesummary()
ofsplit.glm
andselected.glm
. In terms of overall signficance of estimated effects, would your conclusions be very different betweenselected.glm
andsplit.glm
? Do any effects that were significant at level 5% in theselect.glm
switch to not significant insplit.glm
?