Assignment 4#

You may discuss homework problems with other students, but you have to prepare the written assignments yourself.

Please combine all your answers, the computer code and the figures into one file, and submit a copy in your dropbox on Gradescope.

Due date: 11:59 PM, May 28, 2024.

Grading scheme: 10 points per question, total of 30.

Building PDF#

If you have not installed LaTeX on your computer. After running the below commands (once is enough), then using either Quarto or RMarkdown formats should hopefully be sufficient to build directly to PDF.

install.packages('tinytex', repos='http://cloud.r-project.org')
tinytex::install_tinytex()

Download#

Question 1 (Interpreting parameters when features are correlated)#

Generate a simulated data set as follows:

set.seed(1)
x1 = runif(100) # uniformly distributed between 0 and 1
x2 = 0.5 * x1 + rnorm(x1) / 10  # a slightly noisy version of x1
y = 3 + 2 * x1 + 1 * x2 + 0.95 * rnorm(100)

Parts#

  1. Consdier the data generated above and the model lm(y ~ x1 + x2)? What are the true \(\beta_0, \beta_1, \beta_2\)? What is the true \(\sigma^2\)?

  2. Create a scatterplot with x1 on the X-axis and x2 on the Y-axis. What is the sample correlation of x1 and x2? (Use cor)

  3. Fit the model lm(y ~ x1 + x2). At level 5%, can you reject the null hypothesis \(H_0:\beta_{\texttt x1}=0\)? How about the null hypothesis \(H_0:\beta_{\texttt x2}=0\)?

  4. Fit the model lm(y ~ x1). At level 5%, can you reject the null hypothesis \(H_0:\beta_{\texttt x1}=0\)? Next, fit the model lm(y ~ x2). At level 5%, can you reject the null hypothesis \(H_0:\beta_{\texttt x2}=0\)?

  5. Do the results of parts 3. & 4. contradict each other? Explain.

  6. In this part, we’ll add a single point to the data set to explore outliers and high leverage. We’ll call a point an outlier if its residual is at least 3 in absolute value, and we’ll say a point has high leverage if its hatvalues score is at least 0.2. For each of A.-C. compute the residual and hatvalues score.

    A. Pick a point (i.e. values for (x1,x2,y)) that would be an outlier relative to the other 100 points but not a point of high leverage.

    B. Pick a point (i.e. values for (x1,x2,y)) that would be a point of high leverage relative to the other 100 points but not an outlier.

    C. Pick a point (i.e. values for (x1,x2,y)) that would be both a point of high leverage and an outlier relative to the other 100 points.

Question 2 (Model selection for multiple linear regression)#

Use the Carseats data we saw in Assignment 3 for this problem.

Parts#

  1. Fit a full model Sales ~ . to the data.

  2. Extract the design matrix from your model in 1., discarding the column for the intercept.

  3. Use leaps to select a model using adjusted \(R^2\) as a criterion, storing the 5 best models of each size.

  4. Use leaps to select a model using Mallow’s \(C_p\) as a criterion, storing the 5 best models of each size.

  5. Use step to select a model using direction='forward' starting from Sales ~ 1 with the upper model being the model in 1. Do the models in 3., 4. and 5. agree?

  6. Make a plot of \(C_p\) versus model size. Is your selected model clearly better than all others or are there other models of similar quality?

Question 3#

Consider the Wage data we saw in Assignment 1.

Parts#

  1. Create an outcome variable high which is TRUE if wage > 250 and FALSE otherwise.

  2. Fit a logistic regression model high with features maritl, age, health, health_ins, education and race.

  3. At a 5% level, test the null hypothesis that the effect of maritl is 0 in the model in part 2.

  4. Split the data into two sets of equal size using this snippet of code to form a half for model selection and a second half for inference. We’ll focus on students with at least HS:

set.seed(1)
atleast_HS = Wage$education != '1. < HS Grad'
selection = rep(TRUE, nrow(Wage))
selection[sample(1:nrow(Wage), 0.5*nrow(Wage), replace=FALSE)] = FALSE
inference = !selection
selection = selection & atleast_HS
inference = inference & atleast_HS
  1. Using subset=selection, select a model using forward selection using step with the largest model being the model in 2. Call this model selected.glm

  2. Fit the same model selected in 5. on subset=inference. Call this model split.glm. Compare the summary() of split.glm and selected.glm. In terms of overall signficance of estimated effects, would your conclusions be very different between selected.glm and split.glm? Do any effects that were significant at level 5% in the select.glm switch to not significant in split.glm?