Assignment 1#

Due Wednesday, October 12 at 11:59PM on gradescope#

Problem 1#

For each of parts below, indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

  1. The sample size \(n\) is extremely large, and the number of predictors \(p\) is small.

  2. The number of predictors \(p\) is extremely large, and the number of observations \(n\) is small.

  3. The relationship between the predictors and response is approximately linear.

  4. The variance of the error terms, i.e. \(\sigma^2\) = Var(\(\epsilon\)), is extremely low.

Problem 2#

Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide \(n\) and \(p\).

  1. We collect a set of data on the 100 highest paid professional basketball players. For each player we record salary, height, weight, age, number of games played in their career, points scored, assists, rebounds, and blocks. We are interested in understanding which factors affect player salary.

  2. We would like to guess whether Stanford will win its next baseball game against Berkeley. We collect data on the last 30 times the teams have played each other, recording the outcome (win or lose), whether Stanford was at home or not, the winning percentages of each team prior to the game, the ERA or each teams starting pitcher, and 12 other variables.

  3. We are interested in predicting rainfall in Palo Alto based on rainfall in other parts of the world. For each week of the past year, we record the number of inches of rainfall in Palo Alto, New York City, Longyearbyen, London, Beijing, Sydney, Cairo, and Buenos Aires.

Problem 3#

Suppose we have data \((x_i,y_i)\), \(i=1,2,\dots,n\), where both \(x_i\)’s and \(y_i\)’s are real-valued. Consider the polynomial regression model:

\[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \cdots + \beta_d x_i^d + \varepsilon_i,\quad i = 1,2,\dots,n,\]

where \(\varepsilon_i\)’s are the random errors in the observations, and the rest of the right hand side is a polynomial in \(x_i\) of degree \(d\), where \(d\) is chosen prior to fitting the model.

Provide a sketch of the variance, squared bias, training error, and test error curves, on a single plot, as we increase the degree \(d\) of the polynomial used in the above model. Take \(d\) on the x-axis, and the y-axis should represent the values for each curve. There should be four curves, make sure to label them. Also provide brief explanations for their shapes. Note, you don’t need to derive any mathematical expressions for this problem.

Problem 4#

The following table provides a training data set containing six observations, four predictors, and one qualitative response variable.

df = data.frame(X1=c(0,0,0,-1,1,2),
                X2=c(1,1,2,1,0,1),
                X3=c(3,2,1,2,-1,0),
                X4=c(-1,2,0,0,1,3),
                Y=c('Red', 'Green', 'Red', 'Red', 'Green', 'Red'))
df

Suppose that we use this data set to make a prediction for the label \(Y\) at Xnew using the \(K\)-nearest neighbors method:

X_new=c(0,0,0,0)
  1. Compute the Euclidean distance between each observation and the new point X_new.

  2. What is our prediction with \(K=1\) ? Why?

  3. What is our prediction with \(K=3\) ? Why?

  4. If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for \(K\) to be large or small? Why?

Problem 5#

Suppose we have a data set with five predictors, \(X_1=\mathrm{GPA}, X_2=\mathrm{IQ}\), \(X_3=\) Gender ( 1 for Female and 0 for Male), \(X_4=\) Interaction between GPA and IQ, and \(X_5=\) Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get \(\hat{\beta}_0=25, \hat{\beta}_1=\) \(18, \hat{\beta}_2=0.1, \hat{\beta}_3=-30, \hat{\beta}_4=0.01, \hat{\beta}_5=10\).

  1. Decide whether each statement A-C is True or False and justify your answer.

    A. For a fixed value of IQ and GPA, males earn more on average than females.

    B. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough.

    C. Since the coefficient for the Gender term is very large, there is strong evidence of a nonzero effect.

  2. Predict the salary of a female with IQ of 110 and a GPA of 4.0.

Problem 6#

We again consider modeling the new-grad salary, this time of Stanford undergraduate students. For each situation discuss whether you would use a less flexible or more flexible model and why.

  1. Say we have access to 45 demographic and educational features on the approximately 15000 undergraduates that have graduated from Stanford over the past eight years, and we are only focused on predictive performance.

  2. What if we are still concerned with prediction, but we only had access to the 128 students who took STATS 202 in Summer 2018.

  3. Returning to the overall undergraduate body, now instead suppose we are more concerned with understanding the relationships between our features and new-grad salary.

Problem 7#

Consider data of the form

df = data.frame(cartype=c('Mazda RX4', 'Valiant', 'Range Rover'),
                mpg=c(21.0, 18.1, 22.8),
                hp=c(1.15, 3.15, 3.19),
                axleratio=c(2.79,4.60,3.44))
df
  1. Suppose I want to run linear regression on this dataset by letting my response be mpg and include hp and axleratio (rear axle ratio) in my predictors. That is, we model \(y_i = \beta_0 + \beta_{{\tt hp}} x_{1, i} + \beta_{{\tt axleratio}} x_{2, i} + \epsilon_i \, i = 1,2, 3,\) where \(\beta_0\) is the intercept term and \(\epsilon_i \stackrel{iid}\sim N(0, 1)\). In matrix form, \(y = X \beta + \epsilon,\) where \(\epsilon \sim N(\textbf{0}, I).\) Assuming we include an intercept term, write out the design matrix \(X\) and response \(y\) explicitly.

  2. Explain how we would compute the least squares estimate coefficients \(\hat{\beta}\) from previous part. Then, use R or any software to compute the estimates.

  3. Now, use the mtcars Dataset in R. Regress the response mpg on all the other variables. Suppose we are interested in the test \(H_0: \beta_{{\tt hp}} = 0, H_a: \beta_{{\tt hp} } \neq 0.\) Find the \(p\) value in R from the lm function (or any other software). Set \(\alpha = 0.05\). Do we have enough evidence to reject the null hypothesis?

  4. Suppose I want to model the response of a certain car either having an accident on the road or not. Is linear regression suitable for this? Explain in one or two concise sentences.

Problem 8#

This problem focuses on the collinearity problem.

set.seed (1)
x1 = rnorm (100)/10
x2 = x1 + runif(100)
y = 1 + 0.2*x1 + 0.3*x2 + 0.1*rnorm (100)
  1. The last line creates a linear model where y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?

  2. What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.

  3. Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are \((\hat{\beta}_0, \hat{\beta}_1)\) and \(\hat{\beta}_2\)? How do these relate to their true values? Can you reject the null hypothesis \(H_0: \beta_1 = 0\)? How about the null hypothesis \(H_0: \beta_2 = 0\)? You may use the typical rejection threshold of 0.05.

  4. Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis \(H_0: \beta_1 = 0\) at level 0.05?

  5. Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis \(H_0: \beta_2 = 0\).

  6. Do the results obtained in parts 3-5 contradict each other? Explain your answer.