Assignment 2

You may discuss homework problems with other students, but you have to prepare the written assignments yourself.

Please combine all your answers, the computer code and the figures into one PDF file submitting it to gradescope.

Grading scheme: 10 points per numbered problem, 20 for remaining 3 problems.

Due date: January 31, 2022, 11:59PM (Monday evening).

Questions from Agresti

  • 5.6

  • 5.15

  • 5.35

  • 5.39

  • 6.8

  • 6.26

  • 4.6

  • 4.12

  • 4.23

Other questions

Matched case-control yet again

We refer back to our example with smoking (\({\tt Smoke}\)), cancer (\(C\)), sex (\(S\)) and age (\(A\)) of Assignment 1.

  1. Suppose that Smoke is conditionally independent of \(C\) given \(A\) and \(S\). Show that the odds ratio estimated from the sampling scheme converges to 1 as the sample size grows.

  2. We might be interested in scenarios in which the odds ratio derived from the 2x2 table under this sampling scheme converges to some true odds ratio. We saw in Assignment 1 that it generally will not be the population odds ratio. Another possibility would be to estimate a conditional odds ratio given A and S. Express this conditional odds ratio in terms of \(\mathbb{P}\). Does it depend on A and S? Between the marginal odds ratio and this conditional one, which would you say is of more practical interest?

  3. Propose a model of Smoke, C, A, S so that this conditional odds ratio does not depend on A, S? (Hint: look at the homogeneous association models of Assignment 1)

  4. Show that in your model, the conditional odds ratio can be estimated by fitting a logistic regression model given IID draws from \(\mathbb{P}\).

  5. Show that the odds ratio estimated from our 2x2 table in this sample scheme will settle down to this conditional odds ratio (which now does not depend on A and S).

  6. Of the two methods above that estimate this conditional odds ratio which makes more assumptions?

Modeling mpg

For this question, use the Auto data from library(ISLR).

  1. Create a binary variable Y = mpg > median(mpg) that splits the models of cars into high MPG vs. low MPG models.

  2. Try some simple bivariate plots to explore the relationship between Y and the other features. Any notable patterns?

  3. Split the data into a test set and training set of roughly equal size. On the training data, build a logistic regression model to predict Y from the other features. Try out a few model selection techniques such as step or bestglm.

  4. Create and report the confusion matrix on your test data for the model built in step 3. Evaluate the True Positive Rate and False Positive Rate on the test data.

Problem 4.23 extended

Problem 4.23 above discusses binary regression models with links different than the usual 3 that glm provides users. In this problem, we consider fitting such models. Much of the ingredients of the IRLS / Fisher scoring algorithm described in the book can be reused directly.

  1. Given (link, inverse_link, derivative_link) (for binary regression this is a quantile, its corresponding distribution function and derivative of the quantile function respectively), write functions in R that compute

  • The deviance for a binary regression model with design matrix X and binary response Y. The function should take arguments X, Y, link, inverse_link, beta.

  • The gradient of the deviance for a binary regression model with design matrix X and binary response Y. The function should take arguments X, Y, link, inverse_link, derivative_link, beta where derivative_link is the derivative of the link function.

  • The expected Fisher information matrix for a binary regression model with design matrix X and binary response Y. The function should take arguments X, Y, link, inverse_link, derivative_link, beta. How is this matrix used in the Fisher scoring method?

  1. Use your functions in part 1. to write a new function to fit a binary GLM using Fisher scoring with a user-defined link. Use this function to fit a logistic regression model to the spam data of library(ElemStatLearn) data with response.

  2. Fit the “same” binary GLM changing only the link function to be the quantile of a \(T\) distribution with 50 degrees of freedom. For this link, can you check if your algorithm is doing a reasonable job?

  3. Use your functions in part 1. to fit the same model as in part 2. only using gradient descent, i.e. each iteration \(k\) takes a step in the negative gradient direction

\[ \beta^{(k+1)} = \beta^{(k)} -\alpha_k \cdot \nabla DEV(\beta^{(k)})\]

with \(\alpha_k\) chosen to ensure that the step really is a descent step. If \(\alpha_k\) is too large, then we might increase the deviance; if it is too small we may need many iterations. Every 10 iterations or so, we might try to increase the step size by a factor of 2, say. (In the Fisher scoring step, it may also be a good idea to check that the new value indeed has lower deviance, otherwise scale the step down by some factor like 0.9 until the new step does have lower deviance.)

  1. Use your functions from part 4. to refit the models in parts 2. and 3. Do you get the same answer?