Assignment 1

Assignment 1#

You may discuss homework problems with other students, but you have to prepare the written assignments yourself.

Please combine all your answers, the computer code and the figures into one PDF file, and submit a copy to your folder on Gradescope.

Grading scheme: 10 points per question, total of 40.

Due date: 11:59 PM April 15, 2024 (Monday evening).

Download#

RStudio: RMarkdown, Quarto
Jupyter

Question 1#

Install the package ISLR using the command. If already installed use library(ISLR2) instead of install.packages.

{r} install.packages('ISLR2', repos='http://cloud.r-project.org')

We’ll use the College data set for this problem. In particular, we want to compare the cost of room and board between private and public colleges.

Draw a boxplot of Room.Board stratified by Private. Similarly plot histograms of the two samples.
Based on the histograms, do you think the two-sample $t$-test is justified here?
Carry out the two-sample $t$-test comparing Room.Board between private and public colleges.

Question 2#

In this problem, we’ll repeat the analysis above using Top10perc rather than Room.Board. We’ll also simulate the distribution of the $t$-statistic for comparing Top10perc between public and private colleges under the null hypothesis that there is no difference.

Repeat the steps above swapping Top10perc for Room.Board. Do the histograms have a normal shape?
Compute the two-sample $t$-statistic comparing Top10perc among private and public colleges by saving the $statistic attribute of the object returned by t.test.

We’ll create a sample of 10000 draws of the $T$-statistic based on random permutations of the data. For this we’ll need to use a for loop. The snippet below creates a vector of length 10000 and runs a for loop to store some numbers in the vector. In the example below, we store a call to rnorm(1) – a randomly generated normal with mean 0 and variance 1.

{r} my_sample = rep(NA, 10000) # create a vector of length 10000 filled with missing values for (i in 1:10000) { # for loops need to be enclosed in {} my_sample[i] = rnorm(1) }

Randomly shuffle the Private vector 10000 times, and recompute the two sample $t$-test statistic, storing the values in a vector of length 10000.
Create a histogram of the 10000 statistics. Does its shape seem similar to that of the $T$-distribution used to compute the $p$-value by t.test?
Compute a $p$-value by computing the proportion of your 10000 statistics that are larger in absolute value than your observed $t$-statistic. Is it close to the $p$-value you found by using t.test?

Question 3 - Sign Test#

In this problem we’ll carry out the sign test by simulation. The data we’ll use is the Fund dataset from ISLR2 which contains (simulated) returns for different fund managers over 50 months. If fund managers are unable to really beat the market, we expect that they will have positive returns as frequently as negative returns. This is our null hypothesis.

For Manager14, compute the signs of their 50 returns and store the number of months with positive returns as our statistic.

For a null distribution, we will take a distribution symmetric around 0 and look at the distributions of the sum of its signs.

For 10000 reps, draw 50 standard normals (mean 0, variance 1) using rnorm(50), and compute the number of positive entries as a test statistic. Plot a histogram of your results using the argument breaks=1:50.
Instead of normal random variables, repeat 2. with uniform random variables which have a distribution symmetric around 0.5. That is, for 10000 reps, simulate 50 uniform random variables with runif(50) and store the number greater than 0.5 as a test statistic. Does it look the same as 2.? Why?
Explain why the distribution from 2. is a good way to evaluate evidence against our null hypothesis about Manager14. Compute a $p$-value by computing the proportion of your 10000 draws that have more positive draws then your observed number of positive months. Is this a 1-sided or 2-sided $p$-value? What is the null hypothesis here?
The distribution of the number of positive standard normals out of 50 is the Binomial distribution with parameters (50, 1/2). Its true histogram (i.e. its probability mass function, or pmf) can be computed with dbinom. That is, the probability that there are exactly 23 positive standard normals is dbinom(23, 50, 0.5). Use barplot to make a barplot of the true histogram.
Compute the probability of observing at least as many positive standard normals as the number of positive return months for Manager14. You can do this by summing dbinom or using the cumulative histogram (i.e. cumulative distribution function or CDF) pbinom.

Question 4#

Use the Wage data from ISLR2 for this problem.

Make a boxplot of wage ~ maritl (marital status). Inspect histograms for the 5 different values of maritl. Do the histograms have a normal shape?
Repeat 1. for logwage.
Fit the one-way ANOVA model logwage ~ maritl. Test the null hypothesis that there is no difference in wages based on the marital status of the 3000 men.
Construct a 95% confidence the difference between the wage of widowed men and those never married.
Suppose you wanted to not rely on the $p$-value from the $F$ test as provided by your one-way model in 3. How might you simulate a distribution to test the null distribution that the wage distribution doesn’t depend on marital status.
Compute a $p$-value using by simulation using your answer to 4.