Assignment 1#
You may discuss homework problems with other students, but you have to prepare the written assignments yourself.
Please combine all your answers, the computer code and the figures into one PDF file, and submit a copy to your folder on Gradescope.
Grading scheme: 10 points per question, total of 40.
Due date: 11:59 PM April 15, 2024 (Monday evening).
Download#
Question 1#
Install the package
ISLR
using the command. If already installed uselibrary(ISLR2)
instead ofinstall.packages
.
{r} install.packages('ISLR2', repos='http://cloud.r-project.org')
We’ll use the College
data set for this problem. In particular, we want to compare the cost of
room and board between private and public colleges.
Draw a boxplot of
Room.Board
stratified byPrivate
. Similarly plot histograms of the two samples.Based on the histograms, do you think the two-sample \(t\)-test is justified here?
Carry out the two-sample \(t\)-test comparing
Room.Board
between private and public colleges.
Question 2#
In this problem, we’ll repeat the analysis above using Top10perc
rather than Room.Board
. We’ll also simulate the distribution of the \(t\)-statistic for comparing Top10perc
between public and private colleges under the null hypothesis that there is no difference.
Repeat the steps above swapping
Top10perc
forRoom.Board
. Do the histograms have a normal shape?Compute the two-sample \(t\)-statistic comparing
Top10perc
among private and public colleges by saving the$statistic
attribute of the object returned byt.test
.
We’ll create a sample of 10000 draws of the \(T\)-statistic based on random permutations of the data. For this we’ll need to use a for
loop. The snippet below creates a vector of length 10000 and runs a for loop to store
some numbers in the vector. In the example below, we store a call to rnorm(1)
– a randomly generated normal
with mean 0 and variance 1.
{r} my_sample = rep(NA, 10000) # create a vector of length 10000 filled with missing values for (i in 1:10000) { # for loops need to be enclosed in {} my_sample[i] = rnorm(1) }
Randomly shuffle the
Private
vector 10000 times, and recompute the two sample \(t\)-test statistic, storing the values in a vector of length 10000.Create a histogram of the 10000 statistics. Does its shape seem similar to that of the \(T\)-distribution used to compute the \(p\)-value by
t.test
?Compute a \(p\)-value by computing the proportion of your 10000 statistics that are larger in absolute value than your observed \(t\)-statistic. Is it close to the \(p\)-value you found by using
t.test
?
Question 3 - Sign Test#
In this problem we’ll carry out the sign test by simulation. The data we’ll use is the Fund
dataset from ISLR2
which contains (simulated) returns for different fund managers over 50 months.
If fund managers are unable to really beat the market, we expect that they will have positive returns
as frequently as negative returns. This is our null hypothesis.
For
Manager14
, compute the signs of their 50 returns and store the number of months with positive returns as our statistic.
For a null distribution, we will take a distribution symmetric around 0 and look at the distributions of the sum of its signs.
For 10000 reps, draw 50 standard normals (mean 0, variance 1) using
rnorm(50)
, and compute the number of positive entries as a test statistic. Plot a histogram of your results using the argumentbreaks=1:50
.Instead of normal random variables, repeat 2. with uniform random variables which have a distribution symmetric around 0.5. That is, for 10000 reps, simulate 50 uniform random variables with
runif(50)
and store the number greater than 0.5 as a test statistic. Does it look the same as 2.? Why?Explain why the distribution from 2. is a good way to evaluate evidence against our null hypothesis about
Manager14
. Compute a \(p\)-value by computing the proportion of your 10000 draws that have more positive draws then your observed number of positive months. Is this a 1-sided or 2-sided \(p\)-value? What is the null hypothesis here?The distribution of the number of positive standard normals out of 50 is the Binomial distribution with parameters
(50, 1/2)
. Its true histogram (i.e. its probability mass function, or pmf) can be computed withdbinom
. That is, the probability that there are exactly 23 positive standard normals isdbinom(23, 50, 0.5)
. Usebarplot
to make a barplot of the true histogram.Compute the probability of observing at least as many positive standard normals as the number of positive return months for
Manager14
. You can do this by summingdbinom
or using the cumulative histogram (i.e. cumulative distribution function or CDF)pbinom
.
Question 4#
Use the Wage
data from ISLR2
for this problem.
Make a boxplot of
wage ~ maritl
(marital status). Inspect histograms for the 5 different values ofmaritl
. Do the histograms have a normal shape?Repeat 1. for
logwage
.Fit the one-way ANOVA model
logwage ~ maritl
. Test the null hypothesis that there is no difference in wages based on the marital status of the 3000 men.Construct a 95% confidence the difference between the wage of widowed men and those never married.
Suppose you wanted to not rely on the \(p\)-value from the \(F\) test as provided by your one-way model in 3. How might you simulate a distribution to test the null distribution that the wage distribution doesn’t depend on marital status.
Compute a \(p\)-value using by simulation using your answer to 4.