You may discuss homework problems with other students, but you have to prepare the written assignments yourself.
Please combine all your answers, the computer code and the figures into one PDF file, and submit a copy to your folder on canvas.
Grading scheme: 10 points per question, total of 40.
Due date: 11:59 PM January 20, 2017 (Friday evening).
On Groundhog Day, February 2, a famous groundhog in Punxsutawney, PA is used to predict whether a winter will be long or not based on whether or not he sees his shadow. I collected data on whether he saw his shadow or not from here. I stored some of this data in this table.
Although Phil is on the East Coast, I wondered if the information says anything about whether or not we will experience a rainy winter out here in California. For this, I found rainfall data, and saved it in a table. To see how this was extracted see this notebook.
Make a boxplot of the average rainfall in Northern California comparing the years Phil sees his shadow versus the years he does not.
Construct a 90% confidence interval for the difference between the mean rainfall in years Phil sees his shadow and years he does not.
Interpret the interval in part 2.
At level, $\alpha = 0.05$ would you reject the null hypothesis that the average rainfall in Northern California during the month of February was the same in years Phil sees his shadow versus years he does not?
What assumptions are you making in forming your confidence interval and in your hypothesis test?
In Question 1, part 4 above, you are asked to carry out a hypothesis test. In part 5, you are asked to justify your confidence interval and hypothesis test. Both are typically based on a $T$ statistic of some form.
Write functions in
R to generate new data sets for the
two different groups of years, calling them
noshadow. The functions should be such that you
can specify the average rainfall within the two years separately, as well as the variability of the rainfall
within those years (for example, you might use
rnorm with different mean and variance parameters).
Using your two functions above, simulate data under the null hypothesis that the data from
shadow years is the same as the data from
noshadow years, computing the $T$ statistic each time. Plot a density of a sample of 5000 such $T$ statistics, overlaying it with a "true" density that holds under the null hypothesis. Explain how these densities relate to the test you carried out in Question 1, part 4.
Again using the same two functions, simulate data under the null hypothesis that the average rainfall from
shadow years is the same as the average rainfall from
noshadow years, allowing for the possibility that the variability of the average is different among the two groups. The function
t.test allows specifying
var.equal to be true or false. Compare the density of the $T$ statistics when the variability is not the same within the two groups.
The data set
walleye in the package
alr4 (remember you may have to run
install.packages("alr4")) of data
measured on walleye fish in Wisconsin.
Create a boxplot of
Compute the sample mean, sample standard deviation
length in the four groups.
Create a histogram of
1:4 putting the plots in a 2x2 grid in one file.
Compute a 90% confidence interval for the difference in
length in years 1 and 2. What assumptions are you making?
At level $\alpha=5\%$, test the null hypothesis that the average
length in the group
age==3 is the same
as the in the group
age==4. What assumptions are you making? What can you conclude?
Repeat the test in 5. using the function
anscombe data in
R. Attach the table using the command
Plot the 4 data sets
(x1,y1), (x2,y2), (x3,y3), (x4,y4) on a 2-by-2 grid of plots using the commands
Add the number of the dataset to each plot as the main title on each plot.
Fit a regression model to the data sets:
y1 ~ x1
y2 ~ x2
y3 ~ x3
y4 ~ x4
using the command
lm. Verify that all the fitted models have the exact same coefficients (up to numerical tolerance).
Using the command
cor, compute the sample correlation for each data set.
Fit the same models in 3. but with the
y reversed. Using the command
summary, does anything about the results stay
the same when you reverse
Compute the $SSE, SST$ and $R^2$ value for each data set. Use the
mean, sum, predict.
Using the command
abline, replot the data, adding the regression line to each plot.