You may discuss homework problems with other students, but you have to prepare the written assignments yourself.

Please combine all your answers, the computer code and the figures into one PDF file, and submit a copy to your folder on canvas.

Grading scheme: 10 points per question, total of 40.

Due date: 11:59 PM January 20, 2017 (Friday evening).

Question 1¶

On Groundhog Day, February 2, a famous groundhog in Punxsutawney, PA is used to predict whether a winter will be long or not based on whether or not he sees his shadow. I collected data on whether he saw his shadow or not from here. I stored some of this data in this table.

Although Phil is on the East Coast, I wondered if the information says anything about whether or not we will experience a rainy winter out here in California. For this, I found rainfall data, and saved it in a table. To see how this was extracted see this notebook.

1. Make a boxplot of the average rainfall in Northern California comparing the years Phil sees his shadow versus the years he does not.

2. Construct a 90% confidence interval for the difference between the mean rainfall in years Phil sees his shadow and years he does not.

3. Interpret the interval in part 2.

4. At level, $\alpha = 0.05$ would you reject the null hypothesis that the average rainfall in Northern California during the month of February was the same in years Phil sees his shadow versus years he does not?

5. What assumptions are you making in forming your confidence interval and in your hypothesis test?

Question 2¶

In Question 1, part 4 above, you are asked to carry out a hypothesis test. In part 5, you are asked to justify your confidence interval and hypothesis test. Both are typically based on a $T$ statistic of some form.

1. Write functions in R to generate new data sets for the two different groups of years, calling them shadow and noshadow. The functions should be such that you can specify the average rainfall within the two years separately, as well as the variability of the rainfall within those years (for example, you might use rnorm with different mean and variance parameters).

2. Using your two functions above, simulate data under the null hypothesis that the data from shadow years is the same as the data from noshadow years, computing the $T$ statistic each time. Plot a density of a sample of 5000 such $T$ statistics, overlaying it with a "true" density that holds under the null hypothesis. Explain how these densities relate to the test you carried out in Question 1, part 4.

3. Again using the same two functions, simulate data under the null hypothesis that the average rainfall from shadow years is the same as the average rainfall from noshadow years, allowing for the possibility that the variability of the average is different among the two groups. The function t.test allows specifying var.equal to be true or false. Compare the density of the $T$ statistics when the variability is not the same within the two groups.

Question 3¶

The data set walleye in the package alr4 (remember you may have to run install.packages("alr4")) of data measured on walleye fish in Wisconsin.

1. Create a boxplot of length, for age in 1:4.

2. Compute the sample mean, sample standard deviation length in the four groups.

3. Create a histogram of length within age of 1:4 putting the plots in a 2x2 grid in one file.

4. Compute a 90% confidence interval for the difference in length in years 1 and 2. What assumptions are you making?

5. At level $\alpha=5\%$, test the null hypothesis that the average length in the group age==3 is the same as the in the group age==4. What assumptions are you making? What can you conclude?

6. Repeat the test in 5. using the function lm.

Question 4 (RABE)¶

1. Use the anscombe data in R. Attach the table using the command attach.

2. Plot the 4 data sets (x1,y1), (x2,y2), (x3,y3), (x4,y4) on a 2-by-2 grid of plots using the commands plot and par(mfrow=c(2,2)). Add the number of the dataset to each plot as the main title on each plot.

3. Fit a regression model to the data sets:

a. y1 ~ x1

b. y2 ~ x2

c. y3 ~ x3

d. y4 ~ x4

using the command lm. Verify that all the fitted models have the exact same coefficients (up to numerical tolerance).

4. Using the command cor, compute the sample correlation for each data set.

5. Fit the same models in 3. but with the x and y reversed. Using the command summary, does anything about the results stay the same when you reverse x and y?

6. Compute the $SSE, SST$ and $R^2$ value for each data set. Use the commands mean, sum, predict.

7. Using the command abline, replot the data, adding the regression line to each plot.

In [1]:
anscombe

x1x2x3x4y1y2y3y4
10 10 10 8 8.049.14 7.46 6.58
8 8 8 8 6.958.14 6.77 5.76
13 13 13 8 7.588.74 12.74 7.71
9 9 9 8 8.818.77 7.11 8.84
11 11 11 8 8.339.26 7.81 8.47
14 14 14 8 9.968.10 8.84 7.04
6 6 6 8 7.246.13 6.08 5.25
4 4 4 19 4.263.10 5.3912.50
12 12 12 8 10.849.13 8.15 5.56
7 7 7 8 4.827.26 6.42 7.91
5 5 5 8 5.684.74 5.73 6.89
In [2]: