Lecture 21: Testing for Correlation

Lecture 21: Testing for Correlation#

STATS 60 / STATS 160 / PSYCH 10

Concepts and Learning Goals:

Hypothesis test for correlation coefficients
- testing via simulation
- permutation test
Variability of the correlation coefficient
- via simulation! “the bootstrap”

Taking inventory#

Suppose we have conducted an experiment and we have used our data to compute a summary statistic, $\hat{T}$.

We suspect that $\hat{T}$ is indicative of a trend. But it could just be random noise…

Example 1: a student takes a 10-question True/False test, and $\hat{T} = 8/10$ is the fraction answered correctly.
- It seems like the student knows the material!
- Is it possible they were just guessing randomly?
Example 2: we run a randomized controlled trial to see if retrieval practice helps students study, and $\hat{T} = -1$ is the difference in mean scores between the treatment and control group.
- It seems like retrieval practice is worse for remembering the material!
- Is a $-1$ difference in average score really a lot? Could it just be noise?

Question:

Explain what a hypothesis test is, and why we would do it, in plain English, using 25 words or less.
Explain what a null hypothesis is in plain English, using 25 words or less.
Explain what a $p$-value is in plain English, using 25 words or less.

Hypothesis testing recap#

Suppose we have conducted an experiment and we have used our data to compute a summary statistic, $\hat{T}$.

We suspect that $\hat{T}$ is indicative of a trend. But it could just be random noise…

A hypothesis test is a thought experiment to help us figure out whether it is likely that our observation $\hat{T}$ is just random noise.

The null hypothesis is that our data is just random, with no trend.

The specifics of the null hypothesis depend on the experiment we ran.

The p-value is the chance of observing $\hat{T}$ or an even stronger trend under the null hypothesis, if the data were random.

Correlation#

Suppose that I have sampled $n$ individuals from my population, and for each I have measured the values $(x_i,y_i)$.

For example:

Penguins: $x =$ body mass, $y=$ beak length
Health: $x=$ weight, $y=$ breakfast days/week
Economics: $x=$ years of education, $y=$ salary
College admissions: $x=$ SAT score, $y=$ sophomore-year GPA

The correlation coefficient ($R$) of $x$ and $y$ is the slope of the best-fit line for the standardized datasets $x_1,\ldots,x_n$ and $y_1,\ldots,y_n$.

Correlation coefficient for Penguin body mass vs. beak length

Do students who study more sleep more?#

Let’s look at some data from the course survey. Here is a scatterplot of your self-reported “hours of sleep” vs. “hours of studying:

Does it look to you like there is a positive association, negative association, or neither?

Correlation coefficient for sleep vs. study#

Here is the best-fit line. $R = -.19$.

Is this a real trend or is it just noise?

Is this a significant correlation?#

How can we decide if the correlation coefficient $R$ is large? How can we decide if it significant?

To decide if $R$ is large: compare to its max/minvalues, $1$ and $-1$.

But $R$ can be significant (a real linear association) even when $|R|$ is smaller than 1.

Correlation coefficient for Penguin body mass vs. beak length

Testing for correlation#

Suppose that we compute the correlation coefficient, and see that it has value $R \neq 0$.

Is this just a coincidence? Or is the correlation a real pattern?

Let’s formulate this as a hypothesis testing problem:

Null Hypothesis: there is no correlation.
How can we compute the $p$-value?
- We’ll use simulation!

Permuting the datapoints#

Let’s assume, from now on, that our $x_i$ and $y_i$ are standardized.

Suppose there really is a positive association between $(x_i,y_i)$.

Now, what if we randomly shuffle or permute the $y_i$, so that they are matched to a random $x_j$?

There’s almost certainly no correlation now!

Null hypothesis based on shuffling#

Suppose we have computed $R$ from our data.

We want to know if $R$ reflects an actual trend, or if it is just noise.

Null hypothesis: the pairs $(x_i,y_i)$ are paired up totally randomly.

Question: Why is this null hypothesis saying that there is no real correlation?
Question: In plain English, what is the $p$-value of $R$ for this null hypothesis?

The null is saying there is no real correlation because if the pairing of $x_i$ and $y_i$ is arbitrary/random, there would almost certainly not be a linear relationship (as long as $n > 2$, two points always make a line!).
The $p$-value is the chance that you’d get this value of $R$, or a more extreme one, if the data were paired up by randomly shuffling.

Permutation test for correlation#

Suppose we have computed $R$ from our data.

We want to know if $\hat R$ reflects an actual trend, or if it is just noise.

Null hypothesis: the pairs $(x_i,y_i)$ are paired up totally randomly.

$p$-value: the chance that you’d get this value of $R$, or a more extreme one, if the data were paired up by randomly shuffling.

We’ll compute the $p$-value using a Permutation test, a test based on simulation:

Do some large number $T$ of repetitions of the following experiment:

a. Randomly permute/shuffle the $y_i$ so that each is paired with some random $x_j$

b. Compute and record the correlation coefficient for the shuffled dataset
Make a histogram of the correlation coefficient values for all $T$ trials.
Decide the $p$-value for $R$ based on how extreme it is relative to the histogram:
- If $R$ is positive, the $p$-value for $R$ is the fraction of histogram values larger than $R$ (or $1/T$, if none are larger).
- If $R$ is negative, the $p$-value for $R$ is the fraction of histogram values smaller than $R$ (or $1/T$, if none are smaller).

The penguins#

In our correlation lecture, we computed a correlation coefficient of $\hat R = 0.67$ for Gentoo Penguin body mass vs. beak length.

What is the $p$-value?

Permutation test for penguin correlation#

I ran a simulation with $T = 10,000$ trials.

In each trial, I chose a new random permutation of the $y_i$, and computed the correlation coefficient.

Here is trial 1:

Here is trial 2:

Etc.

Aggregating the trials#

Below is a histogram of the dataset of the correlation coefficients from each of the $T$ trials.

Using the permutation test, the $p$-value is at most $.0001$:

We conclude that $R$ is statistically significant at level $\alpha = 0.05$ (or smaller).

GDP in 1960 vs. 2000#

In the correlation lecture we also computed the correlation coefficient for the GDP of countries in 1960 vs. 2000.

We can do a permutation test to check if this value of $R$ is statistically significant.

P-value for GDP correlation#

For $T = 10,000$ permutations, the $p$-value is for the correlation coefficient of 1960 vs. 2000 GDP is $1/10,000$:

Back to sleep vs. study#

Let’s return to our course survey data, about hours of sleep vs. hours of study.

We calculated $R = -.19$ for this data. Is it significant?

P-value for sleep vs. study#

Via simulation, for a $T = 10,000$-trial permutation test, we verify that the $p$-value is $\approx 0.13$, so the trend is not statistically significant at level $\alpha = .05$.

SAT score vs. study#

Let’s look at a different correlation in our course survey dataset: SAT score vs. average number of hours a week spent studying.

Correlation coefficient for SAT score vs. study#

The correlation coefficient is positive (unsurprisingly?), $R = .3$

Statistically significant?#

Again we do $T = 10,000$ permutation tests:

Statistically significant at level $\alpha = 0.05$!

Variability#

We computed $R$ for our data. But what about its variability?

Is $R$ being influenced by outliers?
If our data had been sampled a bit differently, would my value of $R$ be dramatically different?

The permutation-test $p$-values only give us a sense of statistical significance of a correlation:

we can see how extreme $R$ is relative to a random shuffling of the data (no correlation)
we cannot see how $R$ would vary if we had a different sample from data with the same type of associative relationship.

How much variability in SAT score vs. study?#

The SAT score vs. study trend:

Would the value of $R$ changed to a (smaller or larger) negative value if we had sampled differently? There appear to be some outliers.
How much smaller or larger?

Estimating variability with simulation#

We can use simulation to estimate variability too.

The best case scenario would be, if we have access to new samples from our population, just collect more samples and compute a fresh correlation coefficient a bunch of times.
But what if we don’t have access to new samples?

The following approach is called the bootstrap:

Start with our dataset $(x_1,y_1),\ldots,(x_n,y_n)$.
For some large number of trials $T$:

a. Sample $n$ pairs independently with replacement from the dataset: $$(x_{i_1},y_{i_1}),\ldots,(x_{i_n},y_{i_n})$$

b. Compute and record the correlation coefficient of these pairs.
Form a histogram of the correlation coefficients from all $T$ trials.

The Bootstrap#

To get a sense of the variability of $R$:

Start with our dataset $(x_1,y_1),\ldots,(x_n,y_n)$.
For some large number of trials $T$:

a. Sample $n$ pairs independently with replacement from the dataset: $$(x_{i_1},y_{i_1}),\ldots,(x_{i_n},y_{i_n})$$

b. Compute and record the correlation coefficient of these pairs.
Form a histogram of the correlation coefficients from all $T$ trials.

Question: why could this simulation give us a good sense of variability? Will it account for outliers?

There’s a reasonable chance that we’ll avoid any specific outlier in a trial when we sample with replacement: $$\Pr[\text{avoid i }] = (1-\frac{1}{n})^n \approx e^{-1} \approx \frac{1}{3} \text{ when }n\text{ large.}$$

Question: will this always give us a good idea of the variability of $R$?

Not necessarily; our sample could just be really weird.

Simulation for variability in sat vs. study#

If we do a bootstrap simulation with $T= 10,000$ trials, we can see that the confidence interval around $R$ is actually quite small:

This gives us some sense of the variability of $R$.

Variability of the best-fit line#

Here you can see the original scatterplot and best-fit line, with the lines corresponding to the mean +/- a standard deviation

The line doesn’t too change much! Variability of $R$ is reasonably low.

Recap#

Testing for correlation
- Using simulation: permutation tests
Variability of correlation
- Using simulation: “the bootstrap”