Lecture 21: Testing for Correlation#

STATS 60 / STATS 160 / PSYCH 10

Concepts and Learning Goals:

  • Hypothesis test for correlation coefficients

    • testing via simulation

    • permutation test

  • Variability of the correlation coefficient

    • via simulation! “the bootstrap”

Taking inventory#

Suppose we have conducted an experiment and we have used our data to compute a summary statistic, \(\hat{T}\).

We suspect that \(\hat{T}\) is indicative of a trend. But it could just be random noise…

  • Example 1: a student takes a 10-question True/False test, and \(\hat{T} = 8/10\) is the fraction answered correctly.

    • It seems like the student knows the material!

    • Is it possible they were just guessing randomly?

  • Example 2: we run a randomized controlled trial to see if retrieval practice helps students study, and \(\hat{T} = -1\) is the difference in mean scores between the treatment and control group.

    • It seems like retrieval practice is worse for remembering the material!

    • Is a \(-1\) difference in average score really a lot? Could it just be noise?

Question:

  1. Explain what a hypothesis test is, and why we would do it, in plain English, using 25 words or less.

  2. Explain what a null hypothesis is in plain English, using 25 words or less.

  3. Explain what a \(p\)-value is in plain English, using 25 words or less.

Hypothesis testing recap#

Suppose we have conducted an experiment and we have used our data to compute a summary statistic, \(\hat{T}\).

We suspect that \(\hat{T}\) is indicative of a trend. But it could just be random noise…

A hypothesis test is a thought experiment to help us figure out whether it is likely that our observation \(\hat{T}\) is just random noise.

The null hypothesis is that our data is just random, with no trend.

The specifics of the null hypothesis depend on the experiment we ran.

The p-value is the chance of observing \(\hat{T}\) or an even stronger trend under the null hypothesis, if the data were random.

Correlation#

Suppose that I have sampled \(n\) individuals from my population, and for each I have measured the values \((x_i,y_i)\).

For example:

  • Penguins: \(x =\) body mass, \(y=\) beak length

  • Health: \(x=\) weight, \(y=\) breakfast days/week

  • Economics: \(x=\) years of education, \(y=\) salary

  • College admissions: \(x=\) SAT score, \(y=\) sophomore-year GPA

The correlation coefficient (\(R\)) of \(x\) and \(y\) is the slope of the best-fit line for the standardized datasets \(x_1,\ldots,x_n\) and \(y_1,\ldots,y_n\).

Correlation coefficient for Penguin body mass vs. beak length

Do students who study more sleep more?#

Let’s look at some data from the course survey. Here is a scatterplot of your self-reported “hours of sleep” vs. “hours of studying:

Does it look to you like there is a positive association, negative association, or neither?

Correlation coefficient for sleep vs. study#

Here is the best-fit line. \(R = -.19\).

Is this a real trend or is it just noise?

Is this a significant correlation?#

How can we decide if the correlation coefficient \(R\) is large? How can we decide if it significant?

To decide if \(R\) is large: compare to its max/minvalues, \(1\) and \(-1\).

But \(R\) can be significant (a real linear association) even when \(|R|\) is smaller than 1.

Correlation coefficient for Penguin body mass vs. beak length

Testing for correlation#

Suppose that we compute the correlation coefficient, and see that it has value \(R \neq 0\).

Is this just a coincidence? Or is the correlation a real pattern?

Let’s formulate this as a hypothesis testing problem:

  1. Null Hypothesis: there is no correlation.

  2. How can we compute the \(p\)-value?

    • We’ll use simulation!

Permuting the datapoints#

Let’s assume, from now on, that our \(x_i\) and \(y_i\) are standardized.

Suppose there really is a positive association between \((x_i,y_i)\).

Now, what if we randomly shuffle or permute the \(y_i\), so that they are matched to a random \(x_j\)?

There’s almost certainly no correlation now!

Null hypothesis based on shuffling#

Suppose we have computed \(R\) from our data.

We want to know if \(R\) reflects an actual trend, or if it is just noise.

Null hypothesis: the pairs \((x_i,y_i)\) are paired up totally randomly.

  1. Question: Why is this null hypothesis saying that there is no real correlation?

  2. Question: In plain English, what is the \(p\)-value of \(R\) for this null hypothesis?

  • The null is saying there is no real correlation because if the pairing of \(x_i\) and \(y_i\) is arbitrary/random, there would almost certainly not be a linear relationship (as long as \(n > 2\), two points always make a line!).

  • The \(p\)-value is the chance that you’d get this value of \(R\), or a more extreme one, if the data were paired up by randomly shuffling.

Permutation test for correlation#

Suppose we have computed \(R\) from our data.

We want to know if \(\hat R\) reflects an actual trend, or if it is just noise.

Null hypothesis: the pairs \((x_i,y_i)\) are paired up totally randomly.

\(p\)-value: the chance that you’d get this value of \(R\), or a more extreme one, if the data were paired up by randomly shuffling.

We’ll compute the \(p\)-value using a Permutation test, a test based on simulation:

  1. Do some large number \(T\) of repetitions of the following experiment:

    a. Randomly permute/shuffle the \(y_i\) so that each is paired with some random \(x_j\)

    b. Compute and record the correlation coefficient for the shuffled dataset

  2. Make a histogram of the correlation coefficient values for all \(T\) trials.

  3. Decide the \(p\)-value for \(R\) based on how extreme it is relative to the histogram:

    • If \(R\) is positive, the \(p\)-value for \(R\) is the fraction of histogram values larger than \(R\) (or \(1/T\), if none are larger).

    • If \(R\) is negative, the \(p\)-value for \(R\) is the fraction of histogram values smaller than \(R\) (or \(1/T\), if none are smaller).

The penguins#

In our correlation lecture, we computed a correlation coefficient of \(\hat R = 0.67\) for Gentoo Penguin body mass vs. beak length.

What is the \(p\)-value?

Permutation test for penguin correlation#

I ran a simulation with \(T = 10,000\) trials.

In each trial, I chose a new random permutation of the \(y_i\), and computed the correlation coefficient.

Here is trial 1:

Here is trial 2:

Etc.

Aggregating the trials#

Below is a histogram of the dataset of the correlation coefficients from each of the \(T\) trials.

Using the permutation test, the \(p\)-value is at most \(.0001\):

We conclude that \(R\) is statistically significant at level \(\alpha = 0.05\) (or smaller).

GDP in 1960 vs. 2000#

In the correlation lecture we also computed the correlation coefficient for the GDP of countries in 1960 vs. 2000.

We can do a permutation test to check if this value of \(R\) is statistically significant.

P-value for GDP correlation#

For \(T = 10,000\) permutations, the \(p\)-value is for the correlation coefficient of 1960 vs. 2000 GDP is \(1/10,000\):

Back to sleep vs. study#

Let’s return to our course survey data, about hours of sleep vs. hours of study.

We calculated \(R = -.19\) for this data. Is it significant?

P-value for sleep vs. study#

Via simulation, for a \(T = 10,000\)-trial permutation test, we verify that the \(p\)-value is \(\approx 0.13\), so the trend is not statistically significant at level \(\alpha = .05\).

SAT score vs. study#

Let’s look at a different correlation in our course survey dataset: SAT score vs. average number of hours a week spent studying.

Correlation coefficient for SAT score vs. study#

The correlation coefficient is positive (unsurprisingly?), \(R = .3\)

Statistically significant?#

Again we do \(T = 10,000\) permutation tests:

Statistically significant at level \(\alpha = 0.05\)!

Variability#

We computed \(R\) for our data. But what about its variability?

  • Is \(R\) being influenced by outliers?

  • If our data had been sampled a bit differently, would my value of \(R\) be dramatically different?

The permutation-test \(p\)-values only give us a sense of statistical significance of a correlation:

  • we can see how extreme \(R\) is relative to a random shuffling of the data (no correlation)

  • we cannot see how \(R\) would vary if we had a different sample from data with the same type of associative relationship.

How much variability in SAT score vs. study?#

The SAT score vs. study trend:

  • Would the value of \(R\) changed to a (smaller or larger) negative value if we had sampled differently? There appear to be some outliers.

  • How much smaller or larger?

Estimating variability with simulation#

We can use simulation to estimate variability too.

  • The best case scenario would be, if we have access to new samples from our population, just collect more samples and compute a fresh correlation coefficient a bunch of times.

  • But what if we don’t have access to new samples?

The following approach is called the bootstrap:

  1. Start with our dataset \((x_1,y_1),\ldots,(x_n,y_n)\).

  2. For some large number of trials \(T\):

    a. Sample \(n\) pairs independently with replacement from the dataset: $\((x_{i_1},y_{i_1}),\ldots,(x_{i_n},y_{i_n})\)$

    b. Compute and record the correlation coefficient of these pairs.

  3. Form a histogram of the correlation coefficients from all \(T\) trials.

The Bootstrap#

To get a sense of the variability of \(R\):

  1. Start with our dataset \((x_1,y_1),\ldots,(x_n,y_n)\).

  2. For some large number of trials \(T\):

    a. Sample \(n\) pairs independently with replacement from the dataset: $\((x_{i_1},y_{i_1}),\ldots,(x_{i_n},y_{i_n})\)$

    b. Compute and record the correlation coefficient of these pairs.

  3. Form a histogram of the correlation coefficients from all \(T\) trials.

Question: why could this simulation give us a good sense of variability? Will it account for outliers?

  • There’s a reasonable chance that we’ll avoid any specific outlier in a trial when we sample with replacement: $\(\Pr[\text{avoid i }] = (1-\frac{1}{n})^n \approx e^{-1} \approx \frac{1}{3} \text{ when }n\text{ large.}\)$

Question: will this always give us a good idea of the variability of \(R\)?

  • Not necessarily; our sample could just be really weird.

Simulation for variability in sat vs. study#

If we do a bootstrap simulation with \(T= 10,000\) trials, we can see that the confidence interval around \(R\) is actually quite small:

This gives us some sense of the variability of \(R\).

Variability of the best-fit line#

Here you can see the original scatterplot and best-fit line, with the lines corresponding to the mean +/- a standard deviation

The line doesn’t too change much! Variability of \(R\) is reasonably low.

Recap#

  • Testing for correlation

    • Using simulation: permutation tests

  • Variability of correlation

    • Using simulation: “the bootstrap”