Correlation

Outline

Scatterplot: a display of two continuous variables
Key concept: correlation, a numerical summary of the scatter plot.

Reading: Chapter 6 of Intro Stats

Scatter plot

Pearson’s data again: heights of mothers and daughters recorded in early 20th century
Clear pattern visible from the plot: mothers who are taller often give birth to taller daughters.

Scatter plot

A plot with two axes:
1. X-axis is the independent variable;
2. Y-axis is the dependent variable.
Graphical representation of the relationship between two variables.

What can scatter plot tell us?

There is a positive association between the mother and daughter: from the plot, daughters born to taller mothers tend to be taller.

Note

There could be a negative relationship.
There could be no relationship.
There could be a nonlinear relationship.

Different patterns

A grid of four scatterplots. A or top left) Cost of women's clothes (X-axis) vs. Cost of food (Y-axis). Strong positive mostly linear with perhaps a few outliers above on the right hand side. B or top right) Average Hourly Wage (X-axis) vs. Hours to earn an iPhone (Y-axis). Notable exponential decay with tight scatter around the curve. C or bottom left) Working hours (X-axis) vs. Clothes Index (Y-axis). Weak positive linear relationship. D or bottom right) Vacation Days (X-axis) vs Food Costs (Y-axis). No strong pattern in the scatter plot. — A grid of four scatterplots showing different types of correlations.

Correlation

A numerical summary of a scatterplot, i.e. a pair of datasets.
Captures linear association between the datasets.
If there is a strong association between two variables, then knowing one can help a lot in predicting the other.
When there is a weak association, information about one variable does not help much in guessing the other.

Correlation coefficient

The correlation coefficient, \(r\), is a measure of the strength of this association.

\(r=+1\) if the variables are perfectly positively associated.
\(r=-1\) if the variables are perfectly negatively associated.

Perfectly positively correlated, \(r=+1\)

A scatterplot showing a perfect positive linear relationship. All points lie exactly on a straight line sloping upwards from left to right, demonstrating a correlation coefficient of +1.

cor(X, Y)

[1] 1

Perfectly negatively correlated, \(r=-1\)

A scatterplot showing a perfect negative linear relationship. All points lie exactly on a straight line sloping downwards from left to right, demonstrating a correlation coefficient of -1.

cor(X, Z)

[1] -1

Uncorrelated variables (no relation) \(r\) is near 0

A scatterplot of two uncorrelated variables. The points are randomly scattered with no discernible upward or downward trend, indicating a correlation coefficient near 0.

cor(X, W)

[1] 0.0936

Positive but not perfect

cor(X, U)

[1] 0.587

Standardized units

Given a list \(X\), we define the standardized list as the list with its mean removed and scaled to have sd of 1

Z_X = (X - mean(X)) / sd(X)
Z_Y = (Y - mean(Y)) / sd(Y)
c(mean(Z_X), sd(Z_X))

[1] 0.0000000000000000444 1.0000000000000000000

Computing \(r\), the correlation coefficient

The formula for correlation is easily expressed in terms of the standardized variables.
The correlation of X and Y is (almost) the average of the product Z_X and Z_Y

\[ r = \frac{1}{n-1} \sum_{i=1}^n Z_{X,i} * Z_{Y,i} \]

Why isn’t it exactly 1?

Recall \(X\) and \(Y\) were perfectly correlated

cor(X, Y)

[1] 1

mean(Z_X * Z_Y)

[1] 0.98

Doh! It’s the denominator again…

The last line should really be dividing by \(n-1\) as well

n = length(X)
c(sum(Z_X * Z_Y) / (n - 1), cor(X, Y))

[1] 1 1

Why bother with \(n\) vs \(n-1\)?

There are good reasons statisticians do this, but you’ll have to trust us (i.e. statisticians…)
Most noticable when \(n\) is not very big…
We won’t emphasize this \(n\) vs \(n-1\) in the denominator too much, but it will come up again in regression.

Summation notation

The entries of the lists \(Z_X, Z_Y, Z_{XY}\) are:

\[\begin{aligned} Z_{X,i} &= \frac{X_i - \bar{X}}{\text{SD}(X)} \\ Z_{Y,i} &= \frac{Y_i - \bar{Y}}{\text{SD}(Y)} \\ Z_{XY,i} &= Z_{X,i} \times Z_{Y,i} \end{aligned}\]

Then,

\[r = r(X,Y) = \frac{1}{n-1} \sum_{i=1}^n Z_{XY,i}.\]

Another way:

\[r = r(X,Y) = \frac{\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}) (Y_i - \bar{Y})}{\text{SD}(X) \text{SD}(Y)}\]

Yet another way:

\[ r(X, Y) = \frac{\sum_{i=1}^n (X_i-\bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^n (X_i - \bar{X}^2)} * \sqrt{\sum_{i=1}^n (Y_i - \bar{Y}^2)}} \]

One more way:

\[r = r(X,Y) = \frac{\frac{n}{n-1} (\overline{XY} - \bar{X} * \bar{Y})}{\text{SD}(X) \text{SD}(Y)}\]

Example

Suppose our datasets are X = [1,4,6,9,3], Y = [-2,2,8,0,1].

\[\begin{aligned} \bar{X} &= 4.6 & \text{SD}(X) &= 3.05 \\ \bar{Y} &= 1.8 & \text{SD}(Y) &= 3.77 \\ \end{aligned}\]

The only new thing to compute is \(\overline{XY}\)

\[XY = [-2,8,48,0,3], \qquad \overline{XY}=(-2+8+48+3)/5=11.4\]

Therefore (note the 5/4…) \[ r = \frac{\frac{5}{4}(11.4 - 4.6 * 1.8)}{3.05 * 3.77} \approx 0.34\]

X = c(1,4,6,9,3)
Y = c(-2,2,8,0,1)
cor(X, Y)

[1] 0.339

Properties of correlation

Correlation is unitless.
Changing units of \(X\) or \(Y\) does not change the correlation.
Correlation does not change if we interchange \(X\) and \(Y\): it is symmetric.

c(cor(mother, daughter), cor(daughter, mother))

[1] 0.491 0.491

Like mean and sd, cor is not robust to outliers.

Correlation is not causation

Two scatterplots illustrating the symmetric property of correlation. The left plot shows daughter's height vs mother's height. The right plot shows mother's height vs daughter's height. Both plots display the same positive upward trend, even with the axes swapped. — X=`mother`, Y=`daughter`

Swapping the \(X\) (independent) and \(Y\) (dependent) axes does not change the fundamental relation.

Correlation is not causation

Tyler Vigen

Scatter plot for Saturn and “How to make a baby”

Two scatterplots showing the spurious correlation between Google searches for 'how to make a baby' (X-axis) and the distance of Saturn from the sun (Y-axis) on the left, and the variables swapped on the right. Both plots show a strong positive association. — X:Google, Y:Saturn

Tyler Vigen

Another fun one

Correlation and causation

These last two examples above were found by searching through many variables: they clearly demonstrate correlation is not causation.

Example `Shoe Size` and `Reading Ability`

Within schools in a large school district, researchers collected students’ reading ability as measured by some standardized test. They also collected their shoe size.
Do we expect Shoe Size and Reading Ability to be uncorrelated? positively correlated? negatively correlated?
Explain.

Correlation captures linear behavior

Variables can be associated without being linearly associated.

A scatterplot showing a perfect non-linear (quadratic) relationship. The points form a clear inverted parabola. Despite the strong relationship, the linear correlation coefficient is near zero because the association is not linear.

cor(X, Y)

[1] 0.35

Bivariate histogram

All of our numeric summaries so far can be computed on datasets OR histograms
Correlation is no different, but we need to define a bivariate histogram.

Bivariate histogram: breaking the plane into bins

A 2D histogram, or heatmap, of Pearson's mother-daughter height data. The X-axis represents the mother's height and the Y-axis represents the daughter's height, both in inches. The plane is divided into 1-inch square bins. The color of each bin, from light yellow to dark blue, indicates the count of data points falling within it, with darker colors representing higher frequencies. The counts are also printed as numbers within each bin. The highest counts are concentrated along a diagonal line, visually confirming the positive correlation where taller mothers tend to have taller daughters.

Bivariate histogram: assign proportions to bins

A 2D histogram of Pearson's mother-daughter height data showing proportions instead of counts. The color of each 1-inch square bin corresponds to the percentage of the total dataset it contains, with darker shades of blue-green indicating higher proportions. Percentage labels are shown in bins with more than 0.2% of the data. The distribution is concentrated along the positive diagonal, again illustrating that taller mothers tend to have taller daughters.

Bivariate histogram: volumes vs. areas

A 3D visualization of a bivariate histogram. The X and Y axes form a grid on the floor, and rectangular columns rise from this grid. The volume of each column represents the proportion of data in that bin, analogous to how area represents proportion in a standard 1D histogram. — A 3D visualization of a bivariate histogram where column volumes represent proportions.

Replace bars with columns
\(\implies\) volumes are proportions now…

Correlation of a (bivariate) histogram

Just like mean, median, sd, quantile we could compute cor from such a bivariate histogram…
We will not dwell on the details (multivariate calculus…)
Moral: yet again, a histogram captures a lot of interesting things about the data…

Summary

We introduced correlation, a unitless numerical summary of a scatterplot of X and Y.
When the two variables X and Y are linearly related, their correlation quantifies the strength of this linear association.
Can be computed based on the standardized variables scale(X) and scale(Y).

Correlation

Outline

Reading: Chapter 6 of Intro Stats

Scatter plot

Scatter plot

What can scatter plot tell us?

Note

Different patterns

Correlation

Correlation coefficient

Perfectly positively correlated, \(r=+1\)

Perfectly negatively correlated, \(r=-1\)

Uncorrelated variables (no relation) \(r\) is near 0

Positive but not perfect

Standardized units

Computing \(r\), the correlation coefficient

Why isn’t it exactly 1?

Doh! It’s the denominator again…

Why bother with \(n\) vs \(n-1\)?

Summation notation

Example

Properties of correlation

Correlation is not causation

Correlation is not causation

Scatter plot for Saturn and “How to make a baby”

Another fun one

Correlation and causation

Example Shoe Size and Reading Ability

Correlation captures linear behavior

Bivariate histogram

Bivariate histogram: breaking the plane into bins

Bivariate histogram: assign proportions to bins

Bivariate histogram: volumes vs. areas

Correlation of a (bivariate) histogram

Summary

Example `Shoe Size` and `Reading Ability`