Correlation

STATS 60

Outline

  • Scatterplot: a display of two continuous variables

  • Key concept: correlation, a numerical summary of the scatter plot.

Reading: Chapter 6 of Intro Stats

Scatter plot

A scatterplot of daughter's height (Y-axis) versus mother's height (X-axis). The points form a cloud that trends upwards and to the right, indicating a positive linear association: taller mothers tend to have taller daughters. The data points are densest in the center of the range for both heights.
  • Pearson’s data again: heights of mothers and daughters recorded in early 20th century

  • Clear pattern visible from the plot: mothers who are taller often give birth to taller daughters.

Scatter plot

  • A plot with two axes:

    1. X-axis is the independent variable;

    2. Y-axis is the dependent variable.

  • Graphical representation of the relationship between two variables.

What can scatter plot tell us?

  • There is a positive association between the mother and daughter: from the plot, daughters born to taller mothers tend to be taller.

Note

  • There could be a negative relationship.

  • There could be no relationship.

  • There could be a nonlinear relationship.

Different patterns

A grid of four scatterplots. A or top left) Cost of women's clothes (X-axis) vs. Cost of food (Y-axis). Strong positive mostly linear with perhaps a few outliers above on the right hand side. B or top right) Average Hourly Wage (X-axis) vs. Hours to earn an iPhone (Y-axis). Notable exponential decay with tight scatter around the curve. C or bottom left) Working hours (X-axis) vs. Clothes Index (Y-axis). Weak positive linear relationship. D or bottom right) Vacation Days (X-axis) vs Food Costs (Y-axis). No strong pattern in the scatter plot.

A grid of four scatterplots showing different types of correlations.

Correlation

  • A numerical summary of a scatterplot, i.e. a pair of datasets.

  • Captures linear association between the datasets.

  • If there is a strong association between two variables, then knowing one can help a lot in predicting the other.

  • When there is a weak association, information about one variable does not help much in guessing the other.

Correlation coefficient

The correlation coefficient, \(r\), is a measure of the strength of this association.

  • \(r=+1\) if the variables are perfectly positively associated.

  • \(r=-1\) if the variables are perfectly negatively associated.

Perfectly positively correlated, \(r=+1\)

A scatterplot showing a perfect positive linear relationship. All points lie exactly on a straight line sloping upwards from left to right, demonstrating a correlation coefficient of +1.
cor(X, Y)
[1] 1

Perfectly negatively correlated, \(r=-1\)

A scatterplot showing a perfect negative linear relationship. All points lie exactly on a straight line sloping downwards from left to right, demonstrating a correlation coefficient of -1.
cor(X, Z)
[1] -1

Uncorrelated variables (no relation) \(r\) is near 0

A scatterplot of two uncorrelated variables. The points are randomly scattered with no discernible upward or downward trend, indicating a correlation coefficient near 0.
cor(X, W)
[1] 0.0936

Positive but not perfect

A scatterplot showing a positive but not perfect correlation. The points form a cloud that trends upwards and to the right, but with noticeable random scatter around the underlying linear trend.
cor(X, U)
[1] 0.587

Standardized units

  • Given a list \(X\), we define the standardized list as the list with its mean removed and scaled to have sd of 1
Z_X = (X - mean(X)) / sd(X)
Z_Y = (Y - mean(Y)) / sd(Y)
c(mean(Z_X), sd(Z_X))
[1] 0.0000000000000000444 1.0000000000000000000

Computing \(r\), the correlation coefficient

  • The formula for correlation is easily expressed in terms of the standardized variables.

  • The correlation of X and Y is (almost) the average of the product Z_X and Z_Y

\[ r = \frac{1}{n-1} \sum_{i=1}^n Z_{X,i} * Z_{Y,i} \]

Why isn’t it exactly 1?

  • Recall \(X\) and \(Y\) were perfectly correlated
cor(X, Y)
[1] 1
mean(Z_X * Z_Y) 
[1] 0.98

Doh! It’s the denominator again…

  • The last line should really be dividing by \(n-1\) as well
n = length(X)
c(sum(Z_X * Z_Y) / (n - 1), cor(X, Y))
[1] 1 1

Why bother with \(n\) vs \(n-1\)?

  • There are good reasons statisticians do this, but you’ll have to trust us (i.e. statisticians…)

  • Most noticable when \(n\) is not very big…

  • We won’t emphasize this \(n\) vs \(n-1\) in the denominator too much, but it will come up again in regression.

Summation notation

  • The entries of the lists \(Z_X, Z_Y, Z_{XY}\) are:

\[\begin{aligned} Z_{X,i} &= \frac{X_i - \bar{X}}{\text{SD}(X)} \\ Z_{Y,i} &= \frac{Y_i - \bar{Y}}{\text{SD}(Y)} \\ Z_{XY,i} &= Z_{X,i} \times Z_{Y,i} \end{aligned}\]

  • Then,

\[r = r(X,Y) = \frac{1}{n-1} \sum_{i=1}^n Z_{XY,i}.\]

  • Another way:

\[r = r(X,Y) = \frac{\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}) (Y_i - \bar{Y})}{\text{SD}(X) \text{SD}(Y)}\]

  • Yet another way:

\[ r(X, Y) = \frac{\sum_{i=1}^n (X_i-\bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^n (X_i - \bar{X}^2)} * \sqrt{\sum_{i=1}^n (Y_i - \bar{Y}^2)}} \]

  • One more way:

\[r = r(X,Y) = \frac{\frac{n}{n-1} (\overline{XY} - \bar{X} * \bar{Y})}{\text{SD}(X) \text{SD}(Y)}\]

Example

  • Suppose our datasets are X = [1,4,6,9,3], Y = [-2,2,8,0,1].

\[\begin{aligned} \bar{X} &= 4.6 & \text{SD}(X) &= 3.05 \\ \bar{Y} &= 1.8 & \text{SD}(Y) &= 3.77 \\ \end{aligned}\]

  • The only new thing to compute is \(\overline{XY}\)

\[XY = [-2,8,48,0,3], \qquad \overline{XY}=(-2+8+48+3)/5=11.4\]

Therefore (note the 5/4…) \[ r = \frac{\frac{5}{4}(11.4 - 4.6 * 1.8)}{3.05 * 3.77} \approx 0.34\]

X = c(1,4,6,9,3)
Y = c(-2,2,8,0,1)
cor(X, Y)
[1] 0.339

Properties of correlation

  • Correlation is unitless.

  • Changing units of \(X\) or \(Y\) does not change the correlation.

  • Correlation does not change if we interchange \(X\) and \(Y\): it is symmetric.

c(cor(mother, daughter), cor(daughter, mother))
[1] 0.491 0.491
  • Like mean and sd, cor is not robust to outliers.

Correlation is not causation

Two scatterplots illustrating the symmetric property of correlation. The left plot shows daughter's height vs mother's height. The right plot shows mother's height vs daughter's height. Both plots display the same positive upward trend, even with the axes swapped.

X=mother, Y=daughter

Two scatterplots illustrating the symmetric property of correlation. The left plot shows daughter's height vs mother's height. The right plot shows mother's height vs daughter's height. Both plots display the same positive upward trend, even with the axes swapped.

X=daughter, Y=mother
  • Swapping the \(X\) (independent) and \(Y\) (dependent) axes does not change the fundamental relation.

Correlation is not causation

A spurious correlation line chart from Tyler Vigen's website. It shows two lines over time that are highly correlated but not causally related: 'The distance between Saturn and the sun' and 'Google searches for how to make baby'. Both lines show a similar pattern of rising and then falling over the same period.

A spurious correlation chart showing the relationship between Saturn’s distance from the sun and Google searches for ‘how to make baby’.

Tyler Vigen

Scatter plot for Saturn and “How to make a baby”

Two scatterplots showing the spurious correlation between Google searches for 'how to make a baby' (X-axis) and the distance of Saturn from the sun (Y-axis) on the left, and the variables swapped on the right. Both plots show a strong positive association.

X:Google, Y:Saturn

Two scatterplots showing the spurious correlation between Google searches for 'how to make a baby' (X-axis) and the distance of Saturn from the sun (Y-axis) on the left, and the variables swapped on the right. Both plots show a strong positive association.

X:Saturn, Y:Google

Tyler Vigen

Another fun one

Another spurious correlation line chart from Tyler Vigen's website, showing a high correlation between 'GDP per capita in Canada' and 'Gasoline prices in the US'. Both lines trend upwards over the same time period.

A spurious correlation chart showing the relationship between Canadian GDP and US gasoline prices.

Correlation and causation

  • These last two examples above were found by searching through many variables: they clearly demonstrate correlation is not causation.

Example Shoe Size and Reading Ability

  • Within schools in a large school district, researchers collected students’ reading ability as measured by some standardized test. They also collected their shoe size.

  • Do we expect Shoe Size and Reading Ability to be uncorrelated? positively correlated? negatively correlated?

  • Explain.

Correlation captures linear behavior

  • Variables can be associated without being linearly associated.
A scatterplot showing a perfect non-linear (quadratic) relationship. The points form a clear inverted parabola. Despite the strong relationship, the linear correlation coefficient is near zero because the association is not linear.
cor(X, Y)
[1] 0.35

Bivariate histogram

  • All of our numeric summaries so far can be computed on datasets OR histograms

  • Correlation is no different, but we need to define a bivariate histogram.

Bivariate histogram: breaking the plane into bins

A 2D histogram, or heatmap, of Pearson's mother-daughter height data. The X-axis represents the mother's height and the Y-axis represents the daughter's height, both in inches. The plane is divided into 1-inch square bins. The color of each bin, from light yellow to dark blue, indicates the count of data points falling within it, with darker colors representing higher frequencies. The counts are also printed as numbers within each bin. The highest counts are concentrated along a diagonal line, visually confirming the positive correlation where taller mothers tend to have taller daughters.

Bivariate histogram: assign proportions to bins

A 2D histogram of Pearson's mother-daughter height data showing proportions instead of counts. The color of each 1-inch square bin corresponds to the percentage of the total dataset it contains, with darker shades of blue-green indicating higher proportions. Percentage labels are shown in bins with more than 0.2% of the data. The distribution is concentrated along the positive diagonal, again illustrating that taller mothers tend to have taller daughters.

Bivariate histogram: volumes vs. areas

A 3D visualization of a bivariate histogram. The X and Y axes form a grid on the floor, and rectangular columns rise from this grid. The volume of each column represents the proportion of data in that bin, analogous to how area represents proportion in a standard 1D histogram.

A 3D visualization of a bivariate histogram where column volumes represent proportions.
  • Replace bars with columns

  • \(\implies\) volumes are proportions now…

Correlation of a (bivariate) histogram

  • Just like mean, median, sd, quantile we could compute cor from such a bivariate histogram…

  • We will not dwell on the details (multivariate calculus…)

  • Moral: yet again, a histogram captures a lot of interesting things about the data…

Summary

  • We introduced correlation, a unitless numerical summary of a scatterplot of X and Y.

  • When the two variables X and Y are linearly related, their correlation quantifies the strength of this linear association.

  • Can be computed based on the standardized variables scale(X) and scale(Y).