[1] 1
Scatterplot: a display of two continuous variables
Key concept: correlation, a numerical summary of the scatter plot.
Pearson’s data again: heights of mothers and daughters recorded in early 20th century
Clear pattern visible from the plot: mothers who are taller often give birth to taller daughters.
A plot with two axes:
X-axis is the independent variable;
Y-axis is the dependent variable.
Graphical representation of the relationship between two variables.
mother and daughter: from the plot, daughters born to taller mothers tend to be taller.There could be a negative relationship.
There could be no relationship.
There could be a nonlinear relationship.

A numerical summary of a scatterplot, i.e. a pair of datasets.
Captures linear association between the datasets.
If there is a strong association between two variables, then knowing one can help a lot in predicting the other.
When there is a weak association, information about one variable does not help much in guessing the other.
The correlation coefficient, \(r\), is a measure of the strength of this association.
\(r=+1\) if the variables are perfectly positively associated.
\(r=-1\) if the variables are perfectly negatively associated.
sd of 1The formula for correlation is easily expressed in terms of the standardized variables.
The correlation of X and Y is (almost) the average of the product Z_X and Z_Y
\[ r = \frac{1}{n-1} \sum_{i=1}^n Z_{X,i} * Z_{Y,i} \]
There are good reasons statisticians do this, but you’ll have to trust us (i.e. statisticians…)
Most noticable when \(n\) is not very big…
We won’t emphasize this \(n\) vs \(n-1\) in the denominator too much, but it will come up again in regression.
\[\begin{aligned} Z_{X,i} &= \frac{X_i - \bar{X}}{\text{SD}(X)} \\ Z_{Y,i} &= \frac{Y_i - \bar{Y}}{\text{SD}(Y)} \\ Z_{XY,i} &= Z_{X,i} \times Z_{Y,i} \end{aligned}\]
\[r = r(X,Y) = \frac{1}{n-1} \sum_{i=1}^n Z_{XY,i}.\]
\[r = r(X,Y) = \frac{\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}) (Y_i - \bar{Y})}{\text{SD}(X) \text{SD}(Y)}\]
\[ r(X, Y) = \frac{\sum_{i=1}^n (X_i-\bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^n (X_i - \bar{X}^2)} * \sqrt{\sum_{i=1}^n (Y_i - \bar{Y}^2)}} \]
\[r = r(X,Y) = \frac{\frac{n}{n-1} (\overline{XY} - \bar{X} * \bar{Y})}{\text{SD}(X) \text{SD}(Y)}\]
X = [1,4,6,9,3], Y = [-2,2,8,0,1].\[\begin{aligned} \bar{X} &= 4.6 & \text{SD}(X) &= 3.05 \\ \bar{Y} &= 1.8 & \text{SD}(Y) &= 3.77 \\ \end{aligned}\]
\[XY = [-2,8,48,0,3], \qquad \overline{XY}=(-2+8+48+3)/5=11.4\]
Therefore (note the 5/4…) \[ r = \frac{\frac{5}{4}(11.4 - 4.6 * 1.8)}{3.05 * 3.77} \approx 0.34\]
Correlation is unitless.
Changing units of \(X\) or \(Y\) does not change the correlation.
Correlation does not change if we interchange \(X\) and \(Y\): it is symmetric.
mean and sd, cor is not robust to outliers.
mother, Y=daughter
daughter, Y=mother



Shoe Size and Reading AbilityWithin schools in a large school district, researchers collected students’ reading ability as measured by some standardized test. They also collected their shoe size.
Do we expect Shoe Size and Reading Ability to be uncorrelated? positively correlated? negatively correlated?
Explain.
All of our numeric summaries so far can be computed on datasets OR histograms
Correlation is no different, but we need to define a bivariate histogram.

Replace bars with columns
\(\implies\) volumes are proportions now…
Just like mean, median, sd, quantile we could compute cor from such a bivariate histogram…
We will not dwell on the details (multivariate calculus…)
Moral: yet again, a histogram captures a lot of interesting things about the data…
We introduced correlation, a unitless numerical summary of a scatterplot of X and Y.
When the two variables X and Y are linearly related, their correlation quantifies the strength of this linear association.
Can be computed based on the standardized variables scale(X) and scale(Y).