options(repr.plot.width=5, repr.plot.height=3) set.seed(0)
It is a course on applied statistics.
Hands-on: we use R, an open-source statistics software environment.
Course notes will be jupyter notebooks.
We will start out with a review of introductory statistics to see
R in action.
Main topic is (linear) regression models: these are the bread and butter of applied statistics.
A regression model is a model of the relationships between some covariates (predictors) and an outcome.
Specifically, regression is a model of the average outcome given or having fixed the covariates.
We will consider the heights of mothers and daughters collected by Karl Pearson in the late 19th century.
One of our goals is to understand height of the daughter,
D, knowing the height of the
A mathematical model might look like
D = f(M) + \varepsilon$$
where $f$ gives the average height of the daughter
of a mother of height
$\varepsilon$ is error: not every daughter has the same height.
A statistical question: is there any relationship between covariates and outcomes -- is $f$ just a constant?
Let's create a plot of the heights of the mother/daughter pairs. The data is in an
R package that can be downloaded
from CRAN with the command:
If the package is not installed, then you will get an error message when calling
library(alr3) data(heights) M = heights$Mheight D = heights$Dheight plot(M, D, pch = 23, bg = "red", cex = 2)
Loading required package: car Loading required package: carData
In the first part of this course we'll talk about fitting a line to this data. Let's do that and remake the plot, including this "best fitting line".
plot(M, D, pch = 23, bg = "red", cex = 2) height.lm = lm(D ~ M) abline(height.lm, lwd = 3, col = "yellow")
How do we find this line? With a model.
We might model the data as $$ D = \beta_0+ \beta_1 M + \varepsilon. $$
This model is linear in $(\beta_0, \beta_1)$, the intercept and the coefficient of
M (the mother's height), it is a
simple linear regression model.
Another model: $$ D = \beta_0 + \beta_1 M + \beta_2 M^2 + \beta_3 F + \varepsilon $$ where $F$ is the height of the daughter's father.
Also linear (in $(\beta_0, \beta_1, \beta_2, \beta_3)$, the coefficients of $1,M,M^2,F$).
Which model is better? We will need a tool to compare models... more to come later.
Our example here was rather simple: we only had one independent variable.
Independent variables are sometimes called features or covariates.
In practice, we often have many more than one independent variable.
This example from the text considers the effect of right-to-work legislation (which varies by state) on various factors. A description of the data can be found here.
The variables are:
Income: income for a four-person family
COL: cost of living for a four-person family
PD: Population density
URate: rate of unionization in 1978
Taxes: Property taxes in 1972
RTWL: right-to-work indicator
In a study like this, there are many possible questions of interest. Our focus will be on the
Income. However, we recognize that other variables
have an effect on
Income. Let's look at some of these relationships.
url = "http://www1.aucegypt.edu/faculty/hadi/RABE4/Data4/P005.txt" rtw.table <- read.table(url, header=TRUE, sep='\t') print(head(rtw.table))
City COL PD URate Pop Taxes Income RTWL 1 Atlanta 169 414 13.6 1790128 5128 2961 1 2 Austin 143 239 11.0 396891 4303 1711 1 3 Bakersfield 339 43 23.7 349874 4166 2122 0 4 Baltimore 173 951 21.0 2147850 5001 4654 0 5 Baton Rouge 99 255 16.0 411725 3965 1620 1 6 Boston 363 1257 24.4 3914071 4928 5634 0
A graphical way to
visualize the relationship between
RTWL is the boxplot.
attach(rtw.table) # makes variables accessible in top namespace boxplot(Income ~ RTWL, col='orange', pch=23, bg='red')
One variable that may have an important effect on the relationship between
is the cost of living
COL. It also varies between right-to-work states.
boxplot(COL ~ RTWL, col='orange', pch=23, bg='red')
We may want to include more than one plot in a given display. The first line of the code below achieves this.
par(mfrow=c(2,2)) plot(URate, COL, pch=23, bg='red', main='COL vs URate') plot(URate, Income, pch=23, bg='red') plot(URate, Pop, pch=23, bg='red') plot(COL, Income, pch=23, bg='red')
R has a builtin function that will try to display all pairwise relationships in a given dataset, the function
pairs(rtw.table, pch=23, bg='red')
In looking at all the pairwise relationships. There is a point that stands out from all the rest.
This data point is New York City, the 27th row of the table. (Note that
R uses 1-based instead of 0-based indexing for rows and columns of arrays.)
print(rtw.table[27,]) pairs(rtw.table[-27,], pch=23, bg='red')
City COL PD URate Pop Taxes Income RTWL 27 New York 323 6908 39.2 9561089 5260 4862 0