In [1]:

```
options(repr.plot.width=5, repr.plot.height=3)
set.seed(0)
```

It is a course on applied statistics.

Hands-on: we use R, an open-source statistics software environment.

Course notes will be jupyter notebooks.

We will start out with a review of introductory statistics to see

`R`

in action.Main topic is

*(linear) regression models*: these are the*bread and butter*of applied statistics.

A regression model is a model of the relationships between some
*covariates (predictors)* and an *outcome*.

Specifically, regression is a model of the *average* outcome *given or having fixed* the covariates.

We will consider the heights of mothers and daughters collected by Karl Pearson in the late 19th century.

One of our goals is to understand height of the daughter,

`D`

, knowing the height of the mother,`M`

.

A mathematical model might look like $$ D = f(M) + \varepsilon$$ where $f$ gives the average height of the daughter of a mother of height

`M`

and $\varepsilon$ is*error*: not*every*daughter has the same height.A statistical question: is there

*any*relationship between covariates and outcomes -- is $f$ just a constant?

Let's create a plot of the heights of the mother/daughter pairs. The data is in an `R`

package that can be downloaded
from CRAN with the command:

```
install.packages("alr3")
```

If the package is not installed, then you will get an error message when calling `library(alr3)`

.

In [2]:

```
library(alr3)
data(heights)
M = heights$Mheight
D = heights$Dheight
plot(M, D, pch = 23, bg = "red", cex = 2)
```

In the first part of this course we'll talk about fitting a line to this data. Let's do that and remake the plot, including this "best fitting line".

In [3]:

```
plot(M, D, pch = 23, bg = "red", cex = 2)
height.lm = lm(D ~ M)
abline(height.lm, lwd = 3, col = "yellow")
```

How do we find this line? With a model.

We might model the data as $$ D = \beta_0+ \beta_1 M + \varepsilon. $$

This model is

*linear*in $(\beta_0, \beta_1)$, the intercept and the coefficient of`M`

(the mother's height), it is a*simple linear regression model*.Another model: $$ D = \beta_0 + \beta_1 M + \beta_2 M^2 + \beta_3 F + \varepsilon $$ where $F$ is the height of the daughter's father.

Also linear (in $(\beta_0, \beta_1, \beta_2, \beta_3)$, the coefficients of $1,M,M^2,F$).

Which model is better? We will need a tool to compare models... more to come later.

Our example here was rather simple: we only had one independent variable.

Independent variables are sometimes called

*features*or*covariates*.In practice, we often have many more than one independent variable.

This example from the text considers the effect of right-to-work legislation (which varies by state) on various factors. A description of the data can be found here.

The variables are:

Income: income for a four-person family

COL: cost of living for a four-person family

PD: Population density

URate: rate of unionization in 1978

Pop: Population

Taxes: Property taxes in 1972

RTWL: right-to-work indicator

In a study like this, there are many possible questions of interest. Our focus will be on the
relationship between `RTWL`

and `Income`

. However, we recognize that other variables
have an effect on `Income`

. Let's look at some of these relationships.

In [4]:

```
url = "http://www1.aucegypt.edu/faculty/hadi/RABE4/Data4/P005.txt"
rtw.table <- read.table(url, header=TRUE, sep='\t')
print(head(rtw.table))
```

A graphical way to
visualize the relationship between `Income`

and `RTWL`

is the *boxplot*.

In [5]:

```
attach(rtw.table) # makes variables accessible in top namespace
boxplot(Income ~ RTWL, col='orange', pch=23, bg='red')
```

One variable that may have an important effect on the relationship between
is the cost of living `COL`

. It also varies between right-to-work states.

In [6]:

```
boxplot(COL ~ RTWL, col='orange', pch=23, bg='red')
```

We may want to include more than one plot in a given display. The first line of the code below achieves this.

In [7]:

```
options(repr.plot.width=7, repr.plot.height=7)
```

In [8]:

```
par(mfrow=c(2,2))
plot(URate, COL, pch=23, bg='red', main='COL vs URate')
plot(URate, Income, pch=23, bg='red')
plot(URate, Pop, pch=23, bg='red')
plot(COL, Income, pch=23, bg='red')
```

`R`

has a builtin function that will try to display all pairwise relationships in a given dataset, the function `pairs`

.

In [9]:

```
pairs(rtw.table, pch=23, bg='red')
```

In looking at all the pairwise relationships. There is a point that stands out from all the rest.
This data point is New York City, the 27th row of the table. (Note that `R`

uses 1-based instead of 0-based indexing for rows and columns of arrays.)

In [10]:

```
print(rtw.table[27,])
pairs(rtw.table[-27,], pch=23, bg='red')
```