Course Introduction

J. Taylor

2024-01-01

Course overview

  • Review++ (e.g. STATS60 and a little more)

  • Beyond \(t\)-tests

  • Multiple group: ANOVA

  • Simple linear regression

  • Multiple linear regression

  • Model diagnostics

  • Model selection

  • Logistic regression

  • …

What is course about?

  • It is a course on applied statistics.

  • Hands-on: we use R, an open-source statistics software environment.

  • Course notes will be available as jupyter notebooks.

  • We will start out with a review of introductory statistics to see R in action.

  • Main topic is (linear) regression models: these are the bread and butter of applied statistics.

What is a regression model?

  • A regression model is a model of the relationships between some covariates (predictors) and an outcome.

  • Specifically, regression is a model of the average outcome given or having fixed the covariates.

Heights of fathers and sons

  • We will consider the of fathers and sons collected by Karl Pearson in the late 19th century. Perhaps the first regression model!

  • One of our goals is to understand height of the son, S, knowing the height of the father, F.

  • A mathematical model might look like

    \[ S = g(F) + \varepsilon\]

  • Above \(g\) gives the average height of the son of a father of height F and \(\varepsilon\) is error: not every son whose fathers have the same height themselves have the same height.

  • A statistical question: is there any relationship between covariates and outcomes – is \(g\) just a constant?

Line of best fit

Linear regression model

  • How do we find this line? With a model.

  • We might model the data as

\[ S = \beta_0+ \beta_1 \cdot F + \varepsilon. \]

  • This model is linear in \((\beta_0, \beta_1)\), the intercept and the coefficient of F (the father’s height), it is a simple linear regression model.

Another model

\[ S = \beta_0 + \beta_1 F + \beta_2 F^2 + \varepsilon \]

  • Also linear (in \((\beta_0, \beta_1, \beta_2)\), the coefficients of \(1, F, F^2\)).

  • Which model is better? We will need a tool to compare models… more to come later.

  • Our example here was rather simple: we only had one independent variable: F.

Multiple linear regression: brain size in mammals

Features and response

Response in brains

  • Brain: average brain weight (in grams)

Features in brains

  • Body: average body weight (in kilograms)

  • Gestation: gestation period (in days)

  • Litter: average litter size

Graphical exploration

Building a model

Some of the main goals of this course:

  • Build a statistical model describing the effect of Gestation on Brain.

  • This model should recognize that other variables also affect Brain.

  • What sort of statistical conclusions can we make based on our model?

  • Is the model we choose adequate describe this dataset?

  • Are there other (simpler, more complicated) better models?