CME/STATS 195

Lecture 6: Data Modeling and Linear Regression

Evan Rosenman

April 18, 2019

Contents

  • Data Modeling

  • Linear Regression

  • Lasso Regression

Data Modeling

Introduction to models

All models are wrong, but some are useful. Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. However, cunningly chosen parsimonious models often do provide remarkably useful approximations (…).” – George E.P. Box, 1976

  • The goal of a model is to provide a simple low-dimensional summary of a dataset.

  • Models can be used to partition data into patterns of interest and residuals (other sources of variation and random noise).

Hypothesis generation vs. hypothesis confirmation

  • Models are often used for inference about a pre-specified hypothesis e.g. “BMI is associated with blood pressure controlling for other factors”

  • Doing inference correctly is hard. Each observation should either be used for exploration or confirmation, NOT both.

  • Observation can be used many times for exploration, but only once for confirmation.

  • There is nothing wrong with exploration, but you should never sell an exploratory analysis as a confirmatory analysis.

Confirmatory analysis

One approach is to split your data into three pieces before you begin the analysis:

  • Training set – the bulk (e.g. 60%) of the dataset which can be used to do anything: visualizing, fitting multiple models.

  • Validation set – a smaller set (e.g. 20%) used for manually comparing models and visualizations.

  • Test set – a set (e.g. 20%) held back used only ONCE to test and asses your final model.

Confirmatory analysis

  • Partitioning the dataset allows you to explore the training data, generate a number of candidate hypotheses and models.

  • You can select a final model based on its performance on the validation set.

  • Finally, when you are confident with the chosen model you can check how good it is using the test data.

Model Basics


There are two parts to data modeling:

  • defining a family of models: deciding on a set of models that can express a type of pattern you want to capture, e.g. a straight line, or a quadratic curve.
  • fitting a model: finding a model within the family that the closest to your data.


A fitted model is just the best model from a chosen family of models, i.e. the “best” according to some set criteria.

This does not necessarily imply that the model is a good and certainly does NOT imply that the model is true.

A toy dataset

We will work with a simulated dataset sim1 from the modelr package:

library(modelr)
sim1
## # A tibble: 30 x 2
##        x     y
##    <int> <dbl>
##  1     1  4.20
##  2     1  7.51
##  3     1  2.13
##  4     2  8.99
##  5     2 10.2 
##  6     2 11.3 
##  7     3  7.36
##  8     3 10.5 
##  9     3 10.5 
## 10     4 12.4 
## # ... with 20 more rows
ggplot(sim1, aes(x, y)) + geom_point()

Defining a family of models

The relationship between \(x\) and \(y\) for the points in sim1 look linear. So, will look for models which belong to a family of models of the following form:


\[y= \beta_0 + \beta_1 \cdot x\]


The models that can be expressed by the above formula, can adequately capture a linear trend.

We generate a few examples of such models on the right.

models <- tibble(
    b0 = runif(250, -20, 40),
    b1 = runif(250, -5, 5))

ggplot(sim1, aes(x, y)) + 
  geom_abline(
      data = models, 
      aes(intercept = b0, slope = b1), 
      alpha = 1/4) +
  geom_point() 

Fitting a model

From all the lines in the linear family of models, we need to find the best one, i.e. the one that is the closest to the data.

This means that we need to find parameters \(\widehat{a}_0\) and \(\widehat a_1\) that identify such a fitted line.

A typical measure of “closeness” is the sum of squared errors (SSE), i.e. we want the model with minimum squared residuals:

\[\begin{align*} ||\widehat{e}||^2_2 &= ||y - \widehat y||_2^2\\ &= \sum_{i = 1}^n (y_i - (\widehat \beta_0 + \widehat \beta_1 x_i))^2 \end{align*}\]

Linear Regression

Linear Regression


  • Regression is a supervised learning method, whose goal is inferring the relationship between input data, \(x\), and a continuous response variable, \(y\).
  • Linear regression is a type of regression where \(y\) is modeled as a linear function of \(x\).
  • Simple linear regression predicts the output \(y\) from a single predictor \(x\). \[y = \beta_0 + \beta_1 x + \epsilon\]
  • Multiple linear regression assumes \(y\) relies on many covariates: \[\begin{align*} y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p + \epsilon \\ &= \boldsymbol{\beta}^T \boldsymbol{x} + \epsilon \end{align*}\]
  • here \(\epsilon\) denotes a random noise term with zero mean.

Objective function

Linear regression seeks a solution \(\widehat y = \widehat \beta \cdot \ x\) that minimizes the difference between the true outcome \(y\) and the prediction \(\widehat y\), in terms of the residual sum of squares (RSS).

\[ \widehat \beta = \text{arg} \min\limits_{\beta} \sum_i \left(y_i - \boldsymbol{ \beta}^T \boldsymbol{x}_i\right)^2 \]

Simple Linear Regression

  • Predict the mileage per gallon using the weight of the car.

  • In R the linear models can be fit with a lm() function.

# Separate the data into train and test:
set.seed(123)
n <- nrow(mtcars)
idx <- sample(1:n, size = round(0.7*n))
mtcars_train <- mtcars[idx, ]
mtcars_test <- mtcars[-idx, ]

# Fit a simple linear model:
mtcars_fit <- lm(mpg ~ wt, mtcars_train)

# Extract the fitted model coefficients:
coef(mtcars_fit)
## (Intercept)          wt 
##   37.252154   -5.541406

Linear Regression Model Summary

# check the details on the fitted model:
summary(mtcars_fit)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5302 -1.9952  0.0179  1.3017  3.5194 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   36.470      2.108  17.299 7.61e-11 ***
## wt            -5.407      0.621  -8.707 5.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.2 on 14 degrees of freedom
## Multiple R-squared:  0.8441, Adjusted R-squared:  0.833 
## F-statistic: 75.81 on 1 and 14 DF,  p-value: 5.043e-07

Fitted values

We can compute the fitted values \(\widehat y\), a.k.a. the predicted mpg values for existing observations using the predict() function.

pred <- predict(mtcars_fit, newdata = mtcars_train)
pred
##            Merc 280    Pontiac Firebird          Merc 450SL 
##           18.189718           15.945449           16.582710 
##           Fiat X1-9       Porsche 914-2       Mazda RX4 Wag 
##           26.529534           25.393545           21.320612 
##         Merc 450SLC         AMC Javelin      Ford Pantera L 
##           16.305640           18.217425           19.685897 
##           Merc 280C    Dodge Challenger          Volvo 142E 
##           18.189718           17.746405           21.847046 
##          Camaro Z28       Maserati Bora        Lotus Europa 
##           15.973156           17.469335           28.868007 
## Lincoln Continental      Hornet 4 Drive           Mazda RX4 
##            7.195569           19.436534           22.733671 
##   Hornet Sportabout        Ferrari Dino         Honda Civic 
##           18.189718           21.902460           28.302783 
##           Merc 240D 
##           19.575069

Fitted values

Alternatively, the add_predictions() function in the modelr package will automatically append the model predictions to our data frame

mtcars_train <- mtcars_train %>% add_predictions(mtcars_fit)
tbl_df(mtcars_train)
## # A tibble: 22 x 12
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  pred
##  * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4  18.2
##  2  19.2     8  400    175  3.08  3.84  17.0     0     0     3     2  15.9
##  3  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3  16.6
##  4  27.3     4   79     66  4.08  1.94  18.9     1     1     4     1  26.5
##  5  26       4  120.    91  4.43  2.14  16.7     0     1     5     2  25.4
##  6  21       6  160    110  3.9   2.88  17.0     0     1     4     4  21.3
##  7  15.2     8  276.   180  3.07  3.78  18       0     0     3     3  16.3
##  8  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2  18.2
##  9  15.8     8  351    264  4.22  3.17  14.5     0     1     5     4  19.7
## 10  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4  18.2
## # ... with 12 more rows

Predictions for new observations

To predict the mpg for new observations, e.g. cars not in the dataset, we first need to generate a data table with predictors \(x\), in this case the car weights:

newcars <- tibble(wt = c(2, 2.1, 3.14, 4.1, 4.3))
newcars <- newcars %>% add_predictions(mtcars_fit)
newcars
## # A tibble: 5 x 2
##      wt  pred
##   <dbl> <dbl>
## 1  2     26.2
## 2  2.1   25.6
## 3  3.14  19.9
## 4  4.1   14.5
## 5  4.3   13.4

Predictions for the test set

Remember that we already set aside a test set check our model:

mtcars_test <- mtcars_test %>% add_predictions(mtcars_fit)
head(mtcars_test, 3)
##             mpg cyl disp  hp drat   wt  qsec vs am gear carb     pred
## Datsun 710 22.8   4  108  93 3.85 2.32 18.61  1  1    4    1 24.39609
## Valiant    18.1   6  225 105 2.76 3.46 20.22  1  0    3    1 18.07889
## Duster 360 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4 17.46934

Compute the root mean square error:

\[ RMSE = \frac{1}{\sqrt{n}}\|y - \widehat y\| = \sqrt{\frac{1}{n}\sum_{i = 1}^n(y_i - \widehat y_i)^2} \]

sqrt(mean((mtcars_test$mpg - mtcars_test$pred)^2))
## [1] 4.291439

Visualizing the model

We compare our predictions (grey) to the observed (black) values.

ggplot(mtcars_train, aes(x = wt)) + geom_point(aes(y = mpg)) +
    geom_line(aes(y = pred), color = "red", size = 1) +
    geom_point(aes(y = pred), fill = "grey", color = "black", shape = 21, size = 2) 

Visualizing the residuals

Residuals tell you what the model has missed. We can compute and add residuals to data with add_residuals() from modelr package.

Plotting residuals is a good practice. There should be no systematic patterns in the residuals.

mtcars_train <- mtcars_train %>% 
    add_residuals(mtcars_fit)
tbl_df(mtcars_train %>% 
    select(mpg, mpg, resid, pred))
## # A tibble: 22 x 3
##      mpg  resid  pred
##  * <dbl>  <dbl> <dbl>
##  1  19.2  1.01   18.2
##  2  19.2  3.25   15.9
##  3  17.3  0.717  16.6
##  4  27.3  0.770  26.5
##  5  26    0.606  25.4
##  6  21   -0.321  21.3
##  7  15.2 -1.11   16.3
##  8  15.2 -3.02   18.2
##  9  15.8 -3.89   19.7
## 10  17.8 -0.390  18.2
## # ... with 12 more rows
ggplot(mtcars_train, aes(wt, resid)) + 
    geom_ref_line(h = 0, colour = "grey") +
    geom_point() 

Formulae in R

You have seen that lm() takes in a formula relation y ~ x as an argument.

You can take a look at what R actually does, you can use the model_matrix().

sim1
## # A tibble: 30 x 3
##        x     y  pred
##    <int> <dbl> <dbl>
##  1     1  4.20  6.27
##  2     1  7.51  6.27
##  3     1  2.13  6.27
##  4     2  8.99  8.32
##  5     2 10.2   8.32
##  6     2 11.3   8.32
##  7     3  7.36 10.4 
##  8     3 10.5  10.4 
##  9     3 10.5  10.4 
## 10     4 12.4  12.4 
## # ... with 20 more rows
model_matrix(sim1, y ~ x)
## # A tibble: 30 x 2
##    `(Intercept)`     x
##            <dbl> <dbl>
##  1             1     1
##  2             1     1
##  3             1     1
##  4             1     2
##  5             1     2
##  6             1     2
##  7             1     3
##  8             1     3
##  9             1     3
## 10             1     4
## # ... with 20 more rows

Formulae with categorical variables

  • Categorical variables cannot be included in linear models with their raw values

  • year variable is not a number, so R creates an indicator column that is 1 if “male”, and 0 if “female”.

(df <- tibble(
    year = c("F", "S", "F", "J", 
             "S", "S"),
     response = c(2, 5, 1, 3, 6, 8)
))
## # A tibble: 6 x 2
##   year  response
##   <chr>    <dbl>
## 1 F            2
## 2 S            5
## 3 F            1
## 4 J            3
## 5 S            6
## 6 S            8
model_matrix(df, response ~ year)
## # A tibble: 6 x 3
##   `(Intercept)` yearJ yearS
##           <dbl> <dbl> <dbl>
## 1             1     0     0
## 2             1     0     1
## 3             1     0     0
## 4             1     1     0
## 5             1     0     1
## 6             1     0     1

  • In general, it creates \(k−1\) columns, where \(k\) is the number of categories.
(df <- tibble(
    rating = c("good", "bad", "average", 
               "bad", "average", "good", 
               "bad", "good"),
    score = c(2, 5, 1, 3, 6, 8, 10, 6)
))
## # A tibble: 8 x 2
##   rating  score
##   <chr>   <dbl>
## 1 good        2
## 2 bad         5
## 3 average     1
## 4 bad         3
## 5 average     6
## 6 good        8
## 7 bad        10
## 8 good        6
model_matrix(df, score ~ rating)
## # A tibble: 8 x 3
##   `(Intercept)` ratingbad ratinggood
##           <dbl>     <dbl>      <dbl>
## 1             1         0          1
## 2             1         1          0
## 3             1         0          0
## 4             1         1          0
## 5             1         0          0
## 6             1         0          1
## 7             1         1          0
## 8             1         0          1

The underlying parametrization won’t affect model predictions.

Multiple Linear Regression

Models often include multiple predictors, e.g. we might like to predict mpg using three variables: wt, disp and cyl.

ggplot(mtcars, aes(x=wt, y=mpg, col=cyl, size=disp)) + 
    geom_point() +
    scale_color_viridis_c()

mtcars_mult_fit <- lm(mpg ~ wt + disp + cyl, data = mtcars_train)

# Summarize the results 
summary(mtcars_mult_fit)
## 
## Call:
## lm(formula = mpg ~ wt + disp + cyl, data = mtcars_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0278 -0.9406 -0.1110  1.0166  2.8959 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.54012    2.18825  19.440 1.57e-13 ***
## wt          -4.24770    0.79958  -5.312 4.75e-05 ***
## disp         0.01711    0.00882   1.940 0.068188 .  
## cyl         -2.09360    0.49482  -4.231 0.000502 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.685 on 18 degrees of freedom
## Multiple R-squared:  0.9136, Adjusted R-squared:  0.8992 
## F-statistic: 63.45 on 3 and 18 DF,  p-value: 9.075e-10

To predict mpg for new cars, you must first create a data frame describing the attributes of the new cars, before computing predicted mpg values.

newcars <- expand.grid(
    wt = c(2.1, 3.6, 5.1), 
    disp = c(150, 250), 
    cyl = c(4, 6)
)
newcars
##     wt disp cyl
## 1  2.1  150   4
## 2  3.6  150   4
## 3  5.1  150   4
## 4  2.1  250   4
## 5  3.6  250   4
## 6  5.1  250   4
## 7  2.1  150   6
## 8  3.6  150   6
## 9  5.1  150   6
## 10 2.1  250   6
## 11 3.6  250   6
## 12 5.1  250   6
newcars <- newcars %>% 
    add_predictions(mtcars_mult_fit)
newcars
##     wt disp cyl     pred
## 1  2.1  150   4 27.81248
## 2  3.6  150   4 21.44093
## 3  5.1  150   4 15.06938
## 4  2.1  250   4 29.52377
## 5  3.6  250   4 23.15222
## 6  5.1  250   4 16.78067
## 7  2.1  150   6 23.62527
## 8  3.6  150   6 17.25372
## 9  5.1  150   6 10.88218
## 10 2.1  250   6 25.33656
## 11 3.6  250   6 18.96501
## 12 5.1  250   6 12.59347

Predictions for the test set

mtcars_test_mult <- mtcars_test %>% add_predictions(mtcars_mult_fit)
head(mtcars_test_mult, 3)
##             mpg cyl disp  hp drat   wt  qsec vs am gear carb     pred
## Datsun 710 22.8   4  108  93 3.85 2.32 18.61  1  1    4    1 26.15924
## Valiant    18.1   6  225 105 2.76 3.46 20.22  1  0    3    1 19.13187
## Duster 360 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4 16.78766

Similarly, can compute the root mean square error for the test set:

\[ RMSE = \frac{1}{\sqrt{n}}||y - \widehat y|| = \sqrt{\frac{1}{n}\sum_{i = 1}^n(y_i - \widehat y_i)^2} \]

sqrt(mean((mtcars_test_mult$mpg - mtcars_test_mult$pred)^2))
## [1] 3.789124

Interaction terms

  • An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent. variable.

  • For example, one variable, \(x_1\) might have a different effect on \(y\) within different categories of variable \(x_2\).

  • Here is a useful reference for interactions

Formulas with interactions

In the sim3 dataset, there is a categorical variable, x2, and a continuous variable, x1.

ggplot(sim3, aes(x=x1, y=y)) + geom_point(aes(color = x2)) 

Models with interactions

We could fit two different models, one without and one with (mod2) different slopes and intercepts for each line (for each x2 category).

# Model without interactions:
mod1 <- lm(y ~ x1 + x2, data = sim3)    
# Model with interactions:
mod2 <- lm(y ~ x1 * x2, data = sim3)     
# Generate a data grid for two variables 
# and compute predictions from both models
grid <- sim3 %>% data_grid(x1, x2) %>%   
    gather_predictions(mod1, mod2)
head(grid, 3)
## # A tibble: 3 x 4
##   model    x1 x2     pred
##   <chr> <int> <fct> <dbl>
## 1 mod1      1 a      1.67
## 2 mod1      1 b      4.56
## 3 mod1      1 c      6.48
tail(grid, 3)
## # A tibble: 3 x 4
##   model    x1 x2      pred
##   <chr> <int> <fct>  <dbl>
## 1 mod2     10 b     -0.162
## 2 mod2     10 c      5.48 
## 3 mod2     10 d      3.98
ggplot(sim3, aes(x=x1, y=y, color=x2)) +   
    geom_point() +                       
    geom_line(data=grid, aes(y=pred)) +  
    facet_wrap(~ model)                  

Now, we fit interaction effects for the mtcars dataset. Note the ‘:’-notation for the interaction term.

mfit_inter <- lm(mpg ~ am * wt, mtcars_train)
names(coefficients(mfit_inter))
## [1] "(Intercept)" "am"          "wt"          "am:wt"
summary(mfit_inter)
## 
## Call:
## lm(formula = mpg ~ am * wt, data = mtcars_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3134 -0.8925  0.1960  0.8561  4.7280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   34.758      3.883   8.951 4.77e-08 ***
## am             8.171      4.666   1.751 0.096937 .  
## wt            -4.729      1.040  -4.548 0.000249 ***
## am:wt         -3.326      1.445  -2.303 0.033457 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.032 on 18 degrees of freedom
## Multiple R-squared:  0.8743, Adjusted R-squared:  0.8533 
## F-statistic: 41.72 on 3 and 18 DF,  p-value: 2.607e-08

Exercise 1

The data in the following URL “http://www-bcf.usc.edu/~gareth/ISL/Income1.csv” includes observation on income levels (in tens of thousands of dollars) and years of education. The data is not real and was actually simulated.

  1. Read the data into R and generate a ggplot with a fitted line.
income <- read_csv("http://www-bcf.usc.edu/~gareth/ISL/Income1.csv")
income <- income %>% select(-X1)
income
## # A tibble: 30 x 2
##    Education Income
##        <dbl>  <dbl>
##  1      10     26.7
##  2      10.4   27.3
##  3      10.8   22.1
##  4      11.2   21.2
##  5      11.6   15.2
##  6      12.1   26.4
##  7      12.5   17.4
##  8      12.9   25.5
##  9      13.3   36.9
## 10      13.7   39.7
## # ... with 20 more rows

Exercise 1

  1. Split the data into train and test set. Note that here we do not form a validation set as we have very few observations and will only consider a single model.

  2. Fit a linear model with education (years of education) as the predictor and income as the response variable. What are the model coefficients obtained and how can you extract them? Inspect the model fit using summary()

  3. Compute the fitted values of income for the observations included in the train set.

  4. Predict the income for new observations, for people with 16.00, 12.52, 15.55, 21.09, and 18.36 years of education. Then, make predictions also for the test set and evaluate the root mean squared error on the test set.

Lasso Regression

Choosing a model

  • Modern datasets often have “too many” variables, e.g. many problems in genetics have thousands of genes (variables) from only a few dozen patients (observations)
  • Issue: \(n \ll p\) i.e. number of predictors is much larger than than the number of observations.
  • Lasso regression is especially useful for problems in which the number of covariates is large, but only a handful of them are relevant for the prediction of the outcome.

Lasso Regression

  • Lasso regression is simply regression with \(L_1\) penalty.
  • That is, it solves the problem:

\[\boldsymbol{\widehat \beta} = \text{arg} \min\limits_{\boldsymbol{\beta}} \sum_i \left(y_i - \boldsymbol{\beta}^T \boldsymbol{x}_i\right)^2 + \lambda \|\boldsymbol{\beta}\|_1\]

  • It turns out that the \(L_1\) norm \(\|\vec \beta\|_1 = \sum_i |\beta_i|\) promotes sparsity, i.e. only a handful of \(\widehat\beta_i\) will actually be non-zero.

  • The number of non-zero coefficients depends on the choice of the tuning parameter, \(\lambda\). The higher the \(\lambda\) the fewer non-zero coefficients.

glmnet

  • Lasso regression is implemented in an R package glmnet.
  • An introductory tutorial to the package can be found here.
# install.packages("glmnet")
library(glmnet)

  • We go back to the mtcars datasets and use Lasso regression to predict the mpg using all variables.
  • Lasso will generate a “parsimonious” model i.e. return positive coefficients on a subset of highly predictive variables
  • This means that we technically allow for all variables to be included, but due to penalization, most of the fitted coefficients will be zero.

Fitting a sparse model

# Convert to 'glmnet' required input format:
y <- mtcars[, 1]  # response vector, 'mpg'
X <- mtcars[, -1] # all other variables treated as predictors
X <- data.matrix(X, "matrix") # converts to NUMERIC matrix

# Choose a training set
set.seed(123)
idx <- sample(1:nrow(mtcars), floor(0.7 * nrow(mtcars)))
X_train <- X[idx, ]; y_train <- y[idx]
X_test <- X[-idx, ]; y_test <- y[-idx]

# Fit a sparse model
fit <- glmnet(X_train, y_train)
names(fit)
##  [1] "a0"        "beta"      "df"        "dim"       "lambda"   
##  [6] "dev.ratio" "nulldev"   "npasses"   "jerr"      "offset"   
## [11] "call"      "nobs"

  • glmnet() computes the Lasso regression for a sequence of different tuning parameters, \(\lambda\).
  • Each row of fit corresponds to a particular \(\lambda\) in the sequence.
  • column Df denotes the number of non-zero coefficients (degrees of freedom),
  • %Dev is the percentage variance explained,
  • Lambda is the value of the currently chosen tuning parameter.
fit
## 
## Call:  glmnet(x = X_train, y = y_train) 
## 
##       Df   %Dev   Lambda
##  [1,]  0 0.0000 4.679000
##  [2,]  1 0.1383 4.264000
##  [3,]  2 0.2626 3.885000
##  [4,]  2 0.3700 3.540000
##  [5,]  2 0.4593 3.225000
##  [6,]  2 0.5333 2.939000
##  [7,]  2 0.5948 2.678000
##  [8,]  2 0.6459 2.440000
##  [9,]  2 0.6883 2.223000
## [10,]  2 0.7235 2.026000
## [11,]  2 0.7527 1.846000
## [12,]  2 0.7770 1.682000
## [13,]  3 0.7993 1.532000
## [14,]  3 0.8179 1.396000
## [15,]  3 0.8335 1.272000
## [16,]  3 0.8463 1.159000
## [17,]  3 0.8570 1.056000
## [18,]  3 0.8659 0.962300
## [19,]  3 0.8733 0.876800
## [20,]  4 0.8797 0.798900
## [21,]  4 0.8862 0.727900
## [22,]  4 0.8915 0.663300
## [23,]  4 0.8960 0.604300
## [24,]  4 0.8997 0.550700
## [25,]  4 0.9028 0.501700
## [26,]  4 0.9054 0.457200
## [27,]  4 0.9075 0.416600
## [28,]  4 0.9093 0.379500
## [29,]  5 0.9108 0.345800
## [30,]  6 0.9124 0.315100
## [31,]  5 0.9139 0.287100
## [32,]  5 0.9152 0.261600
## [33,]  5 0.9162 0.238400
## [34,]  5 0.9171 0.217200
## [35,]  5 0.9178 0.197900
## [36,]  5 0.9184 0.180300
## [37,]  5 0.9189 0.164300
## [38,]  5 0.9193 0.149700
## [39,]  4 0.9197 0.136400
## [40,]  4 0.9199 0.124300
## [41,]  4 0.9201 0.113200
## [42,]  4 0.9203 0.103200
## [43,]  5 0.9215 0.094020
## [44,]  7 0.9263 0.085660
## [45,]  7 0.9313 0.078050
## [46,]  6 0.9350 0.071120
## [47,]  6 0.9361 0.064800
## [48,]  6 0.9371 0.059050
## [49,]  7 0.9379 0.053800
## [50,]  7 0.9387 0.049020
## [51,]  8 0.9396 0.044670
## [52,]  9 0.9414 0.040700
## [53,] 10 0.9443 0.037080
## [54,] 10 0.9473 0.033790
## [55,] 10 0.9499 0.030790
## [56,] 10 0.9520 0.028050
## [57,] 10 0.9538 0.025560
## [58,] 10 0.9553 0.023290
## [59,] 10 0.9565 0.021220
## [60,] 10 0.9575 0.019330
## [61,] 10 0.9584 0.017620
## [62,] 10 0.9591 0.016050
## [63,] 10 0.9597 0.014630
## [64,] 10 0.9602 0.013330
## [65,] 10 0.9606 0.012140
## [66,] 10 0.9609 0.011060
## [67,] 10 0.9612 0.010080
## [68,] 10 0.9614 0.009186
## [69,] 10 0.9616 0.008369
## [70,] 10 0.9618 0.007626
## [71,] 10 0.9619 0.006949
## [72,] 10 0.9620 0.006331
## [73,] 10 0.9621 0.005769
## [74,] 10 0.9622 0.005256
## [75,] 10 0.9623 0.004789
## [76,] 10 0.9623 0.004364
## [77,] 10 0.9624 0.003976
## [78,] 10 0.9624 0.003623
## [79,] 10 0.9625 0.003301
## [80,] 10 0.9625 0.003008
## [81,] 10 0.9625 0.002741
## [82,] 10 0.9625 0.002497
## [83,] 10 0.9626 0.002275
## [84,] 10 0.9626 0.002073
## [85,] 10 0.9626 0.001889
## [86,] 10 0.9626 0.001721
## [87,] 10 0.9626 0.001568

Plotting the coefficient path

# label = TRUE makes the plot annotate the curves with the corresponding coeffients labels.
plot(fit, label = TRUE, xvar = "lambda") 

y-axis corresponds the value of the coefficients, x-axis gives log of \(\lambda\) parameter penalizing the \(L_1\) norm of \(\boldsymbol{ \widehat \beta}\)

  • Each curve corresponds to a single variable, and shows the value of the coefficient as the tuning parameter varies. decreases from left to right.
  • When \(\lambda\) is small (right) there are more non-zero coefficients.

The computed Lasso coefficient for a particular choice of \(\lambda\) can be printed using:

# Lambda = 1
coef(fit, s = 1)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 34.877093111
## cyl         -0.867649618
## disp         .          
## hp          -0.005778702
## drat         .          
## wt          -2.757808266
## qsec         .          
## vs           .          
## am           .          
## gear         .          
## carb         .

  • Like for lm(), we can use a function predict() to predict the mpg for the training or the test data.
  • However, we need specify the value of \(\lambda\) using the glmnet argument s.
# Predict for the test set:
predict(fit, newx = X_test, s = c(0.5, 1.5, 2))
##                           1        2        3
## Datsun 710         25.36098 23.87240 23.22262
## Valiant            19.82245 19.42427 19.41920
## Duster 360         16.19324 17.27111 17.74858
## Merc 230           22.62471 21.86937 21.50396
## Merc 450SE         15.20595 16.16123 16.71324
## Cadillac Fleetwood 11.25687 13.28117 14.26985
## Chrysler Imperial  10.81730 13.01570 14.07314
## Fiat 128           25.88928 24.20103 23.47110
## Toyota Corolla     27.01880 25.08206 24.22690
## Toyota Corona      24.89106 23.51713 22.92237

Each of the columns corresponds to a choice of \(\lambda\).

Choosing \(\lambda\)

  • To choose \(\lambda\) can use cross-validation.
  • Use cv.glmnet() function to perform a k-fold cross validation.

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining \(k − 1\) subsamples are used as training data.

set.seed(1)
# `nfolds` argument sets the number of folds (k).
cvfit <- cv.glmnet(X_train, y_train, nfolds = 5)
plot(cvfit)

  • The red dots are the average MSE over the k-folds.
  • The two chosen \(\lambda\) values are the one with \(MSE_{min}\) and one with \(MSE_{min} + sd_{min}\)

\(\lambda\) with minimum mean squared error, MSE:

cvfit$lambda.min
## [1] 0.2171905

The “best” \(\lambda\) in a practical sense is usually chosen to be the biggest \(\lambda\) whose MSE is within one standard error of the minimum MSE.

cvfit$lambda.1se
## [1] 0.6632685

Predictions using the “best” \(\lambda\):

final_pred <- predict(cvfit, newx=X_test, s="lambda.1se")
head(final_pred)
##                           1
## Datsun 710         25.01062
## Valiant            19.68422
## Duster 360         16.32664
## Merc 230           22.44375
## Merc 450SE         15.35370
## Cadillac Fleetwood 11.58909

More on models

Building Models

Building models is an important part of EDA.

It takes practice to gain an intuition for which patterns to look for and what predictors to select that are likely to have an important effect.

You should go over examples in http://r4ds.had.co.nz/model-building.html to see concrete examples of how a model is built for diamonds and nycflights2013 datasets we have seen before.

Other model families

This chapter has focused exclusively on the class of linear models \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p + \epsilon = \vec \beta \vec x + \epsilon \]

and penalized linear models.

There are a large set of other model classes.

Extensions of linear models:

  • Generalized linear models, stats::glm(), binary or count data.
  • Generalized additive models, mgcv::gam(), extend generalized linear models to incorporate arbitrary smooth functions.
  • Robust linear models, MASS:rlm(), less sensitive to outliers.

Completely different models:

  • Trees, rpart::rpart(), fit a piece-wise constant model splitting the data into progressively smaller and smaller pieces.
  • Random forests, randomForest::randomForest(), aggregate many different trees.
  • Gradient boosting machines, xgboost::xgboost(), aggregate trees.

Useful Books