# Important Dates/Information

Due date: Friday, January 26, 2018 11:59PM . No late submission allowed! Please submit to blackboard.

Update to submission guidelines: Submit both Rscript, Rdata and a pdf of the answers to the exercises below. Please submit all files in such format: _lab#_mmdd.extension

The .R and .Rdata files will be the same as before. The PDF will now contain the question, your answer, and any plots. No code should be in the PDF. Naming convention should remain the same.

Failure to submit in such format will be penalized by 1 pt each (so you might lose a total of 3 points). Example: If your name is John Doe and today’s date is January 25, your submission files would be: Doe_lab3_0125.R, Doe_lab3_0125.Rdata, Doe_lab3_0125.pdf

# Introduction

### What you should know and have done prior to this

• How linear models, transformed linear models, and leave-one-out cross validation work
• Lab 2 (material from class and the assignment)
• All the material given before lab 2

### Learning Goals

1. Taking Data Samples
2. Fitting Linear Models
3. Ridge Regression
4. Lasso Regression
5. Creating Functions

# Sampling from data

When fitting models to our data, we can (especially with modern SL/ML techniques) build a model which overfits our data. There are a few ways to describe this phenomenon: We could say we are fitting a model to noise instead of signal. Alternatively we are reducing the bias in our model toward this individual instance of data points while increaseing the variance of predictions on others.

Our remedy for this is cross validation. We build our models on some percentage of the data and then evaluate the performance on yet-unseen data. Done repeaditly, this gives us a good characterization of how our model performs on analyzing signal versus noise. There are a variety of ways to do this in R, (some packages will do this for you!) and I will make note of the sample function as a convienent, flexible solution.

Let’s first load the college datset from last lab.

df <- read.csv("./College.csv")

And now assume we want to sample 80 percent of them without replacement we would use the following

sample_rows <- sample(length(df[,2]),size = .8*length(df[,2]),replace = F)
head(sample_rows)
## [1] 261 218  20 502 488 336
df.train <- df[sample_rows,]
head(df.train)
##                            X Private Apps Accept Enroll Top10perc
## 261             Hope College     Yes 1712   1483    624        37
## 218       George Fox College     Yes  809    726    294        27
## 20   Angelo State University      No 3540   2001   1016        24
## 502  Saint Michael's College     Yes 1910   1380    463        16
## 488 Saint Ambrose University     Yes  897    718    276        12
## 336        MacMurray College     Yes  740    558    177        12
## 261        69        2505         208    12275       4341   465     1100
## 218        52        1271          43    12500       4130   400     1050
## 20         54        4190        1512     5130       3592   500     2000
## 502        64        1715         106    13030       5860   500      750
## 488        48        1345         390    10450       4020   500     1500
## 336        29         628          63     9620       3750   550      950
##     PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 261  72       81      12.5          40   9284        72
## 218  53       53      13.5          22   7136        52
## 20   60       62      23.1           5   4010        34
## 502  79       88      14.5          34  10190        84
## 488  56       56      14.1          16   7444        70
## 336  49       55      10.8          33  10642        59
# note the ordering is different between the two

df.test <- df[-sample_rows,]
head(df.test)
##                                          X Private  Apps Accept Enroll
## 8                           Albion College     Yes  1899   1720    489
## 13 Allentown Coll. of St. Francis de Sales     Yes  1179    780    290
## 21                      Antioch University     Yes   713    661    252
## 24    Arizona State University Main campus      No 12809  10308   3761
## 28           Auburn University-Main Campus      No  7548   6791   3070
## 32                          Austin College     Yes   948    798    295
## 8         37        68        1594          32    13868       4826   450
## 13        38        64        1130         638     9690       4785   600
## 21        25        44         712          23    15476       3336   400
## 24        24        49       22593        7585     7434       4850   700
## 28        25        57       16262        1716     6300       3933   600
## 32        42        74        1120          15    11280       4342   400
##    Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 8       850  89      100      13.7          37  11487        73
## 13     1000  60       84      13.3          21   7940        74
## 21     1100  69       82      11.3          35  42926        48
## 24     2100  88       93      18.9           5   4602        48
## 28     1908  85       91      16.7          18   6642        69
## 32     1150  81       95      13.0          33  11361        71

Note, there a variety of ways to use sample function. In this case we are starting with a list as long as there are rows in the df table, and we are going to pick (without replacement) 0.8 times that many. We then subset those rows of the table into a new dataframe for training and its complement into the test set.

# Fitting linear models

Our old friend linear regression is accomplished in R with the lm command. We specify the model’s dependant variable and independant variables with the ~ symbol and specify our data also.

model1 <- lm(Grad.Rate ~ Expend +
S.F.Ratio +
Outstate +
Top10perc +
Apps +
Accept,data = df.train)
summary(model1)
##
## Call:
## lm(formula = Grad.Rate ~ Expend + S.F.Ratio + Outstate + Top10perc +
##     Apps + Accept, data = df.train)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -48.814  -8.639  -0.268   7.998  51.215
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.7338446  3.9614583  10.030  < 2e-16 ***
## Expend      -0.0005304  0.0001836  -2.889   0.0040 **
## S.F.Ratio   -0.0270947  0.1885592  -0.144   0.8858
## Outstate     0.0021899  0.0002008  10.906  < 2e-16 ***
## Top10perc    0.2718565  0.0453393   5.996 3.45e-09 ***
## Apps         0.0010475  0.0005068   2.067   0.0392 *
## Accept      -0.0013427  0.0007730  -1.737   0.0829 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.65 on 614 degrees of freedom
## Multiple R-squared:  0.3874, Adjusted R-squared:  0.3815
## F-statistic: 64.73 on 6 and 614 DF,  p-value: < 2.2e-16

Note the difference between saying Expend + S.F.Ratio and Expend * S.F.Ratio in the documentation.

# Ridge Regression (Tikhonov regularization)/LASSO Regression

If we were to apply an L2 regularisation to penalize residuals of a regression, we get Ridge Regression. In Ordinary Least Square regression we minimze the sum of squared residuals. L2 (Called Tikhonov regularization when done in this way) regularization weights parameters in a particular way to be more stable. We will use the package glmnet which contains the glmnet function. glmnet is neat due to it’s elasticnet mixing parameter, alpha which allows us to define the penalty. Remember, the L2 norm is euclidian distance and L1 is absolute value.

$\lambda \left[ (1-\alpha) \frac{1}{2}||\beta||^2_2 + \alpha ||\beta||_1 \right]$

If you recall from class, letting alpha =0 we get ridge regression, and alpha=1 we get LASSO (an acronym for Least Absolute Shrinkage and Selection Operator). If we use some value between 0 and 1, we are using an elastic-net penalty (hence glm_net_ in the package name).

Your choice of alpha is personal, however there are a few interesting facts about one versus the other. In Ridge regression, the penalty shrinks the coefficients of correlated predictors toward eachother. However in lasso, we will tend to pck one predictor and shrink the others to zero. Think about whether you can prove this to yourself (see what happens when you let alpha vary between 0 and 1).

The other paramter of importance lambda tells us how strong the penalty (regardless of alpha) is. glmnet uses an iterative procedure to determine lambda. Note below when I print model2 you can see the iterations. glmnet varies parameters (the df column) until all have been optimized.

library("glmnet")
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-13
y.train <- as.matrix(df.train$Grad.Rate) x.train <- as.matrix(cbind(df.train$Expend,df.train$S.F.Ratio, df.train$Outstate,df.train$Top10perc, df.train$Apps,df.train$Accept)) model2 <- glmnet(x=x.train,y=y.train,family = "gaussian") (model2) ## ## Call: glmnet(x = x.train, y = y.train, family = "gaussian") ## ## Df %Dev Lambda ## [1,] 0 0.00000 9.974000 ## [2,] 1 0.05618 9.088000 ## [3,] 1 0.10280 8.280000 ## [4,] 1 0.14150 7.545000 ## [5,] 2 0.17560 6.874000 ## [6,] 2 0.20970 6.264000 ## [7,] 2 0.23790 5.707000 ## [8,] 2 0.26140 5.200000 ## [9,] 2 0.28090 4.738000 ## [10,] 2 0.29710 4.317000 ## [11,] 2 0.31050 3.934000 ## [12,] 2 0.32160 3.584000 ## [13,] 2 0.33090 3.266000 ## [14,] 2 0.33860 2.976000 ## [15,] 2 0.34500 2.711000 ## [16,] 2 0.35030 2.471000 ## [17,] 2 0.35470 2.251000 ## [18,] 2 0.35830 2.051000 ## [19,] 2 0.36130 1.869000 ## [20,] 2 0.36390 1.703000 ## [21,] 2 0.36600 1.552000 ## [22,] 2 0.36770 1.414000 ## [23,] 2 0.36910 1.288000 ## [24,] 2 0.37030 1.174000 ## [25,] 2 0.37130 1.069000 ## [26,] 2 0.37210 0.974400 ## [27,] 2 0.37280 0.887900 ## [28,] 2 0.37340 0.809000 ## [29,] 2 0.37390 0.737100 ## [30,] 2 0.37430 0.671600 ## [31,] 2 0.37460 0.612000 ## [32,] 3 0.37490 0.557600 ## [33,] 3 0.37520 0.508100 ## [34,] 4 0.37610 0.462900 ## [35,] 4 0.37750 0.421800 ## [36,] 4 0.37870 0.384300 ## [37,] 4 0.37970 0.350200 ## [38,] 4 0.38050 0.319100 ## [39,] 4 0.38110 0.290700 ## [40,] 4 0.38170 0.264900 ## [41,] 4 0.38210 0.241400 ## [42,] 4 0.38250 0.219900 ## [43,] 4 0.38280 0.200400 ## [44,] 4 0.38310 0.182600 ## [45,] 4 0.38330 0.166400 ## [46,] 4 0.38350 0.151600 ## [47,] 4 0.38370 0.138100 ## [48,] 5 0.38410 0.125900 ## [49,] 5 0.38460 0.114700 ## [50,] 5 0.38510 0.104500 ## [51,] 5 0.38550 0.095200 ## [52,] 5 0.38580 0.086740 ## [53,] 5 0.38610 0.079040 ## [54,] 5 0.38630 0.072020 ## [55,] 5 0.38650 0.065620 ## [56,] 5 0.38660 0.059790 ## [57,] 5 0.38680 0.054480 ## [58,] 5 0.38690 0.049640 ## [59,] 5 0.38700 0.045230 ## [60,] 5 0.38700 0.041210 ## [61,] 6 0.38710 0.037550 ## [62,] 6 0.38720 0.034210 ## [63,] 6 0.38720 0.031170 ## [64,] 6 0.38720 0.028400 ## [65,] 6 0.38730 0.025880 ## [66,] 6 0.38730 0.023580 ## [67,] 6 0.38730 0.021490 ## [68,] 6 0.38730 0.019580 ## [69,] 6 0.38740 0.017840 ## [70,] 6 0.38740 0.016250 ## [71,] 6 0.38740 0.014810 ## [72,] 6 0.38740 0.013490 ## [73,] 6 0.38740 0.012300 ## [74,] 6 0.38740 0.011200 ## [75,] 6 0.38740 0.010210 ## [76,] 6 0.38740 0.009301 ## [77,] 6 0.38740 0.008475 If you’re truly curious as to the inner workings of glmnet I highly recommend this link as a resource. # Building R Functions For R functions, I’m going to jump in with an example and we will then go through what the example means. fahrenheit_convert <- function(temp,from="FAH",to="CEL"){ output <- "NA" if(from=="FAH"){ if(to=="FAH"){ return(temp) } if(to=="CEL"){ return(5/9*(temp-32)) } if(to=="KEL"){ return((temp+459.6)*(5/9)) } if(to=="RANK"){ return(temp+459.6) } } } fahrenheit_convert(50) ## [1] 10 fahrenheit_convert(50,to="RANK") ## [1] 509.6 fahrenheit_convert(50,to="FAH") ## [1] 50 So let’s dig in. We start with the assignment operator. We are naming the function fahrenheit_convert. Next we define it as a function. After that come the arguments. These are inputs to the function. In this case, we have 3: temp is the temperature we want to convert, from is the temperature we are starting in, and to is the temperature we want to go to. Note from the code, from never changes. And if I don’t specify it when calling the function, we get the default value specified in the function. Likewise, if I call the function without the temp argument, the function we get an error. A full discussion of the argument ordering and requirements is here. Basically the order matters unless you specify the names (like I did with from in the example) and unless it’s already set equal to something (like from and to ) you have to put something. Finally, the return part of the function deals with what the function gives us back. In this case, it is specified for each temp combo and we return the converted value. In practice, you will usually have one return statement very late in the function. Good functional programming is to use functions often, use them for repetitive tasks, and generally keep functions short. Some manifesto-like statements about R functions can be seen at this link. # Questions (Using the data “chemicalManufacturingProcess” from “AppliedPredictiveModeling” package) A chemical manufacturing process for a pharmaceutical product is interested in understanding the relationship between biological measurements of the raw materials, measurements of the manufacturing process and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately$100,000 per batch.

1.Split the data into a 85% training and a 15% test set, pre-process the data, and fit and tune Ordinary Regression Model, transformed Linear Regression Model (tell us what you transformed), Ridge Regression and Lasso Model. Compare the performance of the models, compare the statistical significance and pick the best model. Provide a short explanation why you choose that model. (3 points)

1. Write your own Cross-Validation function and submit your code in the report. You are provided with the pseudo-code for K-fold CV to help with writing the function, you do not have to follow exactly the pseudo-code. For LOOCV, the procedure is similar with the main difference in splitting the dataset. Provide the result of your K-fold CV and LOOCV by using the dataset and the Ordinary Regression Model to verify that your function is working properly. (3 points)
Input: typeofCV (LOOCV,kfold), kfold_folds, dataset, model, responseVariable

If typeofCV == kfold
Split data into kfold_folds and kfold_folds != 0
(Hint: you can do this by creating a new column and label that column as kfolds_no)

For each i from 1 to kfold_folds
Test_set <- fold ith of the data
Training_set <- the dataset that do not contain the test_set
Train my model using the training set
Predict my using the predictors in test set
RMSE_kfold(i) = Calculate the RMSE of my prediction
End for loop

Calculate the mean of RMSE_kfold
End If

End of function

Output: RMSE
1. Predict the response for the test set using your best model. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set? (Hint: use the CV function you wrote in the previous question) (1 points)

2. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list? (1 points)

3. Explore the relationships between each of the top 10 predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process? (2 points)