**Due date: Friday, January 26, 2018 11:59PM **. No late submission allowed! Please submit to blackboard.

**Update to submission guidelines:** Submit both Rscript, Rdata and a pdf of the answers to the exercises below. Please submit all files in such format:

**The .R and .Rdata files will be the same as before. The PDF will now contain the question, your answer, and any plots. No code should be in the PDF. Naming convention should remain the same.**

Failure to submit in such format will be penalized by 1 pt each (so you might lose a total of 3 points). Example: If your name is John Doe and today’s date is January 25, your submission files would be: Doe_lab3_0125.R, Doe_lab3_0125.Rdata, Doe_lab3_0125.pdf

- How linear models, transformed linear models, and leave-one-out cross validation work
- Lab 2 (material from class and the assignment)
- All the material given
*before*lab 2

- Taking Data Samples
- Fitting Linear Models
- Ridge Regression
- Lasso Regression
- Creating Functions

When fitting models to our data, we can (especially with modern SL/ML techniques) build a model which *overfits* our data. There are a few ways to describe this phenomenon: We could say we are fitting a model to noise instead of signal. Alternatively we are reducing the bias in our model toward this individual instance of data points while increaseing the variance of predictions on others.

Our remedy for this is cross validation. We build our models on some percentage of the data and then evaluate the performance on yet-unseen data. Done repeaditly, this gives us a good characterization of how our model performs on analyzing *signal* versus *noise*. There are a variety of ways to do this in R, (some packages will do this for you!) and I will make note of the `sample`

function as a convienent, flexible solution.

Let’s first load the college datset from last lab.

`df <- read.csv("./College.csv")`

And now assume we want to sample 80 percent of them without replacement we would use the following

```
sample_rows <- sample(length(df[,2]),size = .8*length(df[,2]),replace = F)
head(sample_rows)
```

`## [1] 261 218 20 502 488 336`

```
df.train <- df[sample_rows,]
head(df.train)
```

```
## X Private Apps Accept Enroll Top10perc
## 261 Hope College Yes 1712 1483 624 37
## 218 George Fox College Yes 809 726 294 27
## 20 Angelo State University No 3540 2001 1016 24
## 502 Saint Michael's College Yes 1910 1380 463 16
## 488 Saint Ambrose University Yes 897 718 276 12
## 336 MacMurray College Yes 740 558 177 12
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal
## 261 69 2505 208 12275 4341 465 1100
## 218 52 1271 43 12500 4130 400 1050
## 20 54 4190 1512 5130 3592 500 2000
## 502 64 1715 106 13030 5860 500 750
## 488 48 1345 390 10450 4020 500 1500
## 336 29 628 63 9620 3750 550 950
## PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 261 72 81 12.5 40 9284 72
## 218 53 53 13.5 22 7136 52
## 20 60 62 23.1 5 4010 34
## 502 79 88 14.5 34 10190 84
## 488 56 56 14.1 16 7444 70
## 336 49 55 10.8 33 10642 59
```

```
# note the ordering is different between the two
df.test <- df[-sample_rows,]
head(df.test)
```

```
## X Private Apps Accept Enroll
## 8 Albion College Yes 1899 1720 489
## 13 Allentown Coll. of St. Francis de Sales Yes 1179 780 290
## 21 Antioch University Yes 713 661 252
## 24 Arizona State University Main campus No 12809 10308 3761
## 28 Auburn University-Main Campus No 7548 6791 3070
## 32 Austin College Yes 948 798 295
## Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books
## 8 37 68 1594 32 13868 4826 450
## 13 38 64 1130 638 9690 4785 600
## 21 25 44 712 23 15476 3336 400
## 24 24 49 22593 7585 7434 4850 700
## 28 25 57 16262 1716 6300 3933 600
## 32 42 74 1120 15 11280 4342 400
## Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 8 850 89 100 13.7 37 11487 73
## 13 1000 60 84 13.3 21 7940 74
## 21 1100 69 82 11.3 35 42926 48
## 24 2100 88 93 18.9 5 4602 48
## 28 1908 85 91 16.7 18 6642 69
## 32 1150 81 95 13.0 33 11361 71
```

Note, there a variety of ways to use `sample`

function. In this case we are starting with a list as long as there are rows in the `df`

table, and we are going to pick (without replacement) `0.8`

times that many. We then subset those rows of the table into a new dataframe for training and its complement into the test set.

Our old friend linear regression is accomplished in R with the `lm`

command. We specify the model’s dependant variable and independant variables with the `~`

symbol and specify our data also.

```
model1 <- lm(Grad.Rate ~ Expend +
S.F.Ratio +
Outstate +
Top10perc +
Apps +
Accept,data = df.train)
summary(model1)
```

```
##
## Call:
## lm(formula = Grad.Rate ~ Expend + S.F.Ratio + Outstate + Top10perc +
## Apps + Accept, data = df.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.814 -8.639 -0.268 7.998 51.215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.7338446 3.9614583 10.030 < 2e-16 ***
## Expend -0.0005304 0.0001836 -2.889 0.0040 **
## S.F.Ratio -0.0270947 0.1885592 -0.144 0.8858
## Outstate 0.0021899 0.0002008 10.906 < 2e-16 ***
## Top10perc 0.2718565 0.0453393 5.996 3.45e-09 ***
## Apps 0.0010475 0.0005068 2.067 0.0392 *
## Accept -0.0013427 0.0007730 -1.737 0.0829 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.65 on 614 degrees of freedom
## Multiple R-squared: 0.3874, Adjusted R-squared: 0.3815
## F-statistic: 64.73 on 6 and 614 DF, p-value: < 2.2e-16
```

Note the difference between saying `Expend + S.F.Ratio`

and `Expend * S.F.Ratio`

in the documentation.

If we were to apply an L2 regularisation to penalize residuals of a regression, we get Ridge Regression. In Ordinary Least Square regression we minimze the sum of squared residuals. L2 (Called Tikhonov regularization when done in this way) regularization weights parameters in a particular way to be more stable. We will use the package `glmnet`

which contains the `glmnet`

function. `glmnet`

is neat due to it’s *elasticnet mixing parameter*, `alpha`

which allows us to define the penalty. Remember, the L2 norm is euclidian distance and L1 is absolute value.

\[ \lambda \left[ (1-\alpha) \frac{1}{2}||\beta||^2_2 + \alpha ||\beta||_1 \right]\]

If you recall from class, letting `alpha =0`

we get **ridge regression**, and `alpha=1`

we get **LASSO** (an acronym for Least Absolute Shrinkage and Selection Operator). If we use some value between 0 and 1, we are using an *elastic-net* penalty (hence glm_net_ in the package name).

Your choice of alpha is personal, however there are a few interesting facts about one versus the other. In Ridge regression, the penalty shrinks the coefficients of correlated predictors toward eachother. However in lasso, we will tend to pck one predictor and shrink the others to zero. Think about whether you can prove this to yourself (see what happens when you let `alpha`

vary between 0 and 1).

The other paramter of importance `lambda`

tells us how strong the penalty (regardless of `alpha`

) is. `glmnet`

uses an iterative procedure to determine `lambda`

. Note below when I print model2 you can see the iterations. `glmnet`

varies parameters (the df column) until all have been optimized.

`library("glmnet")`

`## Loading required package: Matrix`

`## Loading required package: foreach`

`## Loaded glmnet 2.0-13`

```
y.train <- as.matrix(df.train$Grad.Rate)
x.train <- as.matrix(cbind(df.train$Expend,df.train$S.F.Ratio,
df.train$Outstate,df.train$Top10perc,
df.train$Apps,df.train$Accept))
model2 <- glmnet(x=x.train,y=y.train,family = "gaussian")
(model2)
```

```
##
## Call: glmnet(x = x.train, y = y.train, family = "gaussian")
##
## Df %Dev Lambda
## [1,] 0 0.00000 9.974000
## [2,] 1 0.05618 9.088000
## [3,] 1 0.10280 8.280000
## [4,] 1 0.14150 7.545000
## [5,] 2 0.17560 6.874000
## [6,] 2 0.20970 6.264000
## [7,] 2 0.23790 5.707000
## [8,] 2 0.26140 5.200000
## [9,] 2 0.28090 4.738000
## [10,] 2 0.29710 4.317000
## [11,] 2 0.31050 3.934000
## [12,] 2 0.32160 3.584000
## [13,] 2 0.33090 3.266000
## [14,] 2 0.33860 2.976000
## [15,] 2 0.34500 2.711000
## [16,] 2 0.35030 2.471000
## [17,] 2 0.35470 2.251000
## [18,] 2 0.35830 2.051000
## [19,] 2 0.36130 1.869000
## [20,] 2 0.36390 1.703000
## [21,] 2 0.36600 1.552000
## [22,] 2 0.36770 1.414000
## [23,] 2 0.36910 1.288000
## [24,] 2 0.37030 1.174000
## [25,] 2 0.37130 1.069000
## [26,] 2 0.37210 0.974400
## [27,] 2 0.37280 0.887900
## [28,] 2 0.37340 0.809000
## [29,] 2 0.37390 0.737100
## [30,] 2 0.37430 0.671600
## [31,] 2 0.37460 0.612000
## [32,] 3 0.37490 0.557600
## [33,] 3 0.37520 0.508100
## [34,] 4 0.37610 0.462900
## [35,] 4 0.37750 0.421800
## [36,] 4 0.37870 0.384300
## [37,] 4 0.37970 0.350200
## [38,] 4 0.38050 0.319100
## [39,] 4 0.38110 0.290700
## [40,] 4 0.38170 0.264900
## [41,] 4 0.38210 0.241400
## [42,] 4 0.38250 0.219900
## [43,] 4 0.38280 0.200400
## [44,] 4 0.38310 0.182600
## [45,] 4 0.38330 0.166400
## [46,] 4 0.38350 0.151600
## [47,] 4 0.38370 0.138100
## [48,] 5 0.38410 0.125900
## [49,] 5 0.38460 0.114700
## [50,] 5 0.38510 0.104500
## [51,] 5 0.38550 0.095200
## [52,] 5 0.38580 0.086740
## [53,] 5 0.38610 0.079040
## [54,] 5 0.38630 0.072020
## [55,] 5 0.38650 0.065620
## [56,] 5 0.38660 0.059790
## [57,] 5 0.38680 0.054480
## [58,] 5 0.38690 0.049640
## [59,] 5 0.38700 0.045230
## [60,] 5 0.38700 0.041210
## [61,] 6 0.38710 0.037550
## [62,] 6 0.38720 0.034210
## [63,] 6 0.38720 0.031170
## [64,] 6 0.38720 0.028400
## [65,] 6 0.38730 0.025880
## [66,] 6 0.38730 0.023580
## [67,] 6 0.38730 0.021490
## [68,] 6 0.38730 0.019580
## [69,] 6 0.38740 0.017840
## [70,] 6 0.38740 0.016250
## [71,] 6 0.38740 0.014810
## [72,] 6 0.38740 0.013490
## [73,] 6 0.38740 0.012300
## [74,] 6 0.38740 0.011200
## [75,] 6 0.38740 0.010210
## [76,] 6 0.38740 0.009301
## [77,] 6 0.38740 0.008475
```

If you’re truly curious as to the inner workings of `glmnet`

I highly recommend this link as a resource.

For R functions, I’m going to jump in with an example and we will then go through what the example means.

```
fahrenheit_convert <- function(temp,from="FAH",to="CEL"){
output <- "NA"
if(from=="FAH"){
if(to=="FAH"){
return(temp)
}
if(to=="CEL"){
return(5/9*(temp-32))
}
if(to=="KEL"){
return((temp+459.6)*(5/9))
}
if(to=="RANK"){
return(temp+459.6)
}
}
}
fahrenheit_convert(50)
```

`## [1] 10`

`fahrenheit_convert(50,to="RANK")`

`## [1] 509.6`

`fahrenheit_convert(50,to="FAH")`

`## [1] 50`

So let’s dig in. We start with the assignment operator. We are naming the function `fahrenheit_convert`

. Next we define it as a function. After that come the arguments. These are inputs to the function. In this case, we have 3: `temp`

is the temperature we want to convert, `from`

is the temperature we are starting in, and `to`

is the temperature we want to go to.

Note from the code, `from`

never changes. And if I don’t specify it when calling the function, we get the default value specified in the function. Likewise, if I call the function without the `temp`

argument, the function we get an error. A full discussion of the argument ordering and requirements is here. Basically the order matters unless you specify the names (like I did with `from`

in the example) and unless it’s already set equal to something (like `from`

and `to`

) you have to put something.

Finally, the `return`

part of the function deals with what the function gives us back. In this case, it is specified for each temp combo and we return the converted value. In practice, you will usually have one `return`

statement very late in the function.

Good functional programming is to use functions often, use them for repetitive tasks, and generally keep functions short. Some manifesto-like statements about R functions can be seen at this link.

(Using the data “chemicalManufacturingProcess” from “AppliedPredictiveModeling” package)

A chemical manufacturing process for a pharmaceutical product is interested in understanding the relationship between biological measurements of the raw materials, measurements of the manufacturing process and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately $100,000 per batch.

1.Split the data into a 85% training and a 15% test set, pre-process the data, and fit and tune Ordinary Regression Model, transformed Linear Regression Model (tell us what you transformed), Ridge Regression and Lasso Model. Compare the performance of the models, compare the statistical significance and pick the best model. Provide a short explanation why you choose that model. (3 points)

- Write your own Cross-Validation function and submit your code in the report. You are provided with the pseudo-code for K-fold CV to help with writing the function, you do not have to follow exactly the pseudo-code. For LOOCV, the procedure is similar with the main difference in splitting the dataset. Provide the result of your K-fold CV and LOOCV by using the dataset and the Ordinary Regression Model to verify that your function is working properly. (3 points)

```
Input: typeofCV (LOOCV,kfold), kfold_folds, dataset, model, responseVariable
If typeofCV == kfold
Split data into kfold_folds and kfold_folds != 0
(Hint: you can do this by creating a new column and label that column as kfolds_no)
For each i from 1 to kfold_folds
Test_set <- fold ith of the data
Training_set <- the dataset that do not contain the test_set
Train my model using the training set
Predict my using the predictors in test set
RMSE_kfold(i) = Calculate the RMSE of my prediction
End for loop
Calculate the mean of RMSE_kfold
End If
End of function
Output: RMSE
```

Predict the response for the test set using your best model. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set? (Hint: use the CV function you wrote in the previous question) (1 points)

Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list? (1 points)

Explore the relationships between each of the top 10 predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process? (2 points)