The wrong way and the right way to do cross-validation



In this lab, we simulate the wrong way and the right way to perform cross validation, as explained in Lecture 11 and in Section 7.10 of the Elements of Statistical Learning.

We will work under the null hypothesis that there are 1000 standard normal predictors, which are uncorrelated from the response, which is uniform in {0,1}.

n = 200
p = 1000
X = as.data.frame(matrix(rnorm(n*p),n))
Y = runif(200)<0.5
X$Y = Y

Here we define a function to perform variable selection. It selects the 20 variables with the highest correlation with the response.

PickBest <- function(X,p) {
  # Pick 20 variables with highest correlation to response
  correlations = c()
  for (i in 1:p) {
    correlations = c(correlations,cor(X[,i],X$Y))
  }
  sorted = sort(correlations,index.return=T,decreasing=T)
  sorted$ix[1:20]
}

Now, we are ready to define functions to perform cross-validation the wrong way and the right way. The only difference is that in the former, we perform the variable selection before cross validation using all the samples. In the latter, we perform the variable selection within a 10-fold cross validation loop.

WrongCV <- function(X,n,p) {
  best20 = PickBest(X,p)
  order = sample(1:n)
  errors = c()
  for (i in 1:10) {
    test = ((1:n)<=i*n/10) & ((1:n)>(i-1)*n/10)
    train = !test
    glm.fit = glm(Y~.,data=X[order[train],c(best20,p+1)],family=binomial)
    glm.pred = predict(glm.fit,newdata=X[order[test],c(best20,p+1)],type="response")
    glm.pred = glm.pred>0.5
    errors = c(errors,mean(glm.pred!=X[order[test],'Y']))
  }
  mean(errors)
}

RightCV <- function(X,n,p) {
  order = sample(1:n)
  errors = c()
  for (i in 1:10) {
    test = ((1:n)<=i*n/10) & ((1:n)>(i-1)*n/10)
    train = !test
    # Pick best 20
    best20 = PickBest(X[order[train],],p)
    # Fit model to best predictors
    glm.fit = glm(Y~.,data=X[order[train],c(best20,p+1)],family=binomial)
    glm.pred = predict(glm.fit,newdata=X[order[test],c(best20,p+1)],type="response")
    glm.pred = glm.pred>0.5
    errors = c(errors,mean(glm.pred!=X[order[test],'Y']))
  }
  mean(errors)
}

Now, we are ready to evaluate each method. We will estimate the error 100 times the wrong way and 100 times the right way.

n = 200
p = 1000
wErrors = c()
rErrors = c()
for (i in 1:100) {
  X = as.data.frame(matrix(rnorm(n*p),n))
  Y = runif(200)<0.5
  X$Y = Y
  wErrors = c(wErrors,WrongCV(X,n,p))
  rErrors = c(rErrors,RightCV(X,n,p))
}
errors = data.frame('Error'=c(wErrors,rErrors),
                    'Method'=c(rep('Wrong way',100),rep('Right way',100)))

Now, we can visualize the errors in a box plot.

library(ggplot2)
ggplot(errors,aes(y=Error,x=Method)) + geom_boxplot()

plot of chunk unnamed-chunk-5

As expected, the right method estimates an error of 50% on average, whereas the wrong method significantly underestimates the error.

As an exercise, you can modify this code to study the effect of changing the size of the folds in cross validation.