Illustration of the bias-variance decomposition

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.0.2

library(splines)
set.seed(1)

Define a true function \( f \).

f = function(x) {
  x^2 - 0.2*x^2.3333
}

Now, we sample a random observation of the function at 10 input points with normal errors. We repeat this 4 times and show each repetition in different colors.

d = data.frame("x"=seq(0,100,0.1))
d$f = f(d$x)

# Samples
s = data.frame("x"=c(3,10,80,70,65,43,90,40,32,15,98))
means = f(s$x)
s$f1 = rnorm(n=length(s$x),mean=means,sd=rep(250,length(s$x)))  
s$f2 = rnorm(n=length(s$x),mean=means,sd=rep(250,length(s$x)))  
s$f3 = rnorm(n=length(s$x),mean=means,sd=rep(250,length(s$x)))  
s$f4 = rnorm(n=length(s$x),mean=means,sd=rep(250,length(s$x)))  

lims = scale_y_continuous(limits=c(-500,1300),name="f(x)")

ggplot(d,aes(x=x,y=f)) + geom_line()  + geom_point(data=s,aes(x=x,y=f1),color='red') + lims

plot of chunk unnamed-chunk-3

ggplot(d,aes(x=x,y=f)) + geom_line()  + geom_point(data=s,aes(x=x,y=f2),color='blue') + lims

plot of chunk unnamed-chunk-3

ggplot(d,aes(x=x,y=f)) + geom_line()  + geom_point(data=s,aes(x=x,y=f3),color='green') + lims

plot of chunk unnamed-chunk-3

ggplot(d,aes(x=x,y=f)) + geom_line()  + geom_point(data=s,aes(x=x,y=f4),color='orange')+ lims

plot of chunk unnamed-chunk-3

Now, for each realization of our experiment, we fit a spline that estimates the true function in black.

# Spline fits
fit = lm(f1~bs(x,df=5),data=s) 
d$f1 = predict(object=fit,newdata=list(x=d$x))

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned
## bases

fit = lm(f2~bs(x,df=5),data=s) 
d$f2 = predict(object=fit,newdata=list(x=d$x))

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned
## bases

fit = lm(f3~bs(x,df=5),data=s) 
d$f3 = predict(object=fit,newdata=list(x=d$x))

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned
## bases

fit = lm(f4~bs(x,df=5),data=s) 
d$f4 = predict(object=fit,newdata=list(x=d$x))

## Warning: some 'x' values beyond boundary knots may cause ill-conditioned
## bases

ggplot(d,aes(x=x,y=f)) + geom_line()  + geom_point(data=s,aes(x=x,y=f1),color='red') + geom_line(data=d,aes(x=x,y=f1),color='red') + lims

plot of chunk unnamed-chunk-4

ggplot(d,aes(x=x,y=f)) + geom_line()  + geom_point(data=s,aes(x=x,y=f2),color='blue') + geom_line(data=d,aes(x=x,y=f2),color='blue') + lims

plot of chunk unnamed-chunk-4

ggplot(d,aes(x=x,y=f)) + geom_line()  + geom_point(data=s,aes(x=x,y=f3),color='green') + geom_line(data=d,aes(x=x,y=f3),color='green') + lims

plot of chunk unnamed-chunk-4

ggplot(d,aes(x=x,y=f)) + geom_line()  + geom_point(data=s,aes(x=x,y=f4),color='orange') + geom_line(data=d,aes(x=x,y=f4),color='orange') + lims

plot of chunk unnamed-chunk-4

Suppose we want to make a prediction at a specific point, \( x=70 \). The procedure we've employed, fitting a spline, has a Mean Squared Error at this test point. We know that this is split into variance and squared bias components.

The variance measures the variability of \( \hat f(70) \), the fit evaluated at the test point, when we repeat the sampling of the data (all 10 points). We can visualize this by examining the 4 fits above at \( x=70 \).

valuesAt70 = data.frame('x'=rep(70,4),'f'=c(d$f1[701],d$f2[701],d$f3[701],d$f4[701])) 

ggplot(d,aes(x=x,y=f)) + geom_line() + geom_line(data=d,aes(x=x,y=f1),color='red') + geom_line(data=d,aes(x=x,y=f2),color='blue') + geom_line(data=d,aes(x=x,y=f3),color='green') + geom_line(data=d,aes(x=x,y=f4),color='orange') + geom_point(data=valuesAt70,aes(x=x,y=f)) + lims

plot of chunk unnamed-chunk-5

If we repeated the experiment many times, we would be able to estimate the variance exactly.

The bias measures whether the estimates \( \hat f \) tend to err “in the same direction” with respect to the true \( f \) at \( x=70 \). Just from 4 repetitions of the experiment, it is hard to tell. In this case the estimate fell below the true function once, and above once at \( x=70 \). If we were able to repeat the experiment many times, we could measure the bias by taking the difference between the average prediction, plotted below as a dashed line and the true function.

d$average = (d$f1+d$f2+d$f3+d$f4)/4
valuesAt70 = data.frame('x'=rep(70,2),'f'=c(d$f[701],d$average[701])) 

ggplot(d,aes(x=x,y=f)) + geom_line() + geom_line(data=d,aes(x=x,y=average),linetype='dashed') + geom_point(data=valuesAt70,aes(x=x,y=f)) + lims

plot of chunk unnamed-chunk-6

From this limited number of repetitions, there seems to be almost no bias at \( x=70 \).