Data snooping

We've talked about the need for having our hypotheses determined before looking at the summary table. Here is a simple example to illustrate the danger of looking at summaries of the data before deciding which hypothesis to test.

Below, I will create complete noise datasets but try to find "the best model". There is nothing wrong with finding the best model, what is wrong is trusting the results of the summary table after having chosen this as the best model.

In [1]:
X = matrix(rnorm(50000), 50, 1000)
X = scale(X)
Y = rnorm(50)
Z = (t(X) %*% (Y - mean(Y))) / sd(Y)
print(summary(lm(Y ~ X[,1])))
qqnorm(Z)
Call:
lm(formula = Y ~ X[, 1])

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3476 -0.5450  0.1389  0.5628  1.7117 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.001843   0.141169  -0.013     0.99
X[, 1]       0.023411   0.142603   0.164     0.87

Residual standard error: 0.9982 on 48 degrees of freedom
Multiple R-squared:  0.0005612,	Adjusted R-squared:  -0.02026 
F-statistic: 0.02695 on 1 and 48 DF,  p-value: 0.8703

The collection of 1000 $T$ statistics looks pretty close to a normal (with 50 degrees of freedom). This is not surprising.

Now, let's look at the largest $T$

In [2]:
largest_T = order(abs(Z))[1000]
print(summary(lm(Y ~ X[,largest_T])))
Call:
lm(formula = Y ~ X[, largest_T])

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0681 -0.4431  0.0210  0.6325  1.9669 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
(Intercept)    -0.001843   0.127389  -0.014  0.98852   
X[, largest_T]  0.426395   0.128682   3.314  0.00176 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9008 on 48 degrees of freedom
Multiple R-squared:  0.1862,	Adjusted R-squared:  0.1692 
F-statistic: 10.98 on 1 and 48 DF,  p-value: 0.001758

The $T$ statistic is much larger in absolute value than it should be. Let's repeat this experiment many times.

In [3]:
largest_T_sim = function(printit=FALSE) {
    X = matrix(rnorm(500), 50, 10)
    X = scale(X)
    Y = rnorm(50)
    Z = (t(X) %*% (Y - mean(Y))) / sd(Y)
    largest_T = order(abs(Z))[10]
    if (printit) {
        print(summary(lm(Y ~ X[,largest_T])))
    }
    return(summary(lm(Y ~ X[,largest_T]))$coef[2,4])
}
largest_T_sim(printit=TRUE)
Call:
lm(formula = Y ~ X[, largest_T])

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4616 -0.6368  0.1048  0.6741  1.7031 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)   
(Intercept)    -0.08476    0.11860  -0.715  0.47826   
X[, largest_T]  0.34845    0.11981   2.908  0.00549 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8386 on 48 degrees of freedom
Multiple R-squared:  0.1498,	Adjusted R-squared:  0.1321 
F-statistic: 8.459 on 1 and 48 DF,  p-value: 0.005486

0.00548644107728959

We can do this many times and store the $p$-values. What will their distribution look like?

In [4]:
Pval = c()
for (i in 1:200) {
    Pval = c(Pval, largest_T_sim())
}
plot(ecdf(Pval))

Type I error

How likely are we to conclude there is an effect if we use these $p$-values?

In [5]:
ecdf(Pval)(0.05)
0.385
In [6]: