## Data snooping¶

We've talked about the need for having our hypotheses determined before looking at the summary table. Here is a simple example to illustrate the danger of looking at summaries of the data before deciding which hypothesis to test.

Below, I will create complete noise datasets but try to find "the best model". There is nothing wrong with finding the best model, what is wrong is trusting the results of the summary table after having chosen this as the best model.

In [1]:
X = matrix(rnorm(50000), 50, 1000)
X = scale(X)
Y = rnorm(50)
Z = (t(X) %*% (Y - mean(Y))) / sd(Y)
print(summary(lm(Y ~ X[,1])))
qqnorm(Z)

Call:
lm(formula = Y ~ X[, 1])

Residuals:
Min      1Q  Median      3Q     Max
-2.3476 -0.5450  0.1389  0.5628  1.7117

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.001843   0.141169  -0.013     0.99
X[, 1]       0.023411   0.142603   0.164     0.87

Residual standard error: 0.9982 on 48 degrees of freedom
Multiple R-squared:  0.0005612,	Adjusted R-squared:  -0.02026
F-statistic: 0.02695 on 1 and 48 DF,  p-value: 0.8703



The collection of 1000 $T$ statistics looks pretty close to a normal (with 50 degrees of freedom). This is not surprising.

Now, let's look at the largest $T$

In [2]:
largest_T = order(abs(Z))[1000]
print(summary(lm(Y ~ X[,largest_T])))

Call:
lm(formula = Y ~ X[, largest_T])

Residuals:
Min      1Q  Median      3Q     Max
-2.0681 -0.4431  0.0210  0.6325  1.9669

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    -0.001843   0.127389  -0.014  0.98852
X[, largest_T]  0.426395   0.128682   3.314  0.00176 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9008 on 48 degrees of freedom
Multiple R-squared:  0.1862,	Adjusted R-squared:  0.1692
F-statistic: 10.98 on 1 and 48 DF,  p-value: 0.001758



The $T$ statistic is much larger in absolute value than it should be. Let's repeat this experiment many times.

In [3]:
largest_T_sim = function(printit=FALSE) {
X = matrix(rnorm(500), 50, 10)
X = scale(X)
Y = rnorm(50)
Z = (t(X) %*% (Y - mean(Y))) / sd(Y)
largest_T = order(abs(Z))[10]
if (printit) {
print(summary(lm(Y ~ X[,largest_T])))
}
return(summary(lm(Y ~ X[,largest_T]))$coef[2,4]) } largest_T_sim(printit=TRUE)  Call: lm(formula = Y ~ X[, largest_T]) Residuals: Min 1Q Median 3Q Max -2.4616 -0.6368 0.1048 0.6741 1.7031 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.08476 0.11860 -0.715 0.47826 X[, largest_T] 0.34845 0.11981 2.908 0.00549 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.8386 on 48 degrees of freedom Multiple R-squared: 0.1498, Adjusted R-squared: 0.1321 F-statistic: 8.459 on 1 and 48 DF, p-value: 0.005486  0.00548644107728959 We can do this many times and store the$p$-values. What will their distribution look like? In [4]: Pval = c() for (i in 1:200) { Pval = c(Pval, largest_T_sim()) } plot(ecdf(Pval))  ### Type I error¶ How likely are we to conclude there is an effect if we use these$p\$-values?

In [5]:
ecdf(Pval)(0.05)

0.385
In [6]: