Modern science

A (typical?) data scientist \(U\)’s workflow…

library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-18

print(dim(X))

## [1] 500 100

# Q^U_1
lam = cv.glmnet(X, y)$lambda.min
# Q^U_2
G = glmnet(X, y)
beta.hat = coef(G, s=lam)[-1]
interesting_features = which(beta.hat != 0)             
interesting_features

##  [1]   2   4   5   8  12  13  15  18  20  21  24  26  27  29  33  34  39
## [18]  40  47  48  52  54  56  57  58  60  61  65  66  73  74  75  76  78
## [35]  79  80  83  84  85  87  90  91  92  94  95  96  98  99 100

Modern science

Writing up the paper?

# publish this?
confint(lm(y ~ X[,interesting_features]))[-1,]

##                                    2.5 %        97.5 %
## X[, interesting_features]1  -0.134666925  0.0296184179
## X[, interesting_features]2  -0.005672499  0.1613077599
## X[, interesting_features]3  -0.053930161  0.1109551781
## X[, interesting_features]4   0.044349186  0.2076825410
## X[, interesting_features]5  -0.027926685  0.1365118283
## X[, interesting_features]6  -0.032769667  0.1316021200
## X[, interesting_features]7  -0.158303438  0.0072563316
## X[, interesting_features]8  -0.024253365  0.1418585076
## X[, interesting_features]9  -0.121432787  0.0417744865
## X[, interesting_features]10  0.014321121  0.1792561263
## X[, interesting_features]11 -0.005529815  0.1593495228
## X[, interesting_features]12 -0.013091442  0.1517333395
## X[, interesting_features]13 -0.133795013  0.0305297154
## X[, interesting_features]14 -0.036626589  0.1327903094
## X[, interesting_features]15 -0.004901185  0.1590721641
## X[, interesting_features]16 -0.157122744  0.0047186279
## X[, interesting_features]17 -0.098683460  0.0639697418
## X[, interesting_features]18 -0.138232394  0.0260313420
## X[, interesting_features]19 -0.169516536 -0.0045634327
## X[, interesting_features]20 -0.146869516  0.0173152539
## X[, interesting_features]21 -0.010776887  0.1511154896
## X[, interesting_features]22 -0.028609014  0.1356236522
## X[, interesting_features]23 -0.025354318  0.1406934766
## X[, interesting_features]24 -0.044578207  0.1197330768
## X[, interesting_features]25 -0.045746324  0.1183011239
## X[, interesting_features]26 -0.011813679  0.1541610998
## X[, interesting_features]27 -0.152104784  0.0133121940
## X[, interesting_features]28 -0.130852270  0.0352438034
## X[, interesting_features]29 -0.129883488  0.0369461130
## X[, interesting_features]30 -0.178834263 -0.0124556931
## X[, interesting_features]31 -0.164157792  0.0011051734
## X[, interesting_features]32  0.089396198  0.2541155146
## X[, interesting_features]33 -0.036072714  0.1341247449
## X[, interesting_features]34 -0.059571959  0.1062204072
## X[, interesting_features]35 -0.031589104  0.1345085549
## X[, interesting_features]36 -0.128526666  0.0373178142
## X[, interesting_features]37 -0.231997646 -0.0645460622
## X[, interesting_features]38 -0.165737784 -0.0007462745
## X[, interesting_features]39 -0.112658740  0.0496528162
## X[, interesting_features]40  0.027106618  0.1922657835
## X[, interesting_features]41 -0.126489256  0.0370006396
## X[, interesting_features]42 -0.149240623  0.0152159647
## X[, interesting_features]43 -0.117646630  0.0465191464
## X[, interesting_features]44 -0.025995732  0.1381550821
## X[, interesting_features]45 -0.127133786  0.0369194617
## X[, interesting_features]46 -0.006418778  0.1568876267
## X[, interesting_features]47 -0.011286811  0.1513481766
## X[, interesting_features]48 -0.030729088  0.1356614915
## X[, interesting_features]49 -0.167370469 -0.0030653721

Do these confidence intervals mean anything?
Do such intervals have any (desirable) properties?
If they were to cover something, what would they cover?

Modern science

Data splitting: a valid method

test = 1:(n/2)
train = (n/2+1):n
library(glmnet)
# Q^U_1
lam = cv.glmnet(X[train,], y[train])$lambda.min 
# Q^U_2
G = glmnet(X[train,], y[train])
beta.hat = coef(G, s=lam)[-1]
interesting_features_train = which(beta.hat != 0)             
print(interesting_features_train)

##  [1]   2   4   7   8  11  12  13  18  26  27  29  30  33  34  35  36  37
## [18]  46  47  52  53  54  58  59  61  63  66  69  70  73  75  78  80  83
## [35]  85  87  91  96  98 100

# publish this?
confint(lm(y ~ X[,interesting_features_train], subset=test))[-1,]

##                                          2.5 %      97.5 %
## X[, interesting_features_train]1  -0.176888392 0.095904972
## X[, interesting_features_train]2  -0.067808045 0.206493314
## X[, interesting_features_train]3  -0.129119678 0.157355086
## X[, interesting_features_train]4   0.043409650 0.316791551
## X[, interesting_features_train]5  -0.127656060 0.151734772
## X[, interesting_features_train]6  -0.097833997 0.190320548
## X[, interesting_features_train]7  -0.125115056 0.152009075
## X[, interesting_features_train]8  -0.108613262 0.181516865
## X[, interesting_features_train]9  -0.205486415 0.073064216
## X[, interesting_features_train]10 -0.166316178 0.125287400
## X[, interesting_features_train]11 -0.099492671 0.173449817
## X[, interesting_features_train]12 -0.146830468 0.140499848
## X[, interesting_features_train]13 -0.059709656 0.223870212
## X[, interesting_features_train]14 -0.208018437 0.077280011
## X[, interesting_features_train]15 -0.155336405 0.147398781
## X[, interesting_features_train]16 -0.271185834 0.006001601
## X[, interesting_features_train]17 -0.226568395 0.040258694
## X[, interesting_features_train]18 -0.087402111 0.180298799
## X[, interesting_features_train]19 -0.227014421 0.040702382
## X[, interesting_features_train]20 -0.043512844 0.221811221
## X[, interesting_features_train]21 -0.135436241 0.122398662
## X[, interesting_features_train]22 -0.076496690 0.194795728
## X[, interesting_features_train]23 -0.112629537 0.160590765
## X[, interesting_features_train]24 -0.013667991 0.262754717
## X[, interesting_features_train]25 -0.098221085 0.187082754
## X[, interesting_features_train]26 -0.158577403 0.106630717
## X[, interesting_features_train]27 -0.128189404 0.159698822
## X[, interesting_features_train]28 -0.155836548 0.122465899
## X[, interesting_features_train]29 -0.181898830 0.095007694
## X[, interesting_features_train]30 -0.267139422 0.015324643
## X[, interesting_features_train]31  0.054168268 0.340460622
## X[, interesting_features_train]32 -0.121854709 0.162504021
## X[, interesting_features_train]33 -0.113851202 0.144323265
## X[, interesting_features_train]34 -0.204572534 0.084396327
## X[, interesting_features_train]35 -0.121602240 0.163892733
## X[, interesting_features_train]36 -0.004123234 0.252562041
## X[, interesting_features_train]37 -0.143859313 0.122233435
## X[, interesting_features_train]38 -0.134951838 0.150227919
## X[, interesting_features_train]39 -0.141267417 0.132107945
## X[, interesting_features_train]40 -0.237839828 0.038029446

Different interesting_features… seems likely quality of selection is lower than not splitting.
Can we provide better (valid) intervals?

Multiple comparisons

Large scale inference

Much attention (early 2000s) focused on large scale inference
In our current data setup \(U\) will focus on marginal associations

Multiple comparisons

Large scale inference

Z = rep(NA, p)
for (i in 1:p) {
    Z[i] = summary(lm(y ~ X[,i]))$coef[2,3]
}
plot(density(Z), ylim=c(0,0.4))
zval = seq(-5,5,length=101)
lines(zval, dnorm(zval), col='red', lwd=2)

Multiple comparisons

Large scale inference

library(locfdr)
locfdrZ = locfdr(Z);

## Warning in locfdr(Z): CM estimation failed, middle of histogram non-normal

BH_select = which(p.adjust(2 * (1 - pnorm(abs(Z))), method='BH') < 0.2)
bonf_select = which(p.adjust(2 * (1 - pnorm(abs(Z))), method='bonferroni') < 0.2)
print(BH_select)

## [1]  8 75 83

What about confidence intervals?
What would these confidence intervals cover?

Course

Outline

Recognizing DFQL, can we provide valid inference?
Inference here means hypothesis tests, confidence intervals, point estimators, posterior distributions…
What kind of data analyses can we handle (in theory? in practice?)
Two broad approaches: simultaneous (tied to multiple comparisons) and conditional.
The two approaches are not mutually exclusive.
Classical scientific method is a form of conditional inference.

Modern science

Simple example: Drop the losers (Sampson and Sill)

\(T\) denotes \(K=10\) different treatments in a clinical trial.
\(Q^U_1\) asks which treatment (apparently) works best.

Modern science

Simple example: Drop the losers (Sampson and Sill)

\(Q^U_2\) asks which is second best (variant of Sampson and Sill).
Data \(T'\) is confirmatory sample: gold standard reports an estimate of best treatment effect based on \(T'\) alone.

Modern science

Drop the losers

T = first_stage = list(D1=rnorm(200), D2=rnorm(200),
                       D3=rnorm(200), D4=rnorm(200),
                       D5=rnorm(200), D6=rnorm(200),
                       D7=rnorm(200), D8=rnorm(200),
                       D9=rnorm(200), D10=rnorm(200))
treatment_means = c(mean(T$D1), mean(T$D2),
                    mean(T$D3), mean(T$D4),
                    mean(T$D5), mean(T$D6),
                    mean(T$D7), mean(T$D8),
                    mean(T$D9), mean(T$D10))
treatment_order = order(treatment_means, decreasing=TRUE)
candidates = treatment_order[1:2]
candidates

## [1] 7 4

Modern science

Simple example: Drop the losers (Sampson and Sill)

T_prime = second_stage = list(D7=rnorm(200), D4=rnorm(200))
print(t.test(c(second_stage$D7), conf.level=1-0.05)$conf.int)

## [1] -0.06839504  0.20088615
## attr(,"conf.level")
## [1] 0.95

print(t.test(c(second_stage$D4), conf.level=1-0.05)$conf.int)

## [1] -0.1853057  0.1017236
## attr(,"conf.level")
## [1] 0.95

This is classic scientific method (and equivalent to data splitting)
Using \(T'\) for confirmation allows quite arbitrary \(Q^U_i\).

Modern science

Simple example: Drop the losers (Sampson and Sill)

Wasteful to only use \(T\) for selection? (common criticism of data splitting)
Can we reuse \(T\)?

Modern science

Conflict between confirmatory and exploratory

Unadjusted estimator \((T_k+T'_k)/2\) is biased, gold standard estimator \(T'_k\) is unbiased but more variable.
Simple manifestation of DFQL: sometimes referred to as researcher degrees of freedom.
Difference can easily be bigger with larger \(K\), different sample sizes, etc.

Modern science

Reproducibility and replicability

Great efforts have been made to make \(U\)’s results reproducible.
For replicability we need to (statistically) understand this collection of random variables.

Modern science

Classical inference

\(U\)’s exploratory interaction with the data is limited to \(T\).
Earlier data is “wasted”, confirmatory focus on \(T'\).
No selection effect.

Modern science

Challenge for selective inference

A scientist’s question does not always translate easily into statistical objects, a necessary step to model \(U\)’s workflow.

Statistical objects

Statistical model: e.g. for gold standard \(T'|T \sim F \in {\cal M}\)
Parameter: \(\theta: {\cal M} \rightarrow \mathbb{R}\), e.g. treament effect for “best” treatment.
Language of statistics: for parameter \(\theta\) we have

point estimators
confidence intervals
posterior distributions

Scientists, when struck by anything, are not struck with null hypotheses…

Workflows to statistics

Example: Bonferroni

In the drop the losers example, it is clear a priori what the model is and, correspondingly what the parameters of interest are.
Not really in the spirit of DFQL.
Model (parametric form): \(T \sim N(\theta, \Sigma), T' \sim N(\theta, \Sigma')\)
Parameters of interest \(\theta\), \(\Sigma\) and \(\Sigma'\) diagonal.
Implicitly assumes we could sample from all arms in second stage though we only sample in best two arms.

From workflows to statistics

Simultaneous: Bonferroni

print(t.test(c(first_stage$D7, second_stage$D7), conf.level=1-0.05/10)$conf.int)

## [1] -0.03792903  0.23805049
## attr(,"conf.level")
## [1] 0.995

print(t.test(c(first_stage$D4, second_stage$D4), conf.level=1-0.05/10)$conf.int)

## [1] -0.1444023  0.1505564
## attr(,"conf.level")
## [1] 0.995

Uses these implicit samples.

From workflows to statistics

Conditional: classical with Bonferroni

print(t.test(c(second_stage$D7), conf.level=1-0.05/2)$conf.int)

## [1] -0.08795882  0.22044994
## attr(,"conf.level")
## [1] 0.975

print(t.test(c(second_stage$D4), conf.level=1-0.05/2)$conf.int)

## [1] -0.2061589  0.1225768
## attr(,"conf.level")
## [1] 0.975

Same simultaneity as Bonferroni method.
Only really requires a model for \(T'|T\).

From workflows to statistics

Back of the envelope comparison

print(list(bonferroni=qnorm(1-0.05/(2*10)), classical=qnorm(1-0.05/(2*2))*sqrt(2))) # current

## $bonferroni
## [1] 2.807034
## 
## $classical
## [1] 3.169822

print(list(bonferroni=qnorm(1-0.05/(2*40)), classical=qnorm(1-0.05/(2*2))*sqrt(2)))

## $bonferroni
## [1] 3.227218
## 
## $classical
## [1] 3.169822

print(list(bonferroni=qnorm(1-0.05/(2*10)), classical=qnorm(1-0.05/2)*sqrt(2)))

## $bonferroni
## [1] 2.807034
## 
## $classical
## [1] 2.771808

Here \(K=10\) but as soon as \(K > 40\) the classical intervals are shorter (assuming equal size in both \(T\) and \(T'\) and large number of samples in each arm).
When reporting only best treatment, as soon as \(K \approx 10\) the classical intervals are shorter.
There are better simultaneous intervals as well as better conditional intervals. We’ll revisit this example later.

From workflows to statistics

Regression example

Bonferroni?

Could work with no split into test, train.
Order \(p 2^p\) contrasts possible in confint(lm(...)).
Bonferroni threshold \(\approx \sqrt{2 \log m}\) with \(m\) the number of comparisons yields threshold \(\approx \sqrt{p}\).
Similar to Scheff'e correction for all possible contrasts.
More precise adjustment possible POSI (Berk et al.) but worst case still \(O(\sqrt{p})\).
In terms of length of interval, data splitting certainly attractive here.
Quality of selection is harder to measure but likely better with no splitting.
Closer to DQFL but this list of contrasts is still pre-specified.

From workflows to statistics

Data splitting

If interesting_features is selected by \(Q^U_1\) (either on all data or just train) does that mean that the true model is lm(y ~ X[,interesting_features_train])?
Clearly no, and \(U\) will have some opinions about what a correct model is knowing this that interesting_features_train were chosen.
When using data splitting \(U\) is free to choose a model to publish: supports DFQL setting.
Back of the envelope calculation says we will pay \(\sqrt{2}\) penalty in length of intervals (along with deterioration of selection quality)
Can we mitigate this \(\sqrt{2}\)?

Conditional inference

Classical inference

Required to collect data \(T'\).
Throwing away \(T\) is formally equivalent to conditioning on \(T\).

Conditional inference

Drop the losers

Instead of throwing out all of \(T\), condition only on which treatment is apparently best:
Rao-Blackwell (Cohen + Sacrowicz) \[ \hat{\theta}_{\hat{K}} = E[T'_{k}|(T'+T)_k, (T_j)_{j \neq k}, \hat{K}=k] \]

Conditional inference

The (classical) scientific method is inadmissible!

Tests and confidence intervals also available (Sampson and Sill)
Similar technique can be used when looking at best 2 treatments, rather than just single best treatment.

Conditional inference

Classical scientific method

Specifically allows \(U\)’s intervention – even model \({\cal M}\) is chosen after observing \((Q^U_1, Q^U_2)\)!
\(U\) specifies model \({\cal M}\) for the law \(T'|T\).

Conditional inference

General approach

\(U\) specifies model \({\cal M}\) for the law \((T',T)\).
Do inference on all data \((T,T')\).

Conditional inference

General approach

Is this improvement limited to drop the losers? No.
Do we need confirmatory sample \(T'\)? No. We can even have \(T=T'\).
Can we allow arbitrary queries? Probably not.

Conditional inference

What should we condition on?

If choice of \(\theta:{\cal M} \rightarrow \mathbb{R}\) is affected by \((Q^U_1, Q^U_2)\) it is difficult to interpret \(\theta\) as a parameter unless we fix \((Q^U_1, Q^U_2)\) (or whatever coarser sigma algebra determined \(\theta\)).
Note this is a choice of both \({\cal M}\) and \(\theta: {\cal M} \rightarrow \mathbb{R}\).

Conditional inference

Choice of \({\cal M}\)

confint(lm(y ~ X))[-1,][interesting_features,]

##              2.5 %       97.5 %
## X2   -0.1616404680  0.024749262
## X4   -0.0204230882  0.168689551
## X5   -0.0556223871  0.129492952
## X8    0.0355275128  0.220338426
## X12  -0.0423558821  0.141731525
## X13  -0.0426952021  0.143103270
## X15  -0.1706528788  0.015595744
## X18  -0.0378114079  0.145547615
## X20  -0.1370011432  0.044158381
## X21  -0.0007036447  0.186546336
## X24  -0.0221560022  0.170244284
## X26  -0.0179344367  0.165311365
## X27  -0.1506676792  0.031240039
## X29  -0.0434491740  0.150770465
## X33  -0.0095953969  0.173100259
## X34  -0.1892108442 -0.005328268
## X39  -0.1084096858  0.073399864
## X40  -0.1533432574  0.033077796
## X47  -0.1973497473 -0.013104382
## X48  -0.1638190571  0.019842431
## X52  -0.0207974683  0.162661017
## X54  -0.0220168882  0.161524815
## X56  -0.0410425027  0.145646701
## X57  -0.0570684814  0.126227891
## X58  -0.0627383985  0.122373870
## X60  -0.0133046349  0.171560957
## X61  -0.1737119144  0.010669169
## X65  -0.1418024309  0.042776546
## X66  -0.1332307367  0.052068724
## X73  -0.1787070093  0.005354464
## X74  -0.1682989641  0.017257623
## X75   0.0766551805  0.257676063
## X76  -0.0532339871  0.140339787
## X78  -0.0610251725  0.124040924
## X79  -0.0414717948  0.143714331
## X80  -0.1322885226  0.050390804
## X83  -0.2325298441 -0.046958139
## X84  -0.1670509076  0.014208098
## X85  -0.1198502220  0.058091444
## X87   0.0335590952  0.219650453
## X90  -0.1450235342  0.037292465
## X91  -0.1570719312  0.028110668
## X92  -0.1299231174  0.051292739
## X94  -0.0336919902  0.146026743
## X95  -0.1395153360  0.040032891
## X96  -0.0126469605  0.169566314
## X98  -0.0068351396  0.171314613
## X99  -0.0284372199  0.155062343
## X100 -0.1789870458  0.002695492

Model is full model \[ y | X \sim N(X\beta, \sigma^2 I) \]
Choice is which \(\beta\)’s we should report intervals for.

Conditional inference

Choice of \({\cal M}\)

confint(lm(y ~ X[,interesting_features]))[-1,]

##                                    2.5 %        97.5 %
## X[, interesting_features]1  -0.134666925  0.0296184179
## X[, interesting_features]2  -0.005672499  0.1613077599
## X[, interesting_features]3  -0.053930161  0.1109551781
## X[, interesting_features]4   0.044349186  0.2076825410
## X[, interesting_features]5  -0.027926685  0.1365118283
## X[, interesting_features]6  -0.032769667  0.1316021200
## X[, interesting_features]7  -0.158303438  0.0072563316
## X[, interesting_features]8  -0.024253365  0.1418585076
## X[, interesting_features]9  -0.121432787  0.0417744865
## X[, interesting_features]10  0.014321121  0.1792561263
## X[, interesting_features]11 -0.005529815  0.1593495228
## X[, interesting_features]12 -0.013091442  0.1517333395
## X[, interesting_features]13 -0.133795013  0.0305297154
## X[, interesting_features]14 -0.036626589  0.1327903094
## X[, interesting_features]15 -0.004901185  0.1590721641
## X[, interesting_features]16 -0.157122744  0.0047186279
## X[, interesting_features]17 -0.098683460  0.0639697418
## X[, interesting_features]18 -0.138232394  0.0260313420
## X[, interesting_features]19 -0.169516536 -0.0045634327
## X[, interesting_features]20 -0.146869516  0.0173152539
## X[, interesting_features]21 -0.010776887  0.1511154896
## X[, interesting_features]22 -0.028609014  0.1356236522
## X[, interesting_features]23 -0.025354318  0.1406934766
## X[, interesting_features]24 -0.044578207  0.1197330768
## X[, interesting_features]25 -0.045746324  0.1183011239
## X[, interesting_features]26 -0.011813679  0.1541610998
## X[, interesting_features]27 -0.152104784  0.0133121940
## X[, interesting_features]28 -0.130852270  0.0352438034
## X[, interesting_features]29 -0.129883488  0.0369461130
## X[, interesting_features]30 -0.178834263 -0.0124556931
## X[, interesting_features]31 -0.164157792  0.0011051734
## X[, interesting_features]32  0.089396198  0.2541155146
## X[, interesting_features]33 -0.036072714  0.1341247449
## X[, interesting_features]34 -0.059571959  0.1062204072
## X[, interesting_features]35 -0.031589104  0.1345085549
## X[, interesting_features]36 -0.128526666  0.0373178142
## X[, interesting_features]37 -0.231997646 -0.0645460622
## X[, interesting_features]38 -0.165737784 -0.0007462745
## X[, interesting_features]39 -0.112658740  0.0496528162
## X[, interesting_features]40  0.027106618  0.1922657835
## X[, interesting_features]41 -0.126489256  0.0370006396
## X[, interesting_features]42 -0.149240623  0.0152159647
## X[, interesting_features]43 -0.117646630  0.0465191464
## X[, interesting_features]44 -0.025995732  0.1381550821
## X[, interesting_features]45 -0.127133786  0.0369194617
## X[, interesting_features]46 -0.006418778  0.1568876267
## X[, interesting_features]47 -0.011286811  0.1513481766
## X[, interesting_features]48 -0.030729088  0.1356614915
## X[, interesting_features]49 -0.167370469 -0.0030653721

What is the model here?
Note: even if interesting_features fixed a priori the targets these intervals are trying to cover different quantities than previous slide…

Combining conditional and simultaneous

Different \(U\) will have different workflows

pval_marginal = rep(NA, p)
for (i in 1:p) {
    pval_marginal[i] = summary(lm(y ~ X[,i]))$coef[2,4]
}
BH_select_marginal = which(p.adjust(pval_marginal, method='BH') < 0.2)
confint(lm(y ~ X[,BH_select_marginal]))[-1,]

##                                2.5 %      97.5 %
## X[, BH_select_marginal]1  0.06954523  0.23726017
## X[, BH_select_marginal]2  0.06638218  0.23408711
## X[, BH_select_marginal]3 -0.23528794 -0.06770316

# OR
confint(lm(y ~ X))[-1,][BH_select_marginal,]

##           2.5 %      97.5 %
## X8   0.03552751  0.22033843
## X75  0.07665518  0.25767606
## X83 -0.23252984 -0.04695814

# OR
BH_select = which(p.adjust(summary(lm(y ~ X))$coef[,4][-1], method='BH') < 0.2)
confint(lm(y ~ X[,BH_select]))[-1,]

##                       2.5 %      97.5 %
## X[, BH_select]1  0.07730475  0.24423593
## X[, BH_select]2  0.06396168  0.23060713
## X[, BH_select]3 -0.24261213 -0.07577966
## X[, BH_select]4  0.03402342  0.20092145

# OR
confint(lm(y ~ X))[-1,][BH_select,]

##           2.5 %      97.5 %
## X8   0.03552751  0.22033843
## X75  0.07665518  0.25767606
## X83 -0.23252984 -0.04695814
## X87  0.03355910  0.21965045

Note the targets that these intervals are trying to cover are different from target of large scale inference setting.
Each BH correction may enjoy a simultaneous guarantee (subject to satisying assumptions) but we may still want to construct confidence intervals.
Are these intervals valid? No (even if the lm models are well specified in the population, e.g. \((y,X)\) jointly Gaussian).
Why? DFQL.

Combining conditional and simultaneous

Valid intervals

pval_marginal_train = rep(NA, p)
for (i in 1:p) {
    pval_marginal_train[i] = summary(lm(y ~ X[,i], subset=train))$coef[2,4]
}
BH_select_marginal_train = which(p.adjust(pval_marginal_train, method='BH') < 0.2)
confint(lm(y ~ X[,BH_select_marginal_train], subset=test))[-1,]

##       2.5 %      97.5 % 
## -0.19747788  0.07067857

# OR
confint(lm(y ~ X, subset=test))[-1,][BH_select_marginal_train,]

##       2.5 %      97.5 % 
## -0.26593609  0.07141491

# OR
BH_select_train = which(p.adjust(summary(lm(y ~ X, subset=train))$coef[,4][-1], method='BH') < 0.3)
confint(lm(y ~ X[,BH_select_train], subset=test))[-1,]

##                              2.5 %     97.5 %
## X[, BH_select_train]1  -0.10757752 0.16077195
## X[, BH_select_train]2  -0.19988675 0.06581076
## X[, BH_select_train]3  -0.10608767 0.16716446
## X[, BH_select_train]4  -0.04930042 0.21902080
## X[, BH_select_train]5  -0.11423483 0.15507340
## X[, BH_select_train]6   0.06607727 0.33684580
## X[, BH_select_train]7  -0.11567498 0.15233420
## X[, BH_select_train]8  -0.18005117 0.09019904
## X[, BH_select_train]9  -0.11305847 0.16053405
## X[, BH_select_train]10 -0.12887128 0.13209600

# OR
confint(lm(y ~ X, subset=test))[-1,][BH_select_train,]

##           2.5 %     97.5 %
## X18 -0.18944837 0.15316565
## X26 -0.15107276 0.16607232
## X27 -0.21681786 0.10889059
## X33 -0.13059936 0.20194491
## X61 -0.10510713 0.21216939
## X75  0.02024787 0.34420962
## X78 -0.11952105 0.20939033
## X83 -0.26593609 0.07141491
## X96 -0.17421196 0.16571101
## X98 -0.14426862 0.16039904

Note the targets that these intervals are trying to cover are different from target of large scale inference setting.
Each BH correction may enjoy a simultaneous guarantee (subject to satisfying assumptions) but we may still want to construct confidence intervals.

Course

Final takeaway

Conditional and simultaneous approach are different, but can provide similar guarantees in DFQL.

Goals (conditional inference)

Provide \(U\) tools for valid inference (e.g. hypothesis tests, confidence intervals, point estimators, posterior distributions) under DFQL.
Admit as many data analysis workflows as we can subject to computational and theoretical constraints.
In lectures on Mondays, we will build up from simple one sample problem to selection algorithms presented today.

Goals (simultaneous inference)

The BH paper with introduction of False Discovery Rate (FDR) and corresponding procedure (and descendants) has been hugely influential in early 21st century.
Much recent work extending settings under which FDR (or close) can be achieved.
Less focus on usual statistical objects (e.g. hypothesis tests, confidence intervals, point estimators, posterior distributions) and DFQL than on the particular FDR control in a multiple comparisons setting.
Much of these methods will be discussed in our discussion sessions on Wednesday.


Scientists collect data first and ask questions later. (Candes) DFQL		The idea of a scientist, struck, as if by lightning with a question, is far from the truth. (Tukey)

Course introduction

web.stanford.edu/class/stats364/

Replicability crisis in science

Replicability crisis in science

Replicability crisis in science

Exploratory data analysis

Confirmatory data analysis

Conflict identified by Candes, Tukey (and certainly others)

Modern science

A (typical?) data scientist \(U\)’s workflow…

Modern science

A (typical?) data scientist \(U\)’s workflow…

Modern science

A (typical?) data scientist \(U\)’s workflow…

Modern science

Writing up the paper?

Modern science

Data splitting: a valid method

Multiple comparisons

Large scale inference

Multiple comparisons

Large scale inference

Multiple comparisons

Large scale inference

Course

Outline

Modern science

Simple example: Drop the losers (Sampson and Sill)

Modern science

Simple example: Drop the losers (Sampson and Sill)

Modern science

Drop the losers

Modern science

Simple example: Drop the losers (Sampson and Sill)

Modern science

Simple example: Drop the losers (Sampson and Sill)

Modern science

Conflict between confirmatory and exploratory

Modern science

Reproducibility and replicability

Modern science

Classical inference

Modern science

Challenge for selective inference

Statistical objects

Scientists, when struck by anything, are not struck with null hypotheses…

Workflows to statistics

Example: Bonferroni

From workflows to statistics

Simultaneous: Bonferroni

From workflows to statistics

Conditional: classical with Bonferroni

From workflows to statistics

Back of the envelope comparison

From workflows to statistics

Regression example

Bonferroni?

From workflows to statistics

Data splitting

Conditional inference

Classical inference

Conditional inference

Drop the losers

Conditional inference

The (classical) scientific method is inadmissible!

Conditional inference

Classical scientific method

Conditional inference

General approach

Conditional inference

General approach

Conditional inference

What should we condition on?

Conditional inference

Choice of \({\cal M}\)

Conditional inference

Choice of \({\cal M}\)

Combining conditional and simultaneous

Different \(U\) will have different workflows

Combining conditional and simultaneous