First paper (?) to address and propose a solution to inference after selection (DFQL?) in the context of linear regression.
Reduces the problem to simultaneous inference over all contrasts in a collection of linear models.
Provides a general purpose tool in the context of Gaussian regression models.
Emphasized and clarified the targets of inference.
Model: \(X \in \mathbb{R}^{n \times p}\) fixed and \[ \mathbb{R}^n \ni Y|X \sim N(\mu, \sigma^2 I) \] so \(\mu=\mu(X)\) and \(\sigma^2\) assumed known or estimable.
Given \(E \subset \{1, \dots, p\}\) defined \[
\beta_E = \beta_E(\mu) = (X_E^TX_E)^{-1} X_E^T\mu \in \mathbb{R}^E
\] with \(X_E\) given by X[,E].
Also \[ \beta_{j | E}(\mu) = e_j^T\beta_E(\mu). \]
These are slightly different to notation in paper: the actual statistical model is \[ {\cal M} = \left\{N(\mu, \sigma^2 I): \mu \in \mathbb{R}^n \right\}. \] (\(E\) comes from equicorrelation set in LASSO of Lee et al. (2016) but could be anything: Eureka!)
These parameters can be defined in other models as well: suppose \(\mathbb{R}^p \times \mathbb{R} \ni (X, Y) \sim F\) can define \[ \beta_E(F) = E_F[X_EX_E^T]^{-1}E_F[X_E \cdot Y] \in \mathbb{R}^E \] with \(\beta_{j|E}(F)\) similarly defined.
Subsequent work on regression by PoSI group Buja et al. (2014) revisited regression in this context.
Arun will give us a guest lecture on Monday on PoSI in this context.
In model \({\cal M}\) we have \[ \widehat{\beta}_{j|E} = e_j^T(X_E^TX_E)^{-1} X_E^TY \sim N\left(\beta_{j|E}(\mu), \sigma^2 e_j^T(X_E^TX_E)^{-1}e_j \right) \] and \[ \text{Cov}(\widehat{\beta}_{j|E}, \widehat{\beta}_{j'|E'}) = \sigma^2 e_j^T(X_E^TX_E)^{-1}X_E^TX_{E'}(X_{E'}^TX_{E'})^{-1}e_{j'} \]
Reduces to multivariate normal multiple comparisons problem: \(\widehat{\theta} \sim N(\theta, \Sigma)\) \(\theta=\theta_{i \in {\cal I}}\) where \[ {\cal I} \subset \left\{(j,E): E \subset \{1, \dots, p\}, j \in E \right\}. \]
Simultaneous confidence \(100(1-\alpha)\%\) intervals can be found by projection of (translation) of any set \(S\) such that \[ P(\epsilon \in S) \geq 1-\alpha, \qquad \epsilon \sim N(0, \Sigma) \]
Choose to use confidence box: find \(K_{\alpha}\) s.t. \[ P\left(\cap_{i \in {\cal I}} \left \{\epsilon_i \in [-K_{\alpha} \Sigma_{ii}^{1/2}, K_{\alpha} \Sigma_{ii}^{1/2}] \right\} \right) \geq 1-\alpha \]
Simultaneous confidence intervals \[ \widehat{\beta}_{j|E} \pm K_{\alpha} \cdot \text{SE}(\widehat{\beta}_{j|E}). \]
Suppose \(\epsilon \sim N(0, \sigma^2 X^TX)\) \[ P\left(\epsilon'(\sigma^2 X^TX)^{-1}\epsilon \geq \chi^2_{p, 1-\alpha}\right) \leq \alpha \] yields (along with projection of sphere onto lines)
Simultaneous confidence intervals \[ \widehat{\beta}_{j|E} \pm \sqrt{\chi^2_{p, 1-\alpha}} \cdot \text{SE}(\widehat{\beta}_{j|E}). \] with \(\chi^2_{p, 1-\alpha}=O(p)\).
Formally Scheffe uses unknown variance so we might use \(\sqrt{p} \cdot F_{p,n-p,1-\alpha}\) instead.
\({\cal I}\) need not be all submodels if specified a priori.
Examples include:
Models of bounded sparsity \(s < p\)
Sequence of nested models \(E_1 \subset E_2 \subset \dots \subset E_m \subset \{1, \dots, p\}\) (e.g. polynomial regression)
PoSI1: when a certain effect \(v\) is of most interest, can restrict to \({\cal I}_v = \left\{(j,E) \in {\cal I}: v = j \right\}\)
At least in this paper, uses brute force computation.
For \(1 \leq b \leq B\)
Draw \(Y^*_{0,b} \sim N(0, \sigma^2 I)\)
Compute \[ Z^*_{i,b} = \frac{\widehat{\beta}_{j|E}(Y^*_{0,b})}{SE(\widehat{\beta}_{j|E})} = \frac{e_j^T(X_E^TX_E)^{-1}X_E^TY_{0,b}^*}{\widehat{\sigma} \sqrt{(X_E^TX_E)^{-1}[j,j]}} , \qquad i=(j,E) \in {\cal I} \]
Record \[Z_{\max,b} = \max_{i \in {\cal I}} | Z^*_{i,b}|\]
Set \(\hat{K}_{\alpha}\) to be \((1-\alpha)\) quantile of \((Z_{\max,b})_{1 \leq b \leq B}\)
library(PoSI)
library(MASS)
data(Boston)
set.seed(1)
Y = as.numeric(Boston[,14])
X = as.matrix(Boston[,-14])
PoSI_Boston = PoSI(X)## Number of contrasts/adjusted predictors to process: 53248
## Number of bundles: 13
## Done with bundle 1 / 13 model sz = 1
## Done with bundle 2 / 13 model sz = 2
## Done with bundle 3 / 13 model sz = 3
## Done with bundle 4 / 13 model sz = 4
## Done with bundle 5 / 13 model sz = 5
## Done with bundle 6 / 13 model sz = 6
## Done with bundle 7 / 13 model sz = 7
## Done with bundle 8 / 13 model sz = 8
## Done with bundle 9 / 13 model sz = 9
## Done with bundle 10 / 13 model sz = 10
## Done with bundle 11 / 13 model sz = 11
## Done with bundle 12 / 13 model sz = 12
## Done with bundle 13 / 13 model sz = 13
## p = 13 , d = 13 processed 53248 tests in 8191 models. Times in seconds:
## user system elapsed
## 2.556 0.566 3.465
## K.PoSI K.Bonferroni K.Scheffe
## 95% 3.597 4.904 4.729
## 99% 4.077 5.211 5.262
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-18
cvG = cv.glmnet(X, Y)
G = glmnet(X, Y)
beta = coef(G, s=cvG$lambda.1se)[-1]
selected = which(beta != 0)
selected## [1] 1 2 4 5 6 8 11 12 13
## 0.0161 % 99.9839 %
## (Intercept) 11.85756472 47.15842931
## X[, selected]crim -0.17121180 0.04886442
## X[, selected]zn -0.00658954 0.09065389
## X[, selected]chas -0.11561086 6.17545861
## X[, selected]nox -27.79875140 -4.37827369
## X[, selected]rm 2.67285588 5.62647890
## X[, selected]dis -2.11486443 -0.74846471
## X[, selected]ptratio -1.26370306 -0.41357672
## X[, selected]black -0.00144639 0.01802963
## X[, selected]lstat -0.70015311 -0.34985515
random_order = sample(1:506, 506, replace=FALSE)
train = random_order[1:250]
test = random_order[251:506]
cvG_train = cv.glmnet(X[train,], Y[train])
G_train = glmnet(X[train,], Y[train])
beta_train = coef(G_train, s=cvG_train$lambda.1se)[-1]
selected_train = which(beta_train != 0)
selected_train## [1] 1 4 5 6 8 11 12 13
## 2.5 % 97.5 %
## (Intercept) 5.73093716 31.03918720
## X[, selected_train]crim -0.12134596 0.05437917
## X[, selected_train]chas 0.34990827 5.09444004
## X[, selected_train]nox -20.25303704 -3.11906246
## X[, selected_train]rm 4.24010553 6.35597613
## X[, selected_train]dis -1.37067478 -0.46688988
## X[, selected_train]ptratio -1.21807244 -0.62144636
## X[, selected_train]black 0.00320544 0.01712368
## X[, selected_train]lstat -0.59742776 -0.33269668
posi_comp = confint(lm(Y ~ X[,selected_train]), level=1 - 2 * pnorm(-3.597))
data.frame(split=split_conf[,2]-split_conf[,1], posi=posi_comp[,2]-posi_comp[,1])## split posi
## (Intercept) 25.30825004 35.60486348
## X[, selected_train]crim 0.17572512 0.21928657
## X[, selected_train]chas 4.74453177 6.34562364
## X[, selected_train]nox 17.13397458 23.62464331
## X[, selected_train]rm 2.11587060 2.94811328
## X[, selected_train]dis 0.90378490 1.19676252
## X[, selected_train]ptratio 0.59662608 0.81820182
## X[, selected_train]black 0.01391824 0.01964244
## X[, selected_train]lstat 0.26473108 0.35334960
selected a better selection than selected_train? Probably…Explicit construction of \(X\) for which \[ K_{\alpha}(X) \approx 0.63 \sqrt{p} \]
Still smaller than Scheffe in the constant (but rate the same)
Tries to describe DFQL paradigm as well as simultaneous and conditional approach.
As mentioned before, these two are not really the same.
Why? Because these intervals (can be) valid:
## 2.5 % 97.5 %
## (Intercept) 5.73093716 31.03918720
## X[, selected_train]crim -0.12134596 0.05437917
## X[, selected_train]chas 0.34990827 5.09444004
## X[, selected_train]nox -20.25303704 -3.11906246
## X[, selected_train]rm 4.24010553 6.35597613
## X[, selected_train]dis -1.37067478 -0.46688988
## X[, selected_train]ptratio -1.21807244 -0.62144636
## X[, selected_train]black 0.00320544 0.01712368
## X[, selected_train]lstat -0.59742776 -0.33269668
Distinguishes questions derived from data from statistical questions / objects.
Example of statistical objects: hypothesis test, point estimator, posterior distribution.
A scientist \(U\) intuits important questions \(Q\) from data, a statistician (or \(U\)) must transform these into statistical objects.
All of these objects have in common a statistical model \({\cal M}\) that must be divined from \(Q\).
Tries to argue that simultaneous inference is limited / restricted to setting when the (implicit map) from \(Q\) to objects has a fixed range.
E.g. in large scale inference, the set of parameters is imposed from the design.
In PoSI, set of questions prespecified by choice of \({\cal I}\).
What if my investigation of data yields the observation that the difference between two \(\beta_{j|E}\) and \(\beta_{j'|E'}\) is interesting? This is likely not covered in \({\cal I}\).
Discusses the term leftover information…
Informally this is determined by two nested \(\sigma\)-algebras \({\cal F}_< \subset {\cal F}_>\).
In our data splitting example, we might take \({\cal F}_{<}\) to be generated by selected_train while \({\cal F}_>\) to be generated by (X[train,], Y[train], train).
Suppose \({\cal M}\) is a model for our data, \[
{\cal M} = \left\{N(X[,E]\beta_E, \sigma^2_E): \beta_E \in \mathbb{R}^E, \sigma^2_E\right\}
\] with \(E\) given by ~ 1 + crim + chas + nox + rm + dis + ptratio + black + lstat.
For either \({\cal F}_<\) or \({\cal F}_>\) we can condition each \(F \in {\cal M}\) on \({\cal F}\) yielding a collection of probability kernels (e.g. indexed by selected_train).
For a fixed \((\beta_E, \sigma^2_E)\) we can talk about the information \[ \text{Var}(\nabla \ell^* (\beta_E, \sigma^2_E; {\cal F}) | {\cal F}) \] where \(\ell^*\) is the appropriately normalized likelihood.
By the variance reducing property of conditioning \[ E[\text{Var}(\nabla \ell^* (\beta_E, \sigma^2_E; {\cal F}_<) | {\cal F}_<)] \geq E[\text{Var}(\nabla \ell^* (\beta_E, \sigma^2_E; {\cal F}_>) | {\cal F}_>)] \] so conditioning on \({\cal F}_>\) has less information (on average) for \((\beta_E, \sigma^2_E)\) than conditioning on \({\cal F}_<\).