Berk, Brown, Buja, Zhang & Zhao (2013)

Submodel parameters

Emphasized and clarified the targets of inference.
Model: \(X \in \mathbb{R}^{n \times p}\) fixed and \[ \mathbb{R}^n \ni Y|X \sim N(\mu, \sigma^2 I) \] so \(\mu=\mu(X)\) and \(\sigma^2\) assumed known or estimable.
Given \(E \subset \{1, \dots, p\}\) defined \[ \beta_E = \beta_E(\mu) = (X_E^TX_E)^{-1} X_E^T\mu \in \mathbb{R}^E \] with \(X_E\) given by X[,E].
Also \[ \beta_{j | E}(\mu) = e_j^T\beta_E(\mu). \]
These are slightly different to notation in paper: the actual statistical model is \[ {\cal M} = \left\{N(\mu, \sigma^2 I): \mu \in \mathbb{R}^n \right\}. \] (\(E\) comes from equicorrelation set in LASSO of Lee et al. (2016) but could be anything: Eureka!)

Berk, Brown, Buja, Zhang & Zhao (2013)

Submodel parameters

These parameters can be defined in other models as well: suppose \(\mathbb{R}^p \times \mathbb{R} \ni (X, Y) \sim F\) can define \[ \beta_E(F) = E_F[X_EX_E^T]^{-1}E_F[X_E \cdot Y] \in \mathbb{R}^E \] with \(\beta_{j|E}(F)\) similarly defined.
Subsequent work on regression by PoSI group Buja et al. (2014) revisited regression in this context.
Arun will give us a guest lecture on Monday on PoSI in this context.

Berk, Brown, Buja, Zhang & Zhao (2013)

Multiple comparisons problem

In model \({\cal M}\) we have \[ \widehat{\beta}_{j|E} = e_j^T(X_E^TX_E)^{-1} X_E^TY \sim N\left(\beta_{j|E}(\mu), \sigma^2 e_j^T(X_E^TX_E)^{-1}e_j \right) \] and \[ \text{Cov}(\widehat{\beta}_{j|E}, \widehat{\beta}_{j'|E'}) = \sigma^2 e_j^T(X_E^TX_E)^{-1}X_E^TX_{E'}(X_{E'}^TX_{E'})^{-1}e_{j'} \]
Reduces to multivariate normal multiple comparisons problem: \(\widehat{\theta} \sim N(\theta, \Sigma)\) \(\theta=\theta_{i \in {\cal I}}\) where \[ {\cal I} \subset \left\{(j,E): E \subset \{1, \dots, p\}, j \in E \right\}. \]
Simultaneous confidence \(100(1-\alpha)\%\) intervals can be found by projection of (translation) of any set \(S\) such that \[ P(\epsilon \in S) \geq 1-\alpha, \qquad \epsilon \sim N(0, \Sigma) \]
Choose to use confidence box: find \(K_{\alpha}\) s.t. \[ P\left(\cap_{i \in {\cal I}} \left \{\epsilon_i \in [-K_{\alpha} \Sigma_{ii}^{1/2}, K_{\alpha} \Sigma_{ii}^{1/2}] \right\} \right) \geq 1-\alpha \]
Simultaneous confidence intervals \[ \widehat{\beta}_{j|E} \pm K_{\alpha} \cdot \text{SE}(\widehat{\beta}_{j|E}). \]

Berk, Brown, Buja, Zhang & Zhao (2013)

Multiple comparisons problem

An alternative confidence region is found by noting that \(\forall E\) \[ \widehat{\beta}_E = A_E X^TY \] hence every contrast is a function of \(X^TY \sim N(X^T\mu, \sigma^2 X^TX)\).

Scheffe / Mahalanobis

Suppose \(\epsilon \sim N(0, \sigma^2 X^TX)\) \[ P\left(\epsilon'(\sigma^2 X^TX)^{-1}\epsilon \geq \chi^2_{p, 1-\alpha}\right) \leq \alpha \] yields (along with projection of sphere onto lines)
Simultaneous confidence intervals \[ \widehat{\beta}_{j|E} \pm \sqrt{\chi^2_{p, 1-\alpha}} \cdot \text{SE}(\widehat{\beta}_{j|E}). \] with \(\chi^2_{p, 1-\alpha}=O(p)\).
Formally Scheffe uses unknown variance so we might use \(\sqrt{p} \cdot F_{p,n-p,1-\alpha}\) instead.

Berk, Brown, Buja, Zhang & Zhao (2013)

\({\cal I}\) need not be all submodels if specified a priori.
Examples include:
1. Models of bounded sparsity \(s < p\)
2. Sequence of nested models \(E_1 \subset E_2 \subset \dots \subset E_m \subset \{1, \dots, p\}\) (e.g. polynomial regression)
3. PoSI1: when a certain effect \(v\) is of most interest, can restrict to \({\cal I}_v = \left\{(j,E) \in {\cal I}: v = j \right\}\)

Berk, Brown, Buja, Zhang & Zhao (2013)

Computation

At least in this paper, uses brute force computation.
For \(1 \leq b \leq B\)
1. Draw \(Y^*_{0,b} \sim N(0, \sigma^2 I)\)
2. Compute \[ Z^*_{i,b} = \frac{\widehat{\beta}_{j|E}(Y^*_{0,b})}{SE(\widehat{\beta}_{j|E})} = \frac{e_j^T(X_E^TX_E)^{-1}X_E^TY_{0,b}^*}{\widehat{\sigma} \sqrt{(X_E^TX_E)^{-1}[j,j]}} , \qquad i=(j,E) \in {\cal I} \]
3. Record \[Z_{\max,b} = \max_{i \in {\cal I}} | Z^*_{i,b}|\]
Set \(\hat{K}_{\alpha}\) to be \((1-\alpha)\) quantile of \((Z_{\max,b})_{1 \leq b \leq B}\)

Berk, Brown, Buja, Zhang & Zhao (2013)

Example: Boston data

library(PoSI)
library(MASS)
data(Boston)
set.seed(1)
Y = as.numeric(Boston[,14])
X = as.matrix(Boston[,-14])
PoSI_Boston = PoSI(X)

## Number of contrasts/adjusted predictors to process: 53248 
## Number of bundles: 13 
##                          Done with bundle 1 / 13    model sz = 1 
##                          Done with bundle 2 / 13    model sz = 2 
##                          Done with bundle 3 / 13    model sz = 3 
##                          Done with bundle 4 / 13    model sz = 4 
##                          Done with bundle 5 / 13    model sz = 5 
##                          Done with bundle 6 / 13    model sz = 6 
##                          Done with bundle 7 / 13    model sz = 7 
##                          Done with bundle 8 / 13    model sz = 8 
##                          Done with bundle 9 / 13    model sz = 9 
##                          Done with bundle 10 / 13    model sz = 10 
##                          Done with bundle 11 / 13    model sz = 11 
##                          Done with bundle 12 / 13    model sz = 12 
##                          Done with bundle 13 / 13    model sz = 13 
## p = 13 , d = 13   processed 53248 tests in 8191 models.  Times in seconds:
##    user  system elapsed 
##   2.556   0.566   3.465

summary(PoSI_Boston)

##     K.PoSI K.Bonferroni K.Scheffe
## 95%  3.597        4.904     4.729
## 99%  4.077        5.211     5.262

Berk, Brown, Buja, Zhang & Zhao (2013)

Example: Boston data

library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-18

cvG = cv.glmnet(X, Y)
G = glmnet(X, Y)
beta = coef(G, s=cvG$lambda.1se)[-1]
selected = which(beta != 0)
selected

## [1]  1  2  4  5  6  8 11 12 13

posi_conf = confint(lm(Y ~ X[,selected]), level=1 - 2 * pnorm(-3.597))
posi_conf

##                          0.0161 %   99.9839 %
## (Intercept)           11.85756472 47.15842931
## X[, selected]crim     -0.17121180  0.04886442
## X[, selected]zn       -0.00658954  0.09065389
## X[, selected]chas     -0.11561086  6.17545861
## X[, selected]nox     -27.79875140 -4.37827369
## X[, selected]rm        2.67285588  5.62647890
## X[, selected]dis      -2.11486443 -0.74846471
## X[, selected]ptratio  -1.26370306 -0.41357672
## X[, selected]black    -0.00144639  0.01802963
## X[, selected]lstat    -0.70015311 -0.34985515

Berk, Brown, Buja, Zhang & Zhao (2013)

Data splitting

Is it appropriate here?

random_order = sample(1:506, 506, replace=FALSE)
train = random_order[1:250]
test = random_order[251:506]
cvG_train = cv.glmnet(X[train,], Y[train])
G_train = glmnet(X[train,], Y[train])
beta_train = coef(G_train, s=cvG_train$lambda.1se)[-1]
selected_train = which(beta_train != 0)
selected_train

## [1]  1  4  5  6  8 11 12 13

split_conf = confint(lm(Y ~ X[,selected_train], subset=test))
split_conf

##                                   2.5 %      97.5 %
## (Intercept)                  5.73093716 31.03918720
## X[, selected_train]crim     -0.12134596  0.05437917
## X[, selected_train]chas      0.34990827  5.09444004
## X[, selected_train]nox     -20.25303704 -3.11906246
## X[, selected_train]rm        4.24010553  6.35597613
## X[, selected_train]dis      -1.37067478 -0.46688988
## X[, selected_train]ptratio  -1.21807244 -0.62144636
## X[, selected_train]black     0.00320544  0.01712368
## X[, selected_train]lstat    -0.59742776 -0.33269668

Berk, Brown, Buja, Zhang & Zhao (2013)

Data splitting

posi_comp = confint(lm(Y ~ X[,selected_train]), level=1 - 2 * pnorm(-3.597))
data.frame(split=split_conf[,2]-split_conf[,1], posi=posi_comp[,2]-posi_comp[,1])

##                                  split        posi
## (Intercept)                25.30825004 35.60486348
## X[, selected_train]crim     0.17572512  0.21928657
## X[, selected_train]chas     4.74453177  6.34562364
## X[, selected_train]nox     17.13397458 23.62464331
## X[, selected_train]rm       2.11587060  2.94811328
## X[, selected_train]dis      0.90378490  1.19676252
## X[, selected_train]ptratio  0.59662608  0.81820182
## X[, selected_train]black    0.01391824  0.01964244
## X[, selected_train]lstat    0.26473108  0.35334960

Is selected a better selection than selected_train? Probably…

Berk, Brown, Buja, Zhang & Zhao (2013)

Best case

When \(X^TX=I\) (or other exchangable) \[ K_{\alpha}(X) \approx 2 \sqrt{2 \log p}. \]

Worst case

Explicit construction of \(X\) for which \[ K_{\alpha}(X) \approx 0.63 \sqrt{p} \]
Still smaller than Scheffe in the constant (but rate the same)

Taylor (2018)

Tries to describe DFQL paradigm as well as simultaneous and conditional approach.
As mentioned before, these two are not really the same.
Why? Because these intervals (can be) valid:

split_conf

##                                   2.5 %      97.5 %
## (Intercept)                  5.73093716 31.03918720
## X[, selected_train]crim     -0.12134596  0.05437917
## X[, selected_train]chas      0.34990827  5.09444004
## X[, selected_train]nox     -20.25303704 -3.11906246
## X[, selected_train]rm        4.24010553  6.35597613
## X[, selected_train]dis      -1.37067478 -0.46688988
## X[, selected_train]ptratio  -1.21807244 -0.62144636
## X[, selected_train]black     0.00320544  0.01712368
## X[, selected_train]lstat    -0.59742776 -0.33269668

When are they valid?

Taylor (2018)

Distinguishes questions derived from data from statistical questions / objects.
Example of statistical objects: hypothesis test, point estimator, posterior distribution.
A scientist \(U\) intuits important questions \(Q\) from data, a statistician (or \(U\)) must transform these into statistical objects.
All of these objects have in common a statistical model \({\cal M}\) that must be divined from \(Q\).

Taylor (2018)

Tries to argue that simultaneous inference is limited / restricted to setting when the (implicit map) from \(Q\) to objects has a fixed range.
E.g. in large scale inference, the set of parameters is imposed from the design.
In PoSI, set of questions prespecified by choice of \({\cal I}\).
What if my investigation of data yields the observation that the difference between two \(\beta_{j|E}\) and \(\beta_{j'|E'}\) is interesting? This is likely not covered in \({\cal I}\).

Taylor (2018)

Discusses the term leftover information…
Informally this is determined by two nested \(\sigma\)-algebras \({\cal F}_< \subset {\cal F}_>\).
In our data splitting example, we might take \({\cal F}_{<}\) to be generated by selected_train while \({\cal F}_>\) to be generated by (X[train,], Y[train], train).
Suppose \({\cal M}\) is a model for our data, \[ {\cal M} = \left\{N(X[,E]\beta_E, \sigma^2_E): \beta_E \in \mathbb{R}^E, \sigma^2_E\right\} \] with \(E\) given by ~ 1 + crim + chas + nox + rm + dis + ptratio + black + lstat.

Taylor (2018)

For either \({\cal F}_<\) or \({\cal F}_>\) we can condition each \(F \in {\cal M}\) on \({\cal F}\) yielding a collection of probability kernels (e.g. indexed by selected_train).
For a fixed \((\beta_E, \sigma^2_E)\) we can talk about the information \[ \text{Var}(\nabla \ell^* (\beta_E, \sigma^2_E; {\cal F}) | {\cal F}) \] where \(\ell^*\) is the appropriately normalized likelihood.
By the variance reducing property of conditioning \[ E[\text{Var}(\nabla \ell^* (\beta_E, \sigma^2_E; {\cal F}_<) | {\cal F}_<)] \geq E[\text{Var}(\nabla \ell^* (\beta_E, \sigma^2_E; {\cal F}_>) | {\cal F}_>)] \] so conditioning on \({\cal F}_>\) has less information (on average) for \((\beta_E, \sigma^2_E)\) than conditioning on \({\cal F}_<\).

Brief notes on readings #2

web.stanford.edu/class/stats364/

Berk, Brown, Buja, Zhang & Zhao (2013)

Berk, Brown, Buja, Zhang & Zhao (2013)

Submodel parameters

Berk, Brown, Buja, Zhang & Zhao (2013)

Submodel parameters

Berk, Brown, Buja, Zhang & Zhao (2013)

Multiple comparisons problem

Berk, Brown, Buja, Zhang & Zhao (2013)

Multiple comparisons problem

Scheffe / Mahalanobis

Berk, Brown, Buja, Zhang & Zhao (2013)

Berk, Brown, Buja, Zhang & Zhao (2013)

Computation

Berk, Brown, Buja, Zhang & Zhao (2013)

Example: Boston data

Berk, Brown, Buja, Zhang & Zhao (2013)

Example: Boston data

Berk, Brown, Buja, Zhang & Zhao (2013)

Data splitting

Berk, Brown, Buja, Zhang & Zhao (2013)

Data splitting

Berk, Brown, Buja, Zhang & Zhao (2013)

Best case

Worst case

Taylor (2018)

Taylor (2018)

Taylor (2018)

Taylor (2018)

Taylor (2018)