Brief notes on readings #3

web.stanford.edu/class/stats364/

Jonathan Taylor

Spring 2020

From Monday’s lecture

Exercise:

Implement the multiplier bootstrap from Arun’s presentation to compute a 95% UPoSI constant for the Boston housing data. Construct simultaneous intervals for coefficients in model selected by cv.glmnet using lambda.1se.

library(MASS)
data(Boston)
X = model.matrix(lm(medv ~ . - 1, data=Boston))
Y = Boston[,14]

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

Considers problem of trying to decide when to stop fitting the LASSO.
Follows PoSI paper’s suggestion for adjustment criteria for specific model selection procedures.
Does not address confidence intervals, best though of as goodness of fit test along the LASSO path.
Analog for step(..., direction='forward') would be can I stop adding variables yet?

Problem

For a \(\lambda > 0\) solve \[ \text{min}_{\beta} \frac{1}{2} \|Y-X\beta\|^2_2 + \lambda \|\beta\|_1 \]

library(lars)

## Loaded lars 1.2

L = lars(X, Y)
plot(L)

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

Proposed looking at spacings of \(\lambda\)’s in the LARS path

X = matrix(rnorm(50000), 500, 100)
Y = rnorm(500)
L = lars(X, Y)
plot(L)

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

When \(X^TX=I\) \[ \lambda_i = (X^TY)_{(i)} \]
Test statistic \[ T_i = C_i \lambda_i (\lambda_i - \lambda_{i+1}) \] where \(C_i\) depends on \(X\) and the sequence of decisions in lars(X, Y).
Model: \[ {\cal M} = \left\{{\cal L}(Y|X) \right\} = \left\{N(X\beta, \sigma^2I): \beta \in \mathbb{R}^p\right\} \]
Under \(H_0:\beta=0\) \(\lambda_i\)’s are order statistics of IID \(N(0,1)\) when \(X^TX=I\).

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

A result from extreme value theory tells us (independently) \[ T_i \overset{D}{\to} \sigma^2 \text{Exp}(1/i) \]

What about when correlation is present?

Let’s look at \(\lambda\)’s a little \[ \begin{aligned} \lambda_1(X, Y) &= \|X^TY\|_{\infty}\\ &= \max_{j: 1 \leq j \leq p} |X_j^TY| \end{aligned} \]
Set \[ j^*(X, Y) = \text{argmax}_{j} |X_j^TY|, \qquad s^*(X,Y) = \text{sign}(X_{j^*}^TY) \]
Now, \[ \begin{aligned} \lambda_2(X,Y) &= \max_{l \neq j^*(X, Y), s' \in \pm 1} \frac{s' X_l^T(I - P_{j^*})Y}{1 - s' s^* X_l^TX_{j^*}} \\ &= M(j^*, s^*) \end{aligned} \] where \[ M(j,s) = \max_{l \neq j, s' \in \pm 1} \frac{s' X_l^T(I - P_{j})Y}{1 - s' X_l^TX_{j}} \]
Note: under \({\cal M}\) (not just \(H_0\)) we have \(M(j,s)\) is independent of \(X_j^TY\).

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

What about when correlation is present?

A simple manipulation shows that for every \((j_0, s_0)\) the event \[ E_{j_0, s_0} = \left\{(j^*, s^*)=(j_0, s_0) \right\} \] is equivalent to \[ \{s_0 X_{j_0}^TY \geq M(j_0, s_0) \} \]
Hence, under \({\cal M}\) for every \((j_0, s_0)\) \[ {\cal L}(s_0 X_{j_0}^TY | (j^*,s^*)=(j_0, s_0)) = {\cal L}(s_0 X_{j_0}^TY | s_0 X_{j_0}^TY \geq M(j_0, s_0)) \]
Conditioning on \(M(j_0, s_0)\) yields \[ {\cal L}(s_0 X_{j_0}^TY | s_0 X_{j_0}^TY \geq M(j_0, s_0), M(j_0, s_0)) \overset{D}{=} \text{TN}_{[M(j_0, s_0),\infty)}(s_0X_{j_0}^TX\beta, \sigma^2)\]
First example of polyhedral lemma.

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

A simple manipulation

Set \(U(j,s) = sX_j^TY\) and assume \(\text{diag}(X^TX)=1\) \[ \begin{aligned} (j^*,s^*)=(j_0,s_0) & \iff U(j_0, s_0) \geq U(l,s) & \forall (l,s) \neq (j_0,s_0) \\ & \iff U(j_0, s_0) - ss_0 X_j^TX_l U(j_0, s_0) \geq \ U(l,s) - ss_0 X_j^TX_l U(j_0, s_0) & \forall (l,s) \neq (j_0,s_0) \\ & \iff U(j_0, s_0) \geq \frac{ U(l,s) - ss_0 X_j^TX_l U(j_0, s_0)}{1 - ss_0 X_j^TX_l} & \forall (l,s) \neq (j_0,s_0) \\ & \iff U(j_0, s_0) \geq M(j_0, s_0) \end{aligned} \]

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

Standard CDF transform (assuming \(\sigma^2=1\)) implies for every \((j_0, s_0)\), conditional on the event \(E_{j_0, s_0}\) \[ \frac{1 - \Phi(s_0 X_{j_0}^T(Y - X\beta))}{1 - \Phi(M(j_0, s_0) - s_0 X_{j_0}^TX\beta)} \sim \text{Unif}(0,1). \]
Under \(H_0\), for every \((j_0, s_0)\) on the event \(E_{j_0,s_0}\) we have \[ \frac{1 - \Phi(s_0 X_{j_0}^TY)}{1 - \Phi(M(j_0, s_0))} \sim \text{Unif}(0,1). \]
Hence, marginally as well we have \(\text{Unif}(0,1)\) under \(H_0\).

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

Where does exponential come in?

Suppose \(Z \sim N(0, 1)\), then \[ {\cal L}(t(Z - t) | Z > t) \overset{D}{\to} \text{Exp}(1) \]
Applying this to our case each conditional distribution is (with \(\sigma^2=1\)) \[ \lambda_2 ( \lambda_1 - \lambda_2) | (j^*,s^*)=(j_0, s_0), \lambda_2 \to \text{Exp}(1) \]
As \(\lambda_1 / \lambda_2 \to 1\) under \(H_0\), Slutsky’s will imply same result holds for \[ \lambda_1 ( \lambda_1 - \lambda_2) | (j^*,s^*)=(j_0, s_0), \lambda_2 \to \text{Exp}(1). \]

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

Later time points

The later \(\lambda\)’s have similar form (though this requires an expert in LARS!) \[ \lambda_{k+1} \approx \text{max}_{j \not \in A_k} \frac{X_j^T(I-P_{A_k})Y}{\dots} \]
Again fixing \((A_k, s_{A_k})\) one can express \(\lambda_k, \lambda_{k+1}\)’s as particular Gaussians…
The formula for \(TN(\dots)\) above is true for all of \({\cal M}\)!

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

Why did we condition on \(M(j_0, s_0)\)?

Under \(H_0:\beta=0\), \(M(j_0, s_0)\) is just maximum of some collection of centered normal variables.
Could just use parametric bootstrap do sample its distribution.
Essentially the same test of \(H_0\) as \[ {\cal L}_0(\|X^TY\|_{\infty}) \] which would be like PoSI on 1-sparse models…
Under \(\beta \neq 0\) the law \(M(j_0, s_0)\) depends in some way on \(\beta\).
We could try plugging in \(\hat{\beta}\)… will this work?
Generally, having seen \(j^*\) we could think of testing \[ H_{j^*,0}: X_{j^*}^TX\beta=0 \]
Or, for confidence intervals \[ H_{j^*,\delta}: X_{j^*}^TX\beta=\delta. \]

Exercises

Assuming \(X^TX=I\) and \(p=100\), \(\sigma^2=1\) try plugging in \(\widehat{\beta}\) to calibrate the law of \[ s^*X_{j^*}^TY | (j^*,s^*)=(j_0, s_0) \] to test e.g. \(H_{j^*,0}\) or \(H_{j^*,\delta}\) for various true values of \(\beta\) (including 0). Does this work well? What do you plugin for the value of \(\beta_{j^*}\) under \(H_{j^*,\delta}\)?
Suppose \((j^*,s^*)=(1,+1)\) and \(\beta=(\delta, 1, \dots, 1)\) (unbeknownst to you of course). Try plugging in beta[2:p]=0 and testing \(H_{0,j^*}:\delta=1\) without conditioning on \(M(1,+1)\) but simulating its distribution (again keep \(p=100\)). Do you expect this to work?

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

simulate = function(n=1000, p=20, plot=FALSE, s=0, k=1) {
    X = matrix(rnorm(n*p), n, p)
    X = scale(X, TRUE, TRUE)
    beta = rep(0, p)
    if (s > 0) {
        beta[1:s] = (c(1:s)+4) / sqrt(n)
    }
    Y = rnorm(n) + X %*% beta
    L = lars(X, Y)
    
    if (plot) {
        plot(L)
    }
    L = L$lambda
    return(list(L1=L[k], 
                L2=L[k+1], 
                E=(L[k]-L[k+1])*L[k], 
                P=pnorm(L[k], lower.tail=FALSE)/pnorm(L[k+1], lower.tail=FALSE),
                B=min(2 * (p - k + 1) * pnorm(L[k], lower.tail=FALSE), 1)))
}

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

simulate(plot=TRUE)

## $L1
## [1] 1.873002
## 
## $L2
## [1] 1.68958
## 
## $E
## [1] 0.3435503
## 
## $P
## [1] 0.6702797
## 
## $B
## [1] 1

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

simulate(s=3, plot=TRUE)

## $L1
## [1] 8.105509
## 
## $L2
## [1] 6.262178
## 
## $E
## [1] 14.94114
## 
## $P
## [1] 1.383557e-06
## 
## $B
## [1] 1.050501e-14

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

Step 1 – null

D = data.frame(simulate())
for (i in 1:500) {
    result = simulate()
    D = rbind(D, result)
}

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

Step 1 – 3 strong signals

D = data.frame(simulate(s=3))
for (i in 1:500) {
    result = simulate(s=3)
    D = rbind(D, result)
}

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

Step 2 – 3 strong signals

D = data.frame(simulate(s=3, k=2))
for (i in 1:500) {
    result = simulate(s=3, k=2)
    D = rbind(D, result)
}

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

Step 3 – 3 strong signals

D = data.frame(simulate(s=3, k=3))
for (i in 1:500) {
    result = simulate(s=3, k=3)
    D = rbind(D, result)
}

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

Step 4 – 3 strong signals

D = data.frame(simulate(s=3, k=4))
for (i in 1:500) {
    result = simulate(s=3, k=4)
    D = rbind(D, result)
}

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

Step 5 – 3 strong signals

D = data.frame(simulate(s=3,k=5))
for (i in 1:500) {
    result = simulate(s=3,k=5)
    D = rbind(D, result)
}

Lockhart, Taylor, Tibshirani & Tibshirani (2014)

library(selectiveInference)

## Loading required package: glmnet

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-18

## Loading required package: intervals

## 
## Attaching package: 'intervals'

## The following object is masked from 'package:Matrix':
## 
##     expand

## Loading required package: survival

## Loading required package: adaptMCMC

## Loading required package: parallel

## Loading required package: coda

Y = Boston[,14]
X = model.matrix(lm(medv ~ . - 1, data=Boston))
L = lar(X, Y)
larInf(L)

## 
## Call:
## larInf(obj = L)
## 
## Standard deviation of noise (specified or estimated) sigma = 4.745
## 
## Sequential testing results with alpha = 0.100
## 
##  Step Var    Coef Z-score P-value LowConfPt UpConfPt Spacing CovTest
##     1  13  -0.950 -32.129   0.000    -0.999   -0.901   0.000   0.000
##     2   6   5.095  13.383   0.000     4.463    5.791   0.000   0.000
##     3  11  -0.931  -8.718   0.000    -1.108   -0.746   0.000   0.000
##     4  12   0.010   4.160   0.042     0.001    0.018   0.042   0.037
##     5   4   3.320   3.955   0.012     1.071    6.175   0.012   0.008
##     6   1  -0.037  -1.272   0.617    -0.172    0.377   0.617   0.766
##     7   8  -0.597  -4.937   0.144    -1.911    0.406   0.018   0.015
##     8   5 -16.223  -5.111   0.270   -35.591   23.349   0.001   0.000
##     9   2   0.042   3.189   0.150    -0.029    0.114   0.041   0.032
##    10   3  -0.046  -0.837   0.800    -0.475    2.814   0.856   0.953
##    11   9   0.135   3.310   0.418    -0.571    0.614   0.001   0.175
##    12  10  -0.012  -3.280   0.187    -0.063    0.019   0.001   0.000
##    13   7   0.001   0.052   0.466    -0.735    0.854   0.956   0.997
##  LowTailArea UpTailArea
##        0.049      0.049
##        0.048      0.049
##        0.049      0.049
##        0.049      0.050
##        0.050      0.050
##        0.050      0.050
##        0.050      0.000
##        0.050      0.050
##        0.050      0.050
##        0.050      0.050
##        0.050      0.050
##        0.050      0.000
##        0.050      0.050
## 
## Estimated stopping point from ForwardStop rule = 5

Zhong and Prentice (2008)

Introduced univariate \(TN\) to \(R_c = (-\infty, -c] \cup [c,\infty)\) for inference.
Model \(Z \sim N(\mu, \sigma)\) truncated to \(R_c\).
For simplicity let’s work with \(\sigma^2=1\)
CDF: \[ \begin{aligned} F^c_{\mu}(Z \leq t) &= F^c_{\mu, 1}(Z \leq t) \\ &= \begin{cases} \frac{\Phi(t - \mu)}{\Phi(-c-\mu) + (1 - \Phi(c - \mu))} & t < -c \\ \frac{\Phi(-c - \mu)}{\Phi(-c-\mu) + (1 - \Phi(c - \mu))} & c \leq t \leq -c \\ \frac{\Phi(t - \mu) - \Phi(c-\mu) + \Phi(-c-\mu)}{\Phi(-c-\mu) + (1 - \Phi(c - \mu))} & t > c \end{cases} \end{aligned} \]

Zhong and Prentice (2008)

More succintly, the model \[ {\cal M}^*_c = \left\{P^c_{\mu}: \mu \in \mathbb{R} \right\} \] is an exponential family on \(\mathbb{R}\) with sufficient statistic \(T(z)=z\) and reference measure with Lebesgue density \[ z \mapsto e^{-z^2/2} 1_{R_c}(z) \]
Likelihood \[ -\log L^c(\mu|Z) = \frac{1}{2}(Z-\mu)^2 + \log\left(\Phi(-c-\mu) + 1 - \Phi(c-\mu)\right) \]
MLE (a convex problem – why?) \[ Z = \hat{\mu}(Z) + \frac{\phi(-c-\hat{\mu}(Z)) - \phi(c-\hat{\mu}(Z))}{\Phi(-c-\hat{\mu}(Z)) + 1 - \Phi(c-\hat{\mu}(Z))} \]

Zhong and Prentice (2008)

Testing hypotheses \[ H_0: \mu=\mu_0 \]
Use \(p\)-value \[ F^c_{\mu_0}(Z) \overset{H_0}{\sim} \text{Unif}(0,1) \]
\(100(1-\alpha)\%\) confidence interval \[ \left\{\mu: F^c_{\mu}(Z) \in [\alpha/2, 1-\alpha/2]\right\} \]

Exercise:

For various values of \(c\), plot the 95% confidence intervals of Zhong and Prentice with the data \(Z\) on the \(x\)-axis. Repeat the procedure with a single half-open interval for trunction region: \([c,\infty)\). What is the most striking thing about these intervals? Based on the intervals, compare the power of one-sided tests of \(H_0:\mu=0\) using the single half-open interval to the power of a two-sided test with the truncation region \(R_c\).