cv.glmnet using lambda.1se.Considers problem of trying to decide when to stop fitting the LASSO.
Follows PoSI paper’s suggestion for adjustment criteria for specific model selection procedures.
Does not address confidence intervals, best though of as goodness of fit test along the LASSO path.
Analog for step(..., direction='forward') would be can I stop adding variables yet?
## Loaded lars 1.2
When \(X^TX=I\) \[ \lambda_i = (X^TY)_{(i)} \]
Test statistic \[
T_i = C_i \lambda_i (\lambda_i - \lambda_{i+1})
\] where \(C_i\) depends on \(X\) and the sequence of decisions in lars(X, Y).
Model: \[ {\cal M} = \left\{{\cal L}(Y|X) \right\} = \left\{N(X\beta, \sigma^2I): \beta \in \mathbb{R}^p\right\} \]
Under \(H_0:\beta=0\) \(\lambda_i\)’s are order statistics of IID \(N(0,1)\) when \(X^TX=I\).
Let’s look at \(\lambda\)’s a little \[ \begin{aligned} \lambda_1(X, Y) &= \|X^TY\|_{\infty}\\ &= \max_{j: 1 \leq j \leq p} |X_j^TY| \end{aligned} \]
Set \[ j^*(X, Y) = \text{argmax}_{j} |X_j^TY|, \qquad s^*(X,Y) = \text{sign}(X_{j^*}^TY) \]
Now, \[ \begin{aligned} \lambda_2(X,Y) &= \max_{l \neq j^*(X, Y), s' \in \pm 1} \frac{s' X_l^T(I - P_{j^*})Y}{1 - s' s^* X_l^TX_{j^*}} \\ &= M(j^*, s^*) \end{aligned} \] where \[ M(j,s) = \max_{l \neq j, s' \in \pm 1} \frac{s' X_l^T(I - P_{j})Y}{1 - s' X_l^TX_{j}} \]
Note: under \({\cal M}\) (not just \(H_0\)) we have \(M(j,s)\) is independent of \(X_j^TY\).
A simple manipulation shows that for every \((j_0, s_0)\) the event \[ E_{j_0, s_0} = \left\{(j^*, s^*)=(j_0, s_0) \right\} \] is equivalent to \[ \{s_0 X_{j_0}^TY \geq M(j_0, s_0) \} \]
Hence, under \({\cal M}\) for every \((j_0, s_0)\) \[ {\cal L}(s_0 X_{j_0}^TY | (j^*,s^*)=(j_0, s_0)) = {\cal L}(s_0 X_{j_0}^TY | s_0 X_{j_0}^TY \geq M(j_0, s_0)) \]
Conditioning on \(M(j_0, s_0)\) yields \[ {\cal L}(s_0 X_{j_0}^TY | s_0 X_{j_0}^TY \geq M(j_0, s_0), M(j_0, s_0)) \overset{D}{=} \text{TN}_{[M(j_0, s_0),\infty)}(s_0X_{j_0}^TX\beta, \sigma^2)\]
First example of polyhedral lemma.
Standard CDF transform (assuming \(\sigma^2=1\)) implies for every \((j_0, s_0)\), conditional on the event \(E_{j_0, s_0}\) \[ \frac{1 - \Phi(s_0 X_{j_0}^T(Y - X\beta))}{1 - \Phi(M(j_0, s_0) - s_0 X_{j_0}^TX\beta)} \sim \text{Unif}(0,1). \]
Under \(H_0\), for every \((j_0, s_0)\) on the event \(E_{j_0,s_0}\) we have \[ \frac{1 - \Phi(s_0 X_{j_0}^TY)}{1 - \Phi(M(j_0, s_0))} \sim \text{Unif}(0,1). \]
Hence, marginally as well we have \(\text{Unif}(0,1)\) under \(H_0\).
Suppose \(Z \sim N(0, 1)\), then \[ {\cal L}(t(Z - t) | Z > t) \overset{D}{\to} \text{Exp}(1) \]
Applying this to our case each conditional distribution is (with \(\sigma^2=1\)) \[ \lambda_2 ( \lambda_1 - \lambda_2) | (j^*,s^*)=(j_0, s_0), \lambda_2 \to \text{Exp}(1) \]
As \(\lambda_1 / \lambda_2 \to 1\) under \(H_0\), Slutsky’s will imply same result holds for \[ \lambda_1 ( \lambda_1 - \lambda_2) | (j^*,s^*)=(j_0, s_0), \lambda_2 \to \text{Exp}(1). \]
The later \(\lambda\)’s have similar form (though this requires an expert in LARS!) \[ \lambda_{k+1} \approx \text{max}_{j \not \in A_k} \frac{X_j^T(I-P_{A_k})Y}{\dots} \]
Again fixing \((A_k, s_{A_k})\) one can express \(\lambda_k, \lambda_{k+1}\)’s as particular Gaussians…
The formula for \(TN(\dots)\) above is true for all of \({\cal M}\)!
Under \(H_0:\beta=0\), \(M(j_0, s_0)\) is just maximum of some collection of centered normal variables.
Could just use parametric bootstrap do sample its distribution.
Essentially the same test of \(H_0\) as \[ {\cal L}_0(\|X^TY\|_{\infty}) \] which would be like PoSI on 1-sparse models…
Under \(\beta \neq 0\) the law \(M(j_0, s_0)\) depends in some way on \(\beta\).
We could try plugging in \(\hat{\beta}\)… will this work?
Generally, having seen \(j^*\) we could think of testing \[ H_{j^*,0}: X_{j^*}^TX\beta=0 \]
Or, for confidence intervals \[ H_{j^*,\delta}: X_{j^*}^TX\beta=\delta. \]
Assuming \(X^TX=I\) and \(p=100\), \(\sigma^2=1\) try plugging in \(\widehat{\beta}\) to calibrate the law of \[ s^*X_{j^*}^TY | (j^*,s^*)=(j_0, s_0) \] to test e.g. \(H_{j^*,0}\) or \(H_{j^*,\delta}\) for various true values of \(\beta\) (including 0). Does this work well? What do you plugin for the value of \(\beta_{j^*}\) under \(H_{j^*,\delta}\)?
Suppose \((j^*,s^*)=(1,+1)\) and \(\beta=(\delta, 1, \dots, 1)\) (unbeknownst to you of course). Try plugging in beta[2:p]=0 and testing \(H_{0,j^*}:\delta=1\) without conditioning on \(M(1,+1)\) but simulating its distribution (again keep \(p=100\)). Do you expect this to work?
simulate = function(n=1000, p=20, plot=FALSE, s=0, k=1) {
X = matrix(rnorm(n*p), n, p)
X = scale(X, TRUE, TRUE)
beta = rep(0, p)
if (s > 0) {
beta[1:s] = (c(1:s)+4) / sqrt(n)
}
Y = rnorm(n) + X %*% beta
L = lars(X, Y)
if (plot) {
plot(L)
}
L = L$lambda
return(list(L1=L[k],
L2=L[k+1],
E=(L[k]-L[k+1])*L[k],
P=pnorm(L[k], lower.tail=FALSE)/pnorm(L[k+1], lower.tail=FALSE),
B=min(2 * (p - k + 1) * pnorm(L[k], lower.tail=FALSE), 1)))
}## $L1
## [1] 1.873002
##
## $L2
## [1] 1.68958
##
## $E
## [1] 0.3435503
##
## $P
## [1] 0.6702797
##
## $B
## [1] 1
## $L1
## [1] 8.105509
##
## $L2
## [1] 6.262178
##
## $E
## [1] 14.94114
##
## $P
## [1] 1.383557e-06
##
## $B
## [1] 1.050501e-14
D = data.frame(simulate(s=3, k=2))
for (i in 1:500) {
result = simulate(s=3, k=2)
D = rbind(D, result)
}D = data.frame(simulate(s=3, k=3))
for (i in 1:500) {
result = simulate(s=3, k=3)
D = rbind(D, result)
}D = data.frame(simulate(s=3, k=4))
for (i in 1:500) {
result = simulate(s=3, k=4)
D = rbind(D, result)
}D = data.frame(simulate(s=3,k=5))
for (i in 1:500) {
result = simulate(s=3,k=5)
D = rbind(D, result)
}## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-18
## Loading required package: intervals
##
## Attaching package: 'intervals'
## The following object is masked from 'package:Matrix':
##
## expand
## Loading required package: survival
## Loading required package: adaptMCMC
## Loading required package: parallel
## Loading required package: coda
##
## Call:
## larInf(obj = L)
##
## Standard deviation of noise (specified or estimated) sigma = 4.745
##
## Sequential testing results with alpha = 0.100
##
## Step Var Coef Z-score P-value LowConfPt UpConfPt Spacing CovTest
## 1 13 -0.950 -32.129 0.000 -0.999 -0.901 0.000 0.000
## 2 6 5.095 13.383 0.000 4.463 5.791 0.000 0.000
## 3 11 -0.931 -8.718 0.000 -1.108 -0.746 0.000 0.000
## 4 12 0.010 4.160 0.042 0.001 0.018 0.042 0.037
## 5 4 3.320 3.955 0.012 1.071 6.175 0.012 0.008
## 6 1 -0.037 -1.272 0.617 -0.172 0.377 0.617 0.766
## 7 8 -0.597 -4.937 0.144 -1.911 0.406 0.018 0.015
## 8 5 -16.223 -5.111 0.270 -35.591 23.349 0.001 0.000
## 9 2 0.042 3.189 0.150 -0.029 0.114 0.041 0.032
## 10 3 -0.046 -0.837 0.800 -0.475 2.814 0.856 0.953
## 11 9 0.135 3.310 0.418 -0.571 0.614 0.001 0.175
## 12 10 -0.012 -3.280 0.187 -0.063 0.019 0.001 0.000
## 13 7 0.001 0.052 0.466 -0.735 0.854 0.956 0.997
## LowTailArea UpTailArea
## 0.049 0.049
## 0.048 0.049
## 0.049 0.049
## 0.049 0.050
## 0.050 0.050
## 0.050 0.050
## 0.050 0.000
## 0.050 0.050
## 0.050 0.050
## 0.050 0.050
## 0.050 0.050
## 0.050 0.000
## 0.050 0.050
##
## Estimated stopping point from ForwardStop rule = 5
Introduced univariate \(TN\) to \(R_c = (-\infty, -c] \cup [c,\infty)\) for inference.
Model \(Z \sim N(\mu, \sigma)\) truncated to \(R_c\).
For simplicity let’s work with \(\sigma^2=1\)
CDF: \[ \begin{aligned} F^c_{\mu}(Z \leq t) &= F^c_{\mu, 1}(Z \leq t) \\ &= \begin{cases} \frac{\Phi(t - \mu)}{\Phi(-c-\mu) + (1 - \Phi(c - \mu))} & t < -c \\ \frac{\Phi(-c - \mu)}{\Phi(-c-\mu) + (1 - \Phi(c - \mu))} & c \leq t \leq -c \\ \frac{\Phi(t - \mu) - \Phi(c-\mu) + \Phi(-c-\mu)}{\Phi(-c-\mu) + (1 - \Phi(c - \mu))} & t > c \end{cases} \end{aligned} \]
More succintly, the model \[ {\cal M}^*_c = \left\{P^c_{\mu}: \mu \in \mathbb{R} \right\} \] is an exponential family on \(\mathbb{R}\) with sufficient statistic \(T(z)=z\) and reference measure with Lebesgue density \[ z \mapsto e^{-z^2/2} 1_{R_c}(z) \]
Likelihood \[ -\log L^c(\mu|Z) = \frac{1}{2}(Z-\mu)^2 + \log\left(\Phi(-c-\mu) + 1 - \Phi(c-\mu)\right) \]
MLE (a convex problem – why?) \[ Z = \hat{\mu}(Z) + \frac{\phi(-c-\hat{\mu}(Z)) - \phi(c-\hat{\mu}(Z))}{\Phi(-c-\hat{\mu}(Z)) + 1 - \Phi(c-\hat{\mu}(Z))} \]
Testing hypotheses \[ H_0: \mu=\mu_0 \]
Use \(p\)-value \[ F^c_{\mu_0}(Z) \overset{H_0}{\sim} \text{Unif}(0,1) \]
\(100(1-\alpha)\%\) confidence interval \[ \left\{\mu: F^c_{\mu}(Z) \in [\alpha/2, 1-\alpha/2]\right\} \]
For various values of \(c\), plot the 95% confidence intervals of Zhong and Prentice with the data \(Z\) on the \(x\)-axis. Repeat the procedure with a single half-open interval for trunction region: \([c,\infty)\). What is the most striking thing about these intervals? Based on the intervals, compare the power of one-sided tests of \(H_0:\mu=0\) using the single half-open interval to the power of a two-sided test with the truncation region \(R_c\).