We saw that data splitting followed by full conditional inference dominates data splitting (at least in file drawer problem).
Also better behaved then using all data for selection. E.g. MLE behaves as classical even though Fisher information is not constant.
Why not apply this to regression?
Let’s assume that our selection method in regression is a function of \[ {\cal Q}(n^{-1}(X^TY)) \]
Data splitting with \(n_1+n_2=n\) evaluates \[ {\cal Q}(n_1^{-1}X_1^TY_1) = {\cal Q}(n^{-1}X^TY+ \Delta) \] with \(E[\Delta_n]=0\) and (asymptotically) independent of \(n^{-1}X^TY\).
Experiment can be represented in terms of \((n^{-1}X^TY, \Delta_n)\) asymptotically Gaussian.
Recall our general restricted Gaussian model for a query value \(q\) \[ \pi_q(t,n) = P_F\left( a_q \hat{\theta}_q + b_q {\cal N}_q + c_q \Delta_q \in A_q, \hat{\theta}_q=t, {\cal N}_q=n\right) \]
In this case \[ \pi_q(t,n) = P_F\left( {\cal Q}(n^{-1}X^TY + \Delta) = q \vert \hat{\theta}=t, {\cal N}=n\right) \]
Expressible as an integral over \(\Delta_q\) when the mean of our Gaussian is constrained.
Consider KKT conditions of 1st stage for \((E,s_E)\): \[ -\nabla \ell_1(\hat{\beta}_{1,E}) = n_1^{-1}X_1^TY_1 - n_1^{-1}X_1^TX_1 \hat{\beta}_E = \frac{\lambda}{n} \hat{u}_{1,E}. \]
Recall these were expressible in terms of \((\bar{\beta}_{1,E}, \nabla \ell_1(\bar{\beta}_{1,E})\).
Note that \[ \begin{pmatrix} \bar{\beta}_{1,E} \\ \nabla \ell_1(\bar{\beta}_{1,E}) \end{pmatrix} = \begin{pmatrix} \bar{\beta}_{E} \\ \nabla \ell(\bar{\beta}_{E}) \end{pmatrix} + L\Delta \]
To construct proper reference we must integrate out \(\Delta\).
Could choose to condition on more if we want…
Recall that because we have solved the LASSO \[ (\hat{\beta}, \hat{u}) \in \left\{(\beta, u): u \in \partial \|\cdot\|_1(\beta)\right\} = N({\cal P}) \] with \({\cal P}(\beta) = \|\beta\|_1\).
Integrals over \(\Delta\) can be expressed as integrals over pieces of \(N({\cal P})\).
Fixing \((E,s_E)\) restricts to a (subset) of an affine subspace of dimension \(p\) in \(\mathbb{R}^p \times \mathbb{R}^p\).
Explicit form of reference density for whatever model we put on \(X^TY\).
Formally \[ \frac{dF^*}{dF}(X^TY) \propto \int_{(\beta, u) \in N({\cal P};(E,s_E))} g_{\Delta} \left(\frac{1}{n}(X^TX\beta - X^TY) + \frac{\lambda}{n}u\right) {\cal H}_p(d\beta \; du) \]
Easy to condition on \(u\) making this integral an \(E\)-dimensional integral.
Sampling is still non-trivial – especially for confidence intervals.
There is a tractable numerical approximation (to be discussed later).
import numpy as np, pandas as pd, seaborn as sns
from selectinf.randomized import lasso
from selectinf.tests.instance import gaussian_instance
from selectinf.learning.utils import (split_partial_model_inference,
split_full_model_inference)## /Users/jonathantaylor/git-repos/selectinf/selectinf/learning/keras_fit.py:50: UserWarning: module `keras` not importable, `keras_fit` and `keras_fit_multilabel` will not be importable
## warnings.warn('module `keras` not importable, `keras_fit` and `keras_fit_multilabel` will not be importable')
import matplotlib.pyplot as plt
import statsmodels.api as sm
ECDF = sm.distributions.ECDF
# %matplotlib inline## /Users/jonathantaylor/opt/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py:194: UserWarning: evaluating in Python space because the '*' operator is not supported by numexpr for the bool dtype, use '&' instead
## op=op_str, alt_op=unsupported[op_str]
## MLE true target cover SE ... split_lower split_upper target variable
## 0 -1.276152 0.333526 True 1.228676 ... -3.388519 1.326546 0.214730 2
## 1 -0.720044 -0.139625 True 1.128729 ... -2.411866 2.266531 -0.056667 4
## 2 0.284744 0.225595 True 1.301743 ... -1.166275 3.973852 0.333505 8
## 3 3.840087 3.043339 True 1.114823 ... 1.022116 5.892335 2.918780 9
## 4 2.105436 3.162512 True 1.241932 ... 1.267002 5.862226 3.002715 11
## 5 4.100433 2.928182 True 1.043039 ... 2.042312 6.637502 3.178533 15
## 6 3.932669 2.997871 True 1.092015 ... -0.769532 4.300592 2.841684 17
## 7 0.875019 0.018633 True 1.266655 ... -0.591187 4.607814 0.073290 18
## 8 0.546038 0.053838 True 1.096639 ... -3.405330 0.975307 0.110363 21
## 9 3.210836 3.059558 True 1.155215 ... 2.161504 6.943532 3.051207 27
## 10 4.129639 2.777719 True 1.057458 ... 1.888165 6.986781 2.785616 28
## 11 2.370937 3.011855 True 1.134471 ... 0.178898 4.659832 2.869338 32
## 12 3.885217 2.792739 True 1.021503 ... 2.360765 6.885176 2.766129 34
## 13 3.410596 3.004038 True 1.049087 ... 1.370267 5.968115 3.134744 38
## 14 0.111909 0.230520 True 1.189844 ... -4.015998 0.624091 0.155777 40
## 15 0.864049 -0.065623 True 1.062754 ... -2.826887 1.802275 -0.118068 42
## 16 1.241298 0.625545 True 1.225506 ... 0.413445 5.157180 0.478685 45
##
## [17 rows x 18 columns]
lengths = pd.DataFrame({'split':results['split_upper'] - results['split_lower'],
'carve':results['upper'] - results['lower']})
ax = sns.distplot(lengths['split'], label='Split')
sns.distplot(lengths['carve'], ax=ax, label='Carve')## <matplotlib.axes._subplots.AxesSubplot object at 0x11b5e7210>
## <matplotlib.legend.Legend object at 0x1c233f3490>
## (0.9259818731117825, 0.8897280966767371)
results0 = results.loc[lambda df: df['true target'] == 0]
print(np.mean(results0['cover']), np.mean(results0['split_coverage']))## 0.9090909090909091 0.8181818181818182
## 0.0 0.0
resultsA = results.loc[lambda df: df['true target'] != 0]
print(np.mean(resultsA['cover']), np.mean(resultsA['split_coverage']))## 0.9262672811059908 0.890937019969278
## 0.45161290322580644 0.34715821812596004