Jonathan Taylor
Spring 2020
What might a Bayesian do under selection?
Do Bayesians even have to worry about selection?
Suppose \(D \sim N(\mu, 1)\) with a prior \[ \pi(\mu) = N(0, \lambda) \]
Posterior \[ \mu | D \sim N\left(\frac{D}{1+\lambda}, \frac{1}{1+\lambda}\right) \]
What if we only report posterior intervals when \(D > 2\)?
import numpy as np
from scipy.stats import norm as normal_dbn
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
# %matplotlib inline
ECDF = sm.distributions.ECDF
def pivot_plot(pivot):
G = np.linspace(0, 1, 101)
fig = plt.figure()
ax = fig.gca()
ax.plot(G, ECDF(pivot)(G), 'r', linewidth=4)
ax.plot([0,1], [0,1], 'k--')
ax.set_xlabel('ECDF(pivot)', fontsize=20)
ax.set_ylabel('pivot', fontsize=20)
return ax.figuredef simulate(truth,
precision,
threshold=2,
level=0.9):
# (truth=\mu, precision=\lambda)
C = normal_dbn.ppf((1 + level) / 2)
prec = precision
while True:
D = np.random.standard_normal() + truth
if D > threshold:
L = D / (1 + prec) - C / np.sqrt(1 + prec)
U = D / (1 + prec) + C / np.sqrt(1 + prec)
break
return L, U, (L < truth) * (U > truth)
simulate(0, 0.2)
## (0.2217172670177403, 3.22479538460277, False)
## 0.044
## 0.569
## 0.0
We were constructing intervals above using prior \(\pi\) and likelihood \[ \frac{\exp(-\frac{1}{2}(D-\mu)^2)}{\sqrt{2 \pi}} = \phi(D-\mu) \]
Likelihood does not reflect selection.
As we have done for frequentist analysis, we can use selective likelihood \[ \frac{\phi(D - \mu)}{1 - \Phi(2 - \mu)} \]
This approach suggested by Yekutieli (2011) for one-dimensional problems like file drawer.
In this case we need what we called \(\ell^*\) in asymptotics – the \(\Phi\)-adjusted \(\pi\) \[ \ell^*_{\Phi_{\mu}}(t,n) = \frac{\pi(t,n)}{E_{\Phi_{\mu}}[\pi(t,n)]}. \]
For file drawer problem this has closed form \(1-\Phi(2-\mu)\) – for LASSO?
# nonrandomized selection here
def simulate_corrected(truth,
precision,
threshold=2,
level=0.9,
ndraw=5000,
burnin=1000,
proposal_sd=0.5):
# (truth=\mu, precision=\lambda)
prec = precision
while True:
D = np.random.standard_normal() + truth
if D > threshold:
def log_post(mu):
if threshold - mu < 30:
value = np.log(normal_dbn.sf(threshold - mu))
else:
value = - 0.5 * (threshold - mu)**2 - np.log(threshold - mu)
return -(0.5 * (D - mu)**2 +
value +
0.5 * prec * mu**2)
# now draw B samples from Metropolis-Hastings
chain = [D]
den = log_post(chain[-1])
for _ in range(ndraw + burnin):
proposal = (chain[-1] +
np.random.standard_normal() * proposal_sd)
try:
num = log_post(proposal)
except:
num = - np.inf
# acceptance probability
alpha = min(np.exp(num - den), 1)
if np.random.binomial(1, alpha):
chain.append(1 * proposal)
den = num
else:
chain.append(1 * chain[-1])
chain = np.array(chain)[burnin:]
U = np.percentile(chain, 100 * (1 - (1 - level) / 2))
L = np.percentile(chain, 100 * (1 - level) / 2)
break
return L, U, (L < truth) * (U > truth)
simulate_corrected(-2, 0.2, ndraw=5000, proposal_sd=1)## (-4.293642277299732, 1.8099147333531849, True)
coverage = []
for _ in range(50):
coverage.append(simulate_corrected(-1, 0.2))
coverage = np.array(coverage)
coverage.mean(0)[2]## 1.0
## <matplotlib.axes._subplots.AxesSubplot object at 0x1a213a8dd0>
coverage = []
for _ in range(50):
coverage.append(simulate_corrected(-1, 0.01))
coverage = np.array(coverage)
coverage.mean(0)[2]## 0.94
## <matplotlib.axes._subplots.AxesSubplot object at 0x1a213a8dd0>
In our discussion on asymptotics we used Chernoff’s bound / approximation \[ -\log\left(P_{\mu}[D + \omega \in A] \approx \right) \approx \inf_{(d,\omega) \in A} \frac{1}{2}(d-\mu)^T\Sigma^{-1}(d-\mu) + \frac{1}{2} \|\omega\|^2_2 \] where \[ \begin{pmatrix} D \\ \omega \end{pmatrix} \sim N \left( \begin{pmatrix} \mu \\ 0 \end{pmatrix}, \begin{pmatrix} \Sigma & 0 \\ 0 & I \end{pmatrix} \right) \]
Specifically we looked at \[ (d^*(\mu), \omega^*(\mu)) = \text{argmin}_{(d,\omega) \in A} \frac{1}{2}(d-\mu)^T\Sigma^{-1}(d-\mu) + \frac{1}{2} \|\omega\|^2_2 \] as a proxy for where \(\Phi_{\mu}^*\) concentrates…
This bound can be used as proxy for \(\ell^*_{\Phi_{\mu}}\)
Approximate negative log-likelihood takes form \[ \frac{1}{2}(D-\mu)^T\Sigma^{-1}(D-\mu) - \inf_{(d,\omega) \in A} \frac{1}{2}(d-\mu)^T\Sigma^{-1}(d-\mu) + \frac{1}{2} \|\omega\|^2_2 \]
With no randomization \(\omega \equiv 0\) in file drawer, this yields likelihood \[ \frac{1}{2}(D-\mu)^2 - \inf_{d:d \geq 2} \frac{1}{2}(d-\mu)^2 = \begin{cases} \frac{1}{2}(D-\mu)^2 & D > 2 \\ \frac{1}{2}D^2 - D \mu + C & D \leq 2 \end{cases} \]
As we report only when \(D>2\) posterior analysis will be unchanged.
“MLE” is just \(D\) here
Fisher information is just usual information
\(\implies\) this is not a good enough proxy…
Suppose we had split data 50/50 before peeking at file drawer
Associated problem \[ \text{argmin}_{(d,\omega): d+\omega>2} \frac{1}{2}(d-\mu)^2 + \frac{1}{2}\omega^2 \]
Not hard to check that \[ \inf_{(d,\omega): d+\omega>2} \frac{1}{2}(d-\mu)^2 + \frac{1}{2}\omega^2 = \frac{1}{2} \max\left(-(\mu-2)/\sqrt{2}, 0\right)^2 \]
Negative log-likelihood \[ \frac{1}{2}(D-\mu)^2 - \begin{cases} 0 & \mu > 2 \\ \frac{1}{4}(\mu-2)^2 & \mu \leq 2 \end{cases} \]
Solving for MLE yields \[ \hat{\mu} = \begin{cases} D & D > 2 \\ 2 D - 2 & D \leq 2 \end{cases} \]
What about observed information? \[ \nabla^2 \Lambda(\hat{\mu}) = \begin{cases} 1 & D > 2 \\ 2 & D \leq 2 \end{cases} \]
Likelihood inference here either returns unadjusted or something like data splitting intervals…
# randomized selection here
def simulate_chernoff_freq(truth,
threshold=2,
level=0.9):
# truth=\mu
C = normal_dbn.ppf((1 + level) / 2)
while True:
D = np.random.standard_normal() + truth
omega = np.random.standard_normal()
if D + omega > threshold:
if D > threshold:
L = D - C
U = D + C
else:
L = 2 * D - threshold - C * np.sqrt(2)
U = 2 * D - threshold + C * np.sqrt(2)
break
return L, U, (L < truth) * (U > truth)
simulate_chernoff_freq(0)## (-2.104403012045317, 2.547945602661378, True)
coverage = []
for _ in range(1000):
coverage.append(simulate_chernoff_freq(-1, 0.01))
coverage = np.array(coverage)
coverage.mean(0)[2]## 0.825
fig = plt.figure(figsize=(8, 8))
ax = plt.gca()
Zval = np.linspace(-10,10,501)
mle_val = np.linspace(-30,30,501)
chernoff = Zval * (Zval > 2) + (2 * Zval - 2) * (Zval <= 2)
Z_mle = mle_val + ((normal_dbn.pdf((2-mle_val) / np.sqrt(2)) / np.sqrt(2)) /
normal_dbn.sf((2-mle_val) / np.sqrt(2)))
ax.plot(Zval, chernoff, 'b', linewidth=3, label='Chernoff')## [<matplotlib.lines.Line2D object at 0x1c2243b810>]
## [<matplotlib.lines.Line2D object at 0x1c231e2dd0>]
## Text(0.5, 0, 'D')
## Text(0, 0.5, '$\\hat{\\mu}(D)$')
## (-1, 5)
## [<matplotlib.lines.Line2D object at 0x1c231ee790>]
## (-5, 6)
## <matplotlib.legend.Legend object at 0x1c231ee550>
# randomized selection here
def simulate_chernoff_bayesian(truth,
precision,
threshold=2,
level=0.9,
ndraw=5000,
burnin=1000,
proposal_sd=0.5):
prec = precision
while True:
D = np.random.standard_normal() + truth
if D + np.random.standard_normal() > threshold:
# now draw B samples from Metropolis-Hastings
chain = [D]
def log_post(mu):
return -(0.5 * (D - mu)**2 -
0.25 * (mu - threshold)**2 * (D <= threshold) +
0.5 * prec * mu**2)
den = log_post(chain[-1])
for _ in range(ndraw + burnin):
proposal = (chain[-1] +
np.random.standard_normal() * proposal_sd)
num = log_post(proposal)
# acceptance probability
alpha = min(np.exp(num - den), 1)
if np.random.binomial(1, alpha):
chain.append(1 * proposal)
den = num
else:
chain.append(1 * chain[-1])
chain = np.array(chain)[burnin:]
U = np.percentile(chain, 100 * (1 - (1 - level) / 2))
L = np.percentile(chain, 100 * (1 - level) / 2)
break
return L, U, (L < truth) * (U > truth)
simulate_chernoff_bayesian(-2, 0.2, ndraw=5000 ,proposal_sd=1)## (-3.45454562713953, 0.4754887298734898, True)
coverage = []
for _ in range(1000):
coverage.append(simulate_chernoff_bayesian(1, 0.0001))
coverage = np.array(coverage)
coverage.mean(0)[2]## 0.801
## <matplotlib.axes._subplots.AxesSubplot object at 0x1c231c9e90>
Our optimization problem can be written as \[ \text{minimize}_{d,\omega} \frac{1}{2}(d-\mu)^T\Sigma^{-1}(d-\mu) + \frac{1}{2}\| \omega\|^2 + I_A(d, \omega) \] where \(I_A = - \log(1_A)\) is the log-indicator.
When \[A=\left\{(d,\omega): L_A \begin{pmatrix}d \\ \omega \end{pmatrix} \leq b_A\right\}\] is affine with \(L_A \in \mathbb{R}^{m \times (p + p)}\), we might approximate \(I_A\) by \[ z \mapsto \sum_{i=1}^m -\log(b_A[i]-L_A[i]^Tz) \] or some other smooth barrier.
For unrandomized file drawer the approximation is \[ \text{inf}_d \frac{1}{2}(d-\mu)^2 - \log(d - T) \]
For MLE, we must differentiate this with respect to \(\mu\).
Standard argument says that derivative is the (negative) of the minimizer.
Minimizer satisfies \[ \begin{aligned} d^*(\mu) - \frac{1}{d^*(\mu)-T} &= \mu \\ d^*(\mu) &= \frac{T + \mu - \sqrt{(T+\mu)^2 - 4 (T \mu - 1)}}{2} \end{aligned} \]
Overall MLE solves \[ D - \hat{\mu} + d^*(\hat{\mu}) = 0 \]
def value(mu, threshold):
T = threshold
d_star = ((T + mu) + np.sqrt((T + mu)**2 - 4 * (T * mu - 1))) / 2
val = (0.5 * (d_star - mu)**2 - np.log(d_star - T))
return val
# nonrandomized selection here
def simulate_smooth_bayesian(truth,
precision,
threshold=2,
level=0.9,
ndraw=5000,
burnin=1000,
proposal_sd=0.5):
prec = precision
while True:
D = np.random.standard_normal() + truth
if D > threshold:
# now draw B samples from Metropolis-Hastings
chain = [D]
def log_post(mu):
return -(0.5 * (D - mu)**2 -
value(mu, threshold) +
0.5 * prec * mu**2)
den = log_post(chain[-1])
for _ in range(ndraw + burnin):
proposal = (chain[-1] +
np.random.standard_normal() * proposal_sd)
num = log_post(proposal)
# acceptance probability
alpha = min(np.exp(num - den), 1)
if np.random.binomial(1, alpha):
chain.append(1 * proposal)
den = num
else:
chain.append(1 * chain[-1])
chain = np.array(chain)[burnin:]
U = np.percentile(chain, 100 * (1 - (1 - level) / 2))
L = np.percentile(chain, 100 * (1 - level) / 2)
break
return L, U, (L < truth) * (U > truth)
simulate_smooth_bayesian(-2, 0.2, ndraw=5000 ,proposal_sd=1)## (-3.9963500352340864, 1.6283124834639633, True)
coverage = []
for _ in range(100):
coverage.append(simulate_smooth_bayesian(0, 0.0001))
coverage = np.array(coverage)
coverage.mean(0)[2]## 0.67
## <matplotlib.axes._subplots.AxesSubplot object at 0x1c231c9e90>
Corresponding problem \[ \text{argmin}_{d, \omega} \frac{1}{2}(d-\mu)^2 + \frac{1}{2} \omega^2 - \log(d+\omega - 2) \]
Can be solved explicitly by change of variables \((u, v) = ((d-\omega)/\sqrt{2}, (d+\omega)/\sqrt{2})\)… messy but not impossible
def objective(soln, mu, threshold=2):
d, omega = soln
return 0.5 * ((d - mu)**2 + omega**2) - np.log(d + omega - threshold)
def gradient(soln, mu, threshold=2):
d, omega = soln
return np.array([d - mu, omega]) - 1. / (d + omega - threshold)
def value_rand(mu, threshold, initial=None, tol=1.e-5):
soln = np.array([threshold + 1, 0])
cur_val = objective(soln, mu, threshold=threshold)
alpha = 1
count = 0
stop = False
while True:
G = gradient(soln, mu, threshold=threshold)
proposed = soln - alpha * G
proposed_val = objective(proposed, mu, threshold=threshold)
if ((proposed.sum() - threshold < 0) or
(proposed_val > cur_val) or
(np.isfinite(soln).sum() != 2)
or (np.isfinite(G).sum() != 2)):
alpha *= 0.5
else:
soln = proposed
old_val, cur_val = cur_val, proposed_val
if np.fabs(old_val - cur_val) <= tol * max(np.fabs(cur_val), 1):
stop = True
if stop:
break
count += 1
return cur_val, soln
value_rand(0, 2)## (2.177950604421338, array([1.37047989, 1.36157092]))
##
## /Users/jonathantaylor/opt/anaconda3/bin/python3:3: RuntimeWarning: divide by zero encountered in log
# randomized selection here
def simulate_smooth_bayesian_rand(truth,
precision,
threshold=2,
level=0.9,
ndraw=5000,
burnin=1000,
proposal_sd=0.5):
prec = precision
while True:
D = np.random.standard_normal() + truth
omega = np.random.standard_normal()
if D + omega > threshold:
# now draw B samples from Metropolis-Hastings
chain = [D]
def log_post(mu):
return -(0.5 * (D - mu)**2 -
value_rand(mu, threshold)[0] +
0.5 * prec * mu**2)
den = log_post(chain[-1])
for _ in range(ndraw + burnin):
proposal = (chain[-1] +
np.random.standard_normal() * proposal_sd)
num = log_post(proposal)
# acceptance probability
alpha = min(np.exp(num - den), 1)
if np.random.binomial(1, alpha):
chain.append(1 * proposal)
den = num
else:
chain.append(1 * chain[-1])
chain = np.array(chain)[burnin:]
U = np.percentile(chain, 100 * (1 - (1 - level) / 2))
L = np.percentile(chain, 100 * (1 - level) / 2)
break
return L, U, (L < truth) * (U > truth)
simulate_smooth_bayesian_rand(-2, 0.2, ndraw=5000, proposal_sd=1)## (-3.6329185943845195, 0.29411072460259496, True)
##
## /Users/jonathantaylor/opt/anaconda3/bin/python3:3: RuntimeWarning: invalid value encountered in log
coverage = []
for _ in range(50):
coverage.append(simulate_smooth_bayesian_rand(-1, 0.01))
coverage = np.array(coverage)
coverage.mean(0)[2]## 0.88
## <matplotlib.axes._subplots.AxesSubplot object at 0x1c231c9e90>
Formally, given our barrier \(B\) (which smooths out constraint of selection event), our (approximate MLE) problem is \[ \text{minimize}_{\mu} - D \mu + \frac{1}{2}\mu^2 - \inf_{d,\omega} \frac{1}{2}(d-\mu)^2 + \frac{1}{2} \omega^2 + B(w+d) \]
Or \[ \text{minimize}_{\mu} - D \mu + \sup_{d,\omega} d \mu - \frac{1}{2}d^2 - \frac{1}{2} \omega^2 - B(w+d) \]
Or \[ \text{minimize}_{\mu} - D \mu + \frac{1}{2}\mu^2 + \sup_{d}\left[ d \mu - \frac{1}{2}d^2 + \sup_{\omega} \left[ - \frac{1}{2} \omega^2 - B(w+d) \right]\right] \]
Or \[ \text{minimize}_{\mu} - D \mu + \frac{1}{2}\mu^2 + \sup_{d}\left[ d \mu - \frac{1}{2}d^2 + \sup_{\omega} \left[ - \frac{1}{2} (\omega-d)^2 + B(w) \right]\right] \]
Finally \[ \text{minimize}_{\mu} - D \mu + \sup_{d}\left[ d\mu - \frac{1}{2}d^2 + B^*_1(d)\right] \] with \(B^*_1\) the Moreau envelope of \(B^*\) at parameter 1 where the convex conjugate of our barrier. \[ f_{\kappa}(\mu) = \inf_{\omega} \frac{\kappa}{2}(\omega-\mu)^2 + f^*(\omega) \]
Note that we can define \[ h^*(\mu) = \sup_d d\mu - \frac{1}{2}d^2 + B^*_1(d) \] where \[ h(d) = \frac{1}{2} d^2 - B^*_1(d). \]
MLE equation is therefore \[ \nabla h^*(\hat{\mu}(D)) = D \]
Or, \[ \hat{\mu}(D) = \nabla h(D) = D - \nabla B^*_1(D). \]
General results on Moreau envelope tell us \[ \nabla B^*_1(d) = d - \text{argmin}_{\omega} d \omega - \frac{1}{2} \omega^2 - B(\omega) = d - \text{prox}_B(d) \]
\(\implies\) \[ \hat{\mu}(D) = 2 D - \text{prox}_B(D) \]
\(\implies\) we need only solve the prox problem.
For observed information, we must differentiate \(\hat{\mu}(D)\) with respect to \(D\).
KKT for proximal problem can be written as \[ D = \omega^*(D) + \nabla B(\omega^*(D)) \]
Differentiating both sides with respect to \(D\) yields \[ \nabla \omega^*(D) = (I + \nabla^2 B(\omega^*(D))^{-1} \]
Finally, \[ \hat{I}(D) = 2 - (1 + \nabla^2 B(\omega^*(D))^{-1} \]
Can be computed with the solution to the proximal problem…
def prox_barrier(d, value_barrier, grad_barrier, threshold=2, niter=100):
soln = threshold + 1
cur_val = (d - soln)**2 / 2 + value_barrier(soln, threshold=threshold)
alpha = 1.
for _ in range(niter):
G = soln - d + grad_barrier(soln, threshold=threshold)
while True:
proposed = soln - alpha * G
proposed_val = (d - proposed)**2 / 2 + value_barrier(proposed, threshold=threshold)
if ((proposed_val > cur_val)
or (proposed < threshold)):
alpha *= 0.5
else:
cur_val = proposed_val
soln = proposed
alpha *= 4
break
return soln
def value_barrier(arg, threshold=2):
return -np.log(arg - threshold)
def grad_barrier(arg, threshold=2):
return -1 / (arg - threshold)
def hessian_barrier(arg, threshold=2):
return 1 / (arg - threshold)**2
def approx_mle(D,
barrier,
threshold=2):
value, grad, hessian = barrier
prox_soln = prox_barrier(D, value, grad)
MLE = 2 * D - prox_soln
info_MLE = 2 - 1 / (1 + hessian(prox_soln))
return MLE, info_MLE
barrier = (value_barrier, grad_barrier, hessian_barrier)
approx_mle(2.4, barrier)## (1.5801960968332591, 1.4019419322542626)
Zval = np.linspace(5, -1, 601)
ax.plot(Zval, [approx_mle(z, barrier)[0] for z in Zval], 'gray', linewidth=3, label=r'$B(\omega)=\log(\omega)$')## [<matplotlib.lines.Line2D object at 0x1c2329ac90>]
##
## /Users/jonathantaylor/opt/anaconda3/bin/python3:2: RuntimeWarning: divide by zero encountered in log
## /Users/jonathantaylor/opt/anaconda3/bin/python3:2: RuntimeWarning: invalid value encountered in log
## <matplotlib.legend.Legend object at 0x1c23297fd0>
## <Figure size 800x800 with 1 Axes>
def simulate_freq_pivot(truth, barrier, threshold=2):
while True:
D = np.random.standard_normal() + truth
omega = np.random.standard_normal()
if (D + omega) > threshold:
MLE, info = approx_mle(D, barrier)
Z = (MLE - truth) / np.sqrt(info)
return Z
simulate_freq_pivot(0, barrier)## 0.1297325145333067
Zsim = []
for _ in range(5000):
Zsim.append(simulate_freq_pivot(-2, barrier))
axZ = sns.distplot(Zsim)
axZ.plot(np.linspace(-3,3,201), normal_dbn.pdf(np.linspace(-3,3,201)))## [<matplotlib.lines.Line2D object at 0x1c23299d10>]
Zsim = []
for _ in range(5000):
Zsim.append(simulate_freq_pivot(0, barrier))
axZ = sns.distplot(Zsim)
axZ.plot(np.linspace(-3,3,201), normal_dbn.pdf(np.linspace(-3,3,201)))## [<matplotlib.lines.Line2D object at 0x1c2338ff10>]
Zsim = []
for _ in range(5000):
Zsim.append(simulate_freq_pivot(4, barrier))
axZ = sns.distplot(Zsim)
axZ.plot(np.linspace(-3,3,201), normal_dbn.pdf(np.linspace(-3,3,201)))## [<matplotlib.lines.Line2D object at 0x1c2329db50>]
def value_barrier_alt(arg, threshold=2):
return -np.log(arg - threshold) + np.log(arg - threshold + 1)
def grad_barrier_alt(arg, threshold=2):
return -1 / (arg - threshold) + 1 / (arg - threshold + 1)
def hessian_barrier_alt(arg, threshold=2):
return 1 / (arg - threshold)**2 - 1 / (arg - threshold + 1)**2
barrier_alt = (value_barrier_alt, grad_barrier_alt, hessian_barrier_alt)Zval = np.linspace(5, -1, 601)
ax.plot(Zval, [approx_mle(z, barrier_alt)[0] for z in Zval], 'k', linewidth=3, label=r'$B(\omega)=\log(\omega/(\omega+1))$')## [<matplotlib.lines.Line2D object at 0x1c234c1310>]
## <matplotlib.legend.Legend object at 0x1c2338f850>
## <Figure size 800x800 with 1 Axes>
Zsim = []
for _ in range(5000):
Zsim.append(simulate_freq_pivot(-2, barrier_alt))
axZ = sns.distplot(Zsim)
axZ.plot(np.linspace(-3,3,201), normal_dbn.pdf(np.linspace(-3,3,201)))## [<matplotlib.lines.Line2D object at 0x1c2320c350>]
Zsim = []
for _ in range(5000):
Zsim.append(simulate_freq_pivot(0, barrier_alt))
axZ = sns.distplot(Zsim)
axZ.plot(np.linspace(-3,3,201), normal_dbn.pdf(np.linspace(-3,3,201)))## [<matplotlib.lines.Line2D object at 0x1c2342fb50>]
Zsim = []
for _ in range(5000):
Zsim.append(simulate_freq_pivot(2, barrier_alt))
axZ = sns.distplot(Zsim)
axZ.plot(np.linspace(-3,3,201), normal_dbn.pdf(np.linspace(-3,3,201)))## [<matplotlib.lines.Line2D object at 0x1c231d3a90>]
|
|
|
|
| \(\mu^*=-2\) | \(\mu^*=0\) | \(\mu^*=2\) |
|
|
|
|
| \(\mu^*=-2\) | \(\mu^*=0\) | \(\mu^*=2\) |
|
|
|
|
| \(\mu^*=-2\) | \(\mu^*=0\) | \(\mu^*=2\) |
|
|
|
|
| \(\mu^*=-2\) | \(\mu^*=0\) | \(\mu^*=2\) |
How do we compute density?
Does this work beyond file drawer?
MLE problem solves \[ \hat{\mu}(D) = \text{argmin}_{\mu} -\mu^TD + \Lambda(\mu) \]
KKT conditions \[ D = \nabla \Lambda(\hat{\mu}(D)) \]
Plugging this into any density for \(D\) (in the model or not) yields density of MLE.
This goes back to Fisher…
Randomized file drawer solves \[ \hat{\beta}(D,\omega) = \text{minimize}_{\beta \geq 2} \frac{1}{2}(D-\beta)^2 - \omega \cdot \beta \]
Selection is whether constraint is tight or not…
KKT conditions \[ \hat{\beta} - D - \omega + \hat{u} = 0 \]
When \(\hat{\beta} > 2\) (when we have selected this study…) we have \(\hat{u}=0\).
When \(\hat{\beta}=2\) we have \(\hat{u}=D-2+\omega \leq 0\)
Rewriting KKT conditions \[ \omega = \hat{\beta} - D + \hat{u} \]
Change of variables yields \[ \pi(t,n) = \pi(t, 0) \propto \int_2^{\infty} \phi(\beta - t) \; d\beta \]
If we were interested in the complementary selection event we could write \[ \pi^c(t,n) = \pi^c(t, 0) \propto \int_0^{\infty} \phi(2 - t + u) \; du. \]
Moral: integrals over \(\omega\) can be expressed as integrals over \((\beta, u)\)
Data splitting (or direct Gaussian randomization) of the LASSO solves \[ \begin{aligned} \hat{\beta}(D, \mu) &= \text{argmin}_{\beta} \frac{1}{2} \|Y-X\beta\|^2_2 - \omega^T\beta + \lambda \|\beta\|_1 \\ &= \text{argmin}_{\beta} \ell(\beta;D) - \omega^T\beta + \lambda \|\beta\|_1 \\ \end{aligned} \]
KKT conditions \[ \omega = \nabla \ell(\hat{\beta}; D) + \hat{u} \]
Conditioning on signs and active set \[ \pi(t,n) = \int_{\text{sign}(\beta_E)=s_E} \int_{\|u_{-E}\|_{\infty} \leq \lambda} g_{\omega}(\nabla \ell(\beta_E; n + t \Gamma) + u) \; du_{-E} \; d\beta_E \] with \(u_E = \lambda s_E\).
Representation is agnostic to our model – we’ve just used a decomposition \[ D = {\cal N} + \Gamma \hat{\theta}. \]
Our target parameter \(\theta^*\) need not be directly related to \((E,s_E)\) – researcher degrees of freedom.
We wrote change of variables only on event \((\hat{E},\hat{s}_E)=(E,s_E)\) but it is global.
In principle, any function of \((D,\omega)\) can be expressed as function of \((D, \hat{\beta}, \hat{u})\).
Typically, researchers will want to look at functions of \((\hat{\beta},\hat{u})\).
Can even show them the entire subgradient \(\hat{u}\) to see which variables were close to being selected.
Correction takes form \[ \pi(t,n|u) = \int_{\text{sign}(\beta_E)=s_E} g_{\omega}(\nabla \ell(\beta_E; n + t \Gamma) + u) \; d\beta_E \]
An \(E\)-dimensional integral.
If \(\omega \sim N(0, \tau^2 I)\) then \[ \begin{aligned} g_{\omega}(\nabla \ell(\beta_E; {\cal N} + \hat{\theta} \Gamma) + u) & \propto \exp \left(-\frac{1}{2 \tau^2} \|X^TX_E\beta_E - X^TY + (\lambda s_E, u_{-E})\|^2_2 \right) \end{aligned} \]
This is a Gaussian density when viewed as a function of \((\beta_E, u_E)\) with \(X^TY=D\) fixed. Ignoring restriction there is an implied Gaussian \[ {\cal L}\left(\beta_E, u_E \vert X^TY \right) \sim N(\mu(X^TY), Q^{-1}). \] with density (up to normalization) \[ (\beta_E, u_{-E}) \mapsto \exp \left(-\frac{1}{2 \tau^2} \|X^TX_E\beta_E - X^TY + (\lambda s_E, u_{-E})\|^2_2 \right) = N(\gamma_1(X^TY), \Sigma_1) \]
Conditional on \(u_{-E}\) there is another implied Gaussian \[ {\cal L}\left(\beta_E \vert u_E, X^TY \right) \sim N(\mu(X^TY, u_{-E}), Q^{-1}) \] with density (up to normalization) \[ \beta_E \mapsto \exp \left(-\frac{1}{2 \tau^2} \|X^TX_E\beta_E - X^TY + (\lambda s_E, u_{-E})\|^2_2 \right) = = N(\gamma_2(X^TY,u_{-E}), \Sigma_2) \]
_{t, _E:(_E)=s_E} $$
_{t, _E:(_E)=s_E} $$
_{t, _E:(_E)=s_E} $$
_{t} $$
_{t} \[ with \] H(_E) = |X^TX_EE|^2 - {i E} () $$
G(/ ^2) \[ with \] G(t) = + |{N} + t - (s_E, u_{-E})|^2_2 - H^((X_ETX({N} + t - (s_E, u_{-E}))) $$
Selective MLE \(\hat{\theta}^*\) solves \[ \frac{\hat{\theta}}{\sigma^2} = \frac{1}{\sigma^2} \nabla G^*(\hat{\theta}^* / \sigma^2) \]
Or, \[ \hat{\theta}^* = \sigma^2 \cdot \nabla G(\hat{\theta}) = \hat{\theta} + \frac{\sigma^2}{\tau^2}\Gamma^T \left (D - (\lambda s_E, u_{-E}) \right)- \frac{\sigma^2}{\tau^2}\Gamma^T X^TX_E\nabla H^*\left(\frac{1}{\tau^2} X_E^TX(D - (\lambda s_E, u_{-E}))\right) \]
\(\implies\) whatever our target, we need only solve the problem corresponding to \(H^*\)… adjustment for target \(\theta(F)\) depends only on \(\Gamma\)!
Natural parameter is \(\theta / \sigma^2\).
Observed (approximate) information is \[ \begin{aligned} \text{Var}^*(\hat{\theta}^*) &= \sigma^4 \left(\nabla^2 g^*(\hat{\theta}^* / \sigma^2)\right)^{-1} \\ &= \sigma^4 \nabla^2 g(\hat{\theta}) \\ &= \sigma^2 + \frac{\sigma^4 \|\Gamma\|^2_2}{\tau^2} - \frac{\sigma^4}{\tau^2} \Gamma^T X^TX_E \nabla^2 H^*\left(\frac{1}{\tau^2} X_E^TX(D - (\lambda s_E, u_{-E})) \right) X_E^TX\Gamma \end{aligned} \]
Remains to compute \(\nabla^2 H^*\)…
Recall \[ \hat{\beta}_E(\zeta_E) = \nabla H^*(\zeta_E) \]
Standard results yield \[ \begin{aligned} \nabla^2 H^*(\zeta_E) &= \nabla^2 H(\hat{\beta}_E(\zeta_E))^{-1} \\ \end{aligned} \] with \[ \nabla^2 H(\beta_E) = \frac{1}{\tau^2} X_E^TXX^TX_E + \text{diag}\left(\frac{1}{(s_i \beta_i)^2} - \frac{1}{(s_i \beta_i + 1)^2} \right). \]
Suppose \(\omega \sim N(0, \Theta^{-1})\) what pieces above change?
Suppose we work in the model \({\cal M} = \left\{Y|X \sim N(X\beta, \sigma^2 I): \beta \in \mathbb{R}^p, \sigma^2 >0 \right\}\) and we set \(\Theta^{-1} = \tau^2 (X^TX)\). Does anything simplify? Can you “make sense” of these quantities even when \(p>n\)?
When will these estimates (and the information) revert to the (preselection) MLE? Does this match intuition?
Suppose we allow ourselves to specify a prior having observed \(\hat{u}\).
We might choose features \(M\) (typically related to \(E\) but we allow researcher discretion) \[ {\cal M} = \left\{N(X_M\beta_M, \sigma^2_MI): \beta_M \in \mathbb{R}^M, \sigma^2_M > 0 \right\} \]
With a Gaussian prior on \(\beta_M\) (approximate) normalizing constant can be expressed as an optimization over \(\mathbb{R}^{|M|+|E|}\) \[ P(\text{sign}(\hat{\beta}_E(X^TY,\omega))=s_E | \hat{u}) \] where \(\hat{u}_E = s_E, \|\hat{u}_{-E}\|_{\infty} \leq \lambda\).
Smoothed Chernoff approximation replaces \(|E|\) affine constraints with \(|E|\) barrier terms.
Data \(Y \in \mathbb{R}^{97}, X \in \mathbb{R}^{97 \times 320}\)…
Randomization analogous to using 80% of data in first stage…
from selectinf.randomized.lasso import lasso, selected_targets
from selectinf.randomized.posterior_inference import posterior_inference_lasso
n, p, s = 500, 100, 5
X = np.random.standard_normal((n, p))
beta = np.zeros(p)
beta[:s] = np.sqrt(2 * np.log(p) / n)
Y = X.dot(beta) + np.random.standard_normal(n)scale_ = np.std(Y)
# uses noise of variance n * scale_ / 4 by default
L = lasso.gaussian(X, Y, 3 * scale_ * np.sqrt(2 * np.log(p) * np.sqrt(n)))
signs = L.fit()
E = (signs != 0)M = E.copy()
M[-3:] = 1
dispersion = np.linalg.norm(Y - X[:,M].dot(np.linalg.pinv(X[:,M]).dot(Y))) ** 2 / (n - M.sum())
(observed_target,
cov_target,
cov_target_score,
alternatives) = selected_targets(L.loglike,
L._W,
M,
dispersion=dispersion)
observed_target.shape, E.sum()## ((12,), 9)
# interface a little clunky at the moment
# really just specifies quadratic of implied gaussian given subgradient
A_scaling = L.sampler.affine_con.linear_part
b_scaling = L.sampler.affine_con.offset
logdens_linear = L.sampler.logdens_transform[0]
posterior_inf = posterior_inference_lasso(observed_target,
cov_target,
cov_target_score,
L.observed_opt_state,
L.cond_mean,
L.cond_cov,
logdens_linear,
A_scaling,
b_scaling,
observed_target)
samples = posterior_inf.posterior_sampler(nsample=400, nburnin=200, step=1.)## /Users/jonathantaylor/git-repos/selectinf/selectinf/randomized/posterior_inference.py:90: RuntimeWarning: invalid value encountered in double_scalars
## log_normalizer = -val - mean_marginal.T.dot(prec_marginal).dot(mean_marginal)/2
## array([-4.11035730e+299, 1.93045796e+299, -9.06653034e+298,
## 4.25815917e+298, -1.99987413e+298, 9.39254826e+297,
## -4.41127575e+297, 2.07178640e+297, -9.73028925e+296,
## 4.56989817e+296, -2.14628453e+296, 1.00801749e+296,
## -4.73422439e+295, 2.22346148e+295, -1.04426418e+295,
## 4.90445952e+294, -2.30341359e+294, 1.08181425e+294,
## -5.08081602e+293, 2.38624066e+293])
beta_target = np.linalg.pinv(X[:,M]).dot(X.dot(beta))
coverage_cred = (lower_cred < beta_target) * (upper_cred > beta_target)
coverage_cred## array([ True, True, True, True, True, True, True, True, True,
## True, True, True])
try:
S = L.selective_MLE(observed_target,
cov_target,
cov_target_score)[0]
S
except ValueError as err:
print("{0}".format(err))## not finding a feasible point
def simulate_random_lasso():
n, p = 500, 100
X = np.random.standard_normal((n, p))
beta = np.zeros(p)
Y = np.random.standard_normal(n)
scale_ = np.std(Y)
# uses noise of variance n * scale_ / 4 by default
L = lasso.gaussian(X, Y, 3 * scale_ * np.sqrt(2 * np.log(p) * np.sqrt(n)))
signs = L.fit()
E = (signs != 0)
M = E.copy()
M[-3:] = 1
dispersion = np.linalg.norm(Y - X[:,M].dot(np.linalg.pinv(X[:,M]).dot(Y))) ** 2 / (n - M.sum())
(observed_target,
cov_target,
cov_target_score,
alternatives) = selected_targets(L.loglike,
L._W,
M,
dispersion=dispersion)
S = L.selective_MLE(observed_target,
cov_target,
cov_target_score,
level=0.9)[0]
#print(S)
cover = (S['lower'] < 0) * (S['upper'] > 0)
return S['pvalue'], cover
p, c = simulate_random_lasso()## /Users/jonathantaylor/opt/anaconda3/lib/python3.7/site-packages/pandas/core/computation/expressions.py:194: UserWarning: evaluating in Python space because the '*' operator is not supported by numexpr for the bool dtype, use '&' instead
## op=op_str, alt_op=unsupported[op_str]
P, C = [], []
for _ in range(200):
p, c = simulate_random_lasso()
P.extend(p)
C.extend(c)
('coverage', np.mean(C))## ('coverage', 0.844179651695692)
Besides LASSO, such change of variables applicable to stepup rules such as Benjamini-Hochberg when conditioning on top \(\hat{k}\) (as well as the corresponding \({\cal N}\)).
Let’s try this out: \((n, p, s, \mu) = (2400, 1000, 30, 3.5)\) with \(\Sigma=I\) (chosen to be close to Candes & Barber (2014))
We first collect a sample of size \(0.8 * 2400\) and find BH selected variables.
Collect a followup sample of size \(0.2 * 2400\) for selected variables.
Want confidence intervals for parameters of selected set.
A two-sided step procedure can be expressed in terms of scores \((t_1,\dots,t_k)\) as \[ \max \left\{k: |t_{(i)}| > c_i \right\} \] for cutoffs \(c_i\).
Benjamini-Hochberg: \(c_i = \Phi^{-1}(1 - \frac{q \cdot i}{2k})\)
Suppose \(D \sim N(\mu, \Sigma)\) and we data split or otherwise randomize.
Event becomes \[ |D_i + \omega_i| > c_{|E|} \forall i \in E, \qquad |D_j + \omega_j| \leq c_{|E|-1} \forall j \not \in E \]
Conditional on signs \(s_E\) (to make convex truncation) and \(U=(D_j + \omega_j)_{j \not \in E}\) we see \[ \pi(t,n\vert u) = P(s_E(n + \Gamma t)[E] + \omega_E > c_{|E|}). \]
If we make the assumption that the \(D_i\)’s are independent, then we can legitimately condition on \(U=(D_j)_{j \not \in E}\) and simply collect new data with \(\pi\) unchanged.
With dependence, conditioning on \((D_j)_{j \not \in E}\) does not remove dependence on nuisance parameters.
Using the usual decomposition \(D = {\cal N} + \Gamma \hat{\theta}\) would lead to hard constraints on \(\hat{\theta}\) when inspecting \(\{|D_j + \omega_j| \leq c_{|E|-1} \forall j \not \in E\}\).