We’ve spent a lot of time over last few classes analysing \[ \pi(t,n) = P[\omega + n + \Gamma t \in A \vert n, t] \]
Our most common examples were file drawer (randomized or not) and LASSO (randomized or not).
Surely people do more than just these selection mechanisms!
What can we say for more complex ones?
If our model \({\cal M}\) admits an estimator \(\hat{\theta}(D) \approx N(\theta(F), \sigma^2(F))\) and we have access to \(\pi(t,n)\).
Then to test \(H_0:\theta(F)=\theta_0\) we sample from density \[ t \mapsto \phi((t - \theta_0) / \sigma) \cdot \pi(t, {\cal N}_{obs}) \]
Requires our usual decomposition \[ D = {\cal N} + \Gamma \hat{\theta} \] where \[ \Gamma = \Gamma(F) = \text{Cov}_F(D, \hat{\theta}(D)) \text{Cov}_F(\hat{\theta}(D))^{-1}. \]
If we have access in silico to \(\hat{\cal Q}\) we can draw samples \[ T_i, Y_i=1_{\hat{\cal Q}({\cal N} + \Gamma (\cdot))=q}(T_i) \]
Then, we can learn \(t \mapsto \pi(t, {\cal N})\) using any flexible probability learner.
Given this function, use it in reference density above.
truth = [1,0.9,1.1]
samples = [np.random.standard_normal(20) + truth[0],
np.random.standard_normal(20) + truth[1],
np.random.standard_normal(20) + truth[2]]
i_star = winning_arm(*samples)
followup = np.random.standard_normal(20) + truth[i_star]
i_star## 0
np.random.seed(2)
def generate_learning_data(i_star,
samples,
followup,
winning_arm=winning_arm,
B=5000):
samples_cp = copy(samples)
Y, T = [], []
n_star = len(samples_cp[i_star])
full_data = np.hstack([samples_cp[i_star], followup])
for b in range(B):
full_data_star = np.random.choice(full_data, full_data.shape[0], replace=True)
samples_cp[i_star] = np.random.choice(full_data_star, n_star, replace=False)
Y.append(winning_arm(*samples_cp) == i_star)
T.append(full_data_star.mean())
return np.array(Y), np.array(T).reshape((-1,1))
Y, T = generate_learning_data(i_star, samples, followup)
Y.shape, T.shape## ((5000,), (5000, 1))
What should this curve look like?
Assume \(i^*=1\). Denote What would happen if we instead learnt selection as a function of \((\bar{D}_1, \bar{D}_{1,1})\) instead of only \(\bar{D}_1\)? Can you use such a function to do the correct adjustment?
import matplotlib.pyplot as plt
# %matplotlib inline
pi_hat = clf.predict_proba(T)
plt.scatter(T[:,0], pi_hat[:,1])We must maintain \[ \Gamma = \frac{\text{Cov}(\bar{D}_{1,i}, \bar{D}_i)}{\text{Var}(\bar{D}_i)} \] and \[ \text{Var}(\bar{D}_{1,i}) \]
Bootstrapping with any smaller sample size greater than \(n_{1,i}\) will do…
def generate_learning_data_frac(i_star,
samples,
followup,
winning_arm=winning_arm,
B=5000,
frac=0.6):
samples_cp = copy(samples)
Y, T = [], []
n_star = len(samples_cp[i_star])
full_data = np.hstack([samples_cp[i_star], followup])
for b in range(B):
full_data_star = np.random.choice(full_data,
int(frac * full_data.shape[0]),
replace=True)
samples_cp[i_star] = np.random.choice(full_data_star, n_star, replace=False)
Y.append(winning_arm(*samples_cp) == i_star)
T.append(full_data_star.mean())
return np.array(Y), np.array(T).reshape((-1,1))Y2, T2 = generate_learning_data_frac(i_star, samples, followup)
clf2 = GradientBoostingClassifier(random_state=0)
clf2.fit(T2, Y2);
pi_hat2 = clf2.predict_proba(T2)
plt.scatter(T2[:,0], pi_hat2[:,1])T3, Y3 = np.vstack([T, T2]), np.hstack([Y, Y2])
clf3 = GradientBoostingClassifier(random_state=0)
clf3.fit(T3, Y3);
pi_hat3 = clf3.predict_proba(T3)
plt.scatter(T3[:,0], pi_hat3[:,1])Y4, T4 = generate_learning_data_frac(i_star, samples, followup, frac=0.7, B=5000)
T4, Y4 = np.vstack([T4, T3]), np.hstack([Y4, Y3])
clf4 = GradientBoostingClassifier(random_state=0)
clf4.fit(T4, Y4);
pi_hat4 = clf4.predict_proba(T4)
plt.scatter(T4[:,0], pi_hat4[:,1])def pvalue(full_data_mean,
se,
pi_hat,
B=10000,
truth=1):
# test H_a: mean in winning arm is > truth
Z = np.random.standard_normal(B) * se + truth
W = pi_hat(Z)
sf_ = np.sum(W * (Z > full_data_mean)) / np.sum(W)
return sf_
full_data_winning = np.hstack([samples[i_star], followup])
pvalue(full_data_winning.mean(),
full_data_winning.std() / np.sqrt(full_data_winning.shape[0]),
lambda t: clf4.predict_proba(t.reshape((-1,1)))[:,1])## 0.6218785858108452
def simulate(sizes=[20, 20, 20],
size_followup=20,
fracs=[1,0.6,0.7],
truth=[0.9, 1, 1.1],
winning_arm=winning_arm):
# run our trial
samples = [np.random.standard_normal(size) + truth[i] for i, size in enumerate(sizes)]
i_star = winning_arm(*samples)
followup = np.random.standard_normal(size_followup) + truth[i_star]
# generate our learning data
Y, T = [], []
for frac in fracs:
y, t = generate_learning_data_frac(i_star, samples, followup, frac=frac)
Y.append(y); T.append(t)
Y = np.hstack(Y)
T = np.vstack(T)
# learn the selection probability
clf = GradientBoostingClassifier(random_state=0)
clf.fit(T, Y)
# use selection probability for inference
full_data_winning = np.hstack([samples[i_star], followup])
pi_hat = lambda t: clf.predict_proba(t.reshape((-1,1)))[:,1]
return pvalue(full_data_winning.mean(),
full_data_winning.std() / np.sqrt(full_data_winning.shape[0]),
pi_hat,
truth=truth[i_star])
simulate()## 0.2442717377465183
Generating \(Y\)’s from the bootstrap distribution will result in variation in \(\bar{D}_{1,j}, j \neq i^*\).
We can learn \[ (t_1, t_2, t_3) = \tilde{\pi}(t_1, t_2, t_3) = P\left (i^*(D)=1 \bigl \vert \bar{D}_1=t_1, \bar{D}_{2,1}=t_2, \bar{D}_{3,1}=t_3 \right) \]
If \(i^*=1\) \[ t \mapsto P\left(i^*(D)=1 \bigl \vert \bar{D}_1=t, \bar{D}_{j,1}=\bar{D}_{j,1,obs} \forall j \geq 2 \right) \]
Can be recovered as \[ t \mapsto \tilde{\pi}(t, \bar{D}_{2,1,obs}, \bar{D}_{3,1,obs}) \]
def generate_learning_data_boot(i_star,
samples,
followup,
winning_arm=winning_arm,
B=5000,
frac=0.7):
samples_cp = copy(samples)
Y, T = [], []
n_star = len(samples_cp[i_star])
full_data = np.hstack([samples[i_star], followup])
for b in range(B):
full_data_star = np.random.choice(full_data, int(frac * full_data.shape[0]), replace=True)
samples_cp[i_star] = np.random.choice(full_data_star, n_star, replace=False)
for j in range(len(samples)):
if j != i_star:
samples_cp[j] = np.random.choice(samples[j], int(frac * samples[j].shape[0]), replace=True)
Y.append(winning_arm(*samples_cp) == i_star)
T.append([full_data_star.mean()] + [samples_cp[j].mean() for j in range(len(samples)) if j != i_star])
return np.array(Y), np.array(T)
Y, T = generate_learning_data_boot(i_star, samples, followup)## (5000, 3)
def simulate_boot(sizes=[20, 20, 20],
size_followup=20,
fracs=[1,0.6,0.7],
truth=[0.9, 1, 1.1],
winning_arm=winning_arm):
# run our trial
samples = [np.random.standard_normal(size) + truth[i] for i, size in enumerate(sizes)]
i_star = winning_arm(*samples)
followup = np.random.standard_normal(size_followup) + truth[i_star]
# generate our learning data
Y, T = [], []
for frac in fracs:
y, t = generate_learning_data_boot(i_star, samples, followup, frac=frac)
Y.append(y); T.append(t)
Y = np.hstack(Y)
T = np.vstack(T)
# learn the selection probability
clf = GradientBoostingClassifier(random_state=0)
clf.fit(T, Y)
# use selection probability for inference
other_means = [samples[j].mean() for j in range(len(samples)) if j != i_star]
full_data_winning = np.hstack([samples[i_star], followup])
def pi_hat(t):
full_t = np.zeros((t.shape[0], len(samples)))
full_t[:,0] = t
full_t[:,1:] = other_means
return clf.predict_proba(full_t)[:,1]
return pvalue(full_data_winning.mean(),
full_data_winning.std() / np.sqrt(full_data_winning.shape[0]),
pi_hat,
truth=truth[i_star])
simulate_boot()## 0.32429574652109217
# center them around 0
for sample in samples:
sample -= 1
def winning_arm_abs(*samples):
return np.argmax([np.fabs(np.mean(sample)) for sample in samples])i_star_abs = winning_arm_abs(*samples)
Y4, T4 = generate_learning_data_frac(i_star_abs, samples, followup, frac=0.6, B=25000,
winning_arm=winning_arm_abs)
clf4 = GradientBoostingClassifier(random_state=0)
clf4.fit(T4, Y4);
pi_hat4 = clf4.predict_proba(T4)
plt.scatter(T4[:,0], pi_hat4[:,1])## <matplotlib.collections.PathCollection object at 0x1c2efbb8d0>
stage1_samples = [np.random.standard_normal(20) + 1 for _ in range(5)]
def stage1(*samples):
return np.argsort([np.mean(sample) for sample in samples])[-3:]
stage1_winners = stage1(*stage1_samples)
stage1_winners## array([2, 0, 4])
stage2_samples = [np.hstack([np.random.standard_normal(20) + 1,
stage1_samples[i]]) for i in stage1_winners]
def stage2(*samples):
return np.argmax([np.mean(sample) for sample in samples])
stage2_winner = stage2(*samples)
overall_winner = stage1_winners[stage2_winner]
stage3_sample = np.hstack([stage2_samples[stage2_winner],
np.random.standard_normal(20) + 1])
overall_winner## 2
A little more complex to write, but conceptually straightforward – fix all samples but overall_winner.
Bootstrap overall_winner full sample, sampling appropriately for its stage1 and stage2 data.
def generate_learning_multi(stage1_samples,
stage1_winners,
stage2_samples,
stage2_winner,
overall_winner,
stage3_sample,
frac=0.7,
B=5000):
stage2_samples_cp = copy(stage2_samples)
stage1_samples_cp = copy(stage1_samples)
Y, T = [], []
for _ in range(B):
overall_winner_star = np.random.choice(stage3_sample, int(frac * stage3_sample.shape[0]),
replace=True)
stage2_samples_cp[stage2_winner] = np.random.choice(overall_winner_star,
stage2_samples[stage2_winner].shape[0],
replace=False)
stage1_samples_cp[overall_winner] = np.random.choice(stage2_samples_cp[stage2_winner],
stage1_samples_cp[overall_winner].shape[0],
replace=False)
stage1_winners_star = stage1(*stage1_samples_cp)
if overall_winner in stage1_winners_star:
stage2_winner_star = stage2(*stage2_samples_cp)
overall_winner_ = stage1_winners_star[stage2_winner_star]
y = overall_winner_ == overall_winner
else:
y = 0
Y.append(y)
T.append(overall_winner_star.mean())
return np.array(Y), np.array(T).reshape((-1,1))
Y, T = generate_learning_multi(stage1_samples,
stage1_winners,
stage2_samples,
stage2_winner,
overall_winner,
stage3_sample)clf = GradientBoostingClassifier(random_state=0)
clf.fit(T, Y);
pi_hat = clf4.predict_proba(T)
plt.scatter(T[:,0], pi_hat[:,1])## <matplotlib.collections.PathCollection object at 0x1c2db09e50>
def simulate_multi(stage1_size=[20]*10,
stage2_size=[20]*3,
stage3_size=20,
fracs=[1,0.7,0.8],
truth=[1]*10):
# run our trial
stage1_samples = [np.random.standard_normal(size) + truth[i] for i, size in enumerate(stage1_size)]
stage1_winners = stage1(*stage1_samples)
stage2_samples = [np.hstack([np.random.standard_normal(size) + truth[i],
stage1_samples[i]]) for i, size in zip(stage1_winners, stage2_size)]
stage2_winner = stage2(*samples)
overall_winner = stage1_winners[stage2_winner]
stage3_sample = np.hstack([stage2_samples[stage2_winner],
np.random.standard_normal(stage3_size) + truth[overall_winner]])
# generate our learning data
Y, T = [], []
for frac in fracs:
y, t = generate_learning_multi(stage1_samples,
stage1_winners,
stage2_samples,
stage2_winner,
overall_winner,
stage3_sample,
frac=frac)
Y.append(y); T.append(t)
Y = np.hstack(Y)
T = np.vstack(T)
# learn the selection probability
clf = GradientBoostingClassifier(random_state=0)
clf.fit(T, Y)
# use selection probability for inference
pi_hat = lambda t: clf.predict_proba(t.reshape((-1,1)))[:,1]
return pvalue(stage3_sample.mean(),
stage3_sample.std() / np.sqrt(stage3_sample.shape[0]),
pi_hat,
truth=truth[overall_winner])
simulate_multi()## 0.9980666296040841
The sampling here required knowing the structure of the trial, though not the decision functions stage1 and stage2.
This sampling is needed to maintain proper variance of implicit randomization in the different arms…
Suppose we are working in a regression model \[ {\cal M} = \left\{Y|X \sim N(X\beta, \sigma^2 I): \beta \in \mathbb{R}^p, \sigma^2 > 0 \right\} \] and our selection \[ \hat{\cal Q} = \hat{\cal Q}(X^TY) \] returns a selected set \(\hat{E}\) of features.
Target is \[ \beta_j = \beta_{j|\{1,\dots,p\}} \]
The one-at-a-time approach for some \(j\) requires rerunning \(\hat{\cal Q}\) on \[ (X^TY) + (T_b - X_j^TY) \cdot e_j \] with \[ Y_b = \hat{\cal Q}((X^TY) + (T_b - X_j^TY) \cdot e_j) \ni j \]
A many-at-a-time approach for some \(E\) requires rerunning \(\hat{\cal Q}\) on \[ (X^TY) + I[,E] (T_b - X_E^TY) \] with \[ Y^j_b = \hat{\cal Q}((X^TY) + (X^TY) + I[,E] (T_b - X_E^TY)) \ni j, \qquad j \in E \]
Results in \(E\) learning problems yielding \(\hat{\pi}_j, j \in E\).
Reference density for \(H_0:\beta_j=\beta_{j,0}\) expressible as usual density modified by (something like) \[ t \mapsto \hat{\pi}_j(t, (X_i^TY)_{i \in E \setminus \{j\}}). \]
Can really take any distribution for \(T\)’s on \(\mathbb{R}^E\)
Optimal choice could be cast as active learning problem.
||_{*,1} \[ with \] S=XT(I-11T )X $$ a sample covariance matrix.
Penalty usually doesn’t charge the diagonal.
Alternatively, imagine we have some other algorithm to select edges \((i,j)\) and we want to test \(\Theta_{i,j}=0\) in model \[ {\cal M} = \left\{\text{Wishart}(\Theta^{-1},n-1): \Theta^{-1} \geq 0 \right\}. \]
One-at-a-time reruns algorithm \(\hat{\cal Q}\) on \[ S^*_b = S + (T^*_b + S_{ij}) E_{ij} \] (i.e. \({\cal N} = (S_{kl})_{(k,l) \neq (i,j)}\) accounting for symmetry as well) and \[ Y_b = \hat{\cal Q}(S^*_b) \ni (i,j) \]
We learn \[ \hat{\pi}(t, {\cal N}) = P(Y=1|T=t) \]
Reference density is law \[ S_{ij} | S_{kl}, (k,l) \neq (i,j) \] modified by learned \(\hat{\pi}\). This is well-defined only when \(S > 0\)!
Can also use a many-at-a-time resampling scheme.
For \(n=100, p=20\) and \(\Theta=I\), try the above algorithm to construct hypothesis tests for edges selected by the graphical LASSO at a fixed value of the regularization parameter.
Describe how you can use a learnt \(\hat{\pi}\) function to construct confidence intervals for a selected element of \(\Theta\). Try this out on the graphical LASSO above.