There are two primary setups in active learning:
Sanna Jarl, Linus Aronsson, Sadegh Rahrovani, and Morteza Haghir Chehreghani. 2021. "Active Learning of Driving Scenario Trajectories." Eng. Appl. Artif. Intell. 113: 104972.
A. Biswas, N. Abdullah Al, M.S. Ali, I. Hossain, M.A. Ullah, S. Talukder (2023). Active Learning on Medical Image. In: Zheng, B., Andrei, S., Sarker, M.K., Gupta, K.D. (eds) Data Driven Approaches on Medical Imaging. Springer, Cham.
Aarti Singh, Robert D. Nowak, and Parameswaran Ramanathan. 2006. "Active Learning for Adaptive Mobile Sensing Networks." 2006 5th International Conference on Information Processing in Sensor Networks, 60–68.
F. Carcillo, YA. Le Borgne, O. Caelen et al. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization. Int J Data Sci Anal 5, 285–300 (2018).
There are three common ways to quantify uncertainty in AL:
There are also conformal prediction methods. In this lecture, we will focus on Bayesian approach for qualifying uncertainty.
Zhu, Jingbo, Huizhen Wang, Benjamin Ka-Yin T’sou, and Matthew Y. Ma. 2010. "Active Learning with Sampling by Uncertainty and Density for Data Annotations." IEEE Transactions on Audio, Speech, and Language Processing 18: 1323–31.
Example: Consider a binary classification problem with two classes $y_1$ and $y_2$. We have three samples $x_1, x_2, x_3$ and the corresponding predictive distributions are as follows:
$$\begin{aligned} p(y_1|x_1) &= 0.6, \quad p(y_2|x_1) = 0.4\\ p(y_1|x_2) &= 0.3, \quad p(y_2|x_2) = 0.7\\ p(y_1|x_3) &= 0.8, \quad p(y_2|x_3) = 0.2 \end{aligned}$$
Entropy Sampling
Entropy Sampling
We would select $x_1$ for labeling as it has the highest entropy, indicating the model is most uncertain about its prediction at $x_1$.
Margin Sampling
Margin Sampling
We would select $x_1$ for labeling as it has the smallest margin, indicating the model is most uncertain about the prediction at $x_1$.
Least Confidence Sampling
Least Confidence Sampling
We would select $x_1$ for labeling as it has the lowest confidence, indicating the model is most uncertain about the prediction at $x_1$.
Zhu, Jingbo, Huizhen Wang, Benjamin Ka-Yin T’sou, and Matthew Y. Ma. 2010. "Active Learning with Sampling by Uncertainty and Density for Data Annotations." IEEE Transactions on Audio, Speech, and Language Processing 18: 1323–31.
Example: Consider a binary classification problem with two classes $y_1$ and $y_2$. We have three committee members and three samples $x_1, x_2, x_3$. The committee members’ predictive distributions are as follows, where $p_i(y_j|x_k)$ is the probability of committee member $i$ predicting class $y_j$ given input $x_k$.
| $x$ | $p_1(y_1 \vert \cdot)$ | $p_1(y_2 \vert \cdot)$ | $p_2(y_1 \vert \cdot)$ | $p_2(y_2 \vert \cdot)$ | $p_3(y_1 \vert \cdot)$ | $p_3(y_2 \vert \cdot)$ |
|---|---|---|---|---|---|---|
| $x_1$ | 0.6 | 0.4 | 0.7 | 0.3 | 0.3 | 0.7 |
| $x_2$ | 0.3 | 0.7 | 0.4 | 0.6 | 0.4 | 0.6 |
| $x_3$ | 0.8 | 0.2 | 0.9 | 0.1 | 0.7 | 0.3 |
Vote Entropy for $x_1$
Vote Entropy for $x_2$
Vote Entropy for $x_3$
Vote Entropy for $x_1$
Vote Entropy for $x_2$
Vote Entropy for $x_3$
Vote Entropy
Thus, we would select $x_1$ for labeling as it has the highest vote entropy, indicating the committee members disagree the most about the prediction at $x_1$.
Consensus Entropy
Consensus Entropy
Consensus Entropy
We would select $x_1$ for labeling as it has the highest consensus entropy, indicating the committee members disagree the most about the prediction at $x_1$.
Bayesian Active Learning by Disagreement (BALD) selects the samples for which the model believes the most (Shannon) information can be gained in expectation if these corresponding labels are observed:
$$\begin{aligned} \mathbb{I}(\theta; y|x, \mathcal{D}) = \mathbb{H}[p(y|x, \mathcal{D})] - \mathbb{E}_{p(\theta | \mathcal{D})} [\mathbb{H}[p(y|x, \theta, \mathcal{D})]] \end{aligned}$$
where $\mathbb{H}[\cdot]$ denotes entropy. When there is significant disagreement among models, the predictive entropy (first term) will be large, but the expected entropy (second term)) will be lower. This difference represents how much the models disagree with each other. BALD selects points where this disagreement is maximized.
Houlsby, N., Huszár, F., Ghahramani, Z., & Lengyel, M. (2011). Bayesian Active Learning for Classification and Preference Learning. ArXiv, abs/1112.5745.
To compute the first term, we can derive the following expression:
$$\begin{aligned} \mathbb{H}[p(y|x, \mathcal{D})] &= \mathbb{H}\left[\int_{\theta} p(y|x, \theta, \mathcal{D}) p(\theta | \mathcal{D}) d\theta\right]\\ &\approx \mathbb{H}\left[\frac{1}{N}\sum_{i=1}^{N} p(y|x, \theta_i, \mathcal{D})\right]\\ &= \mathbb{H}\left[\overline{p}(y|x, \mathcal{D})\right] \end{aligned}$$
To compute the second term, we can derive the following expression:
$$\begin{aligned} &\mathbb{E}_{p(\theta|\mathcal{D})} [\mathbb{H}[p(y|x, \theta, \mathcal{D})]] \\ &= \mathbb{E}_{p(\theta|\mathcal{D})} \left[ - \sum_{y} p(y|x, \theta, \mathcal{D}) \log p(y|x, \theta, \mathcal{D}) \right]\\ &\approx - \frac{1}{N} \sum_{i=1}^{N} \left( \sum_{y} p(y|x, \theta_i, \mathcal{D}) \log p(y|x, \theta_i, \mathcal{D}) \right) \end{aligned}$$
Example: Consider a binary classification problem with two classes $y_1$ and $y_2$. We have two samples $x_1, x_2$ and the model’s predictive distributions are as follows:
First-time inference (with $\theta_1 \sim p(\theta | \mathcal{D})$)
$$\begin{aligned} p(y_1|x_1, \theta_1, \mathcal{D}) &= 0.6, \quad p(y_2|x_1, \theta_1, \mathcal{D}) = 0.4\\ p(y_1|x_2, \theta_1, \mathcal{D}) &= 0.4, \quad p(y_2|x_2, \theta_1, \mathcal{D}) = 0.6 \end{aligned}$$
Second-time inference (with $\theta_2 \sim p(\theta | \mathcal{D})$)
$$\begin{aligned} p(y_1|x_1, \theta_2, \mathcal{D}) &= 0.8, \quad p(y_2|x_1, \theta_2, \mathcal{D}) = 0.2\\ p(y_1|x_2, \theta_2, \mathcal{D}) &= 0.5, \quad p(y_2|x_2, \theta_2, \mathcal{D}) = 0.5 \end{aligned}$$
Step 1: Compute the entropy of the model’s predictive distribution for each sample.
Step 2: Compute the expected entropy of the model’s predictive distribution for each sample.
Step 1: Compute the entropy of the model’s predictive distribution for each sample.
Step 2: Compute the expected entropy of the model’s predictive distribution for each sample.
⇒ $\mathbb{E}_{p(\theta|\mathcal{D})} [\mathbb{H}[p(y|x_1, \theta, \mathcal{D})]] \approx (0.29 + 0.22) / 2 = 0.255$
⇒ $\mathbb{E}_{p(\theta|\mathcal{D})} [\mathbb{H}[p(y|x_2, \theta, \mathcal{D})]] \approx (0.29 + 0.30) / 2 = 0.295$
Step 3: Compute the BALD score for each sample.
We would select $x_1$ for labeling as it has the highest BALD score, indicating the model will gain the most information from labeling $x_1$.
David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. 1996. “Active Learning with Statistical Models.” CoRR cs.AI/9603104.
$$\mathbb{E}_{\hat{y} \sim p(\hat{y} | \mathcal{D}; x), y \sim p(y|x)} (\hat{y} - y)^2$$
where $\hat{y}$ is the model prediction, $y$ is the true label.
$$\begin{aligned} \mathbb{E}_{\hat{y} \sim p(\hat{y} | \mathcal{D}; x), y \sim p(y|x)} (\hat{y} - y)^2 &= \mathbb{E}_{\hat{y}, y}[(\hat{y} - \mathbb{E}[y|x]) + (\mathbb{E}[y|x] - y)]^2 \\ &= \mathbb{E}_{\hat{y}, y} [(y - \mathbb{E}[y|x])^2]\\ &+ 2\mathbb{E}_{\hat{y}, y} [(\hat{y} - \mathbb{E}[y|x])(\mathbb{E}[y|x] - y)]\\ &+ \mathbb{E}_{\hat{y}, y}(\hat{y} - \mathbb{E}[y|x])^2 \end{aligned}$$
Stuart Geman, Elie Bienenstock, and René Doursat. 1992. “Neural Networks and the Bias/Variance Dilemma.” Neural Computation 4: 1–58.
$$\mathbb{E}_{\hat{y}, y}[\mathbb{E}[y|x] -y] = \mathbb{E}_{y}[\mathbb{E}[y|x]] - \mathbb{E}_{y}[y] = \mathbb{E}_{y}[y] - \mathbb{E}_{y}[y] = 0$$
$$\begin{aligned} \mathbb{E}_{\hat{y}, y}(\hat{y} - \mathbb{E}[y|x])^2 &= \mathbb{E}_{\hat{y}, y}[(\hat{y} - \mathbb{E}_{\hat{y}}[\hat{y}] + \mathbb{E}_{\hat{y}}[\hat{y}] - \mathbb{E}[y|x])^2]\\ &= \mathbb{E}_{\hat{y}, y}[(\hat{y} - \mathbb{E}_{\hat{y}}[\hat{y}])^2] \\ &+ 2 \mathbb{E}_{\hat{y}, y} \left[(\hat{y} - \mathbb{E}_{\hat{y}}[\hat{y}])(\mathbb{E}_{\hat{y}}[\hat{y}] - \mathbb{E}[y|x]) \right]\\ &+ \mathbb{E}_{\hat{y}, y}[(\mathbb{E}_{\hat{y}}[\hat{y}] - \mathbb{E}[y|x])^2]\\ &= \mathbb{E}_{\hat{y}} [(\hat{y} - \mathbb{E}_{\hat{y}}[\hat{y}])^2] + (\mathbb{E}_{\hat{y}}[\hat{y}] - \mathbb{E}[y|x] )^2 \\ \end{aligned}$$
$$\begin{aligned} &\text{as }\mathbb{E}_{\hat{y}, y} (\hat{y} - \mathbb{E}_{\hat{y}}[\hat{y}]) = \mathbb{E}_{\hat{y}} [\hat{y} - \mathbb{E}_{\hat{y}}[\hat{y}]] = \mathbb{E}_{\hat{y}} [\hat{y}] - \mathbb{E}_{\hat{y}}[\hat{y}] = 0, \\ &\mathbb{E}_{\hat{y}}[\hat{y}] = \mathbb{E}_{\hat{y} \sim p(\hat{y} \vert \mathcal{D}, x)}[\hat{y} | x, \mathcal{D}], \text{ and } p(\hat{y} \vert \mathcal{D}, x) \text{ is the posterior predictive} . \end{aligned}$$
Finally, we have:
$$\mathbb{E}_{y} [(y - \mathbb{E}[y|x])^2] + (\mathbb{E}_{\hat{y}} [\hat{y} - \mathbb{E}[y|x]] )^2 + \mathbb{E}_{\hat{y}} [(\hat{y} - \mathbb{E}_{\hat{y}}[\hat{y}])^2]$$
Stuart Geman, Elie Bienenstock, and René Doursat. 1992. “Neural Networks and the Bias/Variance Dilemma.” Neural Computation 4: 1–58.
$$\sigma^2_{\hat{y}} (x | \mathcal{D}) = \mathbb{E}_{\hat{y}} [(\hat{y} - \mathbb{E}_{\hat{y}}[\hat{y}])^2]$$
$$\sigma^2_{\hat{y}} (x | \mathcal{D}) = \mathbb{E}_{\hat{y} \sim p(\hat{y} | \mathcal{D}; x)} [(\hat{y} - \mathbb{E}_{\hat{y} \sim p(\hat{y} | \mathcal{D}; x)}[\hat{y}])^2]$$
David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. 1996. “Active Learning with Statistical Models.” CoRR cs.AI/9603104.
Recall $x \sim p(x)$ is a sample from the input distribution and $\tilde{x}$ is a candidate from the active learning pool. We define $\tilde{\mathcal{D}} = \mathcal{D} \cup \{(\tilde{x}, y(\tilde{x}))\}$. The active learning procedure is as follows:
$$\mathbb{E}_{p(x)} [\sigma^2_{\hat{y}} (x | \tilde{\mathcal{D}})]$$
$$\tilde{x}^* = \arg\min_{\tilde{x}_i} \mathbb{E}_{p(x)} [\sigma^2_{\hat{y}} (x | \tilde{\mathcal{D}})]$$
David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. 1996. “Active Learning with Statistical Models.” CoRR cs.AI/9603104.
Wenbin Cai, Ya Zhang, and Jun Zhou. 2013. "Maximizing Expected Model Change for Active Learning in Regression." In 2013 IEEE 13th International Conference on Data Mining, 51–60.
Stephen Mussmann, Julia Reisler, Daniel Tsai, Ehsan Mousavi, Shayne O’Brien, and Moises Goldszmidt. 2022. "Active Learning with Expected Error Reduction."
David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. 1996. "Active Learning with Statistical Models." CoRR cs.AI/9603104.
Djallel Bouneffouf, Romain Laroche, Tanguy Urvoy, Raphaël Féraud, and Robin Allesiardo. 2014. "Contextual Bandit for Active Learning: Active Thompson Sampling." In International Conference on Neural Information Processing.
Shuyang Zhao, Toni Heittola, and Tuomas Virtanen. 2020. "Active Learning for Sound Event Detection." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28: 2895–905.
Example in autonomous driving: Two candidate trajectories are provided for comparison. We can observe that $\zeta_A$ has a smoother trajectory without any collisions.
What if humans do not precisely know how an agent should optimally behave in
an environment but still have some opinion on what trajectories would be
better than others?
Dorsa Sadigh et al. Active preference-based learning of reward functions. 2017.
$$r_H(x^t, u^t_R, u^t_H) = \mathbf{w}^{\top} \phi(x_t, u^t_R, u^t_H)$$
where $\mathbf{w}$ is the weight vector, $\phi$ is the feature function.
Dorsa Sadigh et al. Active preference-based learning of reward functions. 2017.
$$R_H(x^0, \mathbf{u}_R, \mathbf{u}_H) = \sum_{t=0}^{N} r_H(x^t, u^t_R, u^t_H)$$
We can define a controled trajectory $\zeta \in \Xi$ as $\zeta = {(x^0, u^0_R, u^0_H), \dots, (x^N, u^N_R, u^N_H)}$ and $\Phi(\zeta) = \sum_{t=0}^{N} \phi(x^t, u^t_R, u^t_H)$.
Thus, we can rewrite the expected reward function as
$$R_H(\zeta) = \mathbf{w}^{\top} \Phi(\zeta)$$
Dorsa Sadigh et al. Active preference-based learning of reward functions. 2017.
Recall the previous lecture on preference learning, we can define the probability of the human preferring trajectory $\zeta_A$ over $\zeta_B$ as
$$\begin{aligned} P(\zeta_A \succ \zeta_B) &= \sigma(R_H(\zeta_A) - R_H(\zeta_B))\\ &= \frac{1}{1 + \exp(-(R_H(\zeta_A) - R_H(\zeta_B)))}\\ &= \frac{ \exp (R_H(\zeta_A)) }{ \exp (R_H(\zeta_A)) + \exp (R_H(\zeta_B)) } \end{aligned}$$
Do you find the above equation familiar?
With $R_H(\zeta)$, we can define the probability of the human preferring trajectory $\zeta_A$ over $\zeta_B$ as
$$p(I=1 | \mathbf{w}) = \frac{ \exp (R_H(\zeta_A)) }{ \exp (R_H(\zeta_A)) + \exp (R_H(\zeta_B)) }$$
and the probability of the human preferring trajectory $\zeta_B$ over $\zeta_A$ as
$$p(I=-1 | \mathbf{w}) = \frac{ \exp (R_H(\zeta_B)) }{ \exp (R_H(\zeta_A)) + \exp (R_H(\zeta_B)) }$$
Dorsa Sadigh et al. Active preference-based learning of reward functions. 2017.
We can rewrite the probability of the human preferring trajectory $\zeta_A$ over $\zeta_B$ with $\psi = \Phi(\zeta_A) - \Phi(\zeta_B)$, $\Phi(\zeta) = \sum_{0}^N (x^t, u_R^t, u_H^t)$ as
$$f_{\psi}(\mathbf{w}) = P(I | \mathbf{w}) = \frac{1}{1 + \exp(- I \mathbf{w}^{\top} \psi)}$$
The idea is that we can use a Bayesian update to compute the posterior of the weight vector $\mathbf{w}$ given the human’s preferences.
$$p(\mathbf{w} | I) \propto p(I | \mathbf{w}) p(\mathbf{w})$$
Dorsa Sadigh et al. Active preference-based learning of reward functions. 2017.
The acquisition function used in this case is the following:
$$\max_{\psi} \left\{ \min \{ \mathbb{E}[1-f_{\psi}(\mathbf{w})], \mathbb{E}[1-f_{-\psi}(\mathbf{w})] \} \right\}$$
Instead of optimizing on $\psi$, we can reformulate the optimization problem as
$$\max_{x^0, \mathbf{u}_R, \mathbf{u}_{H}^A, \mathbf{u}_{H}^B} \left\{ \min \{ \mathbb{E}[1-f_{\psi}(\mathbf{w})], \mathbb{E}[1-f_{-\psi}(\mathbf{w})] \} \right\}$$
Dorsa Sadigh et al. Active preference-based learning of reward functions. 2017.
To solve this optimization problem, in case $p(\mathbf{w})$ is complex, we can approximate this $p(\mathbf{w})$ by the empirical distribution of the weight vector $\mathbf{w}$.
$$p(\mathbf{w}) \approx \frac{1}{M} \sum_{i=1}^{M} \delta(\mathbf{w}_i)$$
where $\delta$ is the Dirac delta function. Then, we can rewrite the optimization problem as
$$\max_{x^0, \mathbf{u}_R, \mathbf{u}_{H}^A, \mathbf{u}_{H}^B} \left\{ \min \left\{ \frac{1}{M} \sum_{i=1}^{M} 1-f_{\psi}(\mathbf{w}_i), \frac{1}{M} \sum_{i=1}^{M} 1-f_{-\psi}(\mathbf{w}_i) \right\} \right\}$$
Now, we can use gradient-based optimization methods to solve this optimization problem (e.g. Quasi-Newton methods).
Dorsa Sadigh et al. Active preference-based learning of reward functions. 2017.
Summary
Discussion
Next lecture: Model-free Preference Optimization