$$P(b_i \succ b_j) = (c_{ij} + 1) / 2$$
$$P(b_i \succ b_j) = \epsilon(b_i, b_j) + \frac{1}{2}$$
By this model, we have the following properties:
$$\epsilon(b_i, b_i) = 0, \epsilon(b_i, b_j) = -\epsilon(b_j, b_i)\text{ and,}$$
$$b_i \succ b_j \text{ if and only if } \epsilon(b_i, b_j) > 0$$
We also assume there is a total order over the bandits, i.e., there exists a permutation $\varphi$ such that $b_{\varphi(1)} \succ b_{\varphi(2)} \succ \dots \succ b_{\varphi(K)}$. Without loss of generality, we have
$$b_i \succ b_j \text{ if and only if } \varphi(i) < \varphi(j)$$
Thus, the best bandit is $b_{\varphi(1)}$.
To qualify the decision at each turn $t$, we start to construct the regret measurement. Recall that the probability of winning or losing in a pairwise comparison is at least $0.5$, the instantaneous regret is defined as:
$$r_t = P(b_{\varphi(1)} \succ b_1) + P(b_{\varphi(1)} \succ b_2) - 1$$
where $r_t$ is the regret of the decision at turn $t$.
Then, the total cumulative regret is defined as:
$$\begin{aligned} R_T = \sum_{t=1}^{T} r_t &= \sum_{t=1}^{T} \left[ P(b_{\varphi(1)} \succ b_{1,t}) + P(b_{\varphi(1)} \succ b_{2,t}) - 1 \right]\\ &= \sum_{t=1}^{T} \left[\left(\epsilon(b_{\varphi(1)}, b_{1,t}) + \frac{1}{2}\right) + \left(\epsilon(b_{\varphi(1)}, b_{2,t}) + \frac{1}{2}\right) - 1\right]\\ &= \sum_{t=1}^{T} \left[\epsilon(b_{\varphi(1)}, b_{1,t}) + \epsilon(b_{\varphi(1)}, b_{2,t})\right] \end{aligned}$$
where $R_T$ is the total cummulative regret after $T$ turns.
$$\epsilon(b_i, b_j | x_i, x_j) \in \left[\frac{-1}{2}, \frac{1}{2}\right]$$
Traditional acquisition functions for dueling bandit problem:
Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov), 397-422.
Yue, Y., Broder, J., Kleinberg, R., & Joachims, T. (2012). The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5), 1538-1556.
Agrawal, S., & Goyal, N. (2013, May). Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning (pp. 127-135). PMLR.
Dudík, M., Hofmann, K., Schapire, R. E., Slivkins, A., & Zoghi, M. (2015, June). Contextual dueling bandits. In Conference on Learning Theory (pp. 563-587). PMLR.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$V_{\pi, j}(s) = \mathbb{E}_{\pi} \left[ \sum_{t=j}^{h} \bar{r}(s_t, \pi(s_t, t)) | s_j = s \right]$$
where $\bar{r}(s_t, a_t)$ is the expected utility of the action $a_t$ at state $s_t$.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$\pi^* = \arg \sup_{\pi} \sum_{s\in\mathcal{S}} p_0(s)V_{\pi, 1}(s)$$
where $p_0(s)$ is the initial state distribution.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$\mathbb{E}[\text{Reg}(T)] = \mathbb{E} \left\{ \sum_{i=1}^{\left\lceil \frac{T}{2h} \right\rceil} \sum_{s \in \mathcal{S}} p_0(s) \left[ 2V_{\pi^*,1}(s) - V_{\pi_{i1},1}(s) - V_{\pi_{i2},1}(s) \right] \right\}.$$
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
The DPS algorithm is summarized as follows:
where $\mathbb{I}_{[\tau_{i2} \succ \tau_{i1}]} = P(\tau_{i2} \succ \tau_{i1})$ is the indicator function that returns $1$ if $\tau_{i2} \succ \tau_{i1}$ and $0$ otherwise.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
ADVANTAGE: Sample policy from dynamics and utility models
Input: $f_p$: state transition posterior, $f_r$: utility posterior
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
FEEDBACK: Update dynamics and utility models based on new user feedback
Input: $\mathcal{H} = \{ \tau_{i1}, \tau_{i2}, y_i \}$, $f_p$: state transition posterior, $f_r$: utility posterior
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
In DPS, the process of updating the dynamics posterior is straightforward. We can assume the dynamics are fully observed and model them with Dirichlet distribution. Then, the likelihood of the state transition is multinomial, and we can update the posterior as follows:
$$f_p(s_{t+1} | s_t, a_t) = \text{Dirichlet}(\alpha + \text{count}(s_{t+1} | s_t, a_t))$$
This formula can be interpreted as we update the posterior by adding the count of the new state to the prior.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
We will now analyze the asymptotic Bayesian regret of DFS under a Bayesian linear regression model. The analysis contains three steps:
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
Proposition 1: The sampled dynamics converge in distribution to their true value as the DPS iteration increase
$$P(|P(s_{t+1} | s_t, a_t) - P^*(s_{t+1} | s_t, a_t)| > \epsilon) < \delta$$
Let $N(s, a)$ represent the number of times state-action pair $s, a$ has been observed. As $N(s, a) \to \infty$, the posterior distribution concentrates around the true distribution $P^*(s_{t+1} | s_t, a_t)$.
The remaining problem is to prove that DPS will visit all state-action pairs infinitely often.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
Lemma 3 (Novoseller et al., 2020): Under DPS, every state-action pair is visited infinitely-often.
Proof sketch:
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
Proposition 2: With probability of $1 - \delta$, where delta is a parameter of the Bayesian linear regression model, the sampled rewards converge in distribution to the true reward parameters, $\bar{r}$, as the DPS iteration increases.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
According to Theorem 2 from Abbasi-Yadkori et al. (2011), under certain regularity conditions, we can bound the error between the estimated reward parameter $\hat{r}_i$ and the true reward parameter $r_i$ with high probability. With probability at least $1 - \delta$:
$$ \| \hat{r}_i - r_i \|_{\mathbf{M}_i} \leq \beta_i(\delta)$$
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$\lim_{i \to \infty} \lambda_{i,j} \to 0$$
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
Theorem 1: With probability $1 - \delta$, the sampled policies $\pi_{i1}, \pi_{i2}$ converge in distribution to the optimal policy $\pi^*$ as $i \to \infty$.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$P\left( \left| V(\tilde{p}_{i1}, \tilde{r}_{i1}, \pi) - V(p, r, \pi) \right| > \epsilon \right) \to 0 \quad \text{as} \quad i \to \infty$$
This shows that the value of the sampled policies converges to that of the optimal policy.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$ \Gamma_i = \frac{ \mathbb{E}_i[ (y^*_i - y_i)^2 ] }{ \mathbb{I}_i[\pi^*; (\pi_{i2}, \tau_{i1}, \tau_{i2}, x_{i2} - x_{i1}, y_i )] }$$
Russo, D., & Van Roy, B. (2016). An information-theoretic analysis of Thompson sampling. Journal of Machine Learning Research, 17(68), 1-30.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$ \Gamma_i = \frac{ \mathbb{E}_i[ (y^*_i - y_i)^2 ] }{ \mathbb{I}_i[\pi^*; (\pi_{i2}, \tau_{i1}, \tau_{i2}, x_{i2} - x_{i1}, y_i )] }$$
Russo, D., & Van Roy, B. (2016). An information-theoretic analysis of Thompson sampling. Journal of Machine Learning Research, 17(68), 1-30.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$\begin{aligned} \mathbb{E}[\text{Reg}_2(T)] = \mathbb{E}[\text{Reg}_2(Nh)] = \mathbb{E} \left[ \sum_{i=1}^N \bar{r}^{\top}(x_i^* - x_{i2}) \right] \end{aligned}$$
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$\begin{aligned} \mathbb{E}[\text{Reg}_2(T)] &= \mathbb{E} \left[ \sum_{i=1}^N \bar{r}^{\top}(x_i^* - x_{i2}) \right] \\ &= \mathbb{E} \left[ \sum_{i=1}^N \left( \bar{r}^{\top}(x_i^* - x_{i1}) - \bar{r}^{\top}(x_{i2}- x_{i1}) \right) \right] \\ &= \mathbb{E} \left[ \sum_{i=1}^N (y_i^* - y_i) \right] \end{aligned}$$
where $y_i = \bar{r}(\tau) = \bar{r}^{\top}x_{\tau_{i}}$ is the expected utility of the trajectory $\tau_i$.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
When policy $\pi_{i1}$ is drawn from a fixed distribution:
$$ \mathbb{E}[ \text{Reg}_2(T)] \leq \sqrt{ \bar{\Gamma} \mathbb{H}[\pi^*] N }$$
where $\mathbb{H}[\pi^*]$ is the entropy of the optimal policy $\pi^*$ and $N$ is the number of DPS iterations.
Russo, D., & Van Roy, B. (2016). An information-theoretic analysis of thompson sampling. Journal of Machine Learning Research, 17(68), 1-30.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
Proof for Lemma 12: The expression we are working with (before applying Cauchy-Schwarz) is:
$$\mathbb{E} \left[ \text{Regret}(T, \pi^{TS}) \right] \leq \mathbb{E} \left[ \sum_{t=1}^{T} \sqrt{\Gamma_t \mathbb{I}_t[A^*; (A_t, Y_t, A_t)]} \right]$$
where $\Gamma_t$ is some scaling factor, $\mathbb{I}_t[A^*; (A_t, Y_t, A_t)]$ is the information gain at time $t$, and the summation is taken over the entire time horizon $T$.
The Cauchy-Schwarz inequality for sums states that for any sequences ${a_t}$ and ${b_t}$, we have:
$$\left( \sum_{t=1}^{T} a_t b_t \right)^2 \leq \left( \sum_{t=1}^{T} a_t^2 \right) \left( \sum_{t=1}^{T} b_t^2 \right)$$
Taking square roots on both sides, we get:
$$\sum_{t=1}^{T} a_t b_t \leq \sqrt{ \left( \sum_{t=1}^{T} a_t^2 \right) \left( \sum_{t=1}^{T} b_t^2 \right)}$$
Now, applying this inequality to the expression for regret, we associate $a_t$ with $\sqrt{\Gamma_t}$ and $b_t$ with $\sqrt{\mathbb{I}_t[A^*; (A_t, Y_t, A_t)]}$. Specifically, we write:
$$\mathbb{E} \left[ \sum_{t=1}^{T} \sqrt{\Gamma_t \mathbb{I}_t[A^*; (A_t, Y_t, A_t)]} \right]$$
as the product of two sequences:
$$\sum_{t=1}^{T} \sqrt{\Gamma_t} \cdot \sqrt{\mathbb{I}_t[A^*; (A_t, Y_t, A_t)]}.$$
Applying Cauchy-Schwarz to this sum gives:
$$\begin{aligned} \mathbb{E} & \left[ \sum_{t=1}^{T} \sqrt{\Gamma_t \mathbb{I}_t[A^*; (A_t, Y_t, A_t)[]} \right] \\ &\leq \sqrt{ \mathbb{E} \left[ \sum_{t=1}^{T} \Gamma_t \right] \cdot \mathbb{E} \left[ \sum_{t=1}^{T} \mathbb{I}_t[A^*; (A_t, Y_t, A_t)] \right]}. \end{aligned}$$
$$\begin{aligned} \mathbb{E} & \left[ \sum_{t=1}^{T} \sqrt{\Gamma_t \mathbb{I}_t[A^*; (A_t, Y_t, A_t)]} \right] \\ &\leq \sqrt{ \mathbb{E} \left[ \sum_{t=1}^{T} \Gamma_t \right] \cdot \mathbb{E} \left[ \sum_{t=1}^{T} \mathbb{I}_t[A^*; (A_t, Y_t, A_t)] \right]}. \end{aligned}$$
Assuming $\Gamma_t$ is bounded, say $\Gamma_t \leq \bar{\Gamma}$ for all $t$, the bound becomes:
$$\begin{aligned} \mathbb{E} \left[ \text{Regret}(T, \pi^{TS}) \right] &\leq \sqrt{ T \cdot \bar{\Gamma} \cdot \mathbb{E} \left[ \sum_{t=1}^{T} \mathbb{I}_t[A^*; (A_t, Y_t, A_t)] \right]} \end{aligned}$$
This step is where the bound on the regret is simplified using the total information gain across the horizon $T$. The bound scales with $\sqrt{T}$, which reflects the growth of regret with time, but it is also modulated by the total information gathered during the process.
Let $Z_t = (A_t, Y_t, A_t)$. We can write the total information gain as:
$$\mathbb{E} \left[ \mathbb{I}_t \left[ A^* ; Z_t \right] \right] = \mathbb{I} \left[ A^* ; Z_t | Z_1, \dots, Z_{t-1} \right],$$
and the total information gain across all $T$ steps is:
$$\begin{aligned} \mathbb{E} \sum_{t=1}^T \mathbb{I}_t \left[ A^* ; Z_t \right] &= \sum_{t=1}^T \mathbb{I} \left( A^* ; Z_t | Z_1, \dots, Z_{t-1} \right)\\ &\stackrel{(c)}{=} \mathbb{I} \left[ A^* ; \left( Z_1, \dots, Z_T \right) \right]\\ &= \mathbb{H} \left[ A^* \right] - \mathbb{H} \left[ A^* | Z_1, \dots, Z_T \right] \stackrel{(d)}{\leq} \mathbb{H} \left[ A^* \right] \end{aligned}$$
where $(c)$ follows from the chain rule for mutual information, and $(d)$ follows fromthe non-negativity of entropy.
Gathering all the pieces together, we have:
$$\begin{aligned} \mathbb{E} \left[ \text{Regret}(T, \pi^{TS}) \right] &\leq \sqrt{ T \cdot \bar{\Gamma} \cdot \mathbb{E} \left[ \sum_{t=1}^{T} \mathbb{I}_t [A^*; (A_t, Y_t, A_t)] \right]}\\ &= \sqrt{ T \cdot \bar{\Gamma} \cdot \mathbb{H}[A^*]} \end{aligned}$$
$$ \mathbb{E}[\text{Reg}_2(T)] \leq \sqrt{ \bar{\Gamma} Sh N \log A } = \sqrt{\bar{\Gamma} ST \log A }$$
where $Sh$ is the number of deterministic policies.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$ \mathbb{E}[\text{Reg}_2(T)] \leq S \sqrt{ \frac{AT \log A}{2} }$$
Thus, with a finite set of $S$ states, $A$ actions, and $T$ steps, the regret of DPS in this case is bounded by $S \sqrt{ \frac{AT \log A}{2} }$.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$ \lim_{n \to \infty} \mathbb{I}[X_n, Y_n] = \mathbb{I}[X, Y]$$
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
Proof Outline
$$\begin{aligned} \lim_{n \to \infty} \mathbb{H}[X_n] = \mathbb{H}[X], &\quad \lim_{n \to \infty} \mathbb{H}[Y_n] = \mathbb{H}[Y]\\ \lim_{n \to \infty} \mathbb{H}[X_n, Y_n] &= \mathbb{H}[X, Y]. \end{aligned}$$
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$ \mathbb{I}[X; Y] = \mathbb{H}[X] + \mathbb{H}[Y] - \mathbb{H}[X, Y]$$
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$ \mathbb{H}[X] = - \sum_{x \in \mathcal{X}} P(x) \log P(x)$$
$$ \mathbb{H}[X_n] = - \sum_{x \in \mathcal{X}} P_n(x) \log P_n(x)$$
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$ | P_n(x) - P(x) | < \delta$$
$$| P_{n^*}(x) - P(x) | = 0$$
$$ | P_n(x) \log P_n(x) - P(x) \log P(x) | < \epsilon'$$
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$ | \mathbb{H}(X) - \mathbb{H}(X_n) | = \left| \sum_{x \in \mathcal{X}} \left( P(x) \log P(x) - P_n(x) \log P_n(x) \right) \right|$$
$$ \leq \sum_{x \in \mathcal{X}} \epsilon' = \epsilon'|\mathcal{X}| = \epsilon$$
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$\begin{aligned} \lim_{n \to \infty} \mathbb{H}[X_n] &= \mathbb{H}[X]\\ \lim_{n \to \infty} \mathbb{H}[Y_n] &= \mathbb{H}[Y]\\ \lim_{n \to \infty} \mathbb{H}[X_n, Y_n] &= \mathbb{H}[X, Y] \end{aligned}$$
$$ \lim_{n \to \infty} \mathbb{I}[X_n; Y_n] = \mathbb{I}[X; Y]$$
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
Lemma 17: If $ \pi_{i1}$ is drawn from a fixed distribution, then asymptotically $\Gamma_{\pi_{i1} \text{ fixed}} \leq \frac{SA}{2}$. Assuming that $ \pi_{i1}$ is drawn from a distribution that is drifting and converging to a fixed probability distribution. The information ratio $ \Gamma_i$ for $ \pi_{i2}$’s one-sided regret satisfies:
$$ \lim_{i \to \infty} \Gamma_i \leq \Gamma_{\pi_{i1} \text{ fixed}} \leq \frac{SA}{2}$$
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
Recall the definition of the information ratio $ \Gamma_i$:
$$\Gamma_i = \frac{ \mathbb{E}_i[ (y^*_i - y_i)^2 ] }{ \mathbb{I}_i[\pi^*; (\pi_{i2}, \tau_{i1}, \tau_{i2}, x_{i2} - x_{i1}, y_i )] }$$
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$ \mathbb{E}\left[ \bar{r}^{\top} (x_i^* - x_{i2}) \Big| \mathcal{H}_i^{(2)} \right]^2$$
This term is not dependent on $ x_{i1}$, so it remains unaffected by the distribution of $ \pi_{i1}$.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
$$ \lim_{i \to \infty} \Gamma_i \leq \lim_{i \to \infty} \Gamma_{\pi_{i1} \text{ fixed}} \leq \frac{SA}{2} + \epsilon$$
This means that for any $ \epsilon > 0$, there exists an iteration $ i_0$ such that for all $ i > i_0$, the $ \lim_{i \to \infty} \Gamma_i$ is asymptotically bounded by $ \frac{SA}{2}$.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
Recall Theorem 1, Theorem 2, and Lemma 17:
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
Theorem 3: With probability $1 - \delta$, the expected Bayesian regret $\mathbb{E}[\text{Reg}(T)]$ of DPS achieves an asymptotic rate:
$$\begin{aligned} \mathbb{E}[\text{Reg}(T)] &= \mathbb{E} \left\{ \sum_{i=1}^{\left\lceil \frac{T}{2h} \right\rceil} \sum_{s \in \mathcal{S}} p_0(s) \left[ 2V_{\pi^*,1}(s) - V_{\pi_{i1},1}(s) - V_{\pi_{i2},1}(s) \right] \right\}\\ &= \mathbb{E}[\text{Reg}_1(T)] + \mathbb{E}[\text{Reg}_2(T)] \\ &= S \sqrt{ \frac{AT \log A}{2} } + S \sqrt{ \frac{AT \log A}{2} } = S \sqrt{2AT \log A} \end{aligned}$$
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
Empirical performance of DPS: Each simulated environment is shown under the two least-noisy user preference models that were evaluated. The plots show DPS with three models: Gaussian process regression (GPR), Bayesian linear regression, and a Gaussian process preference model.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.
Plots display the mean +/- one standard deviation over 100 runs of each algorithm tested. Overall, we see that DPS performs well and is robust to the choice of credit assignment model.
Novoseller, E., Wei, Y., Sui, Y., Yue, Y., & Burdick, J. (2020, August). Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence (pp. 1029-1038). PMLR.