Chapter 3: Model-Based Preference Optimization

Metric Elicitation

Sanmi Koyejo

What are the tradeoffs of the 3 binary classifiers above? Which would you choose?
You'll realize that it depends on the context in which the classifier is being used.
In the case of cancer diagnosis, a false negative could result in death. However, in the context of recidivism prediction, a false positive means unjustly putting someone behind bars for too long.

When evaluating models, we need to consider the relative cost/benefit of different kinds of errors.
The metric is a quantitative description of tradeoffs.
For many tasks in regression and classification, a variety of popular metrics are available off the shelf.

One must be careful in choosing the appropriate metrics for a given task. Metrics are not exchangeable because different metrics evaluate different areas of model performance.
Cremonesi, Koren, Turrin (2010) found that the RMSE metric used for evaluating models submitted to the Netflix Prize competition did not translate well to top-N ranking accuracy which more directly impacts what users see.
In some cases, metrics can even be contradictory. The COMPAS model used to predict rescindivism in criminal cases was calibrated on equal accuracy across demographic groups. However, an analysis by ProPublica found "blacks are almost twice as likely as whites to be labeled a higher risk but not actually re-offend," despite the calibration.

Hiranandani et al. "Performance metric elicitation from pairwise classifier comparisons.”
Hiranandani et. al “Multiclass Performance Metric Elicitation”
Hiranandani et. al., “Fair Performance Metric Elicitation”

Preference elicitation is studied in economics and psychology (Samuelson, 1938; Varian, 2005)
Elicitation is related to (contextual dueling) bandits, when focused on recovering reward function (Yue et. al. 2012; Dudik et. al. 2015)
Inverse reinforcement learning generalizes elicitation when reward is transportable (Amin, 2017)

We study work by Hiranandani et al. (YEAR) about eliciting linear binary classification metrics in the noise-free setting.
Assume we have a data generating distribution $\mathcal{P}$ across the space of inputs $X$ and outputs $Y\in\{0, 1\}$. The set of all classifiers is denoted by $\mathcal{H}=\{h:X\mapsto [0, 1]\}$. For a given classifier $h$, its confusion matrix entries $\mathbf{C}(h)$ are

The metric used by the oracle to evaluate a classifier $h$ is given by $$\phi^*(h)=1-(a_1^* FP(h) + a_2^* FN(h)).$$
We do not know $a_1^*$ and $a_2^*$ (which weigh the relative cost of errors between FP and FN) a priori. The objective is to find $\text{argmax}_h \phi^*(h)$ by querying the oracle with pairs of classifiers confusion matrices $\mathcal{C}(h_1), \mathcal{C}(h_2)$ upon which the oracle returns the preferred classifier w.r.t. $\phi^*$. Ideally, we want to minimize the number of queries used (i.e., the query/sample complexity).

The above plot illustrates the region of attainable $TP$ and $TN$ values across all classifiers ($\mathcal{H}$). We note that given $TP$ and $TN$, $FN$ and $FP$ are derivable independent of classifier so there are $2$ degrees of freedom.
The region can be shown to be convex. Since $\phi^*$ is a linear function on it, the optimal classifier must lie on the boundary!

As a result, we unroll the boundary and use a binary search style algorithm to construct queries for the oracle.
The authors found that the resulting classifier was guaranteed to be $\epsilon$-accurate in $O(\log \frac{1}{\epsilon})$ queries. This was true even under system noise (e.g., noisy responses from the oracle using probabilistic extensions).
Work that came after extended to multiclass classification, more complex metrics, and stakeholder groups.