Pairwise comparison experiments collect data on which of two alternatives is preferred, if either is. In online platforms like allourideas.com respondents are presented with two alternatives, one on the left of the screen and one on the right, as well as an (optional) "I can't decide" button. They can choose one of these three options — vote for one of the two alternatives or abstain alltogether — before moving onto another task or leaving the site. There are no task count limits; a respondent can answer none, one, or very very many questions at the site. Wikisurveys, in particular, also allow respondents to *submit* alternatives to be considered by others

The purpose of some of these surveys, particularly at allourideas.com, is to infer alternative rankings. Who had the "worst year in Washington" in 2010?

Finding answers, given only pairwise comparisons, involves modeling and inference. Some model must be posed that, ultimately, has a description of global ranking of alternatives (or their features). Then this model must be fit to the observed choices, along some measure of confidence in estimates. With this estimated model researchers can make judgements about global ranking of alternatives (or their features) without ever having actually witnessed a global ranking.

This article describes a "new" way to model pairwise comparison data like wikisurvey data. Well, our proposal isn't that "new" really. We basically propose to model this data with the Logit model *both* coefficients that model shared "opinions" about alternatives (or features) *and* "idiosyncratic deviations", or respondent-specific effects, about those alternatives (or features). A particular penalization technique generally common in statistics and machine learning enables estimation of the "fewest, smallest" idiosyncratic deviations. Our specific use of this penalization is particular: we penalize a *subset* of the estimated coefficients (the deviations), while most existing techniques penalize *all* coefficients equally. Because of this structure, we call this type of model the idLogit.

We have several motivations to use this type of model.

The first is tractability. The idLogit has a convex MLE problem, and thus has unique estimates conditional on the data.*strongly* convex.*very* strong numerical methods for efficiently computing coefficient estimates. There are even "online" algorithms that could analyze response data as it comes in.

A second motivation is that we get an essentially non-parametric portrait of heterogeneity in the respondent pool. No priors or other a priori assumptions on how "opinions" are distributed are required, other than an identifying assumption that they have zero mean over the population within alternatives (or features). But that is totally generic anyway; whatever the distribution is, we can shift it to be mean zero. With idLogit we can estimate the shared coefficients and deviations, and *then* ask what the heterogeneity looks like.

A third motivation is generalizability. Essentially the same model (and more or less the same estimation code) applies to binary response data, pairwise comparisons, pairwise comparisons with a no-choice option, and even multinomial models (which we don't consider here). Perhaps the availability of a method to include the no-choice option is most immediately relevant to wikisurvey analysts. The inability of a respondent to judge between two alternatives certainly imparts information about their relative comparison and ranking. But the primary method available on the platform method currently ignores this information.

Two major caveats are in order.

First, the idLogit is still a *Logit* model of choice, within individuals. This means that if individual response behavior plausibly violates features of the Logit model, adding heterogeneity with the idLogit isn't a magic fix. Specifically, the Logit has the independence of irrelevant alternatives (IIA) property: the ratio of the probabilities that any two alternatives are chosen — from the full set of all alternatives — is independent of all other alternatives. In pairwise comparisons, however, IIA seems a somewhat natural property: when respondents only see and evaluate pairs, how can any other option be relevant? For wikisurveys, in particular, this property says that we cannot model any changes in one respondent's preferences over pairwise comparisons of existing alternatives when someone adds a new alternative. Moreover, if you would accept the correction of IIA obtained with other generalizations of the Logit (e.g., Mixing with Latent Classes or Random Coefficients) then you should accept it with idLogit, too. basically, that the idLogit is still a Logit doesn't sound so bad, but researchers should keep it in mind.

Second, the idLogit itself doesn't really tell us how to predict responses for new respondents. That is, while we get a non-parametric view of heterogeneity, we get a view *specific to the individuals studied*. If we want to predict observations from new respondents, we have to either accept the model with the shared opinions only, or we have to model the pattern of the deviations in some way that enables extrapolation. This isn't a problem specific to idLogit, by any means. Random Coefficient Logit models and Bayesian methods "avoid" this problem by presuming what the distribution is to begin with. For Latent Class Logit models, a new respondent would have to be assigned to a class; this can itself be done with multinomial Logit modeling built right into the Latent Class specification if respondent characteristics (demographics) are collected.

Some notes on notation: we try to be precise enough to make sense, but not so precise we lose the forest for the trees. Bold symbols (e.g., $\mathbf{L}$, $\boldsymbol{\beta}$) typically refer to vectors. We index with subscripts (e.g., $x_n$) or using "functional" notation (e.g., $x(n)$) depending on which probably makes sense in the context.

Oh, and in the spirit of Distill (from which we draw our article style) this is an open article. You can see all the source, notebooks, and assets on GitLab, and even submit issues on GitLab if you see an error. If I fix something in the article based on a issue I’ll record the change here, in the appendices.

In a set of $N$ pairwise, online comparisons we observe vectors of "left" $\mathbf{L} \in \{1,\dotsc,A\}^N$ and "right" $\mathbf{R} \in \{1,\dotsc,A\}^N$ alternatives, as well as outcomes $\mathbf{y} \in \{\pm 1\}^N$ where $A$ is the number of alternatives. We presume $y_n = +1$ if the left alterative was chosen, and $y_n = -1$ if the right alternative was chosen. We assume each alternative has a corresponding vector of "characteristics" or "features" $\mathbf{x}_a \in \mathbb{R}^K$. There are many treatments of how to analyze this and related discrete choice data, including

Suppose we let the probability that the option on the left wins be
$$
P_n(\boldsymbol{\beta}) = \frac{ e^{ \mathbf{x}_{L_n}^\top \boldsymbol{\beta} } }{ e^{\mathbf{x}_{L_n}^\top \boldsymbol{\beta}} + e^{\mathbf{x}_{R_n}^\top \boldsymbol{\beta}} }.
$$
This isn't an entirely arbitrary choice; if we consider $\mathbf{x}_a^\top \boldsymbol{\beta}$ an alternative's “utility”, then this formula for the choice probability follows from Random Utility Maximization with errors following an i.i.d. extreme value distribution.

For example, suppose we let there be a coefficient for each alternative; that is, the "features" are dummy variables or alternative specific constants (ASCs) only. The inner product $\mathbf{x}_a^\top \boldsymbol{\beta}$ is simply a selection operator: $\mathbf{x}_a^\top \boldsymbol{\beta} = \beta_a$. Equivalently, $\mathbf{x}_a \in \mathbb{R}^A$ is a vector all of whose entries are zero, except $x_a = 1$. An identification constraint could be $\sum_{a=1}^A \beta_a = 0$ because uniform shifts in the coefficients $\boldsymbol{\beta} \mapsto \boldsymbol{\beta} + \delta$ will not change the modeled choice probabilities. An equivalent strategy in this case is to remove the constraint but "leave a coefficient out" by fixing $\beta_{a_f} = 0$ for any $a_f \in \{1,\dotsc,A\}$.

Among several model shortcomings, this Logit model does not account for the fact that some subset of the responses are probably from the same people, and people behave differently. If we can account for those differences, we should be able to increase the proportion of variation in the data explained by our models. Below we sketch a few different approaches to this, to frame our proposed approach.

The Latent-Class Logit model presumes that there is a collection of "groups" or "classes" of respondents, each with distinct behavior defined by different coefficient vectors. Let $$ \boldsymbol{\theta} = ( \; \boldsymbol{\beta}_1 \; , \; \dotsc \; , \; \boldsymbol{\beta}_C \; , \boldsymbol{\rho} \; ) $$ where $\boldsymbol{\rho} \in \mathbb{R}^C$ are class occurence probabilities; that is, $\mathbb{P}(\boldsymbol{\beta}=\boldsymbol{\beta}_c|\boldsymbol{\theta}) = \rho_c$. Hence $\rho_c \geq 0$ for all $c$, $\sum_{c=1}^C \rho_c = 1$. Then the probability that the left alternative wins is $$ P_{n}^{LCL}(\boldsymbol{\theta}) = \sum_{c=1}^C \rho_c P_{n}(\boldsymbol{\beta}_c) $$ where the subscripts are placeholders for the choice set in observation $n$. If we have to impose identification constraints on the coefficients, these need to be imposed on each class' coefficients. Let's also write $$ P_{n,1}^{LCL}(\boldsymbol{\theta}) = \sum_{c=1}^C \rho_c P_{n,1}(\boldsymbol{\beta}_c) \quad\quad\text{and}\quad\quad P_{n,-1}^{LCL}(\boldsymbol{\theta}) = 1 - P_{n,1}(\boldsymbol{\theta}) = \sum_{c=1}^C \rho_c P_{n,-1}(\boldsymbol{\beta}_c) $$ so that the more general label "$P_{n,y_n}^{LCL}(\boldsymbol{\theta})$" for $y_n \in \{\pm 1\}$ makes sense.

The log-likelihood is complicated because individuals fall into specific classes *that are consistent across choices*. This is to suppose there are i.i.d. random variables $C_i \in \{1,\dotsc,C\}$ with mass function $\mathbb{P}(C_i =c) = \rho_c$. Then
$$
\begin{aligned}
&\prod_{i=1}^I \mathbb{P}( \; \text{observing all choices in } \mathcal{O}_i \; ) \\
&\quad\quad\quad\quad
= \prod_{i=1}^I \sum_{c=1}^C \rho_c \mathbb{P}( \; \text{observing all choices in } \mathcal{O}_i \; | \; C_i = c ) \\
&\quad\quad\quad\quad
= \prod_{i=1}^I \left[ \sum_{c=1}^C \rho_c \prod_{n \in \mathcal{O}_i} P_{n,y(n)}(\boldsymbol{\beta}_c) \right]
\end{aligned}
$$
where $\mathcal{O}_i$ is the set of observations $n$ corresponding to individual $i$. From this we can see the (scaled) LCL log-likelihood is
$$
\begin{aligned}
&\frac{1}{N} \sum_{i=1}^I \log \left( \sum_{c=1}^C \rho_c \prod_{n \in \mathcal{O}_i} P_{n,y(n)}(\boldsymbol{\beta}_c) \right) \\
&\quad\quad\quad\quad
= \frac{1}{N} \sum_{i=1}^I \log \left( \sum_{c=1}^C \rho_c \exp \left\{ \sum_{n \in \mathcal{O}_i } \log P_{n,y(n)}(\boldsymbol{\beta}_c) \right\} \right)
\end{aligned}
$$
See, for example,

Computing estimates $\boldsymbol{\theta} = (\boldsymbol{\beta}_1,\dotsc,\boldsymbol{\beta}_C,\boldsymbol{\rho})$ conditional on the data with nonlinear programming methods is nontrivial. A reasonably complete LCL MLE problem, for a fixed number of classes $C$, is $$ \begin{aligned} \max &\quad \frac{1}{N} \sum_{r=1}^R \log \left[ \sum_{c=1}^C \rho_c \exp \left\{ \sum_{n \in \mathcal{O}_r } \log P_{n,y(n)}(\boldsymbol{\beta}_c) \right\} \right] \\ \text{w.r.t.} &\quad \boldsymbol{\beta}_1 , \dotsc , \boldsymbol{\beta}_C , \boldsymbol{\rho} \geq \mathbf{0} \\ \text{s.to} &\quad \sum_{c=1}^C \rho_c = 1 \;\; , \;\; \rho_1 \geq \dotsb \geq \rho_C \;\; \text{ plus identification constraints} \end{aligned} $$ The extra constraint $\rho_1 \geq \dotsb \geq \rho_C$ is important to ruling out certain types of symmetric solutions (class-coefficient permuted solutions) that could make proper estimation with nonlinear programming methods very difficult.

Instead, LCL estimation can be implemented with Expectation-Maximization algorithm.

A potentially serious shortcoming of the LCL modelling approach is that an LCL models with $C$ classes “nests” any simple Logit as well as any “$(C-h)$”-class model estimates as, at least, stationary points for its MLE problem. Bayesian methods even display “funneling” phenomena that reflect reduction of variance parameters in actual estimations

Random Coefficient Logit (RCL) models do exactly what the name says: they take the coefficients to be random variables over the population.

The foundational example for RCL supposes coefficients that are normally distributed over the population.

These are tweaks on a theme: let the $\boldsymbol{\beta}$s be drawn from some distribution, and let the coefficients we estimate be the parameters required to define that distribution. Hence the name "Random Coefficients" models. The broad conceptualization is that "nature" chooses a vector of coefficients for each respondent, after which respondents choose amongst alternatives. This is a powerful concept, exemplified by McFadden & Train's result that RCL is dense in the class of random utility maximization models of choice.*some* distribution of coefficients that recovers that behavior's choice probabilities.

In practice, though, simple distributions are chosen: normal, truncated normal, log-normal, $\chi^2$, etc, and most often independence between random coefficients is assumed. These simplifications mean alot. Primarily, the model class is significantly smaller than the "dense" set of mixtures that can represent any RUM. RCL models with these convenient, but restrictive assumptions aren't really those generic models hinted at in McFadden & Train's density theorem.*a priori* simplifications don't really make estimating RCL models easier.

Like LCL, with RCL we have to carefully derive the likelihood, imagining each individual "draws" a coefficient, and then responds to the sequence of choice tasks. This generates the MLE problem
\begin{equation}
\begin{aligned}
\max &\quad \frac{1}{N} \sum_{r=1}^R \log \left( \int \exp \left\{ \sum_{n \in \mathcal{O}_r } \log P_{n,y(n)}^L(\boldsymbol{\beta}) \right\} f(\boldsymbol{\beta}|\boldsymbol{\theta}) d\boldsymbol{\beta} \right) \\
\text{w.r.t.} &\quad \boldsymbol{\theta}
\end{aligned}
\end{equation}
Distributional parameters can then be estimated using maximum simulated likelihood.

There is even more work to do though. If we can "standardize" the distribution of the coefficients to some $D_0$ so that we can draw $\boldsymbol{\nu} \sim D_0$ and set $\boldsymbol{\beta} = \xi(\boldsymbol{\theta}|\boldsymbol{\nu})$. The easiest example is the independent multivariate normal distribution with $\boldsymbol{\theta} = (\boldsymbol{\mu},\boldsymbol{\sigma})$, whose standardization is $$ \xi(\boldsymbol{\nu}) = \boldsymbol{\mu} + \mathrm{diag}(\boldsymbol{\sigma}) \boldsymbol{\nu}. $$ More generally, if $\boldsymbol{\beta} \sim N(\boldsymbol{\mu},\boldsymbol{\Sigma})$, then $$ \boldsymbol{\beta} = \boldsymbol{\mu} + \mathbf{R} \boldsymbol{\nu} = \xi(\boldsymbol{\nu}|\boldsymbol{\mu},\mathbf{R}) $$ where $\mathbf{R}$ is the Cholesky factor of $\boldsymbol{\Sigma}$. If a particular coefficient is lognormal, for example, we could write $$ \beta_k = \xi_k(\mu_k,\sigma_k|\nu_k) = \exp\{ \mu_k + \sigma_k \nu_k \}. $$ This has been used when coefficients over features should be signed, such as with prices.

With standardization we can draw $RS$ samples $\boldsymbol{\nu}_{r,s}$ for all $r$ and $s = 1,\dotsc,S$ to write
\begin{equation}
\begin{aligned}
\max &\quad \frac{1}{N} \sum_{r=1}^R \log \left( \frac{1}{S} \sum_{s=1}^S \exp \left\{ \sum_{n \in \mathcal{O}_r } \log P_{n,y(n)}^L( \xi(\boldsymbol{\theta}|\boldsymbol{\nu}_{r,s}) ) \right\} \right) \\
\text{w.r.t.} &\quad \boldsymbol{\theta}
\end{aligned}
\end{equation}
Quadrature methods

Like LCL models RCL models, at least with normally distributed parameters, have MLE problems with trivial ”zero-variance” stationary points. Of course, this is probably closely still related to “funneling” phenomena in Bayesian hierarchical models

Regardless of how we estimate an RCL model, it can be very hard to do and remains somewhat specialized. Hopefully, our discussion above illustrates this. We also have to *assume* the distribution of heterogeneity first, and estimate that distribution's parameters using the data. Perhaps worse, we are not aware of any techniques for statistically testing distributional assumptions. Non-convexity in the MLE problem also creates controversy over estimator quality.

TBD

We propose the following: Let each individual $i$ have their own coefficient vector $\boldsymbol{\theta}_i$ decomposed as
$$
\boldsymbol{\theta}_i = \boldsymbol{\beta} + \boldsymbol{\delta}_i
$$
That is, a *shared* opinion about feature value $\boldsymbol{\beta}$ plus some "person-specific effect" or "idiosyncratic deviation" $\boldsymbol{\delta}_i$ from that shared opinion. Regardless of any identification constraints, we constrain $\sum_{i=1}^I \delta_{i,k} = 0$ for all $k$ to align the $\boldsymbol{\beta}$ values with "mean" effects. Note that we are basically denoting Salganik & Levy's

There will be far to much flexibility introduced into our model without further constraints. Our goal will be to estimate coefficients “*minimizing the effect*” of the idiosyncratic deviations. We choose to do this by solving the following penalized MLE formulation
$$
\begin{aligned}
\min &\quad \frac{1}{N} \sum_{n=1}^N \log\left( 1 + e^{ y_n(\mathbf{x}_{R_n} - \mathbf{x}_{L_n})^\top(\boldsymbol{\beta}+\boldsymbol{\delta}_{i_n} ) } \right)
+ \frac{\Lambda_1}{N} || \boldsymbol{\delta} ||_1
+ \frac{\Lambda_2}{2N} || \boldsymbol{\delta} ||_2^2 \\
\text{w.r.t.} &\quad \boldsymbol{\beta} \in \mathbb{R}^K \; , \; \boldsymbol{\delta} \in \mathbb{R}^{IK} \\
\text{s.to}
&\quad \sum_{i=1}^I \delta_{i,k} = 0 \;\; \text{for all} \;\; k \text{, plus identification constraints}
\end{aligned}
$$
where $i_n$ is the individual who responded in observation $n$, $\Lambda_1,\Lambda_2 \geq 0$ are penalty parameters, the 1-norm, $|| \mathbf{x} ||_1$, is the sum of the absolute values of the elements of $\mathbf{x}$, and the squared 2-norm, $|| \mathbf{x} ||_2^2$, is the sum of squares of a vector. Penalizing with the 2-norm will tend to *shrink* estimates towards zero. Penalizing with the 1-norm is a well-known technique to introduce *sparsity* in statistical estimates. Applying both these penalties is often called an "elastic net" penalty

Notice, though, that we apply these penalties only to *some* of the coefficients we intend to estimate: specifically, the idiosyncratic deviations. In effect, what this problem does is finds the *fewest, smallest* nonzero idiosyncratic deviations as well as a shared coefficient vector, depending on the values of $\Lambda_1,\Lambda_2$.

This approach has some useful advantages over other methods for modeling response heterogeneity that derive from a single property: *convexity*

There are also probably advantages unrelated to mathematical and computational tractability imparted by convexity. For all intensive purposes, the idLogit is a *non-parametric* model of preference heterogeneity. Besides imposing constraints that require that the deviations to be mean zero across individuals within features, we don't have to make *any* other assumptions about the idiosyncracies of choice. More or less, the idLogit shows you what the data says about heterogeneity.

This is not the case with other models. To use a Random Coefficient Logit model, for example, you have to specify the *distributional family* of feature preference heterogeneity, and estimate some family parameters (e.g., mean and/or standard deviation) to get your model; to use Bayesian methods you have to choose a prior that imparts some *a priori* structure on the distribution of preference heterogeneity. Latent Class models are perhaps the most non-parametric, with the exception that you have to fix the number of classes to estimate a model. However that has some similarities to the penalty parameter flexibility in the idLogit model.

The idLogit is also relatively easy to generalize to model additional features of choice behavior. For example, we discuss below how to create a model of wikisurvey data that includes the no-choice option. With idLogit, this is almost trivial. It would also be nearly trivial to include a left-right effect term, to model heterogeneous placement effects, by estimating a model with left-choice probabilities $$ \frac{ 1 }{ 1 + e^{(\mathbf{x}_{R_n}-\mathbf{x}_{L_n})^\top( \boldsymbol{\beta} + \boldsymbol{\delta}_{i_n} ) - \gamma_{i_n} } }. $$ Extensions to choice between lots of options (that is, more than two) would also be fairly straightforward.

Computing point estimates is of limited value; any reasonable inferences for experimental work must rely on some measures of variability and/or confidence in those estimates. We can also compute confidence metrics for idLogit estimates or their implied rankings either with traditional second order approximation methods, if we choose, or bootstrapping methods

We should take a moment to compare idLogit to *model coefficients themselves*:
$$
\min_{\boldsymbol{\theta}} \left\{ \frac{1}{N} \sum_{n=1}^N \log\left( 1 + e^{ -y_n\mathbf{x}_n^\top\boldsymbol{\theta} } \right)
+ \Lambda_1 || \boldsymbol{\theta} ||_1 + \frac{\Lambda_2}{2} || \boldsymbol{\theta} ||_2^2 \right\}
$$
This is a strong concept when the feature space is large ($K$ is large) relative to the number of observations as in, for example, sentiment analysis *some* of the coefficients. Thus idLogit falls into a more general category of problems that penalize *re-parameterizations* of the coefficients:
$$
\min_{\boldsymbol{\theta}} \left\{ \frac{1}{N} \sum_{n=1}^N \log\left( 1 + e^{ -y_n\mathbf{x}_n^\top\boldsymbol{\theta} } \right)
+ \frac{\Lambda_1}{N} || \mathbf{A}\boldsymbol{\theta} ||_1 + \frac{\Lambda_2}{2N} || \mathbf{A}\boldsymbol{\theta} ||_2^2 \right\}
$$
for some matrix $\mathbf{A}$. In a way idLogit is more aligned with how methods like Trend Filtering

*data* to reflect the *constraints on the coefficients*, if we can compute a nullspace basis of $\mathbf{S}$ (which isn't always feasible). Notice, however, that we are penalizing different coefficients: $\boldsymbol{\nu}$ instead of $\boldsymbol{\theta} = \mathbf{W}\boldsymbol{\nu}$. This means that

In principle, of course! But we should be clear what we mean here.

A Probit model, basically the model proposed by Salganik & Levy,*convex*,

We don't recommend taking this approach because there isn't an *a priori* reason to expect coefficients estimated in this way to be substantively different than coeffcients estimated using idLogit. Actually the opposite; the distribution of choices described by Logit probabilities is *very* similar to that described by a normal distribution. Moreover, the exponential form of the Logit probabilities makes powerful Exponential Cone Programming

We estimate idLogit models in python using the CVXPy package

CVXPy takes this problem description and “compiles” it into a form suitable for a “target” convex programming solvers. idLogit requires a solver like the ECOS solver

Our actual code is a fair bit more complicated though, with a lot of overhead devoted to parallelization and penalty parameter sweeps. The construction of the matrices *rows* are $y_n(\mathbf{x}_{R_n}-\mathbf{x}_{L_n})$; with the deviations, we have to include these terms but also *replicate* these vectors in the right columns of a much larger

Wikisurvey data, as collected through allourideas.com

We have also analyzed the (public) results of the Washington Post's "Who had the Worst Year in Washington?" wikisurvey (which we'll abbreviate as "WYIW").

To be specific: the WYIW data has 76,632 observations (votes), from either 14,671 or 4,116 respondents — depending on whether we take an individual respondent to be associated with a Session ID or a (hashed) IP address — choosing between 67 alternatives in total. We'll focus on the hashed IP version with 4,116 individuals. We'll ignore "I can't decide" responses for now.

This is a significantly larger dataset to model, with the idLogit MLE problem having 275,839 coefficients to estimate. Regardless, CVXPy/ECOS can estimate the idLogit model in a couple minutes on a modern MacBook Pro laptop. ECOS, which can report its own solve times, takes between just under a minute to 2 and a half minutes depending on the penalty parameter. But more details below.

Figure X has a heatmap of the number of appearances of each alternative in the left and right positions in the wikisurvey. We use a relabeled, zero-based index defined by the ascending order sort of the allourideas.com alternative indices.

The alternative appearances are clearly *not* uniformly distributed. But we shouldn't really expect this in the context of wikisurveys, at least when respondents are allowed to add alternatives themselves.

Figure X plots the rank-ordering, according to Salganik-Levy scores, of the alternatives shown.

This wikisurvey was analyzed by Salganik & Levy

[T]he New York City Mayor’s Office conducted a wiki survey in order to integrate residents’ ideas into the 2011 update to the City’s long-term sustainability plan. The wiki survey asked residents to contribute their ideas about how to create “a greener, greater New York City” and to vote on the ideas of others.

This wikisurvey was also analyzed by Salganik & Levy

The OECD’s wiki survey was created in preparation for an Education Ministerial Meeting and an Education Policy Forum on “Investing in Skills for the 21st Century.” The OECD sought to bring fresh ideas from the public to these events in a democratic, transparent, and bottom-up way by seeking input from education stakeholders located around the globe. To accomplish these goals, the OECD created a wiki survey to allow respondents to contribute and vote on ideas about “the most important action we need to take in education today."

Another distinction between idLogit and the Bayesian analysis paradigm proposed by Salganik & Levy

A more general form of the idLogit MLE problem is
$$
\begin{aligned}
\min &\quad - \frac{1}{N} \sum_{n=1}^N \log P_{n,c_n}(\boldsymbol{\beta},\boldsymbol{\delta})
+ \Lambda_1 || \boldsymbol{\delta} ||_1
+ \frac{\Lambda_2}{2} || \boldsymbol{\delta} ||_2^2 \\
\text{w.r.t.} &\quad \boldsymbol{\beta} \in \mathbb{R}^A \; , \; \boldsymbol{\delta} \in \mathbb{R}^{IA} \\
\text{s.to}
&\quad \sum_{i=1}^I \delta_{i,a} = 0 \;\; \text{for all} \;\; a \text{, plus identification constraints}
\end{aligned}
$$
where $c_n \in \{ L_n , R_n \}$ and
$$
P_{n,c_n}(\boldsymbol{\beta},\boldsymbol{\delta})
= \frac{ e^{\beta_{c_n}+\delta_{i_n,c_n}} }{ e^{\beta_{L_n}+\delta_{i_n,L_n}} + e^{\beta_{R_n}+\delta_{i_n,R_n}} }.
$$
We can easily write choice probabilities that include a no-choice option: if we let $c_n \in \{0,L_n,R_n\}$ for all $n$, where $c_n = 0$ denotes a choice of "I can't decide", we can take
$$
P_{n,c_n}(\boldsymbol{\beta},\boldsymbol{\delta})
= \frac{ e^{\beta_{c_n}+\delta_{i_n,c_n}} }{ 1 + e^{\beta_{L_n}+\delta_{i_n,L_n}} + e^{\beta_{R_n}+\delta_{i_n,R_n}} }
$$
where it is safe to presume that $\beta_0 = \delta_{i_n,0} = 0$.

The result is a problem almost identical to the one introduced above; all we have to do is change the form of the logistic function and add a linear-in-coefficients term in the objective. Note though that we have already implicitly included an identification constraint in the form of setting the no-choice opinion to zero. Thus we don't need to include equivalent constraints explicitly in this model.

These changes aren't too hard to make in a CVXPy-based implementation, but do change the overall problem definition:

A few terms require explanation. We have introduced

Note also that we introduce some new "variables"

A very important question has to be asked: *does it matter* what model we pick, or how we estimate it? Obviously we can address this question by comparing the ranks or scores obtained by different methods. Ranks, being discrete, are probably — rather *hopefully* — fairly stable over small changes to the modeling approach. Instability, meaning dramatically changing rankings from small changes to the model, would probably say more about the poor quality of the modeling approach than the data or the underlying behavior it describes. Scores are likely to be more sensitive to our modeling choices, and thus present a better opportunity for cross-model comparisons. We could compare scoring from Salganik & Levy's Bayesian method, a simple Logit model, the idLogit model, even an idLogit model with the no-choice option. We can compare rankings of scores from Salganik & Levy's Bayesian method to coefficient value rankings from a Logit or idLogit model, with or without the no-choice option (equivalent to rankings from all-alternative choice probabilities with those models).

Moreover, to what degree does modeling *at all* matter? If the number of appearances of each alternative is roughly uniformly distributed, the alternative with the most “votes” or “wins” is the top-ranked alternative. We can also compare any of these to a simple measure readily available from the data itself even if the appearances are not uniformly dsitributed: *empirical win frequency* (EWF). The EWF of an alternative, $F_a$, is the number of times alternative $a$ wins divided by the number of times it appears. This is an essentially trivial computation given the data, and can itself be considered a score that implies a ranking. Generally speaking, it is good practice to always compare modeling to the simplest available alternatives.

Analysis TBD

It should be possible to analyze this situation. For example, it is well known that the maximum-likelihood Logit model recreates the observed, average choice shares of different options.

TBD

Suppose we have a matrix $\triangle$ whose *rows* are $\mathbf{x}_{R_n}-\mathbf{x}_{L_n}$. If this matrix is not full rank, or even not *numerically* full rank (to some tolerance), we probably need some kind of identification constraint. $\triangle$ has a singular value decomposition $\triangle = \mathbf{U}\begin{bmatrix} \boldsymbol{\Sigma} & \mathbf{0} \end{bmatrix}\mathbf{V}^\top$, from which we can derive
$$
\boldsymbol{\Omega} = \triangle\triangle^\top = \mathbf{U}\boldsymbol{\Sigma}^{2}\mathbf{U}^\top
$$
The *smallest* eigenvalues of $\boldsymbol{\Omega}$ are the (squares of the) *smallest* singular values of $\triangle$. Say we estimate the smallest eigenvalues, $\lambda$, and the associated eigenvector(s) $\mathbf{v}$, of $\boldsymbol{\Omega}$ *especially* those eigenvectors with eigenvalue zero. Then we should impose an identification constraint $\mathbf{v}^\top\boldsymbol{\beta} = 0$ for all such vectors $\mathbf{v}$. We can do this in practice by taking an SVD, if that is feasible. If SVD is not feasible we could first take the "proper" nullspace using QR factorization of $\boldsymbol{\Omega}$, and impose that as a set of identification constraints. We can then progressively estimate the smallest eigenvalues of $\boldsymbol{\Omega}$ with eigenvectors *orthogonal* to that proper nullspace, imposing those vectors as identification constraints as long as the eigenvalues are sufficiently small to warrant concern.