Idiosyncratic Deviations Logit

Modeling Non-Parametric Heterogeneity in Wikisurveys, or Other Pairwise Comparison Experiments

Introduction

Pairwise comparison experiments collect data on which of two alternatives is preferred, if either is. In online platforms like allourideas.com respondents are presented with two alternatives, one on the left of the screen and one on the right, as well as an (optional) "I can't decide" button. They can choose one of these three options — vote for one of the two alternatives or abstain alltogether — before moving onto another task or leaving the site. There are no task count limits; a respondent can answer none, one, or very very many questions at the site. Wikisurveys, in particular, also allow respondents to submit alternatives to be considered by others , although we don't deal with any of the unique aspects of this platform feature here. In this way, researchers collect data describing relative comparisons between alternatives or feature levels of alternatives for a population of respondents.


Figure X: Screenshot of a (wiki) survey released on allourideas.com.

The purpose of some of these surveys, particularly at allourideas.com, is to infer alternative rankings. Which sentences describing distilleries sound most "authentic"? Who had the "worst year in Washington" in 2010? What would make a "greener, greater" New York City? What is the "most important action we need to take in education today?"

Finding answers, given only pairwise comparisons, involves modeling and inference. Some model must be posed that, ultimately, has a description of global ranking of alternatives (or their features). Then this model must be fit to the observed choices, along some measure of confidence in estimates. With this estimated model researchers can make judgements about global ranking of alternatives (or their features) without ever having actually witnessed a global ranking.

This article describes a "new" way to model pairwise comparison data like wikisurvey data. Well, our proposal isn't that "new" really. We basically propose to model this data with the Logit model , often also called Logistic Regression. However we propose to do so using both coefficients that model shared "opinions" about alternatives (or features) and "idiosyncratic deviations", or respondent-specific effects, about those alternatives (or features). A particular penalization technique generally common in statistics and machine learning enables estimation of the "fewest, smallest" idiosyncratic deviations. Our specific use of this penalization is particular: we penalize a subset of the estimated coefficients (the deviations), while most existing techniques penalize all coefficients equally. Because of this structure, we call this type of model the idLogit.

We have several motivations to use this type of model.

The first is tractability. The idLogit has a convex MLE problem, and thus has unique estimates conditional on the data.Really, a unique convex set of estimates, but the idLogit problem will probably often be strongly convex. Other frequentist-style models, including Latent-Class Logit models and the Random Coefficients Logit family, have (often provably) non-convex MLE problems which make estimation much harder, both in theory and in practice. Convexity also implies the existence of very strong numerical methods for efficiently computing coefficient estimates. There are even "online" algorithms that could analyze response data as it comes in.

A second motivation is that we get an essentially non-parametric portrait of heterogeneity in the respondent pool. No priors or other a priori assumptions on how "opinions" are distributed are required, other than an identifying assumption that they have zero mean over the population within alternatives (or features). But that is totally generic anyway; whatever the distribution is, we can shift it to be mean zero. With idLogit we can estimate the shared coefficients and deviations, and then ask what the heterogeneity looks like.

A third motivation is generalizability. Essentially the same model (and more or less the same estimation code) applies to binary response data, pairwise comparisons, pairwise comparisons with a no-choice option, and even multinomial models (which we don't consider here). Perhaps the availability of a method to include the no-choice option is most immediately relevant to wikisurvey analysts. The inability of a respondent to judge between two alternatives certainly imparts information about their relative comparison and ranking. But the primary method available on the platform method currently ignores this information. Our method doesn't.

Two major caveats are in order.

First, the idLogit is still a Logit model of choice, within individuals. This means that if individual response behavior plausibly violates features of the Logit model, adding heterogeneity with the idLogit isn't a magic fix. Specifically, the Logit has the independence of irrelevant alternatives (IIA) property: the ratio of the probabilities that any two alternatives are chosen — from the full set of all alternatives — is independent of all other alternatives. In pairwise comparisons, however, IIA seems a somewhat natural property: when respondents only see and evaluate pairs, how can any other option be relevant? For wikisurveys, in particular, this property says that we cannot model any changes in one respondent's preferences over pairwise comparisons of existing alternatives when someone adds a new alternative. Moreover, if you would accept the correction of IIA obtained with other generalizations of the Logit (e.g., Mixing with Latent Classes or Random Coefficients) then you should accept it with idLogit, too. basically, that the idLogit is still a Logit doesn't sound so bad, but researchers should keep it in mind.

Second, the idLogit itself doesn't really tell us how to predict responses for new respondents. That is, while we get a non-parametric view of heterogeneity, we get a view specific to the individuals studied. If we want to predict observations from new respondents, we have to either accept the model with the shared opinions only, or we have to model the pattern of the deviations in some way that enables extrapolation. This isn't a problem specific to idLogit, by any means. Random Coefficient Logit models and Bayesian methods "avoid" this problem by presuming what the distribution is to begin with. For Latent Class Logit models, a new respondent would have to be assigned to a class; this can itself be done with multinomial Logit modeling built right into the Latent Class specification if respondent characteristics (demographics) are collected. Nor is this necessarily a problem for global rank assessment, the prevalent paradigm of wikisurvey analysis.

Some notes on notation: we try to be precise enough to make sense, but not so precise we lose the forest for the trees. Bold symbols (e.g., $\mathbf{L}$, $\boldsymbol{\beta}$) typically refer to vectors. We index with subscripts (e.g., $x_n$) or using "functional" notation (e.g., $x(n)$) depending on which probably makes sense in the context.

Oh, and in the spirit of Distill (from which we draw our article style) this is an open article. You can see all the source, notebooks, and assets on GitLab, and even submit issues on GitLab if you see an error. If I fix something in the article based on a issue I’ll record the change here, in the appendices.

Logit-Like Models of Pairwise Comparisons

In a set of $N$ pairwise, online comparisons we observe vectors of "left" $\mathbf{L} \in \{1,\dotsc,A\}^N$ and "right" $\mathbf{R} \in \{1,\dotsc,A\}^N$ alternatives, as well as outcomes $\mathbf{y} \in \{\pm 1\}^N$ where $A$ is the number of alternatives. We presume $y_n = +1$ if the left alterative was chosen, and $y_n = -1$ if the right alternative was chosen. We assume each alternative has a corresponding vector of "characteristics" or "features" $\mathbf{x}_a \in \mathbb{R}^K$. There are many treatments of how to analyze this and related data, including .

The Simple Logit Model

Suppose we let the probability that the option on the left wins be $$ P_n(\boldsymbol{\beta}) = \frac{ e^{ \mathbf{x}_{L_n}^\top \boldsymbol{\beta} } }{ e^{\mathbf{x}_{L_n}^\top \boldsymbol{\beta}} + e^{\mathbf{x}_{R_n}^\top \boldsymbol{\beta}} }. $$ This isn't an entirely arbitrary choice; if we consider $\mathbf{x}_a^\top \boldsymbol{\beta}$ an alternative's “utility”, then this formula for the choice probability follows from Random Utility Maximization with errors following an i.i.d. extreme value distribution. The coefficients $\boldsymbol{\beta}$ can be computed by solving the maximum likelihood estimation (MLE) problem $$ \begin{aligned} \min &\quad \frac{1}{N} \sum_{n=1}^N \log\left( 1 + e^{ y_n (\mathbf{x}_{R_n} - \mathbf{x}_{L_n})^\top \boldsymbol{\beta} } \right) \\ \text{w.r.t.} &\quad \boldsymbol{\beta} \in \mathbb{R}^K \\ \text{s.to} &\quad \text{identification constraints } \end{aligned} $$ as can be checked. Identification constraints can be generally described,Specifically, if we let $\triangle\mathbf{X} \in \mathbb{R}$ be the matrix of feature vector differences across all pairs observed in the data, we do not change the likelihood by shifting $\boldsymbol{\beta}$ by any $\boldsymbol{\delta}$ such that $\triangle\mathbf{X}^\top \boldsymbol{\delta} = \mathbf{0}$. If the feature vectors are structured enough to let $\triangle\mathbf{X}^\top$ have a nontrivial nullspace, we want $\boldsymbol{\beta}$ to be orthogonal to that nullspace. See the Appendices here as well. but are easier to handle on a case-by-case basis. We'll also presume, without loss of generality, that these are linear equality constraints within the context of the feature-additive models we describe.

For example, suppose we let there be a coefficient for each alternative; that is, the "features" are dummy variables or alternative specific constants (ASCs) only. The inner product $\mathbf{x}_a^\top \boldsymbol{\beta}$ is simply a selection operator: $\mathbf{x}_a^\top \boldsymbol{\beta} = \beta_a$. An identification constraint could be $\sum_{a=1}^A \beta_a = 0$ because uniform shifts in the coefficients $\boldsymbol{\beta} \mapsto \boldsymbol{\beta} + \delta$ will not change the modeled choice probabilities. An equivalent strategy in this case is to remove the constraint but "leave a coefficient out" by fixing $\beta_{a_f} = 0$ for any $a_f \in \{1,\dotsc,A\}$.

Among several model shortcomings, this Logit model does not account for the fact that some subset of the responses are probably from the same people, and people behave differently. If we can account for those differences, we should be able to increase the proportion of variation in the data explained by our models. Below we sketch a few different approaches to this, to frame our proposed approach.

Latent Class Logit

The Latent-Class Logit model presumes that there is a collection of "groups" or "classes" of respondents, each with distinct behavior defined by different coefficient vectors. Let $$ \boldsymbol{\theta} = ( \; \boldsymbol{\beta}_1 \; , \; \dotsc \; , \; \boldsymbol{\beta}_C \; , \boldsymbol{\rho} \; ) $$ where $\boldsymbol{\rho} \in \mathbb{R}^C$ are class occurence probabilities; that is, $\mathbb{P}(\boldsymbol{\beta}=\boldsymbol{\beta}_c|\boldsymbol{\theta}) = \rho_c$. Hence $\rho_c \geq 0$ for all $c$, $\sum_{c=1}^C \rho_c = 1$. Then the probability that the left alternative wins is $$ P_{n}^{LCL}(\boldsymbol{\theta}) = \sum_{c=1}^C \rho_c P_{n}(\boldsymbol{\beta}_c) $$ where the subscripts are placeholders for the choice set in observation $n$. If we have to impose identification constraints on the coefficients, these need to be imposed on each class' coefficients. Let's also write $$ P_{n,1}^{LCL}(\boldsymbol{\theta}) = \sum_{c=1}^C \rho_c P_{n,1}(\boldsymbol{\beta}_c) \quad\quad\text{and}\quad\quad P_{n,-1}^{LCL}(\boldsymbol{\theta}) = 1 - P_{n,1}(\boldsymbol{\theta}) = \sum_{c=1}^C \rho_c P_{n,-1}(\boldsymbol{\beta}_c) $$ so that the more general label "$P_{n,y_n}^{LCL}(\boldsymbol{\theta})$" for $y_n \in \{\pm 1\}$ makes sense.

The log-likelihood is complicated because individuals fall into specific classes that are consistent across choices. This is to suppose there are i.i.d. random variables $C_i \in \{1,\dotsc,C\}$ with mass function $\mathbb{P}(C_i =c) = \rho_c$. Then $$ \begin{aligned} &\prod_{i=1}^I \mathbb{P}( \; \text{observing all choices in } \mathcal{O}_i \; ) \\ &\quad\quad\quad\quad = \prod_{i=1}^I \sum_{c=1}^C \rho_c \mathbb{P}( \; \text{observing all choices in } \mathcal{O}_i \; | \; C_i = c ) \\ &\quad\quad\quad\quad = \prod_{i=1}^I \left[ \sum_{c=1}^C \rho_c \prod_{n \in \mathcal{O}_i} P_{n,y(n)}(\boldsymbol{\beta}_c) \right] \end{aligned} $$ where $\mathcal{O}_i$ is the set of observations $n$ corresponding to individual $i$. From this we can see the (scaled) LCL log-likelihood is $$ \begin{aligned} &\frac{1}{N} \sum_{i=1}^I \log \left( \sum_{c=1}^C \rho_c \prod_{n \in \mathcal{O}_i} P_{n,y(n)}(\boldsymbol{\beta}_c) \right) \\ &\quad\quad\quad\quad = \frac{1}{N} \sum_{i=1}^I \log \left( \sum_{c=1}^C \rho_c \exp \left\{ \sum_{n \in \mathcal{O}_i } \log P_{n,y(n)}(\boldsymbol{\beta}_c) \right\} \right) \end{aligned} $$ See, for example, .

Computing estimates $\boldsymbol{\theta} = (\boldsymbol{\beta}_1,\dotsc,\boldsymbol{\beta}_C,\boldsymbol{\rho})$ conditional on the data with nonlinear programming methods is nontrivial. A reasonably complete LCL MLE problem, for a fixed number of classes $C$, is $$ \begin{aligned} \max &\quad \frac{1}{N} \sum_{r=1}^R \log \left[ \sum_{c=1}^C \rho_c \exp \left\{ \sum_{n \in \mathcal{O}_r } \log P_{n,y(n)}(\boldsymbol{\beta}_c) \right\} \right] \\ \text{w.r.t.} &\quad \boldsymbol{\beta}_1 , \dotsc , \boldsymbol{\beta}_C , \boldsymbol{\rho} \geq \mathbf{0} \\ \text{s.to} &\quad \sum_{c=1}^C \rho_c = 1 \;\; , \;\; \rho_1 \geq \dotsb \geq \rho_C \;\; \text{ plus identification constraints} \end{aligned} $$ The extra constraint $\rho_1 \geq \dotsb \geq \rho_C$ is important to ruling out certain types of symmetric solutions (class-coefficient permuted solutions) that could make proper estimation with nonlinear programming methods very difficult.

Instead, LCL estimation can be implemented with Expectation-Maximization algorithm. For example, this is the algorithm underlying the lclogit command in STATA.

Random Coefficient Logit

Random Coefficient Logit (RCL) models do exactly what the name says: they take the coefficients to be random variables over the population.

The foundational example for RCL supposes coefficients that are normally distributed over the population. Let $\boldsymbol{\theta} = (\boldsymbol{\mu},\boldsymbol{\sigma})$ where $\boldsymbol{\sigma} > \mathbf{0}$, and let $\boldsymbol{\beta} \sim N(\boldsymbol{\mu},\mathrm{diag}(\boldsymbol{\sigma}^2))$. Then $$ P^{RCL}(\boldsymbol{\theta}) = \int P^L(\boldsymbol{\beta}) \prod_{k=1}^K \phi(\beta_k|\mu_k,\sigma_k^2) d \boldsymbol{\beta} $$ where $\phi(\cdot|\mu,\sigma^2)$ is the normal density with mean $\mu$ and variance $\sigma^2$. A product of independent normals is just one choice. To keep normality but relax independence, we could exchange $\boldsymbol{\sigma}$ for the coefficients in the Cholesky factor of a covariance matrix for the multivariate normal distribution (thus avoiding definiteness constraints). Keeping independence but relaxing normality instead we can let $\boldsymbol{\theta} = \begin{pmatrix} \boldsymbol{\theta}_1 & \dotsb \boldsymbol{\theta}_K \end{pmatrix}$ and introduce densities $f_k(\cdot|\boldsymbol{\theta}_k)$ for each $k$, obtaining $$ P^{RCL}(\boldsymbol{\theta}) = \int P(\boldsymbol{\beta}) \prod_{k=1}^K f_k(\beta_k|\boldsymbol{\theta}_k) d \boldsymbol{\beta}. $$ To relax both, you could use copulas.

These are tweaks on a theme: let the $\boldsymbol{\beta}$s be drawn from some distribution, and let the coefficients we estimate be the parameters required to define that distribution. Hence the name "Random Coefficients" models. The broad conceptualization is that "nature" chooses a vector of coefficients for each respondent, after which respondents choose amongst alternatives. This is a powerful concept, exemplified by McFadden & Train's result that RCL is dense in the class of random utility maximization models of choice. That is, given (almost) any “random utility maximizing” choice behavior, there is some distribution of coefficients that recovers that behavior's choice probabilities.

In practice, though, simple distributions are chosen: normal, truncated normal, log-normal, $\chi^2$, etc, and most often independence between random coefficients is assumed. These simplifications mean alot. Primarily, the model class is significantly smaller than the "dense" set of mixtures that can represent any RUM. RCL models with these convenient, but restrictive assumptions aren't really those generic models hinted at in McFadden & Train's density theorem. Moreover, though these a priori simplifications don't really make estimating RCL models easier.

Like LCL, with RCL we have to carefully derive the likelihood, imagining each individual "draws" a coefficient, and then responds to the sequence of choice tasks. This generates the MLE problem \begin{equation} \begin{aligned} \max &\quad \frac{1}{N} \sum_{r=1}^R \log \left( \int \exp \left\{ \sum_{n \in \mathcal{O}_r } \log P_{n,y(n)}^L(\boldsymbol{\beta}) \right\} f(\boldsymbol{\beta}|\boldsymbol{\theta}) d\boldsymbol{\beta} \right) \\ \text{w.r.t.} &\quad \boldsymbol{\theta} \end{aligned} \end{equation} Distributional parameters can then be estimated using maximum simulated likelihood.

There is even more work to do though. If we can "standardize" the distribution of the coefficients to some $D_0$ so that we can draw $\boldsymbol{\nu} \sim D_0$ and set $\boldsymbol{\beta} = \xi(\boldsymbol{\theta}|\boldsymbol{\nu})$. The easiest example is the independent multivariate normal distribution with $\boldsymbol{\theta} = (\boldsymbol{\mu},\boldsymbol{\sigma})$, whose standardization is $$ \xi(\boldsymbol{\nu}) = \boldsymbol{\mu} + \mathrm{diag}(\boldsymbol{\sigma}) \boldsymbol{\nu}. $$ More generally, if $\boldsymbol{\beta} \sim N(\boldsymbol{\mu},\boldsymbol{\Sigma})$, then $$ \boldsymbol{\beta} = \boldsymbol{\mu} + \mathbf{R} \boldsymbol{\nu} = \xi(\boldsymbol{\nu}|\boldsymbol{\mu},\mathbf{R}) $$ where $\mathbf{R}$ is the Cholesky factor of $\boldsymbol{\Sigma}$. If a particular coefficient is lognormal, for example, we could write $$ \beta_k = \xi_k(\mu_k,\sigma_k|\nu_k) = \exp\{ \mu_k + \sigma_k \nu_k \}. $$ This has been used when coefficients over features should be signed, such as with prices.

With standardization we can draw $RS$ samples $\boldsymbol{\nu}_{r,s}$ for all $r$ and $s = 1,\dotsc,S$ to write \begin{equation} \begin{aligned} \max &\quad \frac{1}{N} \sum_{r=1}^R \log \left( \frac{1}{S} \sum_{s=1}^S \exp \left\{ \sum_{n \in \mathcal{O}_r } \log P_{n,y(n)}^L( \xi(\boldsymbol{\theta}|\boldsymbol{\nu}_{r,s}) ) \right\} \right) \\ \text{w.r.t.} &\quad \boldsymbol{\theta} \end{aligned} \end{equation} Quadrature methods essentially do the same thing $-$ approximate an integral with a finite sum $-$ but introduce general weights \begin{equation} \begin{aligned} \max &\quad \frac{1}{N} \sum_{r=1}^R \log \left( \sum_{s=1}^S w_{i,s} \exp \left\{ \sum_{n \in \mathcal{O}_i } \log P_{n,y(n)}( \xi(\boldsymbol{\theta}|\boldsymbol{\nu}_{i,s}) ) \right\} \right) \\ \text{w.r.t.} &\quad \boldsymbol{\theta} \end{aligned} \end{equation} and use the flexibility of weights and points to obtain order of accuracy guarantees. EM methods can also be used to estimate RCL models.

Regardless of how we estimate an RCL model, it can be very hard to do and remains somewhat specialized. Hopefully, our discussion above illustrates this. We also have to assume the distribution of heterogeneity first, and estimate that distribution's parameters using the data. Perhaps worse, we are not aware of any techniques for statistically testing distributional assumptions. Non-convexity in the MLE problem also creates controversy over estimator quality.

Bayesian Methods

TBD

Including Person-Specific Effects or "Idiosyncratic Deviations"

We propose the following: Let each individual $i$ have their own coefficient vector $\boldsymbol{\theta}_i$ decomposed as $$ \boldsymbol{\theta}_i = \boldsymbol{\beta} + \boldsymbol{\delta}_i $$ That is, a shared opinion about feature value $\boldsymbol{\beta}$ plus some "person-specific effect" or "idiosyncratic deviation" $\boldsymbol{\delta}_i$ from that shared opinion. Regardless of any identification constraints, we constrain $\sum_{i=1}^I \delta_{i,k} = 0$ for all $k$ to align the $\boldsymbol{\beta}$ values with "mean" effects. Note that we are basically denoting Salganik & Levy's "opinion matrix" in a slightly different way. We ultimately propose a different model of choices, however, because we do not assume that the choice errors are normally distributed.

There will be far to much flexibility introduced into our model without further constraints. Our goal will be to estimate coefficients "minimizing the effect"" of the idiosyncratic deviations. We choose to do this by solving the following penalized MLE formulation $$ \begin{aligned} \min &\quad \frac{1}{N} \sum_{n=1}^N \log\left( 1 + e^{ y_n(\mathbf{x}_{R_n} - \mathbf{x}_{L_n})^\top(\boldsymbol{\beta}+\boldsymbol{\delta}_{i_n} ) } \right) + \frac{\Lambda_1}{N} || \boldsymbol{\delta} ||_1 + \frac{\Lambda_2}{2N} || \boldsymbol{\delta} ||_2^2 \\ \text{w.r.t.} &\quad \boldsymbol{\beta} \in \mathbb{R}^K \; , \; \boldsymbol{\delta} \in \mathbb{R}^{IK} \\ \text{s.to} &\quad \sum_{i=1}^I \delta_{i,k} = 0 \;\; \text{for all} \;\; k \text{, plus identification constraints} \end{aligned} $$ where $i_n$ is the individual who responded in observation $n$, $\Lambda_1,\Lambda_2 \geq 0$ are penalty parameters, the 1-norm, $|| \mathbf{x} ||_1$, is the sum of the absolute values of the elements of $\mathbf{x}$, and the squared 2-norm, $|| \mathbf{x} ||_2^2$, is the sum of squares of a vector. Penalizing with the 2-norm will tend to shrink estimates towards zero. Penalizing with the 1-norm is a well-known technique to introduce sparsity in statistical estimates. Applying both these penalties is often called an "elastic net" penalty .

Notice, though, that we apply these penalties only to some of the coefficients we intend to estimate: specifically, the idiosyncratic deviations. In effect, what this problem does is finds the fewest, smallest nonzero idiosyncratic deviations as well as a shared coefficient vector, depending on the values of $\Lambda_1,\Lambda_2$.

Advantages of idLogit

This approach has some useful advantages over other methods for modeling response heterogeneity that derive from a single property: convexity . The simple Logit model is easy to manage because it is closed form and convex . That the idLogit MLE problem is convex is easy to prove. The basic problem is convex, as it is the simple Logit model. All we have done is add two penalties that are convex, and add some linear constraints, neither of which breaks convexity. Convexity implies that estimates $\boldsymbol{\beta}$ and $\boldsymbol{\delta}$ are unique conditional on the data (up to numerical approximation error).More precisely, if there are multiple idLogit estimates, they must all form a convex set. Estimates are not obtained by stochastic sampling methods (as is the case with Bayesian approaches to modeling heterogeneity). Nor can idLogit estimates be diminished by controversy about global optimality, or estimate multiplicity, as is the unfortunate case with Latent Class Logit or Random Coefficients Logit models. Moreover, convexity allows us to apply remarkably powerful numerical methods with optimal convergence guarantees and optimality certification. We have indeed found that idLogit models are fairly easy to estimate with existing software.

There are also probably advantages unrelated to mathematical and computational tractability imparted by convexity. For all intensive purposes, the idLogit is a non-parametric model of preference heterogeneity. Besides imposing constraints that require that the deviations to be mean zero across individuals within features, we don't have to make any other assumptions about the idiosyncracies of choice. More or less, the idLogit shows you what the data says about heterogeneity.

This is not the case with other models. To use a Random Coefficient Logit model, for example, you have to specify the distributional family of feature preference heterogeneity, and estimate some family parameters (e.g., mean and/or standard deviation) to get your model; to use Bayesian methods you have to choose a prior that imparts some a priori structure on the distribution of preference heterogeneity. Latent Class models are perhaps the most non-parametric, with the exception that you have to fix the number of classes to estimate a model. However that has some similarities to the penalty parameter flexibility in the idLogit model.

The idLogit is also relatively easy to generalize to model additional features of choice behavior. For example, we discuss below how to create a model of wikisurvey data that includes the no-choice option. With idLogit, this is almost trivial. It would also be nearly trivial to include a left-right effect term, to model heterogeneous placement effects, by estimating a model with left-choice probabilities $$ \frac{ 1 }{ 1 + e^{(\mathbf{x}_{R_n}-\mathbf{x}_{L_n})^\top( \boldsymbol{\beta} + \boldsymbol{\delta}_{i_n} ) - \gamma_{i_n} } }. $$ Extensions to choice between lots of options (that is, more than two) would also be fairly straightforward.

Estimating Confidence

Computing point estimates is of limited value; any reasonable inferences for experimental work must rely on some measures of variability and/or confidence in those estimates. We can also compute confidence metrics for idLogit estimates or their implied rankings either with traditional second order approximation methods, if we choose, or bootstrapping methods . We prefer bootstrapping, as it is more precise. Naturally, the feasibility of bootstrapping pivots on estimation being fairly fast. Luckily idLogit estimation is fast.

Is idLogit Just GLMNET?

We should take a moment to compare idLogit to GLMNET , a method available for both R and python . GLMNET is elastic-net regularization of Generalized Linear Models (GLMs), including the Logit,Called "Logistic Regression" by the GLMNET authors, matching much of the more recent statistics and machine learning literature. However "Logistic Regression" has also used to refer to OLS-estimating logistic-transformed data. The two ideas are not the same, and the OLS version is not applicable to binary choice outcomes. and like most existing regularization approaches aims to enforce sparsity and shrinkage on the model coefficients themselves: $$ \min_{\boldsymbol{\theta}} \left\{ \frac{1}{N} \sum_{n=1}^N \log\left( 1 + e^{ -y_n\mathbf{x}_n^\top\boldsymbol{\theta} } \right) + \Lambda_1 || \boldsymbol{\theta} ||_1 + \frac{\Lambda_2}{2} || \boldsymbol{\theta} ||_2^2 \right\} $$ This is a strong concept when the feature space is large ($K$ is large) relative to the number of observations as in, for example, sentiment analysis . idLogit certainly has a large parameter space over which we intend to induce sparsity; however, we only intend to induce sparsity in some of the coefficients. Thus idLogit falls into a more general category of problems that penalize re-parameterizations of the coefficients: $$ \min_{\boldsymbol{\theta}} \left\{ \frac{1}{N} \sum_{n=1}^N \log\left( 1 + e^{ -y_n\mathbf{x}_n^\top\boldsymbol{\theta} } \right) + \frac{\Lambda_1}{N} || \mathbf{A}\boldsymbol{\theta} ||_1 + \frac{\Lambda_2}{2N} || \mathbf{A}\boldsymbol{\theta} ||_2^2 \right\} $$ for some matrix $\mathbf{A}$. In a way idLogit is more aligned with how methods like Trend Filtering adopt regularization than GLMNET. In particular, idLogit is essentially $$ \min_{\boldsymbol{\theta}} \left\{ \frac{1}{N} \sum_{n=1}^N \log\left( 1 + e^{ -y_n\mathbf{x}_n^\top\boldsymbol{\theta} } \right) + \frac{\Lambda_1}{N} || \boldsymbol{\theta} - \mathbb{E}\boldsymbol{\theta} ||_1 + \frac{\Lambda_2}{2N} || \boldsymbol{\theta} - \mathbb{E}\boldsymbol{\theta} ||_2^2 \right\} $$ instead of GLMNET, although this abuses notation a bit.

GLMNET is, as coded in existing packages, not set up to handle identification constraints. Formally, we could write $\mathbf{S}\boldsymbol{\theta} = \mathbf{0}$ as $\boldsymbol{\theta} = \mathbf{W}\boldsymbol{\nu}$ given a basis $\mathbf{W}$ for the nullspace of $\mathbf{S}$. We can then "kind of" substitute this form in by applying GLMNET to a transformed problem for estimating $\boldsymbol{\nu}$: $$ \max_{\boldsymbol{\nu}} \left\{ \frac{1}{N} \sum_{n=1}^N \log\left( 1 + e^{ -y_n(\mathbf{W}^\top\mathbf{x}_n)^\top\boldsymbol{\nu} } \right) - \frac{\Lambda_1}{N} || \boldsymbol{\nu} ||_1 - \frac{\Lambda_2}{2N} || \boldsymbol{\nu} ||_2^2 \right\} $$ That is, we can transform the data to reflect the constraints on the coefficients, if we can compute a nullspace basis of $\mathbf{S}$ (which isn't always feasible). Notice, however, that we are penalizing different coefficients: $\boldsymbol{\nu}$ instead of $\boldsymbol{\theta} = \mathbf{W}\boldsymbol{\nu}$. This means that GLMNET, in its current form, is unprepared to handle sets of linear identification constraints.

Could it be idProbit?

In principle, of course! But we should be clear what we mean here.

A Probit model, basically the model proposed by Salganik & Levy, would take $$ \mathbb{P}\big( \; a \text{ beats } b \; | \; (L,R) \in \{ (a,b) , (b,a) \} \; , \; \boldsymbol{\theta} \; \big) = \Phi\big( ( \mathbf{x}_a - \mathbf{x}_b)^\top\boldsymbol{\theta} \big) $$ where $\Phi$ is the cumulative standard normal distribution.Note this approach to Probit implicitly presumes that there is no left-right bias in choices. This is the same as presuming that $$ \mathbb{P}\big( \; (L,R) = (a,b) \; | \; (L,R) \in \{ (a,b) , (b,a) \} \; \big) = \mathbb{P}\big( \; (L,R) = (b,a) \; | \; (L,R) \in \{ (a,b) , (b,a) \} \; \big) = \frac{1}{2} $$ and $$ \begin{aligned} \mathbb{P}\big( \; L \text{ wins} \; | \; (L,R) = (a,b) \; , \; \boldsymbol{\theta} \; \big) &= \Phi\big( ( \mathbf{x}_a - \mathbf{x}_b )^\top\boldsymbol{\theta} \big) \\ \mathbb{P}\big( \; R \text{ wins} \; | \; (L,R) = (a,b) \; , \; \boldsymbol{\theta} \; \big) &= \Phi\big( ( \mathbf{x}_b - \mathbf{x}_a )^\top\boldsymbol{\theta} \big) \\ \end{aligned} $$ from which the formula above can be derived.Specifically: $$ \begin{aligned} &\mathbb{P}\big( \; a \text{ beats } b \; | \; (L,R) \in \{ (a,b) , (b,a) \} \; , \; \boldsymbol{\theta} \; \big) \\ &\quad\quad\quad\quad = \mathbb{P}\big( \; a \text{ beats } b \; | \; (L,R) = (a,b) \; , \; \boldsymbol{\theta} \; \big) \\ &\quad\quad\quad\quad\quad\quad\quad\quad \times \mathbb{P}\big( \; (L,R) = (a,b) \; | \; (L,R) \in \{ (a,b) , (b,a) \} \; \big) \\ &\quad\quad\quad\quad\quad\quad + \mathbb{P}\big( \; a \text{ beats } b \; | \; (L,R) = (b,a) \; , \; \boldsymbol{\theta} \; \big) \\ &\quad\quad\quad\quad\quad\quad\quad\quad \times \mathbb{P}\big( \; (L,R) = (b,a) \; | \; (L,R) \in \{ (a,b) , (b,a) \} \; \big) \\ &\quad\quad\quad\quad = \frac{1}{2}\mathbb{P}\big( \; L \text{ wins} \; | \; (L,R) = (a,b) \; , \; \boldsymbol{\theta} \; \big) \\ &\quad\quad\quad\quad\quad\quad + \frac{1}{2}\mathbb{P}\big( \; R \text{ wins} \; | \; (L,R) = (b,a) \; , \; \boldsymbol{\theta} \; \big) \\ &\quad\quad\quad\quad = \frac{1}{2} \Big[ \Phi( ( \mathbf{x}_a - \mathbf{x}_b)^\top\boldsymbol{\theta} ) + 1 - \Phi( ( \mathbf{x}_b - \mathbf{x}_a )^\top\boldsymbol{\theta} ) \Big] \\ &\quad\quad\quad\quad = \frac{1}{2} \Big[ \Phi( ( \mathbf{x}_a - \mathbf{x}_b)^\top\boldsymbol{\theta} ) + 1 - \big( 1 - \Phi( ( \mathbf{x}_a - \mathbf{x}_b)^\top\boldsymbol{\theta} ) \big) \Big] \\ &\quad\quad\quad\quad = \frac{1}{2} \Big[ \Phi( ( \mathbf{x}_a - \mathbf{x}_b)^\top\boldsymbol{\theta} ) + \Phi( ( \mathbf{x}_a - \mathbf{x}_b)^\top\boldsymbol{\theta} ) \Big] \\ &\quad\quad\quad\quad = \Phi( ( \mathbf{x}_a - \mathbf{x}_b)^\top\boldsymbol{\theta} ) \end{aligned} $$ This latter form is a bit easier to consider in an MLE framework because the observation-win probabilitiesSalganik & Levy propose to estimate the Probit style model with Bayesian methods. Formulating the “idProbit” MLE problem, however, is what enables the explicit inclusion of the penalization of idiosyncratic deviations. $$ \begin{aligned} P_{n,y_n}(\boldsymbol{\Theta}) &= \mathbb{I}\{ y_n = 1 \}\Phi\big( ( \mathbf{x}_{L_n} - \mathbf{x}_{R_n} )^\top\boldsymbol{\theta}_i \big) + \mathbb{I}\{ y_n = -1 \}\Phi\big( ( \mathbf{x}_{R_n} - \mathbf{x}_{L_n} )^\top\boldsymbol{\theta}_i \big) \\ &= \mathbb{I}\{ y_n = 1 \}\Phi\big( y_n( \mathbf{x}_{L_n} - \mathbf{x}_{R_n} )^\top\boldsymbol{\theta}_i \big) + \mathbb{I}\{ y_n = -1 \}\Phi\big( y_n( \mathbf{x}_{L_n} - \mathbf{x}_{R_n} )^\top\boldsymbol{\theta}_i \big) \\ &= \Phi\big( y_n( \mathbf{x}_{L_n} - \mathbf{x}_{R_n} )^\top\boldsymbol{\theta}_i \big) \end{aligned} $$ Then the MLE problem for “idProbit” would be $$ \begin{aligned} \min &\quad - \frac{1}{N} \sum_{n=1}^N \log \Phi\Big( y_n \big( \mathbf{x}_{L_n} - \mathbf{x}_{R_n} \big)^\top \big( \boldsymbol{\beta} + \boldsymbol{\delta}_{i_n} \big) \Big) + \frac{\Lambda_1}{N} || \boldsymbol{\delta} ||_1 + \frac{\Lambda_2}{2N} || \boldsymbol{\delta} ||_2^2 \\ \text{w.r.t.} &\quad \boldsymbol{\beta} \in \mathbb{R}^K \; , \; \boldsymbol{\delta} \in \mathbb{R}^{IK} \\ \text{s.to} &\quad \sum_{i=1}^I \delta_{i,k} = 0 \;\; \text{for all} \;\; k \text{, plus any other identification constraints} \end{aligned} $$ In priniciple this is solvable, even still convex, though the log likelihood is significantly more complex here than with idLogit. We must evaluate CDF integrals to use the Probit form.

We don't recommend taking this approach because there isn't an a priori reason to expect coefficients estimated in this way to be substantively different than coeffcients estimated using idLogit. Actually the opposite; the distribution of choices described by Logit probabilities is very similar to that described by a normal distribution. Moreover, the exponential form of the Logit probabilities makes powerful Exponential Cone Programming methods available for estimation. The availability of such methods may be a huge advantage.

Implementing MLE for idLogit

We estimate idLogit models in python using the CVXPy package . Code is available on gitlab, but is actually rather simple:CVXPy, while very useful, is a bit sensitive to mispecifications of the problem you are trying to solve. Seemingly innocuous mistakes in specifying the problem can easily make the code crash. Once you get your formulation “right”, however, it works very well. Given (sparse) matrices Z and S (explained below), all we really need is import cvxpy as cp # CVXPy objective def objective_fcn( N , A , Z , p , Lambda1 , Lambda2 ): return cp.sum( cp.logistic( Z * p ) ) / N + Lambda1/N * cp.norm( p[A:] , 1 ) + Lambda2/(2.0*N) * cp.sum_squares( p[A:] ) # CVXPy variables... A beta's plus I * A "idiosyncratic deviations" p = cp.Variable( (I+1)*A ) # CVXPy constraint object(s) constraints = [ S * p == np.zeros( S.shape[0] ) ] # CVXPy parameters Lambda1 , Lambda2 = cp.Parameter( nonneg=True ) , cp.Parameter( nonneg=True ) # CVXPy problem prob = cp.Problem( cp.Minimize( objective_fcn(N,A,Z,p,Lambda1,Lambda2) ) , constraints ) With this, a solve attempt is as simple as: Lambda1.value , Lambda2.value = ... # set some specific values for the parameters prob.solve( solver="ECOS" , warm_start=True )

CVXPy takes this problem description and “compiles” it into a form suitable for a “target” convex programming solvers. idLogit requires a solver like the ECOS solver that can solve second-order conic programs that include variables constrained to lie inside (products of) the "exponential cone": $$ E = \big \{ \; ( \; x \; , \; y \; , \; z \; ) \; : \; e^{x/z} \leq y/z \; , \; z > 0 \; \big \} $$ Such an approach can use interior-point methods tailored precisely for the "type" of convexity exhibited by the logarithmic/exponential features of the idLogit problem. The non-smooth L1 penalty can also be transformed into (smooth) sets of linear constraints on non-negative "slack" variables, a technique called “graphical programming”.

Our actual code is a fair bit more complicated though, with a lot of overhead devoted to parallelization and penalty parameter sweeps. The construction of the matrices Z and S could also be confusing. In brief, Z and S encode, respectively, the exponentiated terms in the likelihood and the constraints on the coefficients. Without the idiosyncratic deviations, Z at least is pretty simple: it is a matrix whose rows are $y_n(\mathbf{x}_{R_n}-\mathbf{x}_{L_n})$; with the deviations, we have to include these terms but also replicate these vectors in the right columns of a much larger Z to reflect the inclusion of the deviations. The complexity of S depends entirely on how complicated the identification constraints are.

Application to Wikisurvey Data

Wikisurvey data, as collected through allourideas.com is pairwise comparison data. Participants in a wikisurvey see a "left" and "right" alternative, and possibly an "I can't decide" alternative as well. We discuss how to handle the no-choice option in the next section, but ignore it for the time being. In either case, this sort of data is, on its surface, perfect for pairwise comparison analysis as described here. The usual sort of pairwise comparison analysis doesn't account for other interesting features of wikisurveys, such as their longevity (within or across respondents) and the option for respondents to add alternatives themselves.

Carroll & Verhal's Authentic Distilleries Data

Suppose we are just interested in estimating "alternative" level effects. We'll focus on this special case through an example, a set of wikisurveys about authenticity in distilleries. Respondents are asked "Which distillery is more authentic?" and presented with a fixed set of 23, long-text sentence-form alternatives such as

“Our master distiller hails from a family of moonshiners dating back to prohibition.”

in one condition, and a fixed set of 21, short-text abbreviated alternatives such as

“Cultivates own farm”

in another condition. All alternatives are listed in Tables X and X.

Table X: Alternatives and scores in the short-text condition. Numbers refer to the value of the penalty parameter $\Lambda_1$ in the idLogit model ($\Lambda_2 = 0$).

Alt. # Alternative Text Class AOI 3.4 0.6 0.2 AOI 3.4 0.6 0.2
1Cultivates own farmI6060616157565352
2Distills with special partnersE3435363640404142
3Exotic sources of grainE4848464548505352
4Flavor of terroirE5554525055535353
5Grows own grainI6261616057585656
6Makes own barrelsI5757575646474747
7Master distillerE5858586053545454
8Moonshining ancestorsE3233343543444849
9Own handmade stillI+E5656565754535354
10Owns the farmI6261606050494646
11Owns the stillI5353525045434037
12Processes own maltI6363626254555454
13Prohibition legacyE3130323344454849
14Rakes own maltI5151494846464443
15Secret recipesE5252525161605858
16Special barrelsE5253535250505254
17Special blendingE5555565761605957
18True grain-to-glassI5860636358606060
19Uses own copper stillI+E5151515150494848
20Uses prohibition era recipeE3333333546474949
21Uses special third-party stillE2726262829303435

Both conditions are shown to two pools of respondents: a Stanford-only pool ("S") and a national pool ("N"). There is an "I can't decide" option, but it is selected rarely (1-8% across conditions); we analyze the data including this no-choice option below. Other details are provided in Table X. The heatmaps in Figure X show that in all four of these conditions appearances of all alternatives in the left and right positions are, more or less, uniformly distributed.

Table X: Overview of Carroll & Verhal Authentic Distilleries wikisurvey data. Number of respondents is judged from Session ID.

Vote Stats (Over Individuals)
Survey ID Condition Pool Respondents Observations Min Votes Median Votes Max Votes No-Choice %
12318Long TextStanford 1034787 138.03113.05
12383Long TextNational 1075347 241.03631.59
12319Short TextStanford 1165406 138.02493.63
12384Short TextNational 1126499 239.55247.96


Figure X: Heatmaps of the appearance count in the four conditions of the Authentic Distilleries wikisurveys for all alternatives.

We are interested in understanding which alternative is the “most authentic,” a question “difficult” to answer because we have to infer it from pairwise comparisons.If we were to ask respondents to rank-order all alternatives, for example, we would be asking for this directly. Whether we could trust the responses we obtain from such direct inquiry is a different question. Salganik & Levy propose a Bayesian method for estimating an "opinion" matrix $\boldsymbol{\theta} = \{ \theta_{i,a} \}$ from sets of observed choices. Here we we let each individuals “opinion” of the authenticity of an alternative $a$ be decomposed as $$ \theta_{i,a} = \beta_a + \delta_{i,a} $$ That is, a shared opinion about authenticity $\beta_a$ plus some "person-specific effect" or "idiosyncratic deviation" $\delta_{i,a}$ from that shared opinion. We constrain $\sum_{i=1}^I \delta_{i,a} = 0$ for all $a$ to force the $\beta$ values to be "mean" effects, and include the identification constraints $$ \sum_{a=1}^A \beta_a = 0 \quad\quad\text{and}\quad\quad \sum_{a=1}^A \delta_{i,a} = 0 \;\; \text{for all} \;\; i $$ This results in the idLogit MLE problem $$ \begin{aligned} \max &\quad \frac{1}{N} \sum_{n=1}^N \log( 1 + e^{ y_n(\beta_{R_n} + \delta_{i_n,R_n} - \beta_{L_n} - \delta_{i_n,L_n}) } ) + \Lambda_1 || \boldsymbol{\delta} ||_1 + \frac{\Lambda_2}{2} || \boldsymbol{\delta} ||_2^2 \\ \text{w.r.t.} &\quad \boldsymbol{\beta} \in \mathbb{R}^A \; , \; \boldsymbol{\delta} \in \mathbb{R}^{IA} \\ \text{s.to} &\quad \sum_{a=1}^A \beta_a = 0 \; , \; \sum_{a=1}^A \delta_{i,a} = 0 \;\; \text{for all} \;\; i \; , \; \sum_{i=1}^I \delta_{i,a} = 0 \;\; \text{for all} \;\; a \\ \end{aligned} $$

Figure X traces the optimal (i.e. estimated) likelihood from the idLogit model for different values of $\Lambda_1$; we fix $\Lambda_2 = 0$ to focus on sparsity effects. The solid black line plots the optimal likelihood for the actual data, the red dashed line illustrates the likelihood of a null model (uniformaly randomly guessing) for pairwise comparisons, and the envelope and dots plot the range of optimal likelihoods observed in bootstrapping. The lines in these envelopes show the centered 90% of values, while the dots show the mean $\pm$ one standard deviation.

For large values of $\Lambda_1$, the idLogit is essentially a simple Logit model, and fits the data only marginally better than a null model. As $\Lambda_1 \downarrow 0$, allowing deviations to assist in the fit, we can approach a model that predicts the observed choices with around 80% likelihood in all conditions/pools except for the short text/pool N experiment. Somewhere in between these extremes is probably a model with enough deviations to help explain the observed behavior without overfitting to the specific sample too much. For further analysis below, we choose three values of $\Lambda_1$ to focus on spanning this range of model fits. These values of $\Lambda_1$ are shown in the dashed vertical lines; their numerical values are not particularly relevant.


Figure X: Optimal (estimated) likelihood for both the long and short text condition in both pools over different values of the penalty parameter.

Ranking the estimated shared opinions, $\boldsymbol{\beta}$, immediately gives us a ranking of alternatives. This may feel a bit unsatisfying though, in part because the mean opinions may not actually be anyone's opinions. They do express population-level preferences however. We can also provide a global ranking and score by expanding the probabilities to include all alternatives: $$ P_a = \frac{1}{I} \sum_{i=1}^I \frac{e^{\beta_a+\delta_{i,a}}}{ \sum_{b=1}^A e^{\beta_b + \delta_{i,b}}}. $$ This is the modeled probability that a randomly chosen respondent would choose alternative $a$ as the most authentic when shown a list of all alternatives. Should one feel strongly that a “scores” should sum to one, and doesn't want to arbitrarily normalize another score, this is probably a reasonable choice.

The literature on this sort of survey instrument already has a precedent though. Salganik & Levy propose a particular score to better identify the most preferred alternative:

“The ideal summary of the opinion matrix will likely vary from setting to setting, but our preferred summary statistic is what we call the score of each [alternative], $S_a$, the estimated chance that it will beat a randomly chosen [alternative] for a randomly chosen respondent.”

Note I change notation, in particular to reflect that this score is a random variable over the data (as is any estimator). Let's call this the Salganik-Levy score.

Let's define the Salganik-Levy score for our idLogit setting. Suppose we are focusing on alternative $a$ for scoring. Let $R_a$ be a uniformly distributed alternative that is not $a$,$R_a \sim U[\{1,\dotsc,A\} \setminus \{a\}]$ so that $\mathbb{P}(R_a = b) = 1/(A-1)$ for all $b \neq a$. Then let $C_i(a,b)$ be individual $i$'s Bernoulli "choice" variable between alternatives $a$ and $b$,$C_i(a,b) \sim B[P_i(a,b)]$, so that $\mathbb{P}(C_i(a,b) = a) = P_i(a,b)$ where $$ P_i(a,b) = \frac{e^{\beta_a+\delta_{i,a}}}{e^{\beta_a+\delta_{i,a}}+e^{\beta_b+\delta_{i,b}}}. $$ Then $$ S_a = \mathbb{P}( C(a,R_a) = a ) = \frac{1}{I} \sum_{i=1}^I \mathbb{P}( C_i(a,R_a) = a ) = \frac{1}{I(A-1)} \sum_{i=1}^I \sum_{b \neq a} P_i(a,b) $$ These scores lie in $[0,1]$; they can be scaled to percentages in $[0,100]$ if we like by multiplying by $100$.

Figures X and X plot Salganik-Levy scores for the authenticity data over all conditions, rank ordered by the median value obtained through bootstrapping. The box plots show the range of scores obtained from models estimated on bootstrapped data, black dots (•) represent scores estimated from the actual data, and a “+” denotes the score estimated and displayed at allourideas.com. Alternatives are coded by number, listed on the vertical axis. The background color bars denotes IE coding: “I” in white, “E” in light grey, and “I+E” in dark grey.


Figure X: Scores in the long-text treatment, both respondent pools, for specific values of $\Lambda_1$.



Figure X: Scores in the short-text treatment, both respondent pools, for specific values of $\Lambda_1$.

Washington Post's "Who Had the Worst Year in Washington" Data

We have also analyzed the (public) results of the Washington Post's "Who had the Worst Year in Washington?" wikisurvey (which we'll abbreviate as "WYIW"). Though the content is old news, it is an altogether different scale of data. Can idLogit analyze it?

To be specific: the WYIW data has 76,632 observations (votes), from either 14,671 or 4,116 respondents — depending on whether we take an individual respondent to be associated with a Session ID or a (hashed) IP address — choosing between 67 alternatives in total. We'll focus on the hashed IP version with 4,116 individuals. Respondents never choose "I can't decide."

This is a significantly larger dataset to model, with the idLogit MLE problem having 275,839 coefficients to estimate. Regardless, CVXPy/ECOS can estimate the idLogit model in a couple minutes on a modern MacBook Pro laptop. ECOS, which can report its own solve times, takes between just under a minute to 2 and a half minutes depending on the penalty parameter. But more details below.

Figure X has a heatmap of the number of appearances of each alternative in the left and right positions in the wikisurvey. We use a relabeled, zero-based index defined by the ascending order sort of the allourideas.com alternative indices.


Figure X: Heatmap of the appearance count in the WYIW wikisurvey for all alternatives.

The alternative appearances are clearly not uniformly distributed. But we shouldn't really expect this in the context of wikisurveys, at least when respondents are allowed to add alternatives themselves.

Figure X plots the rank-ordering, according to Salganik-Levy scores, of the alternatives shown.

Figure X: Caption goes here.

New York City's "PlaNYC 2030" Data

This wikisurvey was analyzed by Salganik & Levy , who describe it as follows:

[T]he New York City Mayor’s Office conducted a wiki survey in order to integrate residents’ ideas into the 2011 update to the City’s long-term sustainability plan. The wiki survey asked residents to contribute their ideas about how to create “a greener, greater New York City” and to vote on the ideas of others.

OECD "Raise Your Hand" Data

This wikisurvey was also analyzed by Salganik & Levy , who describe it as follows:

The OECD’s wiki survey was created in preparation for an Education Ministerial Meeting and an Education Policy Forum on “Investing in Skills for the 21st Century.” The OECD sought to bring fresh ideas from the public to these events in a democratic, transparent, and bottom-up way by seeking input from education stakeholders located around the globe. To accomplish these goals, the OECD created a wiki survey to allow respondents to contribute and vote on ideas about “the most important action we need to take in education today."

Including the No-Choice Option

Another distinction between idLogit and the Bayesian analysis paradigm proposed by Salganik & Levy is that the no-choice option is almost trivially easy to incorporate. Basically, the Logit style framework is probably much more easily generalized than Bayesian methods.

Deriving a Model with a No-Choice Option

A more general form of the idLogit MLE problem is $$ \begin{aligned} \min &\quad - \frac{1}{N} \sum_{n=1}^N \log P_{n,c_n}(\boldsymbol{\beta},\boldsymbol{\delta}) + \Lambda_1 || \boldsymbol{\delta} ||_1 + \frac{\Lambda_2}{2} || \boldsymbol{\delta} ||_2^2 \\ \text{w.r.t.} &\quad \boldsymbol{\beta} \in \mathbb{R}^A \; , \; \boldsymbol{\delta} \in \mathbb{R}^{IA} \\ \text{s.to} &\quad \sum_{i=1}^I \delta_{i,a} = 0 \;\; \text{for all} \;\; a \text{, plus identification constraints} \end{aligned} $$ where $c_n \in \{ L_n , R_n \}$ and $$ P_{n,c_n}(\boldsymbol{\beta},\boldsymbol{\delta}) = \frac{ e^{\beta_{c_n}+\delta_{i_n,c_n}} }{ e^{\beta_{L_n}+\delta_{i_n,L_n}} + e^{\beta_{R_n}+\delta_{i_n,R_n}} }. $$ We can easily write choice probabilities that include a no-choice option: if we let $c_n \in \{0,L_n,R_n\}$ for all $n$, where $c_n = 0$ denotes a choice of "I can't decide", we can take $$ P_{n,c_n}(\boldsymbol{\beta},\boldsymbol{\delta}) = \frac{ e^{\beta_{c_n}+\delta_{i_n,c_n}} }{ 1 + e^{\beta_{L_n}+\delta_{i_n,L_n}} + e^{\beta_{R_n}+\delta_{i_n,R_n}} } $$ where it is safe to presume that $\beta_0 = \delta_{i_n,0} = 0$.This is, essentially, the "leave-one-out" strategy. It also isn't hard to show that, should we let $\beta_0 , \delta_{i_n,0} \neq 0$, the rest of the coefficients are not uniquely determined ("identified"). Then $$ \log P_{n,c_n}(\boldsymbol{\beta},\boldsymbol{\delta}) = \beta_{c_n} + \delta_{i_n,c_n} - \log \left( 1 + e^{\beta_{L_n}+\delta_{i_n,L_n}} + e^{\beta_{R_n}+\delta_{i_n,R_n}} \right) $$ This makes the MLE problem $$ \begin{aligned} \min &\quad \frac{1}{N} \sum_{n=1}^N \log \left( 1 + e^{\beta_{L_n}+\delta_{i_n,L_n}} + e^{\beta_{R_n}+\delta_{i_n,R_n}} \right) - \frac{1}{N} \sum_{n : c_n \neq 0} ( \beta_{c_n} + \delta_{i_n,c_n} ) + \Lambda_1 || \boldsymbol{\delta} ||_1 + \frac{\Lambda_2}{2} || \boldsymbol{\delta} ||_2^2 \\ \text{w.r.t.} &\quad \boldsymbol{\beta} \in \mathbb{R}^A \; , \; \boldsymbol{\delta} \in \mathbb{R}^{IA} \\ \text{s.to} &\quad \sum_{i=1}^I \delta_{i,a} = 0 \;\; \text{for all} \;\; a \text{, plus identification constraints} \end{aligned} $$

The result is a problem almost identical to the one introduced above; all we have to do is change the form of the logistic function and add a linear-in-coefficients term in the objective. Note though that we have already implicitly included an identification constraint in the form of setting the no-choice opinion to zero. Thus we don't need to include equivalent constraints explicitly in this model.

Estimating a Model with a No-Choice Option

These changes aren't too hard to make in a CVXPy-based implementation, but do change the overall problem definition: import cvxpy as cp # CVXPy objective def objective_fcn( N , A , wb , p , U , Lambda1 , Lambda2 ): return cp.sum( cp.log_sum_exp( U , axis=1 ) ) / N - wb.T * p + Lambda1/N * cp.norm( p[A:] , 1 ) + Lambda2/(2.0*N) * cp.sum_squares( p[A:] ) # CVXPy variables... p = cp.Variable( (I+1)*A ) U = cp.Variable( (N,3) ) # CVXPy constraint object(s) constraints = [ S * p == np.zeros( S.shape[0] ) , U[:,0] == np.zeros(N) , U[:,1] - XL * p == np.zeros(N) , U[:,2] - XR * p == np.zeros(N) ] # CVXPy parameters Lambda1 , Lambda2 = cp.Parameter( nonneg=True ) , cp.Parameter( nonneg=True ) # CVXPy problem prob = cp.Problem( cp.Minimize( objective_fcn(N,A,wb,p,U,Lambda1,Lambda2) ) , constraints ) With this, a solve attempt is as simple as: Lambda1.value , Lambda2.value = ... # set some specific values for the parameters prob.solve( solver="ECOS" , warm_start=True )

A few terms require explanation. We have introduced XL, XR, and wb. The first two are sparse matrices that "pick out" the appropriate left and right alternative coefficients in each observation. The third is the linear part of the objective, and can be simplified as follows: $$ \begin{aligned} \frac{1}{N} \sum_{n : c_n \neq 0} ( \beta_{c_n} + \delta_{i_n,c_n} ) &= \frac{1}{N} \sum_{a = 1}^A \sum_{ n : c_n = a } ( \beta_{a} + \delta_{i_n,a} ) \\ &= \frac{1}{N} \sum_{a = 1}^A \sum_{ n : c_n = a } \beta_{a} + \frac{1}{N} \sum_{a = 1}^A \sum_{i=1}^I \sum_{ n : i_n = i , c_n = a } \delta_{i,a} \\ &= \sum_{a = 1}^A \left( \frac{N_a}{N} \right) \beta_{a} + \sum_{a = 1}^A \sum_{i=1}^I \left(\frac{N_{i,a}}{N}\right) \delta_{i,a} \end{aligned} $$ where $N_a$ is the number of times alternative $a$ won, and $N_{i,a}$ is the number of times alternative $a$ won in a comparison shown to individual $i$. Of course, $N_a = \sum_{i=1}^I N_{i,a}$. The vector wb can be constructed from these ratios, is full in the first $A$ components, but may be sparse in the last $IA$ components because $N_{i,a}$ may (often) be zero.

Note also that we introduce some new "variables" U, defined as an $N \times 3$ matrix. These "variables" are fully constrained: our new constraints state that (i) the first column of U is exactly zero, (ii) the second column of U is equal to XL * p (i.e., has $n^{\text{th}}$ entry equal to $\beta_{L_n} + \delta_{i_n,L_n}$), and (iii) the third column of U is equal to XR * p (i.e., has $n^{\text{th}}$ entry equal to $\beta_{R_n} + \delta_{i_n,R_n}$). It may seem silly to introduce alot of variables only to constrain them all. However this form allows us to easily write the objective in a "disciplined" way using CVXPy primitives. The addition of linear constraints, especially when highly sparse, doesn't really impede a good solver's stability or speed. There is also no reason we couldn't have written the original problem in this way, either. Ignoring the no-choice option does result in a slightly simpler formulation that we might as well take advantage of though.

Authentic Distileries

Worst Year in Washington

PlaNYC 2030

Raise Your Hand

Does the Model Matter?

A very important question has to be asked: does it matter what model we pick, or how we estimate it? Obviously we can address thisquestion by comparing the ranks or scores obtained by different methods. Ranks, being discrete, are probably — rather hopefully — fairly stable over small changes to the modeling approach. Instability, meaning dramatically changing rankings from small changes to the model, would probably say more about the poor quality of the modeling approach than the data or the underlying behavior it describes. Scores are likely to be more sensitive to our modeling choices, and thus present a better opportunity for cross-model comparisons. We could compare scoring from Salganik & Levy's Bayesian method, a simple Logit model, the idLogit model, even an idLogit model with the no-choice option. We can compare rankings of scores from Salganik & Levy's Bayesian method to coefficient value rankings from a Logit or idLogit model, with or without the no-choice option (equivalent to rankings from all-alternative choice probabilities with those models).

Moreover, to what degree does modeling at all matter? If the number of appearances of each alternative is roughly uniformly distributed, the alternative with the most “votes” or “wins” is the top-ranked alternative. We can also compare any of these to a simple measure readily available from the data itself even if the appearances are not uniformly dsitributed: empirical win frequency (EWF). The EWF of an alternative, $F_a$, is the number of times alternative $a$ wins divided by the number of times it appears. This is an essentially trivial computation given the data, and can itself be considered a score that implies a ranking. Generally speaking, it is good practice to always compare modeling to the simplest available alternatives.

Figure X compares scores for the Carroll-Verhal “Authentic Distilleries” data to win frequencies. The black dots (•) plot scores from the idLogit model and a blue plus (+) plots scores from allourideas.com The inference, whether we like it or not, is that at least these two models offer essentially no greater (pointwise) insights into alternative comparisons than simply computing the EWF.



Figure X: Comparison of estimated scores to empirical win frequency for the all conditions in the Carroll-Verhal “Authentic Distilleries” data.

It should be possible to analyze this situation. For example, it is well known that the maximum-likelihood Logit model recreates the observed, average choice shares of different options. Surely there are further specializations more informative to the case of pairwise comparisons, generalizations to the idLogit model, and similar results related to emperical win frequencies instead of average shares. Building all this isn't a small task though; so we'll leave it for another article.

Conclusions

TBD

A General Method for Creating Identification Constraints

Suppose we have a matrix $\triangle$ whose rows are $\mathbf{x}_{R_n}-\mathbf{x}_{L_n}$. If this matrix is not full rank, or even not numerically full rank (to some tolerance), we probably need some kind of identification constraint. $\triangle$ has a singular value decomposition $\triangle = \mathbf{U}\begin{bmatrix} \boldsymbol{\Sigma} & \mathbf{0} \end{bmatrix}\mathbf{V}^\top$, from which we can derive $$ \boldsymbol{\Omega} = \triangle\triangle^\top = \mathbf{U}\boldsymbol{\Sigma}^{2}\mathbf{U}^\top $$ The smallest eigenvalues of $\boldsymbol{\Omega}$ are the (squares of the) smallest singular values of $\triangle$. Say we estimate the smallest eigenvalues, $\lambda$, and the associated eigenvector(s) $\mathbf{v}$, of $\boldsymbol{\Omega}$ especially those eigenvectors with eigenvalue zero. Then we should impose an identification constraint $\mathbf{v}^\top\boldsymbol{\beta} = 0$ for all such vectors $\mathbf{v}$. We can do this in practice by taking an SVD, if that is feasible. If SVD is not feasible we could first take the "proper" nullspace using QR factorization of $\boldsymbol{\Omega}$, and impose that as a set of identification constraints. We can then progressively estimate the smallest eigenvalues of $\boldsymbol{\Omega}$ with eigenvectors orthogonal to that proper nullspace, imposing those vectors as identification constraints as long as the eigenvalues are sufficiently small to warrant concern.

Article Changelog

Here we keep a log of the changes made to this article based on post-publication feedback. Submit an issue on gitlab, and if I fix it in the article I'll record the change here.