Abstract:AI arenas, which rank generative models from pairwise preferences of users, are a popular method for measuring the relative performance of models in the course of their organic use. Because rankings are computed from noisy preferences, there is a concern that model producers can exploit this randomness by submitting many models (e.g., multiple variants of essentially the same model) and thereby artificially improve the rank of their top models. This can lead to degradations in the quality, and therefore the usefulness, of the ranking. In this paper, we begin by establishing, both theoretically and in simulations calibrated to data from the platform Arena (formerly LMArena, Chatbot Arena), conditions under which producers can benefit from submitting clones when their goal is to be ranked highly. We then propose a new mechanism for ranking models from pairwise comparisons, called You-Rank-We-Rank (YRWR). It requires that producers submit rankings over their own models and uses these rankings to correct statistical estimates of model quality. We prove that this mechanism is approximately clone-robust, in the sense that a producer cannot improve their rank much by doing anything other than submitting each of their unique models exactly once. Moreover, to the extent that model producers are able to correctly rank their own models, YRWR improves overall ranking accuracy. In further simulations, we show that indeed the mechanism is approximately clone-robust and quantify improvements to ranking accuracy, even under producer misranking.
Abstract:Sortition, the random selection of political representatives, is increasingly being used around the world to choose participants of deliberative processes like Citizens' Assemblies. Motivated by sortition's practical importance, there has been a recent flurry of research on sortition algorithms, whose task it is to select a panel from among a pool of volunteers. This panel must satisfy quotas enforcing representation of key population subgroups. Past work has contributed an algorithmic approach for fulfilling this task while ensuring that volunteers' chances of selection are maximally equal, as measured by any convex equality objective. The question, then, is: which equality objective is the right one? Past work has mainly studied the objectives Minimax and Leximin, which respectively minimize the maximum and maximize the minimum chance of selection given to any volunteer. Recent work showed that both of these objectives have key weaknesses: Minimax is highly robust to manipulation but is arbitrarily unfair; oppositely, Leximin is highly fair but arbitrarily manipulable. In light of this gap, we propose a new equality objective, Goldilocks, that aims to achieve these ideals simultaneously by ensuring that no volunteer receives too little or too much chance of selection. We theoretically bound the extent to which Goldilocks achieves these ideals, finding that in an important sense, Goldilocks recovers among the best available solutions in a given instance. We then extend our bounds to the case where the output of Goldilocks is transformed to achieve a third goal, Transparency. Our empirical analysis of Goldilocks in real data is even more promising: we find that this objective achieves nearly instance-optimal minimum and maximum selection probabilities simultaneously in most real instances -- an outcome not even guaranteed to be possible for any algorithm.
Abstract:A common explanation for negative user impacts of content recommender systems is misalignment between the platform's objective and user welfare. In this work, we show that misalignment in the platform's objective is not the only potential cause of unintended impacts on users: even when the platform's objective is fully aligned with user welfare, the platform's learning algorithm can induce negative downstream impacts on users. The source of these user impacts is that different pieces of content may generate observable user reactions (feedback information) at different rates; these feedback rates may correlate with content properties, such as controversiality or demographic similarity of the creator, that affect the user experience. Since differences in feedback rates can impact how often the learning algorithm engages with different content, the learning algorithm may inadvertently promote content with certain such properties. Using the multi-armed bandit framework with probabilistic feedback, we examine the relationship between feedback rates and a learning algorithm's engagement with individual arms for different no-regret algorithms. We prove that no-regret algorithms can exhibit a wide range of dependencies: if the feedback rate of an arm increases, some no-regret algorithms engage with the arm more, some no-regret algorithms engage with the arm less, and other no-regret algorithms engage with the arm approximately the same number of times. From a platform design perspective, our results highlight the importance of looking beyond regret when measuring an algorithm's performance, and assessing the nature of a learning algorithm's engagement with different types of content as well as their resulting downstream impacts.