Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ashton Anderson

SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

Aug 25, 2025

Zhenwei Tang, Difan Jiao, Blair Yang, Ashton Anderson

Abstract:Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains that have existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception failures from tokenization in domain notation and visual perception failures that induce hallucinations. We also show that our results are largely robust to visual transformations. SEAM establishes a controlled, semantically equivalent setting for measuring and improving modality-agnostic reasoning.

* COLM 2025

Via

Access Paper or Ask Questions

Maia-2: A Unified Model for Human-AI Alignment in Chess

Sep 30, 2024

Zhenwei Tang, Difan Jiao, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, Ashton Anderson

Abstract:There are an increasing number of domains in which artificial intelligence (AI) systems both surpass human ability and accurately model human behavior. This introduces the possibility of algorithmically-informed teaching in these domains through more relatable AI partners and deeper insights into human decision-making. Critical to achieving this goal, however, is coherently modeling human behavior at various skill levels. Chess is an ideal model system for conducting research into this kind of human-AI alignment, with its rich history as a pivotal testbed for AI research, mature superhuman AI systems like AlphaZero, and precise measurements of skill via chess rating systems. Previous work in modeling human decision-making in chess uses completely independent models to capture human style at different skill levels, meaning they lack coherence in their ability to adapt to the full spectrum of human improvement and are ultimately limited in their effectiveness as AI partners and teaching tools. In this work, we propose a unified modeling approach for human-AI alignment in chess that coherently captures human style across different skill levels and directly captures how people improve. Recognizing the complex, non-linear nature of human learning, we introduce a skill-aware attention mechanism to dynamically integrate players' strengths with encoded chess positions, enabling our model to be sensitive to evolving player skill. Our experimental results demonstrate that this unified framework significantly enhances the alignment between AI and human players across a diverse range of expertise levels, paving the way for deeper insights into human decision-making and AI-guided teaching tools.

* Accepted @ NeurIPS 2024

Via

Access Paper or Ask Questions

Designing Skill-Compatible AI: Methodologies and Frameworks in Chess

May 08, 2024

Karim Hamade, Reid McIlroy-Young, Siddhartha Sen, Jon Kleinberg, Ashton Anderson

Abstract:Powerful artificial intelligence systems are often used in settings where they must interact with agents that are computationally much weaker, for example when they work alongside humans or operate in complex environments where some tasks are handled by algorithms, heuristics, or other entities of varying computational power. For AI agents to successfully interact in these settings, however, achieving superhuman performance alone is not sufficient; they also need to account for suboptimal actions or idiosyncratic style from their less-skilled counterparts. We propose a formal evaluation framework for assessing the compatibility of near-optimal AI with interaction partners who may have much lower levels of skill; we use popular collaborative chess variants as model systems to study and develop AI agents that can successfully interact with lower-skill entities. Traditional chess engines designed to output near-optimal moves prove to be inadequate partners when paired with engines of various lower skill levels in this domain, as they are not designed to consider the presence of other agents. We contribute three methodologies to explicitly create skill-compatible AI agents in complex decision-making settings, and two chess game frameworks designed to foster collaboration between powerful AI agents and less-skilled partners. On these frameworks, our agents outperform state-of-the-art chess AI (based on AlphaZero) despite being weaker in conventional chess, demonstrating that skill-compatibility is a tangible trait that is qualitatively and measurably distinct from raw performance. Our evaluations further explore and clarify the mechanisms by which our agents achieve skill-compatibility.

* 18 pages, 5 figures, 15 tables, Published In The Twelfth International Conference on Learning Representations, ICLR 2024

Via

Access Paper or Ask Questions

ICL Markup: Structuring In-Context Learning using Soft-Token Tags

Dec 12, 2023

Marc-Etienne Brunet, Ashton Anderson, Richard Zemel

Abstract:Large pretrained language models (LLMs) can be rapidly adapted to a wide variety of tasks via a text-to-text approach, where the instruction and input are fed to the model in natural language. Combined with in-context learning (ICL), this paradigm is impressively flexible and powerful. However, it also burdens users with an overwhelming number of choices, many of them arbitrary. Inspired by markup languages like HTML, we contribute a method of using soft-token tags to compose prompt templates. This approach reduces arbitrary decisions and streamlines the application of ICL. Our method is a form of meta-learning for ICL; it learns these tags in advance during a parameter-efficient fine-tuning ``warm-up'' process. The tags can subsequently be used in templates for ICL on new, unseen tasks without any additional fine-tuning. Our experiments with this approach yield promising initial results, improving LLM performance on important enterprise applications such as few-shot and open-world intent detection, as well as text classification in news and legal domains.

* R0-FoMo: Workshop on Robustness of Few-shot and Zero-shot Learning in Foundation Models at NeurIPS 2023

Via

Access Paper or Ask Questions

Sparsify-then-Classify: From Internal Neurons of Large Language Models To Efficient Text Classifiers

Nov 27, 2023

Yilun Liu, Difan Jiao, Ashton Anderson

Abstract:Among the many tasks that Large Language Models (LLMs) have revolutionized is text classification. However, existing approaches for applying pretrained LLMs to text classification predominantly rely on using single token outputs from only the last layer of hidden states. As a result, they suffer from limitations in efficiency, task-specificity, and interpretability. In our work, we contribute an approach that uses all internal representations by employing multiple pooling strategies on all activation and hidden states. Our novel lightweight strategy, Sparsify-then-Classify (STC) first sparsifies task-specific features layer-by-layer, then aggregates across layers for text classification. STC can be applied as a seamless plug-and-play module on top of existing LLMs. Our experiments on a comprehensive set of models and datasets demonstrate that STC not only consistently improves the classification performance of pretrained and fine-tuned models, but is also more efficient for both training and inference, and is more intrinsically interpretable.

* 23 pages, 5 figures, 8 tables Code available at https://github.com/difanj0713/Sparsify-then-Classify

Via

Access Paper or Ask Questions

Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science

May 24, 2023

Veniamin Veselovsky, Manoel Horta Ribeiro, Akhil Arora, Martin Josifoski, Ashton Anderson, Robert West

Abstract:Large Language Models (LLMs) have democratized synthetic data generation, which in turn has the potential to simplify and broaden a wide gamut of NLP tasks. Here, we tackle a pervasive problem in synthetic data generation: its generative distribution often differs from the distribution of real-world data researchers care about (in other words, it is unfaithful). In a case study on sarcasm detection, we study three strategies to increase the faithfulness of synthetic data: grounding, filtering, and taxonomy-based generation. We evaluate these strategies using the performance of classifiers trained with generated synthetic data on real-world data. While all three strategies improve the performance of classifiers, we find that grounding works best for the task at hand. As synthetic data generation plays an ever-increasing role in NLP research, we expect this work to be a stepping stone in improving its utility. We conclude this paper with some recommendations on how to generate high(er)-fidelity synthetic data for specific tasks.

* 8 pages

Via

Access Paper or Ask Questions

Detecting Individual Decision-Making Style: Exploring Behavioral Stylometry in Chess

Aug 02, 2022

Reid McIlroy-Young, Russell Wang, Siddhartha Sen, Jon Kleinberg, Ashton Anderson

Figure 1 for Detecting Individual Decision-Making Style: Exploring Behavioral Stylometry in Chess

Figure 2 for Detecting Individual Decision-Making Style: Exploring Behavioral Stylometry in Chess

Figure 3 for Detecting Individual Decision-Making Style: Exploring Behavioral Stylometry in Chess

Figure 4 for Detecting Individual Decision-Making Style: Exploring Behavioral Stylometry in Chess

Abstract:The advent of machine learning models that surpass human decision-making ability in complex domains has initiated a movement towards building AI systems that interact with humans. Many building blocks are essential for this activity, with a central one being the algorithmic characterization of human behavior. While much of the existing work focuses on aggregate human behavior, an important long-range goal is to develop behavioral models that specialize to individual people and can differentiate among them. To formalize this process, we study the problem of behavioral stylometry, in which the task is to identify a decision-maker from their decisions alone. We present a transformer-based approach to behavioral stylometry in the context of chess, where one attempts to identify the player who played a set of games. Our method operates in a few-shot classification framework, and can correctly identify a player from among thousands of candidate players with 98% accuracy given only 100 labeled games. Even when trained on amateur play, our method generalises to out-of-distribution samples of Grandmaster players, despite the dramatic differences between amateur and world-class players. Finally, we consider more broadly what our resulting embeddings reveal about human style in chess, as well as the potential ethical implications of powerful methods for identifying individuals from behavioral data.

* 23 pages, 7 figures, 9 tables, In Advances in Neural Information Processing Systems 34 (NeurIPS 2021)

Via

Access Paper or Ask Questions

Mimetic Models: Ethical Implications of AI that Acts Like You

Jul 19, 2022

Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, Solon Barocas, Ashton Anderson

Abstract:An emerging theme in artificial intelligence research is the creation of models to simulate the decisions and behavior of specific people, in domains including game-playing, text generation, and artistic expression. These models go beyond earlier approaches in the way they are tailored to individuals, and the way they are designed for interaction rather than simply the reproduction of fixed, pre-computed behaviors. We refer to these as mimetic models, and in this paper we develop a framework for characterizing the ethical and social issues raised by their growing availability. Our framework includes a number of distinct scenarios for the use of such models, and considers the impacts on a range of different participants, including the target being modeled, the operator who deploys the model, and the entities that interact with it.

* In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society (AIES'22), August 1-3, 2022, Oxford, United Kingdom

Via

Access Paper or Ask Questions

Community embeddings reveal large-scale cultural organization of online platforms

Oct 02, 2020

Isaac Waller, Ashton Anderson

Figure 1 for Community embeddings reveal large-scale cultural organization of online platforms

Figure 2 for Community embeddings reveal large-scale cultural organization of online platforms

Figure 3 for Community embeddings reveal large-scale cultural organization of online platforms

Figure 4 for Community embeddings reveal large-scale cultural organization of online platforms

Abstract:Optimism about the Internet's potential to bring the world together has been tempered by concerns about its role in inflaming the 'culture wars'. Via mass selection into like-minded groups, online society may be becoming more fragmented and polarized, particularly with respect to partisan differences. However, our ability to measure the cultural makeup of online communities, and in turn understand the cultural structure of online platforms, is limited by the pseudonymous, unstructured, and large-scale nature of digital discussion. Here we develop a neural embedding methodology to quantify the positioning of online communities along cultural dimensions by leveraging large-scale patterns of aggregate behaviour. Applying our methodology to 4.8B Reddit comments made in 10K communities over 14 years, we find that the macro-scale community structure is organized along cultural lines, and that relationships between online cultural concepts are more complex than simply reflecting their offline analogues. Examining political content, we show Reddit underwent a significant polarization event around the 2016 U.S. presidential election, and remained highly polarized for years afterward. Contrary to conventional wisdom, however, instances of individual users becoming more polarized over time are rare; the majority of platform-level polarization is driven by the arrival of new and newly political users. Our methodology is broadly applicable to the study of online culture, and our findings have implications for the design of online platforms, understanding the cultural contexts of online content, and quantifying cultural shifts in online behaviour.

* 44 pages, 21 figures

Via

Access Paper or Ask Questions

Adoption of Twitter's New Length Limit: Is 280 the New 140?

Sep 16, 2020

Kristina Gligorić, Ashton Anderson, Robert West

Figure 1 for Adoption of Twitter's New Length Limit: Is 280 the New 140?

Figure 2 for Adoption of Twitter's New Length Limit: Is 280 the New 140?

Figure 3 for Adoption of Twitter's New Length Limit: Is 280 the New 140?

Figure 4 for Adoption of Twitter's New Length Limit: Is 280 the New 140?

Abstract:In November 2017, Twitter doubled the maximum allowed tweet length from 140 to 280 characters, a drastic switch on one of the world's most influential social media platforms. In the first long-term study of how the new length limit was adopted by Twitter users, we ask: Does the effect of the new length limit resemble that of the old one? Or did the doubling of the limit fundamentally change how Twitter is shaped by the limited length of posted content? By analyzing Twitter's publicly available 1% sample over a period of around 3 years, we find that, when the length limit was raised from 140 to 280 characters, the prevalence of tweets around 140 characters dropped immediately, while the prevalence of tweets around 280 characters rose steadily for about 6 months. Despite this rise, tweets approaching the length limit have been far less frequent after than before the switch. We find widely different adoption rates across languages and client-device types. The prevalence of tweets around 140 characters before the switch in a given language is strongly correlated with the prevalence of tweets around 280 characters after the switch in the same language, and very long tweets are vastly more popular on Web clients than on mobile clients. Moreover, tweets of around 280 characters after the switch are syntactically and semantically similar to tweets of around 140 characters before the switch, manifesting patterns of message squeezing in both cases. Taken together, these findings suggest that the new 280-character limit constitutes a new, less intrusive version of the old 140-character limit. The length limit remains an important factor that should be considered in all studies using Twitter data.

Via

Access Paper or Ask Questions