Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chatrik Singh Mangat

From Stability to Inconsistency: A Study of Moral Preferences in LLMs

Apr 08, 2025

Monika Jotautaite, Mary Phuong, Chatrik Singh Mangat, Maria Angelica Martinez

Figure 1 for From Stability to Inconsistency: A Study of Moral Preferences in LLMs

Figure 2 for From Stability to Inconsistency: A Study of Moral Preferences in LLMs

Figure 3 for From Stability to Inconsistency: A Study of Moral Preferences in LLMs

Figure 4 for From Stability to Inconsistency: A Study of Moral Preferences in LLMs

Abstract:As large language models (LLMs) increasingly integrate into our daily lives, it becomes crucial to understand their implicit biases and moral tendencies. To address this, we introduce a Moral Foundations LLM dataset (MFD-LLM) grounded in Moral Foundations Theory, which conceptualizes human morality through six core foundations. We propose a novel evaluation method that captures the full spectrum of LLMs' revealed moral preferences by answering a range of real-world moral dilemmas. Our findings reveal that state-of-the-art models have remarkably homogeneous value preferences, yet demonstrate a lack of consistency.

Via

Access Paper or Ask Questions

FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research

Mar 29, 2025

Gabriel Recchia, Chatrik Singh Mangat, Issac Li, Gayatri Krishnakumar

Figure 1 for FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research

Figure 2 for FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research

Figure 3 for FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research

Figure 4 for FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research

Abstract:As AI models tackle increasingly complex problems, ensuring reliable human oversight becomes more challenging due to the difficulty of verifying solutions. Approaches to scaling AI supervision include debate, in which two agents engage in structured dialogue to help a judge evaluate claims; critique, in which models identify potential flaws in proposed solutions; and prover-verifier games, in which a capable 'prover' model generates solutions that must be verifiable by a less capable 'verifier'. Evaluations of the scalability of these and similar approaches to difficult problems benefit from datasets that include (1) long-form expert-verified correct solutions and (2) long-form flawed solutions with annotations highlighting specific errors, but few are available. To address this gap, we present FindTheFlaws, a group of five diverse datasets spanning medicine, mathematics, science, coding, and the Lojban language. Each dataset contains questions and long-form solutions with expert annotations validating their correctness or identifying specific error(s) in the reasoning. We evaluate frontier models' critiquing capabilities and observe a range of performance that can be leveraged for scalable oversight experiments: models performing more poorly on particular datasets can serve as judges/verifiers for more capable models. Additionally, for some task/dataset combinations, expert baselines exceed even top model performance, making them more beneficial for scalable oversight experiments.

* 43 pages, 3 figures. for associated repository, see https://github.com/modulo-research/findtheflaws

Via

Access Paper or Ask Questions

Characterizing stable regions in the residual stream of LLMs

Sep 26, 2024

Jett Janiak, Jacek Karwowski, Chatrik Singh Mangat, Giorgi Giglemiani, Nora Petrova, Stefan Heimersheim

Figure 1 for Characterizing stable regions in the residual stream of LLMs

Figure 2 for Characterizing stable regions in the residual stream of LLMs

Figure 3 for Characterizing stable regions in the residual stream of LLMs

Figure 4 for Characterizing stable regions in the residual stream of LLMs

Abstract:We identify "stable regions" in the residual stream of Transformers, where the model's output remains insensitive to small activation changes, but exhibits high sensitivity at region boundaries. These regions emerge during training and become more defined as training progresses or model size increases. The regions appear to be much larger than previously studied polytopes. Our analysis suggests that these stable regions align with semantic distinctions, where similar prompts cluster within regions, and activations from the same region lead to similar next token predictions. This work provides a promising research direction for understanding the complexity of neural networks, shedding light on training dynamics, and advancing interpretability.

Via

Access Paper or Ask Questions