Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Seong Hah Cho

Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering

Mar 17, 2025

Kenneth J. K. Ong, Lye Jia Jun, Hieu Minh "Jord" Nguyen, Seong Hah Cho, Natalia Pérez-Campanero Antolín

Abstract:As Large Language Models (LLMs) gain autonomous capabilities, their coordination in multi-agent settings becomes increasingly important. However, they often struggle with cooperation, leading to suboptimal outcomes. Inspired by Axelrod's Iterated Prisoner's Dilemma (IPD) tournaments, we explore how personality traits influence LLM cooperation. Using representation engineering, we steer Big Five traits (e.g., Agreeableness, Conscientiousness) in LLMs and analyze their impact on IPD decision-making. Our results show that higher Agreeableness and Conscientiousness improve cooperation but increase susceptibility to exploitation, highlighting both the potential and limitations of personality-based steering for aligning AI agents.

* Poster, Technical AI Safety Conference 2025

Via

Access Paper or Ask Questions

Inducing Human-like Biases in Moral Reasoning Language Models

Nov 23, 2024

Artem Karpov, Seong Hah Cho, Austin Meek, Raymond Koopmanschap, Lucy Farnik, Bogdan-Ionut Cirstea

Figure 1 for Inducing Human-like Biases in Moral Reasoning Language Models

Figure 2 for Inducing Human-like Biases in Moral Reasoning Language Models

Figure 3 for Inducing Human-like Biases in Moral Reasoning Language Models

Figure 4 for Inducing Human-like Biases in Moral Reasoning Language Models

Abstract:In this work, we study the alignment (BrainScore) of large language models (LLMs) fine-tuned for moral reasoning on behavioral data and/or brain data of humans performing the same task. We also explore if fine-tuning several LLMs on the fMRI data of humans performing moral reasoning can improve the BrainScore. We fine-tune several LLMs (BERT, RoBERTa, DeBERTa) on moral reasoning behavioral data from the ETHICS benchmark [Hendrycks et al., 2020], on the moral reasoning fMRI data from Koster-Hale et al. [2013], or on both. We study both the accuracy on the ETHICS benchmark and the BrainScores between model activations and fMRI data. While larger models generally performed better on both metrics, BrainScores did not significantly improve after fine-tuning.

* Accepted to the 2nd Workshop on Unifying Representations in Neural Models (UniReps) at NeurIPS 2024

Via

Access Paper or Ask Questions