Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hieu Minh "Jord" Nguyen

Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering

Mar 17, 2025

Kenneth J. K. Ong, Lye Jia Jun, Hieu Minh "Jord" Nguyen, Seong Hah Cho, Natalia Pérez-Campanero Antolín

Figure 1 for Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering

Figure 2 for Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering

Figure 3 for Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering

Figure 4 for Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering

Abstract:As Large Language Models (LLMs) gain autonomous capabilities, their coordination in multi-agent settings becomes increasingly important. However, they often struggle with cooperation, leading to suboptimal outcomes. Inspired by Axelrod's Iterated Prisoner's Dilemma (IPD) tournaments, we explore how personality traits influence LLM cooperation. Using representation engineering, we steer Big Five traits (e.g., Agreeableness, Conscientiousness) in LLMs and analyze their impact on IPD decision-making. Our results show that higher Agreeableness and Conscientiousness improve cooperation but increase susceptibility to exploitation, highlighting both the potential and limitations of personality-based steering for aligning AI agents.

* Poster, Technical AI Safety Conference 2025

Via

Access Paper or Ask Questions

DarkBench: Benchmarking Dark Patterns in Large Language Models

Mar 13, 2025

Esben Kran, Hieu Minh "Jord" Nguyen, Akash Kundu, Sami Jawhar, Jinsuk Park, Mateusz Maria Jurewicz

Abstract:We introduce DarkBench, a comprehensive benchmark for detecting dark design patterns--manipulative techniques that influence user behavior--in interactions with large language models (LLMs). Our benchmark comprises 660 prompts across six categories: brand bias, user retention, sycophancy, anthropomorphism, harmful generation, and sneaking. We evaluate models from five leading companies (OpenAI, Anthropic, Meta, Mistral, Google) and find that some LLMs are explicitly designed to favor their developers' products and exhibit untruthful communication, among other manipulative behaviors. Companies developing LLMs should recognize and mitigate the impact of dark design patterns to promote more ethical AI.

* Accepted as an Oral paper at ICLR 2025

Via

Access Paper or Ask Questions

A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks

Feb 10, 2025

Hieu Minh "Jord" Nguyen

Abstract:Theory of Mind (ToM), the ability to attribute mental states to others and predict their behaviour, is fundamental to social intelligence. In this paper, we survey studies evaluating behavioural and representational ToM in Large Language Models (LLMs), identify important safety risks from advanced LLM ToM capabilities, and suggest several research directions for effective evaluation and mitigation of these risks.

* Advancing Artificial Intelligence through Theory of Mind Workshop, AAAI 2025

Via

Access Paper or Ask Questions