Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Arslan Basharat

Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping

Nov 17, 2025

Dena Mujtaba, Brian Hu, Anthony Hoogs, Arslan Basharat

Figure 1 for Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping

Figure 2 for Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping

Figure 3 for Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping

Figure 4 for Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping

Abstract:The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with human values or guidelines while operating in complex, dynamic environments. Agents trained solely to achieve their objectives may adopt harmful behavior, exposing a key trade-off between maximizing the reward function and maintaining alignment. For pre-trained agents, ensuring alignment is particularly challenging, as retraining can be a costly and slow process. This is further complicated by the diverse and potentially conflicting attributes representing the ethical values for alignment. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between ethical alignment and reward maximization without requiring agent retraining. We evaluate our approach using the MACHIAVELLI benchmark, which comprises 134 text-based game environments and thousands of annotated scenarios involving ethical decisions. The RL agents are first trained to maximize the reward in their respective games. At test time, we apply policy shaping via scenario-action attribute classifiers to ensure decision alignment with ethical attributes. We compare our approach against prior training-time methods and general-purpose agents, as well as study several types of ethical violations and power-seeking behavior. Our results demonstrate that test-time policy shaping provides an effective and scalable solution for mitigating unethical behavior across diverse environments and alignment attributes.

* Accepted to AAAI 2026 AI Alignment Track

Via

Access Paper or Ask Questions

Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression

Aug 11, 2025

Jadie Adams, Brian Hu, Emily Veenhuis, David Joy, Bharadwaj Ravichandran, Aaron Bray, Anthony Hoogs, Arslan Basharat

Abstract:Large language models (LLMs) are currently aligned using techniques such as reinforcement learning from human feedback (RLHF). However, these methods use scalar rewards that can only reflect user preferences on average. Pluralistic alignment instead seeks to capture diverse user preferences across a set of attributes, moving beyond just helpfulness and harmlessness. Toward this end, we propose a steerable pluralistic model based on few-shot comparative regression that can adapt to individual user preferences. Our approach leverages in-context learning and reasoning, grounded in a set of fine-grained attributes, to compare response options and make aligned choices. To evaluate our algorithm, we also propose two new steerable pluralistic benchmarks by adapting the Moral Integrity Corpus (MIC) and the HelpSteer2 datasets, demonstrating the applicability of our approach to value-aligned decision-making and reward modeling, respectively. Our few-shot comparative regression approach is interpretable and compatible with different attributes and LLMs, while outperforming multiple baseline and state-of-the-art methods. Our work provides new insights and research directions in pluralistic alignment, enabling a more fair and representative use of LLMs and advancing the state-of-the-art in ethical AI.

* AIES '25: Proceedings of the 2025 AAAI/ACM Conference on AI, Ethics, and Society

Via

Access Paper or Ask Questions

Personalized Attacks of Social Engineering in Multi-turn Conversations -- LLM Agents for Simulation and Detection

Mar 18, 2025

Tharindu Kumarage, Cameron Johnson, Jadie Adams, Lin Ai, Matthias Kirchner, Anthony Hoogs, Joshua Garland, Julia Hirschberg, Arslan Basharat, Huan Liu

Figure 1 for Personalized Attacks of Social Engineering in Multi-turn Conversations -- LLM Agents for Simulation and Detection

Figure 2 for Personalized Attacks of Social Engineering in Multi-turn Conversations -- LLM Agents for Simulation and Detection

Figure 3 for Personalized Attacks of Social Engineering in Multi-turn Conversations -- LLM Agents for Simulation and Detection

Figure 4 for Personalized Attacks of Social Engineering in Multi-turn Conversations -- LLM Agents for Simulation and Detection

Abstract:The rapid advancement of conversational agents, particularly chatbots powered by Large Language Models (LLMs), poses a significant risk of social engineering (SE) attacks on social media platforms. SE detection in multi-turn, chat-based interactions is considerably more complex than single-instance detection due to the dynamic nature of these conversations. A critical factor in mitigating this threat is understanding the mechanisms through which SE attacks operate, specifically how attackers exploit vulnerabilities and how victims' personality traits contribute to their susceptibility. In this work, we propose an LLM-agentic framework, SE-VSim, to simulate SE attack mechanisms by generating multi-turn conversations. We model victim agents with varying personality traits to assess how psychological profiles influence susceptibility to manipulation. Using a dataset of over 1000 simulated conversations, we examine attack scenarios in which adversaries, posing as recruiters, funding agencies, and journalists, attempt to extract sensitive information. Based on this analysis, we present a proof of concept, SE-OmniGuard, to offer personalized protection to users by leveraging prior knowledge of the victims personality, evaluating attack strategies, and monitoring information exchanges in conversations to identify potential SE attempts.

Via

Access Paper or Ask Questions

Defending Against Social Engineering Attacks in the Age of LLMs

Jun 18, 2024

Lin Ai, Tharindu Kumarage, Amrita Bhattacharjee, Zizhou Liu, Zheng Hui, Michael Davinroy, James Cook, Laura Cassani, Kirill Trapeznikov, Matthias Kirchner(+5 more)

Figure 1 for Defending Against Social Engineering Attacks in the Age of LLMs

Figure 2 for Defending Against Social Engineering Attacks in the Age of LLMs

Figure 3 for Defending Against Social Engineering Attacks in the Age of LLMs

Figure 4 for Defending Against Social Engineering Attacks in the Age of LLMs

Abstract:The proliferation of Large Language Models (LLMs) poses challenges in detecting and mitigating digital deception, as these models can emulate human conversational patterns and facilitate chat-based social engineering (CSE) attacks. This study investigates the dual capabilities of LLMs as both facilitators and defenders against CSE threats. We develop a novel dataset, SEConvo, simulating CSE scenarios in academic and recruitment contexts, and designed to examine how LLMs can be exploited in these situations. Our findings reveal that, while off-the-shelf LLMs generate high-quality CSE content, their detection capabilities are suboptimal, leading to increased operational costs for defense. In response, we propose ConvoSentinel, a modular defense pipeline that improves detection at both the message and the conversation levels, offering enhanced adaptability and cost-effectiveness. The retrieval-augmented module in ConvoSentinel identifies malicious intent by comparing messages to a database of similar conversations, enhancing CSE detection at all stages. Our study highlights the need for advanced strategies to leverage LLMs in cybersecurity.

Via

Access Paper or Ask Questions

Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain

Jun 10, 2024

Brian Hu, Bill Ray, Alice Leung, Amy Summerville, David Joy, Christopher Funk, Arslan Basharat

Figure 1 for Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain

Figure 2 for Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain

Figure 3 for Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain

Figure 4 for Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain

Abstract:In difficult decision-making scenarios, it is common to have conflicting opinions among expert human decision-makers as there may not be a single right answer. Such decisions may be guided by different attributes that can be used to characterize an individual's decision. We introduce a novel dataset for medical triage decision-making, labeled with a set of decision-maker attributes (DMAs). This dataset consists of 62 scenarios, covering six different DMAs, including ethical principles such as fairness and moral desert. We present a novel software framework for human-aligned decision-making by utilizing these DMAs, paving the way for trustworthy AI with better guardrails. Specifically, we demonstrate how large language models (LLMs) can serve as ethical decision-makers, and how their decisions can be aligned to different DMAs using zero-shot prompting. Our experiments focus on different open-source models with varying sizes and training techniques, such as Falcon, Mistral, and Llama 2. Finally, we also introduce a new form of weighted self-consistency that improves the overall quantified performance. Our results provide new research directions in the use of LLMs as alignable decision-makers. The dataset and open-source software are publicly available at: https://github.com/ITM-Kitware/llm-alignable-dm.

* 15 pages total (including appendix), NAACL 2024 Industry Track

Via

Access Paper or Ask Questions

A Coarse-to-fine Deep Convolutional Neural Network Framework for Frame Duplication Detection and Localization in Video Forgery

Nov 27, 2018

Chengjiang Long, Arslan Basharat, Anthony Hoogs

Figure 1 for A Coarse-to-fine Deep Convolutional Neural Network Framework for Frame Duplication Detection and Localization in Video Forgery

Figure 2 for A Coarse-to-fine Deep Convolutional Neural Network Framework for Frame Duplication Detection and Localization in Video Forgery

Figure 3 for A Coarse-to-fine Deep Convolutional Neural Network Framework for Frame Duplication Detection and Localization in Video Forgery

Figure 4 for A Coarse-to-fine Deep Convolutional Neural Network Framework for Frame Duplication Detection and Localization in Video Forgery

Abstract:Frame duplication is to duplicate a sequence of consecutive frames and insert or replace to conceal or imitate a specific event/content in the same source video. To automatically detect the duplicated frames in a manipulated video, we propose a coarse-to-fine deep convolutional neural network framework to detect and localize the frame duplications. We first run an I3D network to obtain the most candidate duplicated frame sequences and selected frame sequences, and then run a Siamese network with ResNet network to identify each pair of a duplicated frame and the corresponding selected frame. We also propose a heuristic strategy to formulate the video-level score. We then apply our inconsistency detector fine-tuned on the I3D network to distinguish duplicated frames from selected frames. With the experimental evaluation conducted on two video datasets, we strongly demonstrate that our proposed method outperforms the current state-of-the-art methods.

Via

Access Paper or Ask Questions