Abstract:We introduce a multi-turn benchmark for evaluating personalised alignment in LLM-based AI assistants, focusing on their ability to handle user-provided safety-critical contexts. Our assessment of ten leading models across five scenarios (each with 337 use cases) reveals systematic inconsistencies in maintaining user-specific consideration, with even top-rated "harmless" models making recommendations that should be recognised as obviously harmful to the user given the context provided. Key failure modes include inappropriate weighing of conflicting preferences, sycophancy (prioritising user preferences above safety), a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge. The same systematic biases were observed in OpenAI's o1, suggesting that strong reasoning capacities do not necessarily transfer to this kind of personalised thinking. We find that prompting LLMs to consider safety-critical context significantly improves performance, unlike a generic 'harmless and helpful' instruction. Based on these findings, we propose research directions for embedding self-reflection capabilities, online user modelling, and dynamic risk assessment in AI assistants. Our work emphasises the need for nuanced, context-aware approaches to alignment in systems designed for persistent human interaction, aiding the development of safe and considerate AI assistants.
Abstract:With the growing popularity of dialogue agents based on large language models (LLMs), urgent attention has been drawn to finding ways to ensure their behaviour is ethical and appropriate. These are largely interpreted in terms of the 'HHH' criteria: making outputs more helpful and honest, and avoiding harmful (biased, toxic, or inaccurate) statements. Whilst this semantic focus is useful from the perspective of viewing LLM agents as mere mediums for information, it fails to account for pragmatic factors that can make the same utterance seem more or less offensive or tactless in different social situations. We propose an approach to ethics that is more centred on relational and situational factors, exploring what it means for a system, as a social actor, to treat an individual respectfully in a (series of) interaction(s). Our work anticipates a set of largely unexplored risks at the level of situated interaction, and offers practical suggestions to help LLM technologies behave as 'good' social actors and treat people respectfully.
Abstract:The increasing sophistication of NLP models has renewed optimism regarding machines achieving a full human-like command of natural language. Whilst work in NLP/NLU may have made great strides in that direction, the lack of conceptual clarity in how 'understanding' is used in this and other disciplines have made it difficult to discern how close we actually are. A critical, interdisciplinary review of current approaches and remaining challenges is yet to be carried out. Beyond linguistic knowledge, this requires considering our species-specific capabilities to categorize, memorize, label and communicate our (sufficiently similar) embodied and situated experiences. Moreover, gauging the practical constraints requires critically analyzing the technical capabilities of current models, as well as deeper philosophical reflection on theoretical possibilities and limitations. In this paper, I unite all of these perspectives -- the philosophical, cognitive-linguistic, and technical -- to unpack the challenges involved in approaching true (human-like) language understanding. By unpacking the theoretical assumptions inherent in current approaches, I hope to illustrate how far we actually are from achieving this goal, if indeed it is the goal.