Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes

Dec 18, 2024

Katarzyna Kobalczyk, Claudio Fanconi, Hao Sun, Mihaela van der Schaar

Figure 1 for Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes

Figure 2 for Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes

Figure 3 for Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes

Figure 4 for Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes

Share this with someone who'll enjoy it:

Abstract:As large language models (LLMs) become increasingly embedded in everyday applications, ensuring their alignment with the diverse preferences of individual users has become a critical challenge. Currently deployed approaches typically assume homogeneous user objectives and rely on single-objective fine-tuning. However, human preferences are inherently heterogeneous, influenced by various unobservable factors, leading to conflicting signals in preference data. Existing solutions addressing this diversity often require costly datasets labelled for specific objectives and involve training multiple reward models or LLM policies, which is computationally expensive and impractical. In this work, we present a novel framework for few-shot steerable alignment, where users' underlying preferences are inferred from a small sample of their choices. To achieve this, we extend the Bradley-Terry-Luce model to handle heterogeneous preferences with unobserved variability factors and propose its practical implementation for reward modelling and LLM fine-tuning. Thanks to our proposed approach of functional parameter-space conditioning, LLMs trained with our framework can be adapted to individual preferences at inference time, generating outputs over a continuum of behavioural modes. We empirically validate the effectiveness of methods, demonstrating their ability to capture and align with diverse human preferences in a data-efficient manner. Our code is made available at: https://github.com/kasia-kobalczyk/few-shot-steerable-alignment.

View paper on

Share this with someone who'll enjoy it:

Title:Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes

Paper and Code