Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning

Nov 07, 2024

Joey Hong, Anca Dragan, Sergey Levine

Figure 1 for Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning

Figure 2 for Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning

Figure 3 for Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning

Figure 4 for Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning

Share this with someone who'll enjoy it:

Abstract:Value-based reinforcement learning (RL) can in principle learn effective policies for a wide range of multi-turn problems, from games to dialogue to robotic control, including via offline RL from static previously collected datasets. However, despite the widespread use of policy gradient methods to train large language models for single turn tasks (e.g., question answering), value-based methods for multi-turn RL in an off-policy or offline setting have proven particularly challenging to scale to the setting of large language models. This setting requires effectively leveraging pretraining, scaling to large architectures with billions of parameters, and training on large datasets, all of which represent major challenges for current value-based RL methods. In this work, we propose a novel offline RL algorithm that addresses these drawbacks, casting Q-learning as a modified supervised fine-tuning (SFT) problem where the probabilities of tokens directly translate to Q-values. In this way we obtain an algorithm that smoothly transitions from maximizing the likelihood of the data during pretraining to learning a near-optimal Q-function during finetuning. Our algorithm has strong theoretical foundations, enjoying performance bounds similar to state-of-the-art Q-learning methods, while in practice utilizing an objective that closely resembles SFT. Because of this, our approach can enjoy the full benefits of the pretraining of language models, without the need to reinitialize any weights before RL finetuning, and without the need to initialize new heads for predicting values or advantages. Empirically, we evaluate our method on both pretrained LLMs and VLMs, on a variety of tasks including both natural language dialogue and robotic manipulation and navigation from images.

* 16 pages, 4 figures

View paper on

Share this with someone who'll enjoy it:

Title:Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning

Paper and Code