Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Improving Language Models with Advantage-based Offline Policy Gradients

May 24, 2023

Ashutosh Baheti, Ximing Lu, Faeze Brahman, Ronan Le Bras, Maarten Sap, Mark Riedl

Figure 1 for Improving Language Models with Advantage-based Offline Policy Gradients

Figure 2 for Improving Language Models with Advantage-based Offline Policy Gradients

Figure 3 for Improving Language Models with Advantage-based Offline Policy Gradients

Figure 4 for Improving Language Models with Advantage-based Offline Policy Gradients

Share this with someone who'll enjoy it:

Abstract:Improving language model generations according to some user-defined quality or style constraints is challenging. Typical approaches include learning on additional human-written data, filtering ``low-quality'' data using heuristics and/or using reinforcement learning with human feedback (RLHF). However, filtering can remove valuable training signals, whereas data collection and RLHF constantly require additional human-written or LM exploration data which can be costly to obtain. A natural question to ask is ``Can we leverage RL to optimize LM utility on existing crowd-sourced and internet data?'' To this end, we present Left-over Lunch RL (LoL-RL), a simple training algorithm that uses offline policy gradients for learning language generation tasks as a 1-step RL game. LoL-RL can finetune LMs to optimize arbitrary classifier-based or human-defined utility functions on any sequence-to-sequence data. Experiments with five different language generation tasks using models of varying sizes and multiple rewards show that models trained with LoL-RL can consistently outperform the best supervised learning models. We also release our experimental code. https://github.com/abaheti95/LoL-RL

View paper on

Share this with someone who'll enjoy it:

Title:Improving Language Models with Advantage-based Offline Policy Gradients

Paper and Code