Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sam Work

Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs

Mar 19, 2025

Nicolas Le Roux, Marc G. Bellemare, Jonathan Lebensold, Arnaud Bergeron, Joshua Greaves, Alex Fréchette, Carolyne Pelletier, Eric Thibodeau-Laufer, Sándor Toth, Sam Work

Abstract:We propose a new algorithm for fine-tuning large language models using reinforcement learning. Tapered Off-Policy REINFORCE (TOPR) uses an asymmetric, tapered variant of importance sampling to speed up learning while maintaining stable learning dynamics, even without the use of KL regularization. TOPR can be applied in a fully offline fashion, allows the handling of positive and negative examples in a unified framework, and benefits from the implementational simplicity that is typical of Monte Carlo algorithms. We demonstrate the effectiveness of our approach with a series of experiments on the GSM8K and MATH reasoning benchmarks, finding performance gains for training both a model for solution generation and as a generative verifier. We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency, all the while avoiding the ``wasted inference'' that comes with discarding negative examples. We find that this advantage persists over multiple iterations of training and can be amplified by dataset curation techniques, enabling us to match 70B-parameter model performance with 8B language models. As a corollary to this work, we find that REINFORCE's baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples, and is consequently critical in driving off-policy performance.

Via

Access Paper or Ask Questions

#ContextMatters: Advantages and Limitations of Using Machine Learning to Support Women in Politics

Oct 10, 2021

Jacqueline Comer, Sam Work, Kory W Mathewson, Lana Cuthbertson, Kasey Machin

Figure 1 for #ContextMatters: Advantages and Limitations of Using Machine Learning to Support Women in Politics

Figure 2 for #ContextMatters: Advantages and Limitations of Using Machine Learning to Support Women in Politics

Figure 3 for #ContextMatters: Advantages and Limitations of Using Machine Learning to Support Women in Politics

Figure 4 for #ContextMatters: Advantages and Limitations of Using Machine Learning to Support Women in Politics

Abstract:The United Nations identified gender equality as a Sustainable Development Goal in 2015, recognizing the underrepresentation of women in politics as a specific barrier to achieving gender equality. Political systems around the world experience gender inequality across all levels of elected government as fewer women run for office than men. This is due in part to online abuse, particularly on social media platforms like Twitter, where women seeking or in power tend to be targeted with more toxic maltreatment than their male counterparts. In this paper, we present reflections on ParityBOT - the first natural language processing-based intervention designed to affect online discourse for women in politics for the better, at scale. Deployed across elections in Canada, the United States and New Zealand, ParityBOT was used to analyse and classify more than 12 million tweets directed at women candidates and counter toxic tweets with supportive ones. From these elections we present three case studies highlighting the current limitations of, and future research and application opportunities for, using a natural language processing-based system to detect online toxicity, specifically with regards to contextually important microaggressions. We examine the rate of false negatives, where ParityBOT failed to pick up on insults directed at specific high profile women, which would be obvious to human users. We examine the unaddressed harms of microaggressions and the potential of yet unseen damage they cause for women in these communities, and for progress towards gender equality overall, in light of these technological blindspots. This work concludes with a discussion on the benefits of partnerships between nonprofit social groups and technology experts to develop responsible, socially impactful approaches to addressing online hate.

* 21 pages, 1 figure. Presented as Policy and Practice, Problem Pitches poster at EAAMO'21

Via

Access Paper or Ask Questions

Corporate Disruption in the Science of Machine Learning

Dec 13, 2016

Sam Work

Abstract:This MSc dissertation considers the effects of the current corporate interest on researchers in the field of machine learning. Situated within the field's cyclical history of academic, public and corporate interest, this dissertation investigates how current researchers view recent developments and negotiate their own research practices within an environment of increased commercial interest and funding. The original research consists of in-depth interviews with 12 machine learning researchers working in both academia and industry. Building on theory from science, technology and society studies, this dissertation problematizes the traditional narratives of the neoliberalization of academic research by allowing the researchers themselves to discuss how their career choices, working environments and interactions with others in the field have been affected by the reinvigorated corporate interest of recent years.

* MSc dissertation, qualitative analysis, machine learning researchers

Via

Access Paper or Ask Questions