Picture for Yunhao Tang

Yunhao Tang

Magistral

Add code
Jun 12, 2025
Viaarxiv icon

On a few pitfalls in KL divergence gradient estimation for RL

Add code
Jun 11, 2025
Viaarxiv icon

LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Trainin

Add code
May 29, 2025
Viaarxiv icon

Learning to chain-of-thought with Jensen's evidence lower bound

Add code
Mar 25, 2025
Viaarxiv icon

RL-finetuning LLMs from on- and off-policy data with a single algorithm

Add code
Mar 25, 2025
Viaarxiv icon

Optimizing Language Models for Inference Time Objectives using Reinforcement Learning

Add code
Mar 25, 2025
Viaarxiv icon

Soft Policy Optimization: Online Off-Policy RL for Sequence Models

Add code
Mar 07, 2025
Figure 1 for Soft Policy Optimization: Online Off-Policy RL for Sequence Models
Figure 2 for Soft Policy Optimization: Online Off-Policy RL for Sequence Models
Viaarxiv icon

On scalable oversight with weak LLMs judging strong LLMs

Add code
Jul 05, 2024
Figure 1 for On scalable oversight with weak LLMs judging strong LLMs
Figure 2 for On scalable oversight with weak LLMs judging strong LLMs
Figure 3 for On scalable oversight with weak LLMs judging strong LLMs
Figure 4 for On scalable oversight with weak LLMs judging strong LLMs
Viaarxiv icon

A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning

Add code
Jun 04, 2024
Figure 1 for A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning
Figure 2 for A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning
Figure 3 for A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning
Figure 4 for A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning
Viaarxiv icon

Offline Regularised Reinforcement Learning for Large Language Models Alignment

Add code
May 29, 2024
Viaarxiv icon