Picture for Yunhao Tang

Yunhao Tang

Learning to chain-of-thought with Jensen's evidence lower bound

Add code
Mar 25, 2025
Viaarxiv icon

RL-finetuning LLMs from on- and off-policy data with a single algorithm

Add code
Mar 25, 2025
Viaarxiv icon

Optimizing Language Models for Inference Time Objectives using Reinforcement Learning

Add code
Mar 25, 2025
Viaarxiv icon

Soft Policy Optimization: Online Off-Policy RL for Sequence Models

Add code
Mar 07, 2025
Figure 1 for Soft Policy Optimization: Online Off-Policy RL for Sequence Models
Figure 2 for Soft Policy Optimization: Online Off-Policy RL for Sequence Models
Viaarxiv icon

On scalable oversight with weak LLMs judging strong LLMs

Add code
Jul 05, 2024
Figure 1 for On scalable oversight with weak LLMs judging strong LLMs
Figure 2 for On scalable oversight with weak LLMs judging strong LLMs
Figure 3 for On scalable oversight with weak LLMs judging strong LLMs
Figure 4 for On scalable oversight with weak LLMs judging strong LLMs
Viaarxiv icon

A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning

Add code
Jun 04, 2024
Figure 1 for A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning
Figure 2 for A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning
Figure 3 for A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning
Figure 4 for A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning
Viaarxiv icon

Offline Regularised Reinforcement Learning for Large Language Models Alignment

Add code
May 29, 2024
Viaarxiv icon

Understanding the performance gap between online and offline alignment algorithms

Add code
May 14, 2024
Figure 1 for Understanding the performance gap between online and offline alignment algorithms
Figure 2 for Understanding the performance gap between online and offline alignment algorithms
Figure 3 for Understanding the performance gap between online and offline alignment algorithms
Figure 4 for Understanding the performance gap between online and offline alignment algorithms
Viaarxiv icon

Human Alignment of Large Language Models through Online Preference Optimisation

Add code
Mar 13, 2024
Figure 1 for Human Alignment of Large Language Models through Online Preference Optimisation
Figure 2 for Human Alignment of Large Language Models through Online Preference Optimisation
Figure 3 for Human Alignment of Large Language Models through Online Preference Optimisation
Figure 4 for Human Alignment of Large Language Models through Online Preference Optimisation
Viaarxiv icon

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Add code
Mar 08, 2024
Viaarxiv icon