Picture for Shenao Zhang

Shenao Zhang

Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following

Add code
Dec 27, 2024
Viaarxiv icon

Offline Reinforcement Learning for LLM Multi-Step Reasoning

Add code
Dec 20, 2024
Viaarxiv icon

DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs

Add code
Nov 20, 2024
Figure 1 for DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs
Figure 2 for DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs
Figure 3 for DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs
Figure 4 for DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs
Viaarxiv icon

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Add code
Oct 10, 2024
Figure 1 for Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Figure 2 for Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Figure 3 for Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Figure 4 for Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Viaarxiv icon

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Add code
May 29, 2024
Figure 1 for Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Figure 2 for Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Figure 3 for Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Figure 4 for Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Viaarxiv icon

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Add code
May 26, 2024
Figure 1 for Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Figure 2 for Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Figure 3 for Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Figure 4 for Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Viaarxiv icon

How Can LLM Guide RL? A Value-Based Approach

Add code
Feb 25, 2024
Viaarxiv icon

Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms

Add code
Oct 30, 2023
Viaarxiv icon

Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency

Add code
Oct 11, 2023
Figure 1 for Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency
Figure 2 for Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency
Figure 3 for Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency
Figure 4 for Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency
Viaarxiv icon

One Objective to Rule Them All: A Maximization Objective Fusing Estimation and Planning for Exploration

Add code
May 29, 2023
Viaarxiv icon