Picture for Zhaoran Wang

Zhaoran Wang

DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs

Add code
Nov 20, 2024
Viaarxiv icon

Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos

Add code
Oct 11, 2024
Figure 1 for Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos
Figure 2 for Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos
Figure 3 for Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos
Figure 4 for Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos
Viaarxiv icon

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Add code
Oct 10, 2024
Figure 1 for Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Figure 2 for Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Figure 3 for Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Figure 4 for Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Viaarxiv icon

Just say what you want: only-prompting self-rewarding online preference optimization

Add code
Sep 26, 2024
Viaarxiv icon

Safe MPC Alignment with Human Directional Feedback

Add code
Jul 05, 2024
Viaarxiv icon

Toward Optimal LLM Alignments Using Two-Player Games

Add code
Jun 16, 2024
Viaarxiv icon

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Add code
May 29, 2024
Figure 1 for Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Figure 2 for Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Figure 3 for Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Figure 4 for Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Viaarxiv icon

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Add code
May 26, 2024
Figure 1 for Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Figure 2 for Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Figure 3 for Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Figure 4 for Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Viaarxiv icon

A Mean-Field Analysis of Neural Gradient Descent-Ascent: Applications to Functional Conditional Moment Equations

Add code
Apr 18, 2024
Viaarxiv icon

Advancing Object Goal Navigation Through LLM-enhanced Object Affinities Transfer

Add code
Mar 15, 2024
Viaarxiv icon