Picture for Aviral Kumar

Aviral Kumar

Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data

Add code
Dec 10, 2024
Viaarxiv icon

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Add code
Dec 09, 2024
Viaarxiv icon

What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?

Add code
Nov 12, 2024
Figure 1 for What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?
Figure 2 for What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?
Figure 3 for What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?
Figure 4 for What Do Learning Dynamics Reveal About Generalization in LLM Reasoning?
Viaarxiv icon

Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance

Add code
Oct 17, 2024
Figure 1 for Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance
Figure 2 for Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance
Figure 3 for Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance
Figure 4 for Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance
Viaarxiv icon

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Add code
Oct 10, 2024
Figure 1 for Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
Figure 2 for Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
Figure 3 for Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
Figure 4 for Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
Viaarxiv icon

Generative Verifiers: Reward Modeling as Next-Token Prediction

Add code
Aug 27, 2024
Figure 1 for Generative Verifiers: Reward Modeling as Next-Token Prediction
Figure 2 for Generative Verifiers: Reward Modeling as Next-Token Prediction
Figure 3 for Generative Verifiers: Reward Modeling as Next-Token Prediction
Figure 4 for Generative Verifiers: Reward Modeling as Next-Token Prediction
Viaarxiv icon

D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning

Add code
Aug 15, 2024
Viaarxiv icon

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Add code
Aug 06, 2024
Figure 1 for Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Figure 2 for Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Figure 3 for Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Figure 4 for Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Viaarxiv icon

Recursive Introspection: Teaching Language Model Agents How to Self-Improve

Add code
Jul 26, 2024
Figure 1 for Recursive Introspection: Teaching Language Model Agents How to Self-Improve
Figure 2 for Recursive Introspection: Teaching Language Model Agents How to Self-Improve
Figure 3 for Recursive Introspection: Teaching Language Model Agents How to Self-Improve
Figure 4 for Recursive Introspection: Teaching Language Model Agents How to Self-Improve
Viaarxiv icon

RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

Add code
Jun 20, 2024
Viaarxiv icon