Picture for Jiacai Liu

Jiacai Liu

Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It

Add code
Feb 02, 2026
Viaarxiv icon

Policy Mirror Descent with Temporal Difference Learning: Sample Complexity under Online Markov Data

Add code
Dec 30, 2025
Viaarxiv icon

A Note on Hybrid Online Reinforcement and Imitation Learning for LLMs: Formulations and Algorithms

Add code
Dec 28, 2025
Viaarxiv icon

Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning

Add code
Dec 28, 2025
Viaarxiv icon

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Add code
Dec 28, 2025
Viaarxiv icon

Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

Add code
Sep 11, 2025
Viaarxiv icon

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Add code
Jul 02, 2025
Viaarxiv icon

Skywork Open Reasoner 1 Technical Report

Add code
May 29, 2025
Viaarxiv icon

Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization

Add code
Dec 24, 2024
Figure 1 for Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization
Figure 2 for Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization
Figure 3 for Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization
Figure 4 for Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization
Viaarxiv icon

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Add code
Oct 24, 2024
Viaarxiv icon