Picture for Ziniu Li

Ziniu Li

Off-Policy Value-Based Reinforcement Learning for Large Language Models

Add code
Mar 24, 2026
Viaarxiv icon

Non-Adversarial Imitation Learning Provably Free of Compounding Errors: The Role of Bellman Constraints

Add code
Mar 24, 2026
Viaarxiv icon

The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL

Add code
Feb 06, 2026
Viaarxiv icon

Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It

Add code
Feb 02, 2026
Viaarxiv icon

The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

Add code
Jan 13, 2026
Viaarxiv icon

Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

Add code
Dec 31, 2025
Viaarxiv icon

A Note on Hybrid Online Reinforcement and Imitation Learning for LLMs: Formulations and Algorithms

Add code
Dec 28, 2025
Viaarxiv icon

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Add code
Dec 28, 2025
Viaarxiv icon

Taming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning

Add code
Dec 28, 2025
Viaarxiv icon

Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Add code
Dec 21, 2025
Figure 1 for Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
Figure 2 for Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
Figure 3 for Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
Figure 4 for Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
Viaarxiv icon