Picture for Tuo Zhao

Tuo Zhao

Shuffle the Context: RoPE-Perturbed Self-Distillation for Long-Context Adaptation

Add code
Apr 15, 2026
Viaarxiv icon

Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

Add code
Apr 10, 2026
Viaarxiv icon

Diffusion Model for Manifold Data: Score Decomposition, Curvature, and Statistical Complexity

Add code
Mar 21, 2026
Viaarxiv icon

Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training

Add code
Feb 05, 2026
Viaarxiv icon

Teach Diffusion Language Models to Learn from Their Own Mistakes

Add code
Jan 10, 2026
Viaarxiv icon

Ask a Strong LLM Judge when Your Reward Model is Uncertain

Add code
Oct 23, 2025
Viaarxiv icon

OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment

Add code
Oct 09, 2025
Figure 1 for OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment
Figure 2 for OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment
Figure 3 for OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment
Figure 4 for OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment
Viaarxiv icon

Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

Add code
May 22, 2025
Viaarxiv icon

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Add code
Apr 20, 2025
Viaarxiv icon

Adversarial Training of Reward Models

Add code
Apr 08, 2025
Figure 1 for Adversarial Training of Reward Models
Figure 2 for Adversarial Training of Reward Models
Figure 3 for Adversarial Training of Reward Models
Figure 4 for Adversarial Training of Reward Models
Viaarxiv icon