Picture for Han Zhong

Han Zhong

Muon in Associative Memory Learning: Training Dynamics and Scaling Laws

Add code
Feb 05, 2026
Viaarxiv icon

Optimism Stabilizes Thompson Sampling for Adaptive Inference

Add code
Feb 05, 2026
Viaarxiv icon

The Sample Complexity of Online Strategic Decision Making with Information Asymmetry and Knowledge Transportability

Add code
Jun 11, 2025
Viaarxiv icon

Less is More: Improving LLM Alignment via Preference Data Selection

Add code
Feb 22, 2025
Viaarxiv icon

Learning an Optimal Assortment Policy under Observational Data

Add code
Feb 10, 2025
Viaarxiv icon

BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning

Add code
Jan 31, 2025
Figure 1 for BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning
Figure 2 for BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning
Figure 3 for BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning
Figure 4 for BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning
Viaarxiv icon

A3S: A General Active Clustering Method with Pairwise Constraints

Add code
Jul 14, 2024
Figure 1 for A3S: A General Active Clustering Method with Pairwise Constraints
Figure 2 for A3S: A General Active Clustering Method with Pairwise Constraints
Figure 3 for A3S: A General Active Clustering Method with Pairwise Constraints
Figure 4 for A3S: A General Active Clustering Method with Pairwise Constraints
Viaarxiv icon

Combinatorial Multivariant Multi-Armed Bandits with Applications to Episodic Reinforcement Learning and Beyond

Add code
Jun 03, 2024
Viaarxiv icon

DPO Meets PPO: Reinforced Token Optimization for RLHF

Add code
Apr 29, 2024
Figure 1 for DPO Meets PPO: Reinforced Token Optimization for RLHF
Figure 2 for DPO Meets PPO: Reinforced Token Optimization for RLHF
Figure 3 for DPO Meets PPO: Reinforced Token Optimization for RLHF
Figure 4 for DPO Meets PPO: Reinforced Token Optimization for RLHF
Viaarxiv icon

Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

Add code
Apr 19, 2024
Viaarxiv icon