Picture for Yiyuan Ma

Yiyuan Ma

SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning

Add code
Feb 02, 2026
Viaarxiv icon

Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Add code
Dec 29, 2025
Viaarxiv icon

Virtual Width Networks

Add code
Nov 17, 2025
Figure 1 for Virtual Width Networks
Figure 2 for Virtual Width Networks
Figure 3 for Virtual Width Networks
Figure 4 for Virtual Width Networks
Viaarxiv icon

MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

Add code
May 19, 2025
Viaarxiv icon

Model Merging in Pre-training of Large Language Models

Add code
May 17, 2025
Viaarxiv icon

FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Add code
Feb 28, 2025
Figure 1 for FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
Figure 2 for FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
Figure 3 for FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
Figure 4 for FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
Viaarxiv icon