Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kartik Sinha

Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

Nov 13, 2024

Vima Gupta, Kartik Sinha, Ada Gavrilovska, Anand Padmanabha Iyer

Figure 1 for Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

Figure 2 for Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

Figure 3 for Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

Figure 4 for Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection

Abstract:Mixture-of-Experts (MoE) architectures have recently gained popularity in enabling efficient scaling of large language models. However, we uncover a fundamental tension: while MoEs are designed for selective expert activation, production serving requires request batching, which forces the activation of all experts and negates MoE's efficiency benefits during the decode phase. We present Lynx, a system that enables efficient MoE inference through dynamic, batch-aware expert selection. Our key insight is that expert importance varies significantly across tokens and inference phases, creating opportunities for runtime optimization. Lynx leverages this insight through a lightweight framework that dynamically reduces active experts while preserving model accuracy. Our evaluations show that Lynx achieves up to 1.55x reduction in inference latency while maintaining negligible accuracy loss from baseline model across complex code generation and mathematical reasoning tasks.

Via

Access Paper or Ask Questions