Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mahesh Marina

MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving

Jan 25, 2024

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina

Figure 1 for MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving

Figure 2 for MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving

Figure 3 for MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving

Figure 4 for MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving

Abstract:This paper presents MoE-Infinity, a cost-efficient mixture-of-expert (MoE) serving system that realizes activation-aware expert offloading. MoE-Infinity features sequence-level expert activation tracing, a new approach adept at identifying sparse activations and capturing the temporal locality of MoE inference. By analyzing these traces, MoE-Infinity performs novel activation-aware expert prefetching and caching, substantially reducing the latency overheads usually associated with offloading experts for improved cost performance. Extensive experiments in a cluster show that MoE-Infinity outperforms numerous existing systems and approaches, reducing latency by 4 - 20X and decreasing deployment costs by over 8X for various MoEs. MoE-Infinity's source code is publicly available at https://github.com/TorchMoE/MoE-Infinity

Via

Access Paper or Ask Questions