Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping

Aug 29, 2023

Rui Kong, Yuanchun Li, Qingtian Feng, Weijun Wang, Linghe Kong, Yunxin Liu

Figure 1 for Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping

Figure 2 for Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping

Figure 3 for Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping

Figure 4 for Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping

Share this with someone who'll enjoy it:

Abstract:Mixture of experts (MoE) is a popular technique in deep learning that improves model capacity with conditionally-activated parallel neural network modules (experts). However, serving MoE models in resource-constrained latency-critical edge scenarios is challenging due to the significantly increased model size and complexity. In this paper, we first analyze the behavior pattern of MoE models in continuous inference scenarios, which leads to three key observations about the expert activations, including temporal locality, exchangeability, and skippable computation. Based on these observations, we introduce PC-MoE, an inference framework for resource-constrained continuous MoE model serving. The core of PC-MoE is a new data structure, Parameter Committee, that intelligently maintains a subset of important experts in use to reduce resource consumption. The optimal configuration of Parameter Committee is found offline by a profiling-guided committee planner, and expert swapping and request handling at runtime are managed by an adaptive committee scheduler. To evaluate the effectiveness of PC-MoE, we conduct experiments using state-of-the-art MoE models on common computer vision and natural language processing tasks. The results demonstrate optimal trade-offs between resource consumption and model accuracy achieved by PC-MoE. For instance, on object detection tasks with the Swin-MoE model, our approach can reduce memory usage and latency by 42.34% and 18.63% with only 0.10% accuracy degradation.

View paper on

Share this with someone who'll enjoy it:

Title:Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping

Paper and Code