Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. Exploiting the heterogeneous capabilities of edge LLMs is crucial for diverse emerging applications, as it enables greater cost-effectiveness and reduced latency. In this work, we introduce \textit{Mixture-of-Edge-Experts (MoE$^2$)}, a novel collaborative inference framework for edge LLMs. We formulate the joint gating and expert selection problem to optimize inference performance under energy and latency constraints. Unlike conventional MoE problems, LLM expert selection is significantly more challenging due to the combinatorial nature and the heterogeneity of edge LLMs across various attributes. To this end, we propose a two-level expert selection mechanism through which we uncover an optimality-preserving property of gating parameters across expert selections. This property enables the decomposition of the training and selection processes, significantly reducing complexity. Furthermore, we leverage the objective's monotonicity and design a discrete monotonic optimization algorithm for optimal expert selection. We implement edge servers with NVIDIA Jetson AGX Orins and NVIDIA RTX 4090 GPUs, and perform extensive experiments. Our results validate that performance improvements of various LLM models and show that our MoE$^2$ method can achieve optimal trade-offs among different delay and energy budgets, and outperforms baselines under various system resource constraints.
Abstract:In the realm of emerging real-time networked applications like cyber-physical systems (CPS), the Age of Information (AoI) has merged as a pivotal metric for evaluating the timeliness. To meet the high computational demands, such as those in intelligent manufacturing within CPS, mobile edge computing (MEC) presents a promising solution for optimizing computing and reducing AoI. In this work, we study the timeliness of computational-intensive updates and explores jointly optimize the task updating and offloading policies to minimize AoI. Specifically, we consider edge load dynamics and formulate a task scheduling problem to minimize the expected time-average AoI. The fractional objective introduced by AoI and the semi-Markov game nature of the problem render this challenge particularly difficult, with existing approaches not directly applicable. To this end, we present a comprehensive framework to fractional reinforcement learning (RL). We first introduce a fractional single-agent RL framework and prove its linear convergence. We then extend this to a fractional multi-agent RL framework with a convergence analysis. To tackle the challenge of asynchronous control in semi-Markov game, we further design an asynchronous model-free fractional multi-agent RL algorithm, where each device makes scheduling decisions with the hybrid action space without knowing the system dynamics and decisions of other devices. Experimental results show that our proposed algorithms reduce the average AoI by up to 52.6% compared with the best baseline algorithm in our experiments.
Abstract:Mobile edge computing (MEC) is a promising paradigm for real-time applications with intensive computational needs (e.g., autonomous driving), as it can reduce the processing delay. In this work, we focus on the timeliness of computational-intensive updates, measured by Age-ofInformation (AoI), and study how to jointly optimize the task updating and offloading policies for AoI with fractional form. Specifically, we consider edge load dynamics and formulate a task scheduling problem to minimize the expected time-average AoI. The uncertain edge load dynamics, the nature of the fractional objective, and hybrid continuous-discrete action space (due to the joint optimization) make this problem challenging and existing approaches not directly applicable. To this end, we propose a fractional reinforcement learning(RL) framework and prove its convergence. We further design a model-free fractional deep RL (DRL) algorithm, where each device makes scheduling decisions with the hybrid action space without knowing the system dynamics and decisions of other devices. Experimental results show that our proposed algorithms reduce the average AoI by up to 57.6% compared with several non-fractional benchmarks.