Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ramin Khalili

Large Language Model Partitioning for Low-Latency Inference at the Edge

May 05, 2025

Dimitrios Kafetzis, Ramin Khalili, Iordanis Koutsopoulos

Abstract:Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence, the length grows and so does the memory and compute load, due to the expanding key-value caches, which store intermediate representations of all previously generated tokens in the multi-head attention (MHA) layer. As this iterative process steadily increases memory and compute demands, layer-based partitioning in resource-constrained edge environments often results in memory overload or high inference latency. To address this and reduce inference latency, we propose a resource-aware Transformer architecture partitioning algorithm, where the partitioning decision is updated at regular intervals during token generation. The approach is myopic in that it is based on instantaneous information about device resource availability and network link bandwidths. When first executed, the algorithm places blocks on devices, and in later executions, it migrates these blocks among devices so that the sum of migration delay and inference delay remains low. Our approach partitions the decoder at the attention head level, co-locating each attention head with its key-value cache and allowing dynamic migrations whenever resources become tight. By allocating different attention heads to different devices, we exploit parallel execution of attention heads and thus achieve substantial reductions in inference delays. Our experiments show that in small-scale settings (3-5 devices), the proposed method achieves within 15 to 20 percent of an exact optimal solver's latency, while in larger-scale tests it achieves notable improvements in inference speed and memory usage compared to state-of-the-art layer-based partitioning approaches.

Via

Access Paper or Ask Questions

Collaborative Split Federated Learning with Parallel Training and Aggregation

Apr 22, 2025

Yiannis Papageorgiou, Yannis Thomas, Alexios Filippakopoulos, Ramin Khalili, Iordanis Koutsopoulos

Abstract:Federated learning (FL) operates based on model exchanges between the server and the clients, and it suffers from significant client-side computation and communication burden. Split federated learning (SFL) arises a promising solution by splitting the model into two parts, that are trained sequentially: the clients train the first part of the model (client-side model) and transmit it to the server that trains the second (server-side model). Existing SFL schemes though still exhibit long training delays and significant communication overhead, especially when clients of different computing capability participate. Thus, we propose Collaborative-Split Federated Learning~(C-SFL), a novel scheme that splits the model into three parts, namely the model parts trained at the computationally weak clients, the ones trained at the computationally strong clients, and the ones at the server. Unlike existing works, C-SFL enables parallel training and aggregation of model's parts at the clients and at the server, resulting in reduced training delays and commmunication overhead while improving the model's accuracy. Experiments verify the multiple gains of C-SFL against the existing schemes.

Via

Access Paper or Ask Questions

Accelerated Training on Low-Power Edge Devices

Feb 25, 2025

Mohamed Aboelenien Ahmed, Kilian Pfeiffer, Heba Khdr, Osama Abboud, Ramin Khalili, Jörg Henkel

Figure 1 for Accelerated Training on Low-Power Edge Devices

Figure 2 for Accelerated Training on Low-Power Edge Devices

Figure 3 for Accelerated Training on Low-Power Edge Devices

Figure 4 for Accelerated Training on Low-Power Edge Devices

Abstract:Training on edge devices poses several challenges as these devices are generally resource-constrained, especially in terms of power. State-of-the-art techniques at the device level reduce the GPU frequency to enforce power constraints, leading to a significant increase in training time. To accelerate training, we propose to jointly adjust the system and application parameters (in our case, the GPU frequency and the batch size of the training task) while adhering to the power constraints on devices. We introduce a novel cross-layer methodology that combines predictions of batch size efficiency and device profiling to achieve the desired optimization. Our evaluation on real hardware shows that our method outperforms the current baselines that depend on state of the art techniques, reducing the training time by $2.4\times$ with results very close to optimal. Our measurements also indicate a substantial reduction in the overall energy used for the training process. These gains are achieved without reduction in the performance of the trained model.

Via

Access Paper or Ask Questions

Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices

Nov 12, 2024

Kilian Pfeiffer, Mohamed Aboelenien Ahmed, Ramin Khalili, Jörg Henkel

Figure 1 for Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices

Figure 2 for Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices

Figure 3 for Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices

Figure 4 for Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices

Abstract:In recent years, Large Language Models (LLMs) through Transformer structures have dominated many machine learning tasks, especially text processing. However, these models require massive amounts of data for training and induce high resource requirements, particularly in terms of the large number of Floating Point Operations (FLOPs) and the high amounts of memory needed. To fine-tune such a model in a parameter-efficient way, techniques like Adapter or LoRA have been developed. However, we observe that the application of LoRA, when used in federated learning (FL), while still being parameter-efficient, is memory and FLOP inefficient. Based on that observation, we develop a novel layer finetuning scheme that allows devices in cross-device FL to make use of pretrained neural networks (NNs) while adhering to given resource constraints. We show that our presented scheme outperforms the current state of the art when dealing with homogeneous or heterogeneous computation and memory constraints and is on par with LoRA regarding limited communication, thereby achieving significantly higher accuracies in FL training.

Via

Access Paper or Ask Questions

Multi-Objective Optimization Using Adaptive Distributed Reinforcement Learning

Mar 13, 2024

Jing Tan, Ramin Khalili, Holger Karl

Figure 1 for Multi-Objective Optimization Using Adaptive Distributed Reinforcement Learning

Figure 2 for Multi-Objective Optimization Using Adaptive Distributed Reinforcement Learning

Figure 3 for Multi-Objective Optimization Using Adaptive Distributed Reinforcement Learning

Figure 4 for Multi-Objective Optimization Using Adaptive Distributed Reinforcement Learning

Abstract:The Intelligent Transportation System (ITS) environment is known to be dynamic and distributed, where participants (vehicle users, operators, etc.) have multiple, changing and possibly conflicting objectives. Although Reinforcement Learning (RL) algorithms are commonly applied to optimize ITS applications such as resource management and offloading, most RL algorithms focus on single objectives. In many situations, converting a multi-objective problem into a single-objective one is impossible, intractable or insufficient, making such RL algorithms inapplicable. We propose a multi-objective, multi-agent reinforcement learning (MARL) algorithm with high learning efficiency and low computational requirements, which automatically triggers adaptive few-shot learning in a dynamic, distributed and noisy environment with sparse and delayed reward. We test our algorithm in an ITS environment with edge cloud computing. Empirical results show that the algorithm is quick to adapt to new environments and performs better in all individual and system metrics compared to the state-of-the-art benchmark. Our algorithm also addresses various practical concerns with its modularized and asynchronous online training method. In addition to the cloud simulation, we test our algorithm on a single-board computer and show that it can make inference in 6 milliseconds.

Via

Access Paper or Ask Questions

DISTINQT: A Distributed Privacy Aware Learning Framework for QoS Prediction for Future Mobile and Wireless Networks

Jan 15, 2024

Nikolaos Koursioumpas, Lina Magoula, Ioannis Stavrakakis, Nancy Alonistioti, M. A. Gutierrez-Estevez, Ramin Khalili

Figure 1 for DISTINQT: A Distributed Privacy Aware Learning Framework for QoS Prediction for Future Mobile and Wireless Networks

Figure 2 for DISTINQT: A Distributed Privacy Aware Learning Framework for QoS Prediction for Future Mobile and Wireless Networks

Figure 3 for DISTINQT: A Distributed Privacy Aware Learning Framework for QoS Prediction for Future Mobile and Wireless Networks

Figure 4 for DISTINQT: A Distributed Privacy Aware Learning Framework for QoS Prediction for Future Mobile and Wireless Networks

Abstract:Beyond 5G and 6G networks are expected to support new and challenging use cases and applications that depend on a certain level of Quality of Service (QoS) to operate smoothly. Predicting the QoS in a timely manner is of high importance, especially for safety-critical applications as in the case of vehicular communications. Although until recent years the QoS prediction has been carried out by centralized Artificial Intelligence (AI) solutions, a number of privacy, computational, and operational concerns have emerged. Alternative solutions have been surfaced (e.g. Split Learning, Federated Learning), distributing AI tasks of reduced complexity across nodes, while preserving the privacy of the data. However, new challenges rise when it comes to scalable distributed learning approaches, taking into account the heterogeneous nature of future wireless networks. The current work proposes DISTINQT, a privacy-aware distributed learning framework for QoS prediction. Our framework supports multiple heterogeneous nodes, in terms of data types and model architectures, by sharing computations across them. This, enables the incorporation of diverse knowledge into a sole learning process that will enhance the robustness and generalization capabilities of the final QoS prediction model. DISTINQT also contributes to data privacy preservation by encoding any raw input data into a non-linear latent representation before any transmission. Evaluation results showcase that our framework achieves a statistically identical performance compared to its centralized version and an average performance improvement of up to 65% against six state-of-the-art centralized baseline solutions in the Tele-Operated Driving use case.

* 11 Pages Double Column, 9 Figures, Submitted for possible publication in the IEEE Transactions on Vehicular Technology (IEEE TVT)

Via

Access Paper or Ask Questions

A Safe Deep Reinforcement Learning Approach for Energy Efficient Federated Learning in Wireless Communication Networks

Aug 21, 2023

Nikolaos Koursioumpas, Lina Magoula, Nikolaos Petropouleas, Alexandros-Ioannis Thanopoulos, Theodora Panagea, Nancy Alonistioti, M. A. Gutierrez-Estevez, Ramin Khalili

Abstract:Progressing towards a new era of Artificial Intelligence (AI) - enabled wireless networks, concerns regarding the environmental impact of AI have been raised both in industry and academia. Federated Learning (FL) has emerged as a key privacy preserving decentralized AI technique. Despite efforts currently being made in FL, its environmental impact is still an open problem. Targeting the minimization of the overall energy consumption of an FL process, we propose the orchestration of computational and communication resources of the involved devices to minimize the total energy required, while guaranteeing a certain performance of the model. To this end, we propose a Soft Actor Critic Deep Reinforcement Learning (DRL) solution, where a penalty function is introduced during training, penalizing the strategies that violate the constraints of the environment, and ensuring a safe RL process. A device level synchronization method, along with a computationally cost effective FL environment are proposed, with the goal of further reducing the energy consumption and communication overhead. Evaluation results show the effectiveness of the proposed scheme compared to four state-of-the-art baseline solutions in both static and dynamic environments, achieving a decrease of up to 94% in the total energy consumption.

* 27 Pages Single Column, 6 Figures, Submitted for possible publication in the IEEE Transactions on Green Communications and Networking (TGCN). arXiv admin note: text overlap with arXiv:2306.14237

Via

Access Paper or Ask Questions

Federated Learning for Computationally-Constrained Heterogeneous Devices: A Survey

Jul 18, 2023

Kilian Pfeiffer, Martin Rapp, Ramin Khalili, Jörg Henkel

Figure 1 for Federated Learning for Computationally-Constrained Heterogeneous Devices: A Survey

Figure 2 for Federated Learning for Computationally-Constrained Heterogeneous Devices: A Survey

Figure 3 for Federated Learning for Computationally-Constrained Heterogeneous Devices: A Survey

Figure 4 for Federated Learning for Computationally-Constrained Heterogeneous Devices: A Survey

Abstract:With an increasing number of smart devices like internet of things (IoT) devices deployed in the field, offloadingtraining of neural networks (NNs) to a central server becomes more and more infeasible. Recent efforts toimprove users' privacy have led to on-device learning emerging as an alternative. However, a model trainedonly on a single device, using only local data, is unlikely to reach a high accuracy. Federated learning (FL)has been introduced as a solution, offering a privacy-preserving trade-off between communication overheadand model accuracy by sharing knowledge between devices but disclosing the devices' private data. Theapplicability and the benefit of applying baseline FL are, however, limited in many relevant use cases dueto the heterogeneity present in such environments. In this survey, we outline the heterogeneity challengesFL has to overcome to be widely applicable in real-world applications. We especially focus on the aspect ofcomputation heterogeneity among the participating devices and provide a comprehensive overview of recentworks on heterogeneity-aware FL. We discuss two groups: works that adapt the NN architecture and worksthat approach heterogeneity on a system level, covering Federated Averaging (FedAvg), distillation, and splitlearning-based approaches, as well as synchronous and asynchronous aggregation schemes.

* ACM Comput. Surv. 55, 14s, Article 334, 2023

Via

Access Paper or Ask Questions

A Safe Genetic Algorithm Approach for Energy Efficient Federated Learning in Wireless Communication Networks

Jul 05, 2023

Lina Magoula, Nikolaos Koursioumpas, Alexandros-Ioannis Thanopoulos, Theodora Panagea, Nikolaos Petropouleas, M. A. Gutierrez-Estevez, Ramin Khalili

Figure 1 for A Safe Genetic Algorithm Approach for Energy Efficient Federated Learning in Wireless Communication Networks

Figure 2 for A Safe Genetic Algorithm Approach for Energy Efficient Federated Learning in Wireless Communication Networks

Figure 3 for A Safe Genetic Algorithm Approach for Energy Efficient Federated Learning in Wireless Communication Networks

Figure 4 for A Safe Genetic Algorithm Approach for Energy Efficient Federated Learning in Wireless Communication Networks

Abstract:Federated Learning (FL) has emerged as a decentralized technique, where contrary to traditional centralized approaches, devices perform a model training in a collaborative manner, while preserving data privacy. Despite the existing efforts made in FL, its environmental impact is still under investigation, since several critical challenges regarding its applicability to wireless networks have been identified. Towards mitigating the carbon footprint of FL, the current work proposes a Genetic Algorithm (GA) approach, targeting the minimization of both the overall energy consumption of an FL process and any unnecessary resource utilization, by orchestrating the computational and communication resources of the involved devices, while guaranteeing a certain FL model performance target. A penalty function is introduced in the offline phase of the GA that penalizes the strategies that violate the constraints of the environment, ensuring a safe GA process. Evaluation results show the effectiveness of the proposed scheme compared to two state-of-the-art baseline solutions, achieving a decrease of up to 83% in the total energy consumption.

* 6 pages, 6 figures, Accepted in IEEE PIMRC 2023 Conference, Latest revision with small corrections (typos etc.)

Via

Access Paper or Ask Questions

Aggregating Capacity in FL through Successive Layer Training for Computationally-Constrained Devices

May 26, 2023

Kilian Pfeiffer, Ramin Khalili, Jörg Henkel

Figure 1 for Aggregating Capacity in FL through Successive Layer Training for Computationally-Constrained Devices

Figure 2 for Aggregating Capacity in FL through Successive Layer Training for Computationally-Constrained Devices

Figure 3 for Aggregating Capacity in FL through Successive Layer Training for Computationally-Constrained Devices

Figure 4 for Aggregating Capacity in FL through Successive Layer Training for Computationally-Constrained Devices

Abstract:Federated learning (FL) is usually performed on resource-constrained edge devices, e.g., with limited memory for the computation. If the required memory to train a model exceeds this limit, the device will be excluded from the training. This can lead to a lower accuracy as valuable data and computation resources are excluded from training, also causing bias and unfairness. The FL training process should be adjusted to such constraints. The state-of-the-art techniques propose training subsets of the FL model at constrained devices, reducing their resource requirements for training. But these techniques largely limit the co-adaptation among parameters of the model and are highly inefficient, as we show: it is actually better to train a smaller (less accurate) model by the system where all the devices can train the model end-to-end, than applying such techniques. We propose a new method that enables successive freezing and training of the parameters of the FL model at devices, reducing the training's resource requirements at the devices, while still allowing enough co-adaptation between parameters. We show through extensive experimental evaluation that our technique greatly improves the accuracy of the trained model (by 52.4 p.p.) compared with the state of the art, efficiently aggregating the computation capacity available on distributed devices.

Via

Access Paper or Ask Questions