Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jaya Prakash Champati

Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques

Jun 06, 2025

Adarsh Prasad Behera, Jaya Prakash Champati, Roberto Morabito, Sasu Tarkoma, James Gross

Abstract:Recent progress in Language Models (LMs) has dramatically advanced the field of natural language processing (NLP), excelling at tasks like text generation, summarization, and question answering. However, their inference remains computationally expensive and energy intensive, especially in settings with limited hardware, power, or bandwidth. This makes it difficult to deploy LMs in mobile, edge, or cost sensitive environments. To address these challenges, recent approaches have introduced multi LLM intelligent model selection strategies that dynamically allocate computational resources based on query complexity -- using lightweight models for simpler queries and escalating to larger models only when necessary. This survey explores two complementary strategies for efficient LLM inference: (i) routing, which selects the most suitable model based on the query, and (ii) cascading or hierarchical inference (HI), which escalates queries through a sequence of models until a confident response is found. Both approaches aim to reduce computation by using lightweight models for simpler tasks while offloading only when needed. We provide a comparative analysis of these techniques across key performance metrics, discuss benchmarking efforts, and outline open challenges. Finally, we outline future research directions to enable faster response times, adaptive model selection based on task complexity, and scalable deployment across heterogeneous environments, making LLM based systems more efficient and accessible for real world applications.

Via

Access Paper or Ask Questions

Minimizing Age of Detection for a Markov Source over a Lossy Channel

Mar 04, 2025

Shivang Garde, Jaya Prakash Champati, Arpan Chattopadhyay

Abstract:Monitoring a process/phenomenon of specific interest is prevalent in Cyber-Physical Systems (CPS), remote healthcare, smart buildings, intelligent transport, industry 4.0, etc. A key building block of the monitoring system is a sensor sampling the process and communicating the status updates to a monitor for detecting events of interest. Measuring the freshness of the status updates is essential for the timely detection of events, and it has received significant research interest in recent times. In this paper, we propose a new freshness metric, Age of Detection (AoD), for monitoring the state transitions of a Discrete Time Markov Chain (DTMC) source over a lossy wireless channel. We consider the pull model where the sensor samples DTMC state whenever the monitor requests a status update. We formulate a Constrained Markov Decision Problem (CMDP) for optimising the AoD subject to a constraint on the average sampling frequency and solve it using the Lagrangian MDP formulation and Relative Value Iteration (RVI) algorithm. Our numerical results show interesting trade-offs between AoD, sampling frequency, and transmission success probability. Further, the AoD minimizing policy provides a lower estimation error than the Age of Information (AoI) minimizing policy, thus demonstrating the utility of AoD for monitoring DTMC sources.

Via

Access Paper or Ask Questions

The Case for Hierarchical Deep Learning Inference at the Network Edge

Apr 23, 2023

Ghina Al-Atat, Andrea Fresa, Adarsh Prasad Behera, Vishnu Narayanan Moothedath, James Gross, Jaya Prakash Champati

Figure 1 for The Case for Hierarchical Deep Learning Inference at the Network Edge

Figure 2 for The Case for Hierarchical Deep Learning Inference at the Network Edge

Figure 3 for The Case for Hierarchical Deep Learning Inference at the Network Edge

Figure 4 for The Case for Hierarchical Deep Learning Inference at the Network Edge

Abstract:Resource-constrained Edge Devices (EDs), e.g., IoT sensors and microcontroller units, are expected to make intelligent decisions using Deep Learning (DL) inference at the edge of the network. Toward this end, there is a significant research effort in developing tinyML models - Deep Learning (DL) models with reduced computation and memory storage requirements - that can be embedded on these devices. However, tinyML models have lower inference accuracy. On a different front, DNN partitioning and inference offloading techniques were studied for distributed DL inference between EDs and Edge Servers (ESs). In this paper, we explore Hierarchical Inference (HI), a novel approach proposed by Vishnu et al. 2023, arXiv:2304.00891v1 , for performing distributed DL inference at the edge. Under HI, for each data sample, an ED first uses a local algorithm (e.g., a tinyML model) for inference. Depending on the application, if the inference provided by the local algorithm is incorrect or further assistance is required from large DL models on edge or cloud, only then the ED offloads the data sample. At the outset, HI seems infeasible as the ED, in general, cannot know if the local inference is sufficient or not. Nevertheless, we present the feasibility of implementing HI for machine fault detection and image classification applications. We demonstrate its benefits using quantitative analysis and argue that using HI will result in low latency, bandwidth savings, and energy savings in edge AI systems.

* This paper consists of 9 pages, with 6 tables and 8 figures

Via

Access Paper or Ask Questions

Online Algorithms for Hierarchical Inference in Deep Learning applications at the Edge

Apr 03, 2023

Vishnu Narayanan Moothedath, Jaya Prakash Champati, James Gross

Figure 1 for Online Algorithms for Hierarchical Inference in Deep Learning applications at the Edge

Figure 2 for Online Algorithms for Hierarchical Inference in Deep Learning applications at the Edge

Figure 3 for Online Algorithms for Hierarchical Inference in Deep Learning applications at the Edge

Figure 4 for Online Algorithms for Hierarchical Inference in Deep Learning applications at the Edge

Abstract:We consider a resource-constrained Edge Device (ED) embedded with a small-size ML model (S-ML) for a generic classification application, and an Edge Server (ES) that hosts a large-size ML model (L-ML). Since the inference accuracy of S-ML is lower than that of the L-ML, offloading all the data samples to the ES results in high inference accuracy, but it defeats the purpose of embedding S-ML on the ED and deprives the benefits of reduced latency, bandwidth savings, and energy efficiency of doing local inference. To get the best out of both worlds, i.e., the benefits of doing inference on the ED and the benefits of doing inference on ES, we explore the idea of Hierarchical Inference (HI), wherein S-ML inference is only accepted when it is correct, otherwise the data sample is offloaded for L-ML inference. However, the ideal implementation of HI is infeasible as the correctness of the S-ML inference is not known to the ED. We thus propose an online meta-learning framework to predict the correctness of the S-ML inference. The resulting online learning problem turns out to be a Prediction with Expert Advice (PEA) problem with continuous expert space. We consider the full feedback scenario, where the ED receives feedback on the correctness of the S-ML once it accepts the inference, and the no-local feedback scenario, where the ED does not receive the ground truth for the classification, and propose the HIL-F and HIL-N algorithms and prove a regret bound that is sublinear with the number of data samples. We evaluate and benchmark the performance of the proposed algorithms for image classification applications using four datasets, namely, Imagenette, Imagewoof, MNIST, and CIFAR-10.

* This work will be appearing in a journal soon and the 'Journal reference' will be updated as and when the information is available. The submission contains 22 pages, 7 figures including subfigures, 2 tables and 2 algorithms

Via

Access Paper or Ask Questions

Edge-MultiAI: Multi-Tenancy of Latency-Sensitive Deep Learning Applications on Edge

Nov 14, 2022

SM Zobaed, Ali Mokhtari, Jaya Prakash Champati, Mathieu Kourouma, Mohsen Amini Salehi

Figure 1 for Edge-MultiAI: Multi-Tenancy of Latency-Sensitive Deep Learning Applications on Edge

Figure 2 for Edge-MultiAI: Multi-Tenancy of Latency-Sensitive Deep Learning Applications on Edge

Figure 3 for Edge-MultiAI: Multi-Tenancy of Latency-Sensitive Deep Learning Applications on Edge

Figure 4 for Edge-MultiAI: Multi-Tenancy of Latency-Sensitive Deep Learning Applications on Edge

Abstract:Smart IoT-based systems often desire continuous execution of multiple latency-sensitive Deep Learning (DL) applications. The edge servers serve as the cornerstone of such IoT-based systems, however, their resource limitations hamper the continuous execution of multiple (multi-tenant) DL applications. The challenge is that, DL applications function based on bulky "neural network (NN) models" that cannot be simultaneously maintained in the limited memory space of the edge. Accordingly, the main contribution of this research is to overcome the memory contention challenge, thereby, meeting the latency constraints of the DL applications without compromising their inference accuracy. We propose an efficient NN model management framework, called Edge-MultiAI, that ushers the NN models of the DL applications into the edge memory such that the degree of multi-tenancy and the number of warm-starts are maximized. Edge-MultiAI leverages NN model compression techniques, such as model quantization, and dynamically loads NN models for DL applications to stimulate multi-tenancy on the edge server. We also devise a model management heuristic for Edge-MultiAI, called iWS-BFE, that functions based on the Bayesian theory to predict the inference requests for multi-tenant applications, and uses it to choose the appropriate NN models for loading, hence, increasing the number of warm-start inferences. We evaluate the efficacy and robustness of Edge-MultiAI under various configurations. The results reveal that Edge-MultiAI can stimulate the degree of multi-tenancy on the edge by at least 2X and increase the number of warm-starts by around 60% without any major loss on the inference accuracy of the applications.

* Accepted in Utility Cloud Computing Conference 2022

Via

Access Paper or Ask Questions

Offloading Algorithms for Maximizing Inference Accuracy on Edge Device Under a Time Constraint

Dec 21, 2021

Andrea Fresa, Jaya Prakash Champati

Figure 1 for Offloading Algorithms for Maximizing Inference Accuracy on Edge Device Under a Time Constraint

Figure 2 for Offloading Algorithms for Maximizing Inference Accuracy on Edge Device Under a Time Constraint

Figure 3 for Offloading Algorithms for Maximizing Inference Accuracy on Edge Device Under a Time Constraint

Figure 4 for Offloading Algorithms for Maximizing Inference Accuracy on Edge Device Under a Time Constraint

Abstract:With the emergence of edge computing, the problem of offloading jobs between an Edge Device (ED) and an Edge Server (ES) received significant attention in the past. Motivated by the fact that an increasing number of applications are using Machine Learning (ML) inference, we study the problem of offloading inference jobs by considering the following novel aspects: 1) in contrast to a typical computational job, the processing time of an inference job depends on the size of the ML model, and 2) recently proposed Deep Neural Networks (DNNs) for resource-constrained devices provide the choice of scaling the model size. We formulate an assignment problem with the aim of maximizing the total inference accuracy of n data samples available at the ED, subject to a time constraint T on the makespan. We propose an approximation algorithm AMR2, and prove that it results in a makespan at most 2T, and achieves a total accuracy that is lower by a small constant from optimal total accuracy. As proof of concept, we implemented AMR2 on a Raspberry Pi, equipped with MobileNet, and is connected to a server equipped with ResNet, and studied the total accuracy and makespan performance of AMR2 for image classification application.

Via

Access Paper or Ask Questions