Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Poulami Das

HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving

Apr 14, 2025

Avinash Kumar, Shashank Nag, Jason Clemons, Lizy John, Poulami Das

Abstract:Deploying large language models (LLMs) presents critical challenges due to the inherent trade-offs associated with key performance metrics, such as latency, accuracy, and throughput. Typically, gains in one metric is accompanied with degradation in others. Early-Exit LLMs (EE-LLMs) efficiently navigate this trade-off space by skipping some of the later model layers when it confidently finds an output token early, thus reducing latency without impacting accuracy. However, as the early exits taken depend on the task and are unknown apriori to request processing, EE-LLMs conservatively load the entire model, limiting resource savings and throughput. Also, current frameworks statically select a model for a user task, limiting our ability to adapt to changing nature of the input queries. We propose HELIOS to address these challenges. First, HELIOS shortlists a set of candidate LLMs, evaluates them using a subset of prompts, gathering telemetry data in real-time. Second, HELIOS uses the early exit data from these evaluations to greedily load the selected model only up to a limited number of layers. This approach yields memory savings which enables us to process more requests at the same time, thereby improving throughput. Third, HELIOS monitors and periodically reassesses the performance of the candidate LLMs and if needed, switches to another model that can service incoming queries more efficiently (such as using fewer layers without lowering accuracy). Our evaluations show that HELIOS achieves 1.48$\times$ throughput, 1.10$\times$ energy-efficiency, 1.39$\times$ lower response time, and 3.7$\times$ improvements in inference batch sizes compared to the baseline, when optimizing for the respective service level objectives.

Via

Access Paper or Ask Questions

Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs

Mar 02, 2025

Ravi Ghadia, Avinash Kumar, Gaurav Jain, Prashant Nair, Poulami Das

Abstract:Autoregressive Transformers rely on Key-Value (KV) caching to accelerate inference. However, the linear growth of the KV cache with context length leads to excessive memory consumption and bandwidth constraints. This bottleneck is particularly problematic in real-time applications -- such as chatbots and interactive assistants -- where low latency and high memory efficiency are critical. Existing methods drop distant tokens or compress states in a lossy manner, sacrificing accuracy by discarding vital context or introducing bias. We propose MorphKV, an inference-time technique that maintains a constant-sized KV cache while preserving accuracy. MorphKV balances long-range dependencies and local coherence during text generation. It eliminates early-token bias while retaining high-fidelity context by adaptively ranking tokens through correlation-aware selection. Unlike heuristic retention or lossy compression, MorphKV iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens. This approach captures inter-token correlation with greater accuracy, crucial for tasks like content creation and code generation. Our studies on long-response tasks show 52.9$\%$ memory savings and 18.2$\%$ higher accuracy on average compared to state-of-the-art prior works, enabling efficient real-world deployment.

Via

Access Paper or Ask Questions

Élivágar: Efficient Quantum Circuit Search for Classification

Jan 17, 2024

Sashwat Anagolum, Narges Alavisamani, Poulami Das, Moinuddin Qureshi, Eric Kessler, Yunong Shi

Abstract:Designing performant and noise-robust circuits for Quantum Machine Learning (QML) is challenging -- the design space scales exponentially with circuit size, and there are few well-supported guiding principles for QML circuit design. Although recent Quantum Circuit Search (QCS) methods attempt to search for performant QML circuits that are also robust to hardware noise, they directly adopt designs from classical Neural Architecture Search (NAS) that are misaligned with the unique constraints of quantum hardware, resulting in high search overheads and severe performance bottlenecks. We present \'Eliv\'agar, a novel resource-efficient, noise-guided QCS framework. \'Eliv\'agar innovates in all three major aspects of QCS -- search space, search algorithm and candidate evaluation strategy -- to address the design flaws in current classically-inspired QCS methods. \'Eliv\'agar achieves hardware-efficiency and avoids an expensive circuit-mapping co-search via noise- and device topology-aware candidate generation. By introducing two cheap-to-compute predictors, Clifford noise resilience and Representational capacity, \'Eliv\'agar decouples the evaluation of noise robustness and performance, enabling early rejection of low-fidelity circuits and reducing circuit evaluation costs. Due to its resource-efficiency, \'Eliv\'agar can further search for data embeddings, significantly improving performance. Based on a comprehensive evaluation of \'Eliv\'agar on 12 real quantum devices and 9 QML applications, \'Eliv\'agar achieves 5.3% higher accuracy and a 271$\times$ speedup compared to state-of-the-art QCS methods.

* 13 pages, 11 figures. To appear in ASPLOS 2024

Via

Access Paper or Ask Questions

FrozenQubits: Boosting Fidelity of QAOA by Skipping Hotspot Nodes

Oct 31, 2022

Ramin Ayanzadeh, Narges Alavisamani, Poulami Das, Moinuddin Qureshi

Abstract:Quantum Approximate Optimization Algorithm (QAOA) is one of the leading candidates for demonstrating the quantum advantage using near-term quantum computers. Unfortunately, high device error rates limit us from reliably running QAOA circuits for problems with more than a few qubits. In QAOA, the problem graph is translated into a quantum circuit such that every edge corresponds to two 2-qubit CNOT operations in each layer of the circuit. As CNOTs are extremely error-prone, the fidelity of QAOA circuits is dictated by the number of edges in the problem graph. We observe that majority of graphs corresponding to real-world applications follow the ``power-law`` distribution, where some hotspot nodes have significantly higher number of connections. We leverage this insight and propose ``FrozenQubits`` that freezes the hotspot nodes or qubits and intelligently partitions the state-space of the given problem into several smaller sub-spaces which are then solved independently. The corresponding QAOA sub-circuits are significantly less vulnerable to gate and decoherence errors due to the reduced number of CNOT operations in each sub-circuit. Unlike prior circuit-cutting approaches, FrozenQubits does not require any exponentially complex post-processing step. Our evaluations with 5,300 QAOA circuits on eight different quantum computers from IBM shows that FrozenQubits can improve the quality of solutions by 8.73x on average (and by up to 57x), albeit utilizing 2x more quantum resources.

Via

Access Paper or Ask Questions

Multilevel Threshold Based Gray Scale Image Segmentation using Cuckoo Search

Jul 01, 2013

Sourav Samantaa, Nilanjan Dey, Poulami Das, Suvojit Acharjee, Sheli Sinha Chaudhuri

Figure 1 for Multilevel Threshold Based Gray Scale Image Segmentation using Cuckoo Search

Figure 2 for Multilevel Threshold Based Gray Scale Image Segmentation using Cuckoo Search

Figure 3 for Multilevel Threshold Based Gray Scale Image Segmentation using Cuckoo Search

Figure 4 for Multilevel Threshold Based Gray Scale Image Segmentation using Cuckoo Search

Abstract:Image Segmentation is a technique of partitioning the original image into some distinct classes. Many possible solutions may be available for segmenting an image into a certain number of classes, each one having different quality of segmentation. In our proposed method, multilevel thresholding technique has been used for image segmentation. A new approach of Cuckoo Search (CS) is used for selection of optimal threshold value. In other words, the algorithm is used to achieve the best solution from the initial random threshold values or solutions and to evaluate the quality of a solution correlation function is used. Finally, MSE and PSNR are measured to understand the segmentation quality.

* 8 Pages,7 figures,ICECIT2012,Anatapur,India. arXiv admin note: text overlap with arXiv:1003.1594, arXiv:1005.2908 by other authors

Via

Access Paper or Ask Questions

Embedding of Blink Frequency in Electrooculography Signal using Difference Expansion based Reversible Watermarking Technique

Mar 09, 2013

Nilanjan Dey, Prasenjit Maji, Poulami Das, Shouvik Biswas, Achintya Das, Sheli Sinha Chaudhuri

Figure 1 for Embedding of Blink Frequency in Electrooculography Signal using Difference Expansion based Reversible Watermarking Technique

Figure 2 for Embedding of Blink Frequency in Electrooculography Signal using Difference Expansion based Reversible Watermarking Technique

Figure 3 for Embedding of Blink Frequency in Electrooculography Signal using Difference Expansion based Reversible Watermarking Technique

Figure 4 for Embedding of Blink Frequency in Electrooculography Signal using Difference Expansion based Reversible Watermarking Technique

Abstract:In the past few years, like other fields, rapid expansion of digitization and globalization has influenced the medical field as well. For progress of diagnostic results most of the reputed hospitals and diagnostic centres all over the world have started exchanging medical information. In this proposed method, the calculated diagnostic parametric values of the original Electrooculography (EOG) signal are embedded as a watermark by using Difference Expansion (DE) algorithm based reversible watermarking technique. The extracted watermark provides the required parametric values at the recipient end without any post computation of the recovered EOG signal. By computing the parametric values from the recovered signal, the integrity of the extracted watermark can be validated. The time domain features of EOG signal are calculated for the generation of watermark. In the current work, various features are studied and two major features related to blink frequency are used to generate the watermark. The high Signal to Noise Ratio (SNR) and the Bit Error Rate (BER) claim the robustness of the proposed method.

* Scientific Bulletin of the Politehnica University of Timisoara - Transactions on Electronics and Communications p-ISSN 1583-3380, vol. 57(71), no. 2, 2012
* 6 Pages, 3 Figures, 4 Tables

Via

Access Paper or Ask Questions