Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Thomas Ziller

GreenServ: Energy-Efficient Context-Aware Dynamic Routing for Multi-Model LLM Inference

Jan 24, 2026

Thomas Ziller, Shashikant Ilager, Alessandro Tundo, Ezio Bartocci, Leonardo Mariani, Ivona Brandic

Abstract:Large language models (LLMs) demonstrate remarkable capabilities, but their broad deployment is limited by significant computational resource demands, particularly energy consumption during inference. Static, one-model-fits-all inference strategies are often inefficient, as they do not exploit the diverse range of available models or adapt to varying query requirements. This paper presents GreenServ, a dynamic, context-aware routing framework that optimizes the trade-off between inference accuracy and energy efficiency. GreenServ extracts lightweight contextual features from each query, including task type, semantic cluster, and text complexity, and routes queries to the most suitable model from a heterogeneous pool, based on observed accuracy and energy usage. We employ a multi-armed bandit approach to learn adaptive routing policies online. This approach operates under partial feedback, eliminates the need for extensive offline calibration, and streamlines the integration of new models into the inference pipeline. We evaluated GreenServ across five benchmark tasks and a pool of 16 contemporary open-access LLMs. Experimental results show that GreenServ consistently outperforms static (single-model) and random baselines. In particular, compared to random routing, GreenServ achieved a 22% increase in accuracy while reducing cumulative energy consumption by 31%. Finally, we evaluated GreenServ with RouterBench, achieving an average accuracy of 71.7% with a peak accuracy of 75.7%. All artifacts are open-source and available as an anonymous repository for review purposes here: https://anonymous.4open.science/r/llm-inference-router-EBEA/README.md

* Paper under submisison

Via

Access Paper or Ask Questions

Improving Simulation Regression Efficiency using a Machine Learning-based Method in Design Verification

May 24, 2024

Deepak Narayan Gadde, Sebastian Simon, Djones Lettnin, Thomas Ziller

Abstract:The verification throughput is becoming a major challenge bottleneck, since the complexity and size of SoC designs are still ever increasing. Simply adding more CPU cores and running more tests in parallel will not scale anymore. This paper discusses various methods of improving verification throughput: ranking and the new machine learning (ML) based technology introduced by Cadence i.e. Xcelium ML. Both methods aim at getting comparable coverage in less CPU time by applying more efficient stimulus. Ranking selects specific seeds that simply turned out to come up with the largest coverage in previous simulations, while Xcelium ML generates optimized patterns as a result of finding correlations between randomization points and achieved coverage of previous regressions. Quantified results as well as pros & cons of each approach are discussed in this paper at the example of three actual industry projects. Both Xcelium ML and Ranking methods gave comparable compression & speedup factors around 3 consistently. But the optimized ML based regressions simulated new random scenarios occasionally producing a coverage regain of more than 100%. Finally, a methodology is proposed to use Xcelium ML efficiently throughout the product development.

* Published in DVCon Europe 2022

Via

Access Paper or Ask Questions