Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ekaterina Aidova

Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO

Nov 08, 2023

Haim Barad, Ekaterina Aidova, Yury Gorbachev

Abstract:Inference optimizations are critical for improving user experience and reducing infrastructure costs and power consumption. In this article, we illustrate a form of dynamic execution known as speculative sampling to reduce the overall latency of text generation and compare it with standard autoregressive sampling. This can be used together with model-based optimizations (e.g. quantization) to provide an optimized solution. Both sampling methods make use of KV caching. A Jupyter notebook and some sample executions are provided.

* To be published on openvino.ai. Code available at https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/266-speculative-sampling

Via

Access Paper or Ask Questions