Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Optimized Speculative Sampling for GPU Hardware Accelerators

Jun 16, 2024

Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet

Figure 1 for Optimized Speculative Sampling for GPU Hardware Accelerators

Figure 2 for Optimized Speculative Sampling for GPU Hardware Accelerators

Figure 3 for Optimized Speculative Sampling for GPU Hardware Accelerators

Figure 4 for Optimized Speculative Sampling for GPU Hardware Accelerators

Share this with someone who'll enjoy it:

Abstract:In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. Additionally, we use fast on-chip memory to store intermediate results, thereby minimizing the frequency of slow read and write operations across different types of memory. This results in profiling time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a slight decline in accuracy. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate the effectiveness of our optimization methods.

View paper on

Share this with someone who'll enjoy it:

Title:Optimized Speculative Sampling for GPU Hardware Accelerators

Paper and Code