Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sungsoo Ha

BASS: Batched Attention-optimized Speculative Sampling

Apr 24, 2024

Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, Anoop Deoras

Figure 1 for BASS: Batched Attention-optimized Speculative Sampling

Figure 2 for BASS: Batched Attention-optimized Speculative Sampling

Figure 3 for BASS: Batched Attention-optimized Speculative Sampling

Figure 4 for BASS: Batched Attention-optimized Speculative Sampling

Abstract:Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.

Via

Access Paper or Ask Questions

Metal Artifact Reduction in Cone-Beam X-Ray CT via Ray Profile Correction

Aug 06, 2018

Sungsoo Ha, Klaus Mueller

Abstract:In computed tomography (CT), metal implants increase the inconsistencies between the measured data and the linear attenuation assumption made by analytic CT reconstruction algorithms. The inconsistencies give rise to dark and bright bands and streaks in the reconstructed image, collectively called metal artifacts. These artifacts make it difficult for radiologists to render correct diagnostic decisions. We describe a data-driven metal artifact reduction (MAR) algorithm for image-guided spine surgery that applies to scenarios in which a prior CT scan of the patient is available. We tested the proposed method with two clinical datasets that were both obtained during spine surgery. Using the proposed method, we were not only able to remove the dark and bright streaks caused by the implanted screws but we also recovered the anatomical structures hidden by these artifacts. This results in an improved capability of surgeons to confirm the correctness of the implanted pedicle screw placements.

* 10 pages, 8 figures

Via

Access Paper or Ask Questions