Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Joo-Young Kim

AoP-SAM: Automation of Prompts for Efficient Segmentation

May 17, 2025

Yi Chen, Mu-Young Son, Chuanbo Hua, Joo-Young Kim

Abstract:The Segment Anything Model (SAM) is a powerful foundation model for image segmentation, showing robust zero-shot generalization through prompt engineering. However, relying on manual prompts is impractical for real-world applications, particularly in scenarios where rapid prompt provision and resource efficiency are crucial. In this paper, we propose the Automation of Prompts for SAM (AoP-SAM), a novel approach that learns to generate essential prompts in optimal locations automatically. AoP-SAM enhances SAM's efficiency and usability by eliminating manual input, making it better suited for real-world tasks. Our approach employs a lightweight yet efficient Prompt Predictor model that detects key entities across images and identifies the optimal regions for placing prompt candidates. This method leverages SAM's image embeddings, preserving its zero-shot generalization capabilities without requiring fine-tuning. Additionally, we introduce a test-time instance-level Adaptive Sampling and Filtering mechanism that generates prompts in a coarse-to-fine manner. This notably enhances both prompt and mask generation efficiency by reducing computational overhead and minimizing redundant mask refinements. Evaluations of three datasets demonstrate that AoP-SAM substantially improves both prompt generation efficiency and mask generation accuracy, making SAM more effective for automated segmentation tasks.

* Accepted at AAAI 2025

Via

Access Paper or Ask Questions

LightNobel: Improving Sequence Length Limitation in Protein Structure Prediction Model via Adaptive Activation Quantization

May 09, 2025

Seunghee Han, Soongyu Choi, Joo-Young Kim

Abstract:Recent advances in Protein Structure Prediction Models (PPMs), such as AlphaFold2 and ESMFold, have revolutionized computational biology by achieving unprecedented accuracy in predicting three-dimensional protein folding structures. However, these models face significant scalability challenges, particularly when processing proteins with long amino acid sequences (e.g., sequence length > 1,000). The primary bottleneck that arises from the exponential growth in activation sizes is driven by the unique data structure in PPM, which introduces an additional dimension that leads to substantial memory and computational demands. These limitations have hindered the effective scaling of PPM for real-world applications, such as analyzing large proteins or complex multimers with critical biological and pharmaceutical relevance. In this paper, we present LightNobel, the first hardware-software co-designed accelerator developed to overcome scalability limitations on the sequence length in PPM. At the software level, we propose Token-wise Adaptive Activation Quantization (AAQ), which leverages unique token-wise characteristics, such as distogram patterns in PPM activations, to enable fine-grained quantization techniques without compromising accuracy. At the hardware level, LightNobel integrates the multi-precision reconfigurable matrix processing unit (RMPU) and versatile vector processing unit (VVPU) to enable the efficient execution of AAQ. Through these innovations, LightNobel achieves up to 8.44x, 8.41x speedup and 37.29x, 43.35x higher power efficiency over the latest NVIDIA A100 and H100 GPUs, respectively, while maintaining negligible accuracy loss. It also reduces the peak memory requirement up to 120.05x in PPM, enabling scalable processing for proteins with long sequences.

* To appear in the Proceedings of the 52nd IEEE/ACM International Symposium on Computer Architecture (ISCA 2025)

Via

Access Paper or Ask Questions

SCRec: A Scalable Computational Storage System with Statistical Sharding and Tensor-train Decomposition for Recommendation Models

Apr 01, 2025

Jinho Yang, Ji-Hoon Kim, Joo-Young Kim

Abstract:Deep Learning Recommendation Models (DLRMs) play a crucial role in delivering personalized content across web applications such as social networking and video streaming. However, with improvements in performance, the parameter size of DLRMs has grown to terabyte (TB) scales, accompanied by memory bandwidth demands exceeding TB/s levels. Furthermore, the workload intensity within the model varies based on the target mechanism, making it difficult to build an optimized recommendation system. In this paper, we propose SCRec, a scalable computational storage recommendation system that can handle TB-scale industrial DLRMs while guaranteeing high bandwidth requirements. SCRec utilizes a software framework that features a mixed-integer programming (MIP)-based cost model, efficiently fetching data based on data access patterns and adaptively configuring memory-centric and compute-centric cores. Additionally, SCRec integrates hardware acceleration cores to enhance DLRM computations, particularly allowing for the high-performance reconstruction of approximated embedding vectors from extremely compressed tensor-train (TT) format. By combining its software framework and hardware accelerators, while eliminating data communication overhead by being implemented on a single server, SCRec achieves substantial improvements in DLRM inference performance. It delivers up to 55.77$\times$ speedup compared to a CPU-DRAM system with no loss in accuracy and up to 13.35$\times$ energy efficiency gains over a multi-GPU system.

* 14 pages, 12 figures

Via

Access Paper or Ask Questions

Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

Mar 24, 2025

Minsu Kim, Seongmin Hong, RyeoWook Ko, Soongyu Choi, Hunjong Lee, Junsoo Kim, Joo-Young Kim, Jongse Park

Figure 1 for Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

Figure 2 for Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

Figure 3 for Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

Figure 4 for Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

Abstract:Modern Large Language Model serving system batches multiple requests to achieve high throughput, while batching attention operations is challenging, rendering memory bandwidth a critical bottleneck. The community relies on high-end GPUs with multiple high-bandwidth memory channels. Unfortunately, HBM's high bandwidth often comes at the expense of limited memory capacity, which reduces core utilization and increases costs. Recent advancements enabling longer contexts for LLMs have substantially increased the key-value cache size, further intensifying the pressures on memory capacity. The literature has explored KV cache quantization techniques, which commonly use low bitwidth for most values, selectively using higher bitwidth for outlier values. While this approach helps achieve high accuracy and low bitwidth simultaneously, it comes with the limitation that cost for online outlier detection is excessively high, negating the advantages. We propose Oaken, an acceleration solution that achieves high accuracy and high performance simultaneously through co-designing algorithm and hardware. To effectively find a sweet spot in the accuracy-performance trade-off space of KV cache quantization, Oaken employs an online-offline hybrid approach, setting outlier thresholds offline, which are then used to determine the quantization scale online. To translate the proposed algorithmic technique into tangible performance gains, Oaken also comes with custom quantization engines and memory management units that can be integrated with any LLM accelerators. We built an Oaken accelerator on top of an LLM accelerator, LPU, and conducted a comprehensive evaluation. Our experiments show that for a batch size of 256, Oaken achieves up to 1.58x throughput improvement over NVIDIA A100 GPU, incurring a minimal accuracy loss of only 0.54\% on average, compared to state-of-the-art KV cache quantization techniques.

* 15 pages, 14 figures, and 4 tables

Via

Access Paper or Ask Questions

Training-Free Graph Filtering via Multimodal Feature Refinement for Extremely Fast Multimodal Recommendation

Mar 06, 2025

Yu-Seung Roh, Joo-Young Kim, Jin-Duk Park, Won-Yong Shin

Figure 1 for Training-Free Graph Filtering via Multimodal Feature Refinement for Extremely Fast Multimodal Recommendation

Figure 2 for Training-Free Graph Filtering via Multimodal Feature Refinement for Extremely Fast Multimodal Recommendation

Abstract:Multimodal recommender systems improve the performance of canonical recommender systems with no item features by utilizing diverse content types such as text, images, and videos, while alleviating inherent sparsity of user-item interactions and accelerating user engagement. However, current neural network-based models often incur significant computational overhead due to the complex training process required to learn and integrate information from multiple modalities. To overcome this limitation, we propose MultiModal-Graph Filtering (MM-GF), a training-free method based on the notion of graph filtering (GF) for efficient and accurate multimodal recommendations. Specifically, MM-GF first constructs multiple similarity graphs through nontrivial multimodal feature refinement such as robust scaling and vector shifting by addressing the heterogeneous characteristics across modalities. Then, MM-GF optimally fuses multimodal information using linear low-pass filters across different modalities. Extensive experiments on real-world benchmark datasets demonstrate that MM-GF not only improves recommendation accuracy by up to 13.35% compared to the best competitor but also dramatically reduces computational costs by achieving the runtime of less than 10 seconds.

* 10 pages, 6 figures, 6 tables

Via

Access Paper or Ask Questions

EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models

Jan 10, 2025

Jaehoon Heo, Adiwena Putra, Jieon Yoon, Sungwoong Yune, Hangyeol Lee, Ji-Hoon Kim, Joo-Young Kim

Figure 1 for EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models

Figure 2 for EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models

Figure 3 for EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models

Figure 4 for EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models

Abstract:Over the past few years, diffusion models have emerged as novel AI solutions, generating diverse multi-modal outputs from text prompts. Despite their capabilities, they face challenges in computing, such as excessive latency and energy consumption due to their iterative architecture. Although prior works specialized in transformer acceleration can be applied, the iterative nature of diffusion models remains unresolved. In this paper, we present EXION, the first SW-HW co-designed diffusion accelerator that solves the computation challenges by exploiting the unique inter- and intra-iteration output sparsity in diffusion models. To this end, we propose two SW-level optimizations. First, we introduce the FFN-Reuse algorithm that identifies and skips redundant computations in FFN layers across different iterations (inter-iteration sparsity). Second, we use a modified eager prediction method that employs two-step leading-one detection to accurately predict the attention score, skipping unnecessary computations within an iteration (intra-iteration sparsity). We also introduce a novel data compaction mechanism named ConMerge, which can enhance HW utilization by condensing and merging sparse matrices into compact forms. Finally, it has a dedicated HW architecture that supports the above sparsity-inducing algorithms, translating high output sparsity into improved energy efficiency and performance. To verify the feasibility of the EXION, we first demonstrate that it has no impact on accuracy in various types of multi-modal diffusion models. We then instantiate EXION in both server- and edge-level settings and compare its performance against GPUs with similar specifications. Our evaluation shows that EXION achieves dramatic improvements in performance and energy efficiency by 3.2-379.3x and 45.1-3067.6x compared to a server GPU and by 42.6-1090.9x and 196.9-4668.2x compared to an edge GPU.

* To appear in 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA 2025)

Via

Access Paper or Ask Questions

LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight Grouping for Multi-Agent Reinforcement Learning

Oct 29, 2022

Je Yang, JaeUk Kim, Joo-Young Kim

Abstract:Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems in various applications such as multi-robot control and self-driving cars. Unlike supervised model or single-agent reinforcement learning, which actively exploits network pruning, it is obscure that how pruning will work in multi-agent reinforcement learning with its cooperative and interactive characteristics. \par In this paper, we present a real-time sparse training acceleration system named LearningGroup, which adopts network pruning on the training of MARL for the first time with an algorithm/architecture co-design approach. We create sparsity using a weight grouping algorithm and propose on-chip sparse data encoding loop (OSEL) that enables fast encoding with efficient implementation. Based on the OSEL's encoding format, LearningGroup performs efficient weight compression and computation workload allocation to multiple cores, where each core handles multiple sparse rows of the weight matrix simultaneously with vector processing units. As a result, LearningGroup system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x. Its FPGA accelerator shows 257.40-3629.48 GFLOPS throughput and 7.10-100.12 GFLOPS/W energy efficiency for various conditions in MARL, which are 7.13x higher and 12.43x more energy efficient than Nvidia Titan RTX GPU, thanks to the fully on-chip training and highly optimized dataflow/data format provided by FPGA. Most importantly, the accelerator shows speedup up to 12.52x for processing sparse data over the dense case, which is the highest among state-of-the-art sparse training accelerators.

* This paper will be published in IEEE International Conference on Field-Programmable Technology 2022

Via

Access Paper or Ask Questions

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

Sep 22, 2022

Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim, Dongsoo Lee, Joo-Young Kim

Figure 1 for DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

Figure 2 for DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

Figure 3 for DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

Figure 4 for DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

Abstract:Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pre-trained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.

* Extension of HOTCHIPS 2022 and accepted in MICRO 2022

Via

Access Paper or Ask Questions

Accelerating Large-Scale Graph-based Nearest Neighbor Search on a Computational Storage Platform

Jul 12, 2022

Ji-Hoon Kim, Yeo-Reum Park, Jaeyoung Do, Soo-Young Ji, Joo-Young Kim

Figure 1 for Accelerating Large-Scale Graph-based Nearest Neighbor Search on a Computational Storage Platform

Figure 2 for Accelerating Large-Scale Graph-based Nearest Neighbor Search on a Computational Storage Platform

Figure 3 for Accelerating Large-Scale Graph-based Nearest Neighbor Search on a Computational Storage Platform

Figure 4 for Accelerating Large-Scale Graph-based Nearest Neighbor Search on a Computational Storage Platform

Abstract:K-nearest neighbor search is one of the fundamental tasks in various applications and the hierarchical navigable small world (HNSW) has recently drawn attention in large-scale cloud services, as it easily scales up the database while offering fast search. On the other hand, a computational storage device (CSD) that combines programmable logic and storage modules on a single board becomes popular to address the data bandwidth bottleneck of modern computing systems. In this paper, we propose a computational storage platform that can accelerate a large-scale graph-based nearest neighbor search algorithm based on SmartSSD CSD. To this end, we modify the algorithm more amenable on the hardware and implement two types of accelerators using HLS- and RTL-based methodology with various optimization methods. In addition, we scale up the proposed platform to have 4 SmartSSDs and apply graph parallelism to boost the system performance further. As a result, the proposed computational storage platform achieves 75.59 query per second throughput for the SIFT1B dataset at 258.66W power dissipation, which is 12.83x and 17.91x faster and 10.43x and 24.33x more energy efficient than the conventional CPU-based and GPU-based server platform, respectively. With multi-terabyte storage and custom acceleration capability, we believe that the proposed computational storage platform is a promising solution for cost-sensitive cloud datacenters.

* IEEE Transactions on Computers
* Extension of FCCM 20201 and Accepted in Transaction on Computers

Via

Access Paper or Ask Questions

FIXAR: A Fixed-Point Deep Reinforcement Learning Platform with Quantization-Aware Training and Adaptive Parallelism

Feb 24, 2021

Je Yang, Seongmin Hong, Joo-Young Kim

Figure 1 for FIXAR: A Fixed-Point Deep Reinforcement Learning Platform with Quantization-Aware Training and Adaptive Parallelism

Figure 2 for FIXAR: A Fixed-Point Deep Reinforcement Learning Platform with Quantization-Aware Training and Adaptive Parallelism

Figure 3 for FIXAR: A Fixed-Point Deep Reinforcement Learning Platform with Quantization-Aware Training and Adaptive Parallelism

Figure 4 for FIXAR: A Fixed-Point Deep Reinforcement Learning Platform with Quantization-Aware Training and Adaptive Parallelism

Abstract:In this paper, we present a deep reinforcement learning platform named FIXAR which employs fixed-point data types and arithmetic units for the first time using a SW/HW co-design approach. Starting from 32-bit fixed-point data, Quantization-Aware Training (QAT) reduces its data precision based on the range of activations and performs retraining to minimize the reward degradation. FIXAR proposes the adaptive array processing core composed of configurable processing elements to support both intra-layer parallelism and intra-batch parallelism for high-throughput inference and training. Finally, FIXAR was implemented on Xilinx U50 and achieves 25293.3 inferences per second (IPS) training throughput and 2638.0 IPS/W accelerator efficiency, which is 2.7 times faster and 15.4 times more energy efficient than those of the CPU-GPU platform without any accuracy degradation.

* This paper will be published in Proc. IEEE/ACM Design Automation Conference (DAC) 2021

Via

Access Paper or Ask Questions