Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jae-Joon Kim

Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning

May 20, 2025

Jiwon Song, Dongwon Jo, Yulhwa Kim, Jae-Joon Kim

Abstract:Recent reasoning-focused language models achieve high accuracy by generating lengthy intermediate reasoning paths before producing final answers. While this approach is effective in solving problems that require logical thinking, long reasoning paths significantly increase memory usage and throughput of token generation, limiting the practical deployment of such models. We propose Reasoning Path Compression (RPC), a training-free method that accelerates inference by leveraging the semantic sparsity of reasoning paths. RPC periodically compresses the KV cache by retaining KV cache that receive high importance score, which are computed using a selector window composed of recently generated queries. Experiments show that RPC improves generation throughput of QwQ-32B by up to 1.60$\times$ compared to the inference with full KV cache, with an accuracy drop of 1.2% on the AIME 2024 benchmark. Our findings demonstrate that semantic sparsity in reasoning traces can be effectively exploited for compression, offering a practical path toward efficient deployment of reasoning LLMs. Our code is available at https://github.com/jiwonsong-dev/ReasoningPathCompression.

Via

Access Paper or Ask Questions

FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

Feb 03, 2025

Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim

Abstract:While large language models (LLMs) excel at handling long-context sequences, they require substantial key-value (KV) caches to store contextual information, which can heavily burden computational efficiency and memory usage. Previous efforts to compress these KV caches primarily focused on reducing memory demands but were limited in enhancing latency. To address this issue, we introduce FastKV, a KV cache compression method designed to enhance latency for long-context sequences. To enhance processing speeds while maintaining accuracy, FastKV adopts a novel Token-Selective Propagation (TSP) approach that retains the full context information in the initial layers of LLMs and selectively propagates only a portion of this information in deeper layers even in the prefill stage. Additionally, FastKV incorporates grouped-query attention (GQA)-aware KV cache compression to exploit the advantages of GQA in both memory and computational efficiency. Our experimental results show that FastKV achieves 2.00$\times$ and 1.40$\times$ improvements in time-to-first-token (TTFT) and throughput, respectively, compared to HeadKV, the state-of-the-art KV cache compression method. Moreover, FastKV successfully maintains accuracy on long-context benchmarks at levels comparable to the baselines. Our code is available at https://github.com/dongwonjo/FastKV.

Via

Access Paper or Ask Questions

COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators

Jan 12, 2025

Jihoon Park, Jeongin Choe, Dohyun Kim, Jae-Joon Kim

Abstract:Recently, crossbar array based in-memory accelerators have been gaining interest due to their high throughput and energy efficiency. While software and compiler support for the in-memory accelerators has also been introduced, they are currently limited to the case where all weights are assumed to be on-chip. This limitation becomes apparent with the significantly increasing network sizes compared to the in-memory footprint. Weight replacement schemes are essential to address this issue. We propose COMPASS, a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators. COMPASS is specially targeted for networks that exceed the capacity of PIM crossbar arrays, necessitating access to external memories. We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip. Our scheme takes into account the data dependence between layers, core utilization, and the number of write instructions to minimize latency, memory accesses, and improve energy efficiency. Simulation results demonstrate that COMPASS can accommodate much more networks using a minimal memory footprint, while improving throughput by 1.78X and providing 1.28X savings in energy-delay product (EDP) over baseline partitioning methods.

* Accepted IEEE DATE 2025

Via

Access Paper or Ask Questions

Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Jun 18, 2024

Dongwon Jo, Taesu Kim, Yulhwa Kim, Jae-Joon Kim

Figure 1 for Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Figure 2 for Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Figure 3 for Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Figure 4 for Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Abstract:Binarization, which converts weight parameters to binary values, has emerged as an effective strategy to reduce the size of large language models (LLMs). However, typical binarization techniques significantly diminish linguistic effectiveness of LLMs. To address this issue, we introduce a novel binarization technique called Mixture of Scales (BinaryMoS). Unlike conventional methods, BinaryMoS employs multiple scaling experts for binary weights, dynamically merging these experts for each token to adaptively generate scaling factors. This token-adaptive approach boosts the representational power of binarized LLMs by enabling contextual adjustments to the values of binary weights. Moreover, because this adaptive process only involves the scaling factors rather than the entire weight matrix, BinaryMoS maintains compression efficiency similar to traditional static binarization methods. Our experimental results reveal that BinaryMoS surpasses conventional binarization techniques in various natural language processing tasks and even outperforms 2-bit quantization methods, all while maintaining similar model size to static binarization techniques.

Via

Access Paper or Ask Questions

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Feb 14, 2024

Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim

Abstract:Large language models (LLMs) have proven to be highly effective across various natural language processing tasks. However, their large number of parameters poses significant challenges for practical deployment. Pruning, a technique aimed at reducing the size and complexity of LLMs, offers a potential solution by removing redundant components from the network. Despite the promise of pruning, existing methods often struggle to achieve substantial end-to-end LLM inference speedup. In this paper, we introduce SLEB, a novel approach designed to streamline LLMs by eliminating redundant transformer blocks. We choose the transformer block as the fundamental unit for pruning, because LLMs exhibit block-level redundancy with high similarity between the outputs of neighboring blocks. This choice allows us to effectively enhance the processing speed of LLMs. Our experimental results demonstrate that SLEB successfully accelerates LLM inference without compromising the linguistic capabilities of these models, making it a promising technique for optimizing the efficiency of LLMs. The code is available at: https://github.com/leapingjagg-dev/SLEB

Via

Access Paper or Ask Questions

Squeezing Large-Scale Diffusion Models for Mobile

Jul 03, 2023

Jiwoong Choi, Minkyu Kim, Daehyun Ahn, Taesu Kim, Yulhwa Kim, Dongwon Jo, Hyesung Jeon, Jae-Joon Kim, Hyungjun Kim

Abstract:The emergence of diffusion models has greatly broadened the scope of high-fidelity image synthesis, resulting in notable advancements in both practical implementation and academic research. With the active adoption of the model in various real-world applications, the need for on-device deployment has grown considerably. However, deploying large diffusion models such as Stable Diffusion with more than one billion parameters to mobile devices poses distinctive challenges due to the limited computational and memory resources, which may vary according to the device. In this paper, we present the challenges and solutions for deploying Stable Diffusion on mobile devices with TensorFlow Lite framework, which supports both iOS and Android devices. The resulting Mobile Stable Diffusion achieves the inference latency of smaller than 7 seconds for a 512x512 image generation on Android devices with mobile GPUs.

* 7 pages, 8 figures, ICML 2023 Workshop on Challenges in Deployable Generative AI

Via

Access Paper or Ask Questions

INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold

Apr 18, 2022

Changhun Lee, Hyungjun Kim, Eunhyeok Park, Jae-Joon Kim

Figure 1 for INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold

Figure 2 for INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold

Figure 3 for INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold

Figure 4 for INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold

Abstract:Binary Neural Networks (BNNs) have emerged as a promising solution for reducing the memory footprint and compute costs of deep neural networks. BNNs, on the other hand, suffer from information loss because binary activations are limited to only two values, resulting in reduced accuracy. To improve the accuracy, previous studies have attempted to control the distribution of binary activation by manually shifting the threshold of the activation function or making the shift amount trainable. During the process, they usually depended on statistical information computed from a batch. We argue that using statistical data from a batch fails to capture the crucial information for each input instance in BNN computations, and the differences between statistical information computed from each instance need to be considered when determining the binary activation threshold of each instance. Based on the concept, we propose the Binary Neural Network with INSTAnce-aware threshold (INSTA-BNN), which decides the activation threshold value considering the difference between statistical data computed from a batch and each instance. The proposed INSTA-BNN outperforms the baseline by 2.5% and 2.3% on the ImageNet classification task with comparable computing cost, achieving 68.0% and 71.7% top-1 accuracy on ResNet-18 and MobileNetV1 based models, respectively.

* 19 pages, 7 figures; excluded axessibility package

Via

Access Paper or Ask Questions

Improving Accuracy of Binary Neural Networks using Unbalanced Activation Distribution

Dec 02, 2020

Hyungjun Kim, Jihoon Park, Changhun Lee, Jae-Joon Kim

Figure 1 for Improving Accuracy of Binary Neural Networks using Unbalanced Activation Distribution

Figure 2 for Improving Accuracy of Binary Neural Networks using Unbalanced Activation Distribution

Figure 3 for Improving Accuracy of Binary Neural Networks using Unbalanced Activation Distribution

Figure 4 for Improving Accuracy of Binary Neural Networks using Unbalanced Activation Distribution

Abstract:Binarization of neural network models is considered as one of the promising methods to deploy deep neural network models on resource-constrained environments such as mobile devices. However, Binary Neural Networks (BNNs) tend to suffer from severe accuracy degradation compared to the full-precision counterpart model. Several techniques were proposed to improve the accuracy of BNNs. One of the approaches is to balance the distribution of binary activations so that the amount of information in the binary activations becomes maximum. Based on extensive analysis, in stark contrast to previous work, we argue that unbalanced activation distribution can actually improve the accuracy of BNNs. We also show that adjusting the threshold values of binary activation functions results in the unbalanced distribution of the binary activation, which increases the accuracy of BNN models. Experimental results show that the accuracy of previous BNN models (e.g. XNOR-Net and Bi-Real-Net) can be improved by simply shifting the threshold values of binary activation functions without requiring any other modification.

* 11 pages, 10 figures

Via

Access Paper or Ask Questions

Unifying Activation- and Timing-based Learning Rules for Spiking Neural Networks

Jun 04, 2020

Jinseok Kim, Kyungsu Kim, Jae-Joon Kim

Figure 1 for Unifying Activation- and Timing-based Learning Rules for Spiking Neural Networks

Figure 2 for Unifying Activation- and Timing-based Learning Rules for Spiking Neural Networks

Figure 3 for Unifying Activation- and Timing-based Learning Rules for Spiking Neural Networks

Figure 4 for Unifying Activation- and Timing-based Learning Rules for Spiking Neural Networks

Abstract:For the gradient computation across the time domain in Spiking Neural Networks (SNNs) training, two different approaches have been independently studied. The first is to compute the gradients with respect to the change in spike activation (activation-based methods), and the second is to compute the gradients with respect to the change in spike timing (timing-based methods). In this work, we present a comparative study of the two methods and propose a new supervised learning method that combines them. The proposed method utilizes each individual spike more effectively by shifting spike timings as in the timing-based methods as wells as generating and removing spikes as in the activation-based methods. Experimental results showed that the proposed method achieves higher performance in terms of both accuracy and efficiency than the previous approaches.

Via

Access Paper or Ask Questions

BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations

Feb 16, 2020

Hyungjun Kim, Kyungsu Kim, Jinseok Kim, Jae-Joon Kim

Figure 1 for BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations

Figure 2 for BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations

Figure 3 for BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations

Figure 4 for BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations

Abstract:Binary Neural Networks (BNNs) have been garnering interest thanks to their compute cost reduction and memory savings. However, BNNs suffer from performance degradation mainly due to the gradient mismatch caused by binarizing activations. Previous works tried to address the gradient mismatch problem by reducing the discrepancy between activation functions used at forward pass and its differentiable approximation used at backward pass, which is an indirect measure. In this work, we use the gradient of smoothed loss function to better estimate the gradient mismatch in quantized neural network. Analysis using the gradient mismatch estimator indicates that using higher precision for activation is more effective than modifying the differentiable approximation of activation function. Based on the observation, we propose a new training scheme for binary activation networks called BinaryDuo in which two binary activations are coupled into a ternary activation during training. Experimental results show that BinaryDuo outperforms state-of-the-art BNNs on various benchmarks with the same amount of parameters and computing cost.

Via

Access Paper or Ask Questions