Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Joonyoung Kim

TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation

Feb 04, 2026

Junhan Kim, Yeo Jeong Park, Seungwoo Son, Chungman Lee, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon

Abstract:The rapid growth of large language models (LLMs) has heightened the importance of post-training quantization (PTQ) for reducing memory and computation costs. Among PTQ methods, GPTQ has gained significant attention for its efficiency, enabling billion-scale LLMs to be quantized within a few GPU hours. However, GPTQ's assumption of layer-wise independence leads to severe accuracy drops in low-bit regimes. Recently, BoA improved upon GPTQ by incorporating inter-layer dependencies within attention modules, but its reliance on sequential quantization across all out-channels makes it substantially less efficient. In this paper, we propose TurboBoA, a new backpropagation-free PTQ algorithm that preserves the accuracy benefits of BoA while significantly accelerating the process. The proposed TurboBoA introduces three key innovations: (i) joint quantization of multiple out-channels with a closed-form error compensation rule, which reduces sequential bottlenecks and yields more than a three-fold speedup; (ii) a correction mechanism for errors propagated from preceding quantized layers; and (iii) adaptive grid computation with coordinate descent refinement to maintain alignment during iterative updates. Extensive experiments demonstrate that TurboBoA delivers substantial acceleration over BoA while consistently improving accuracy. When combined with outlier suppression techniques, it achieves state-of-the-art results in both weight-only and weight-activation quantization. The code will be available at https://github.com/SamsungLabs/TurboBoA.

* ICLR 2026

Via

Access Paper or Ask Questions

Attention-aware Post-training Quantization without Backpropagation

Jun 19, 2024

Junhan Kim, Ho-young Kim, Eulrang Cho, Chungman Lee, Joonyoung Kim, Yongkweon Jeon

Figure 1 for Attention-aware Post-training Quantization without Backpropagation

Figure 2 for Attention-aware Post-training Quantization without Backpropagation

Figure 3 for Attention-aware Post-training Quantization without Backpropagation

Figure 4 for Attention-aware Post-training Quantization without Backpropagation

Abstract:Quantization is a promising solution for deploying large-scale language models (LLMs) on resource-constrained devices. Existing quantization approaches, however, rely on gradient-based optimization, regardless of it being post-training quantization (PTQ) or quantization-aware training (QAT), which becomes problematic for hyper-scale LLMs with billions of parameters. This overhead can be alleviated via recently proposed backpropagation-free PTQ methods; however, their performance is somewhat limited by their lack of consideration of inter-layer dependencies. In this paper, we thus propose a novel PTQ algorithm that considers inter-layer dependencies without relying on backpropagation. The fundamental concept involved is the development of attention-aware Hessian matrices, which facilitates the consideration of inter-layer dependencies within the attention module. Extensive experiments demonstrate that the proposed algorithm significantly outperforms conventional PTQ methods, particularly for low bit-widths.

* 20 pages, under review

Via

Access Paper or Ask Questions

Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

Feb 14, 2024

Junhan Kim, Kyungphil Park, Chungman Lee, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon

Figure 1 for Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

Figure 2 for Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

Figure 3 for Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

Figure 4 for Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

Abstract:With the increasing complexity of generative AI models, post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile devices and TVs. Existing PTQ schemes, however, consume considerable time and resources, which could be a bottleneck in real situations where frequent model updates and multiple hyper-parameter tunings are required. As a cost-effective alternative, one-shot PTQ schemes have been proposed. Still, the performance is somewhat limited because they cannot consider the inter-layer dependency within the attention module, which is a very important feature of Transformers. In this paper, we thus propose a novel PTQ algorithm that balances accuracy and efficiency. The key idea of the proposed algorithm called aespa is to perform quantization layer-wise for efficiency while considering cross-layer dependency to preserve the attention score. Through extensive experiments on various language models and complexity analysis, we demonstrate that aespa is accurate and efficient in quantizing Transformer models.

* 17 pages, under review

Via

Access Paper or Ask Questions

Intuitive Access to Smartphone Settings Using Relevance Model Trained by Contrastive Learning

Jul 15, 2023

Joonyoung Kim, Kangwook Lee, Haebin Shin, Hurnjoo Lee, Sechun Kang, Byunguk Choi, Dong Shin, Joohyung Lee

Figure 1 for Intuitive Access to Smartphone Settings Using Relevance Model Trained by Contrastive Learning

Figure 2 for Intuitive Access to Smartphone Settings Using Relevance Model Trained by Contrastive Learning

Figure 3 for Intuitive Access to Smartphone Settings Using Relevance Model Trained by Contrastive Learning

Figure 4 for Intuitive Access to Smartphone Settings Using Relevance Model Trained by Contrastive Learning

Abstract:The more new features that are being added to smartphones, the harder it becomes for users to find them. This is because the feature names are usually short, and there are just too many to remember. In such a case, the users may want to ask contextual queries that describe the features they are looking for, but the standard term frequency-based search cannot process them. This paper presents a novel retrieval system for mobile features that accepts intuitive and contextual search queries. We trained a relevance model via contrastive learning from a pre-trained language model to perceive the contextual relevance between query embeddings and indexed mobile features. Also, to make it run efficiently on-device using minimal resources, we applied knowledge distillation to compress the model without degrading much performance. To verify the feasibility of our method, we collected test queries and conducted comparative experiments with the currently deployed search baselines. The results show that our system outperforms the others on contextual sentence queries and even on usual keyword-based queries.

* 7 pages, IAAI 2023

Via

Access Paper or Ask Questions

Augment & Valuate : A Data Enhancement Pipeline for Data-Centric AI

Dec 07, 2021

Youngjune Lee, Oh Joon Kwon, Haeju Lee, Joonyoung Kim, Kangwook Lee, Kee-Eung Kim

Figure 1 for Augment & Valuate : A Data Enhancement Pipeline for Data-Centric AI

Figure 2 for Augment & Valuate : A Data Enhancement Pipeline for Data-Centric AI

Figure 3 for Augment & Valuate : A Data Enhancement Pipeline for Data-Centric AI

Figure 4 for Augment & Valuate : A Data Enhancement Pipeline for Data-Centric AI

Abstract:Data scarcity and noise are important issues in industrial applications of machine learning. However, it is often challenging to devise a scalable and generalized approach to address the fundamental distributional and semantic properties of dataset with black box models. For this reason, data-centric approaches are crucial for the automation of machine learning operation pipeline. In order to serve as the basis for this automation, we suggest a domain-agnostic pipeline for refining the quality of data in image classification problems. This pipeline contains data valuation, cleansing, and augmentation. With an appropriate combination of these methods, we could achieve 84.711% test accuracy (ranked #6, Honorable Mention in the Most Innovative) in the Data-Centric AI competition only with the provided dataset.

* Data Centric AI Workshop at NeurIPS 2021

Via

Access Paper or Ask Questions

Neural Sequence-to-grid Module for Learning Symbolic Rules

Jan 13, 2021

Segwang Kim, Hyoungwook Nam, Joonyoung Kim, Kyomin Jung

Figure 1 for Neural Sequence-to-grid Module for Learning Symbolic Rules

Figure 2 for Neural Sequence-to-grid Module for Learning Symbolic Rules

Figure 3 for Neural Sequence-to-grid Module for Learning Symbolic Rules

Figure 4 for Neural Sequence-to-grid Module for Learning Symbolic Rules

Abstract:Logical reasoning tasks over symbols, such as learning arithmetic operations and computer program evaluations, have become challenges to deep learning. In particular, even state-of-the-art neural networks fail to achieve \textit{out-of-distribution} (OOD) generalization of symbolic reasoning tasks, whereas humans can easily extend learned symbolic rules. To resolve this difficulty, we propose a neural sequence-to-grid (seq2grid) module, an input preprocessor that automatically segments and aligns an input sequence into a grid. As our module outputs a grid via a novel differentiable mapping, any neural network structure taking a grid input, such as ResNet or TextCNN, can be jointly trained with our module in an end-to-end fashion. Extensive experiments show that neural networks having our module as an input preprocessor achieve OOD generalization on various arithmetic and algorithmic problems including number sequence prediction problems, algebraic word problems, and computer program evaluation problems while other state-of-the-art sequence transduction models cannot. Moreover, we verify that our module enhances TextCNN to solve the bAbI QA tasks without external memory.

* 9 pages, 9 figures, AAAI 2021

Via

Access Paper or Ask Questions