Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bo-Kyeong Kim

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

Apr 01, 2025

Jewon Lee, Ki-Ung Song, Seungmin Yang, Donguk Lim, Jaeyeon Kim, Wooksu Shin, Bo-Kyeong Kim, Yong Jae Lee, Tae-Ho Kim

Abstract:Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.

* accepted at CVPR 2025 Workshop on ELVM

Via

Access Paper or Ask Questions

EdgeFusion: On-Device Text-to-Image Generation

Apr 18, 2024

Thibault Castells, Hyoung-Kyu Song, Tairen Piao, Shinkook Choi, Bo-Kyeong Kim, Hanyoung Yim, Changgwun Lee, Jae Gon Kim, Tae-Ho Kim

Abstract:The intensive computational burden of Stable Diffusion (SD) for text-to-image generation poses a significant hurdle for its practical application. To tackle this challenge, recent research focuses on methods to reduce sampling steps, such as Latent Consistency Model (LCM), and on employing architectural optimizations, including pruning and knowledge distillation. Diverging from existing approaches, we uniquely start with a compact SD variant, BK-SDM. We observe that directly applying LCM to BK-SDM with commonly used crawled datasets yields unsatisfactory results. It leads us to develop two strategies: (1) leveraging high-quality image-text pairs from leading generative models and (2) designing an advanced distillation process tailored for LCM. Through our thorough exploration of quantization, profiling, and on-device deployment, we achieve rapid generation of photo-realistic, text-aligned images in just two steps, with latency under one second on resource-limited edge devices.

* 4 pages, accepted to CVPR24 First Workshop on Efficient and On-Device Generation (EDGE)

Via

Access Paper or Ask Questions

LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

Apr 18, 2024

Thibault Castells, Hyoung-Kyu Song, Bo-Kyeong Kim, Shinkook Choi

Figure 1 for LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

Figure 2 for LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

Figure 3 for LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

Figure 4 for LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

Abstract:Latent Diffusion Models (LDMs) have emerged as powerful generative models, known for delivering remarkable results under constrained computational resources. However, deploying LDMs on resource-limited devices remains a complex issue, presenting challenges such as memory consumption and inference speed. To address this issue, we introduce LD-Pruner, a novel performance-preserving structured pruning method for compressing LDMs. Traditional pruning methods for deep neural networks are not tailored to the unique characteristics of LDMs, such as the high computational cost of training and the absence of a fast, straightforward and task-agnostic method for evaluating model performance. Our method tackles these challenges by leveraging the latent space during the pruning process, enabling us to effectively quantify the impact of pruning on model performance, independently of the task at hand. This targeted pruning of components with minimal impact on the output allows for faster convergence during training, as the model has less information to re-learn, thereby addressing the high computational cost of training. Consequently, our approach achieves a compressed model that offers improved inference speed and reduced parameter count, while maintaining minimal performance degradation. We demonstrate the effectiveness of our approach on three different tasks: text-to-image (T2I) generation, Unconditional Image Generation (UIG) and Unconditional Audio Generation (UAG). Notably, we reduce the inference time of Stable Diffusion (SD) by 34.9% while simultaneously improving its FID by 5.2% on MS-COCO T2I benchmark. This work paves the way for more efficient pruning methods for LDMs, enhancing their applicability.

* 8 pages, accepted to CVPR24 First Workshop on Efficient and On-Device Generation (EDGE)

Via

Access Paper or Ask Questions

Shortened LLaMA: A Simple Depth Pruning for Large Language Models

Feb 05, 2024

Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song

Abstract:Structured pruning of modern large language models (LLMs) has emerged as a way of decreasing their high computational needs. Width pruning reduces the size of projection weight matrices (e.g., by removing attention heads) while maintaining the number of layers. Depth pruning, in contrast, removes entire layers or blocks, while keeping the size of the remaining weights unchanged. Most current research focuses on either width-only or a blend of width and depth pruning, with little comparative analysis between the two units (width vs. depth) concerning their impact on LLM inference efficiency. In this work, we show that a simple depth pruning approach can compete with recent width pruning methods in terms of zero-shot task performance. Our pruning method boosts inference speeds, especially under memory-constrained conditions that require limited batch sizes for running LLMs, where width pruning is ineffective. We hope this work can help deploy LLMs on local and edge devices.

Via

Access Paper or Ask Questions

On Architectural Compression of Text-to-Image Diffusion Models

May 25, 2023

Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, Shinkook Choi

Abstract:Exceptional text-to-image (T2I) generation results of Stable Diffusion models (SDMs) come with substantial computational demands. To resolve this issue, recent research on efficient SDMs has prioritized reducing the number of sampling steps and utilizing network quantization. Orthogonal to these directions, this study highlights the power of classical architectural compression for general-purpose T2I synthesis by introducing block-removed knowledge-distilled SDMs (BK-SDMs). We eliminate several residual and attention blocks from the U-Net of SDMs, obtaining over a 30% reduction in the number of parameters, MACs per sampling step, and latency. We conduct distillation-based pretraining with only 0.22M LAION pairs (fewer than 0.1% of the full training pairs) on a single A100 GPU. Despite being trained with limited resources, our compact models can imitate the original SDM by benefiting from transferred knowledge and achieve competitive results against larger multi-billion parameter models on the zero-shot MS-COCO benchmark. Moreover, we demonstrate the applicability of our lightweight pretrained models in personalized generation with DreamBooth finetuning.

* 10 figures, 5 tables

Via

Access Paper or Ask Questions

A Unified Compression Framework for Efficient Speech-Driven Talking-Face Generation

Apr 02, 2023

Bo-Kyeong Kim, Jaemin Kang, Daeun Seo, Hancheol Park, Shinkook Choi, Hyungshin Kim, Sungsu Lim

Abstract:Virtual humans have gained considerable attention in numerous industries, e.g., entertainment and e-commerce. As a core technology, synthesizing photorealistic face frames from target speech and facial identity has been actively studied with generative adversarial networks. Despite remarkable results of modern talking-face generation models, they often entail high computational burdens, which limit their efficient deployment. This study aims to develop a lightweight model for speech-driven talking-face synthesis. We build a compact generator by removing the residual blocks and reducing the channel width from Wav2Lip, a popular talking-face generator. We also present a knowledge distillation scheme to stably yet effectively train the small-capacity generator without adversarial learning. We reduce the number of parameters and MACs by 28$\times$ while retaining the performance of the original model. Moreover, to alleviate a severe performance drop when converting the whole generator to INT8 precision, we adopt a selective quantization method that uses FP16 for the quantization-sensitive layers and INT8 for the other layers. Using this mixed precision, we achieve up to a 19$\times$ speedup on edge GPUs without noticeably compromising the generation quality.

* MLSys Workshop on On-Device Intelligence, 2023

Via

Access Paper or Ask Questions

Cut Inner Layers: A Structured Pruning Strategy for Efficient U-Net GANs

Jun 29, 2022

Bo-Kyeong Kim, Shinkook Choi, Hancheol Park

Figure 1 for Cut Inner Layers: A Structured Pruning Strategy for Efficient U-Net GANs

Figure 2 for Cut Inner Layers: A Structured Pruning Strategy for Efficient U-Net GANs

Figure 3 for Cut Inner Layers: A Structured Pruning Strategy for Efficient U-Net GANs

Figure 4 for Cut Inner Layers: A Structured Pruning Strategy for Efficient U-Net GANs

Abstract:Pruning effectively compresses overparameterized models. Despite the success of pruning methods for discriminative models, applying them for generative models has been relatively rarely approached. This study conducts structured pruning on U-Net generators of conditional GANs. A per-layer sensitivity analysis confirms that many unnecessary filters exist in the innermost layers near the bottleneck and can be substantially pruned. Based on this observation, we prune these filters from multiple inner layers or suggest alternative architectures by completely eliminating the layers. We evaluate our approach with Pix2Pix for image-to-image translation and Wav2Lip for speech-driven talking face generation. Our method outperforms global pruning baselines, demonstrating the importance of properly considering where to prune for U-Net generators.

* ICML Workshop on Hardware Aware Efficient Training, 2022

Via

Access Paper or Ask Questions

Semi-supervised Disentanglement with Independent Vector Variational Autoencoders

Mar 14, 2020

Bo-Kyeong Kim, Sungjin Park, Geonmin Kim, Soo-Young Lee

Figure 1 for Semi-supervised Disentanglement with Independent Vector Variational Autoencoders

Figure 2 for Semi-supervised Disentanglement with Independent Vector Variational Autoencoders

Figure 3 for Semi-supervised Disentanglement with Independent Vector Variational Autoencoders

Figure 4 for Semi-supervised Disentanglement with Independent Vector Variational Autoencoders

Abstract:We aim to separate the generative factors of data into two latent vectors in a variational autoencoder. One vector captures class factors relevant to target classification tasks, while the other vector captures style factors relevant to the remaining information. To learn the discrete class features, we introduce supervision using a small amount of labeled data, which can simply yet effectively reduce the effort required for hyperparameter tuning performed in existing unsupervised methods. Furthermore, we introduce a learning objective to encourage statistical independence between the vectors. We show that (i) this vector independence term exists within the result obtained on decomposing the evidence lower bound with multiple latent vectors, and (ii) encouraging such independence along with reducing the total correlation within the vectors enhances disentanglement performance. Experiments conducted on several image datasets demonstrate that the disentanglement achieved via our method can improve classification performance and generation controllability.

* 24 pages: 10 p for main paper (8 figures) and 14 p for supplementary material (12 figures). A shortened version of this paper is currently under review by a conference

Via

Access Paper or Ask Questions

Unpaired Speech Enhancement by Acoustic and Adversarial Supervision for Speech Recognition

Nov 06, 2018

Geonmin Kim, Hwaran Lee, Bo-Kyeong Kim, Sang-Hoon Oh, Soo-Young Lee

Figure 1 for Unpaired Speech Enhancement by Acoustic and Adversarial Supervision for Speech Recognition

Figure 2 for Unpaired Speech Enhancement by Acoustic and Adversarial Supervision for Speech Recognition

Figure 3 for Unpaired Speech Enhancement by Acoustic and Adversarial Supervision for Speech Recognition

Figure 4 for Unpaired Speech Enhancement by Acoustic and Adversarial Supervision for Speech Recognition

Abstract:Many speech enhancement methods try to learn the relationship between noisy and clean speech, obtained using an acoustic room simulator. We point out several limitations of enhancement methods relying on clean speech targets; the goal of this work is proposing an alternative learning algorithm, called acoustic and adversarial supervision (AAS). AAS makes the enhanced output both maximizing the likelihood of transcription on the pre-trained acoustic model and having general characteristics of clean speech, which improve generalization on unseen noisy speeches. We employ the connectionist temporal classification and the unpaired conditional boundary equilibrium generative adversarial network as the loss function of AAS. AAS is tested on two datasets including additive noise without and with reverberation, Librispeech + DEMAND and CHiME-4. By visualizing the enhanced speech with different loss combinations, we demonstrate the role of each supervision. AAS achieves a lower word error rate than other state-of-the-art methods using the clean speech target in both datasets.

* will be published in IEEE Signal Processing Letter

Via

Access Paper or Ask Questions