Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hyoung-Kyu Song

LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

Apr 18, 2024

Thibault Castells, Hyoung-Kyu Song, Bo-Kyeong Kim, Shinkook Choi

Figure 1 for LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

Figure 2 for LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

Figure 3 for LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

Figure 4 for LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

Abstract:Latent Diffusion Models (LDMs) have emerged as powerful generative models, known for delivering remarkable results under constrained computational resources. However, deploying LDMs on resource-limited devices remains a complex issue, presenting challenges such as memory consumption and inference speed. To address this issue, we introduce LD-Pruner, a novel performance-preserving structured pruning method for compressing LDMs. Traditional pruning methods for deep neural networks are not tailored to the unique characteristics of LDMs, such as the high computational cost of training and the absence of a fast, straightforward and task-agnostic method for evaluating model performance. Our method tackles these challenges by leveraging the latent space during the pruning process, enabling us to effectively quantify the impact of pruning on model performance, independently of the task at hand. This targeted pruning of components with minimal impact on the output allows for faster convergence during training, as the model has less information to re-learn, thereby addressing the high computational cost of training. Consequently, our approach achieves a compressed model that offers improved inference speed and reduced parameter count, while maintaining minimal performance degradation. We demonstrate the effectiveness of our approach on three different tasks: text-to-image (T2I) generation, Unconditional Image Generation (UIG) and Unconditional Audio Generation (UAG). Notably, we reduce the inference time of Stable Diffusion (SD) by 34.9% while simultaneously improving its FID by 5.2% on MS-COCO T2I benchmark. This work paves the way for more efficient pruning methods for LDMs, enhancing their applicability.

* 8 pages, accepted to CVPR24 First Workshop on Efficient and On-Device Generation (EDGE)

Via

Access Paper or Ask Questions

EdgeFusion: On-Device Text-to-Image Generation

Apr 18, 2024

Thibault Castells, Hyoung-Kyu Song, Tairen Piao, Shinkook Choi, Bo-Kyeong Kim, Hanyoung Yim, Changgwun Lee, Jae Gon Kim, Tae-Ho Kim

Abstract:The intensive computational burden of Stable Diffusion (SD) for text-to-image generation poses a significant hurdle for its practical application. To tackle this challenge, recent research focuses on methods to reduce sampling steps, such as Latent Consistency Model (LCM), and on employing architectural optimizations, including pruning and knowledge distillation. Diverging from existing approaches, we uniquely start with a compact SD variant, BK-SDM. We observe that directly applying LCM to BK-SDM with commonly used crawled datasets yields unsatisfactory results. It leads us to develop two strategies: (1) leveraging high-quality image-text pairs from leading generative models and (2) designing an advanced distillation process tailored for LCM. Through our thorough exploration of quantization, profiling, and on-device deployment, we achieve rapid generation of photo-realistic, text-aligned images in just two steps, with latency under one second on resource-limited edge devices.

* 4 pages, accepted to CVPR24 First Workshop on Efficient and On-Device Generation (EDGE)

Via

Access Paper or Ask Questions

LatentSwap: An Efficient Latent Code Mapping Framework for Face Swapping

Mar 02, 2024

Changho Choi, Minho Kim, Junhyeok Lee, Hyoung-Kyu Song, Younggeun Kim, Seungryong Kim

Figure 1 for LatentSwap: An Efficient Latent Code Mapping Framework for Face Swapping

Figure 2 for LatentSwap: An Efficient Latent Code Mapping Framework for Face Swapping

Figure 3 for LatentSwap: An Efficient Latent Code Mapping Framework for Face Swapping

Figure 4 for LatentSwap: An Efficient Latent Code Mapping Framework for Face Swapping

Abstract:We propose LatentSwap, a simple face swapping framework generating a face swap latent code of a given generator. Utilizing randomly sampled latent codes, our framework is light and does not require datasets besides employing the pre-trained models, with the training procedure also being fast and straightforward. The loss objective consists of only three terms, and can effectively control the face swap results between source and target images. By attaching a pre-trained GAN inversion model independent to the model and using the StyleGAN2 generator, our model produces photorealistic and high-resolution images comparable to other competitive face swap models. We show that our framework is applicable to other generators such as StyleNeRF, paving a way to 3D-aware face swapping and is also compatible with other downstream StyleGAN2 generator tasks. The source code and models can be found at \url{https://github.com/usingcolor/LatentSwap}.

* 9 pages, 11 figures

Via

Access Paper or Ask Questions

Shortened LLaMA: A Simple Depth Pruning for Large Language Models

Feb 05, 2024

Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song

Abstract:Structured pruning of modern large language models (LLMs) has emerged as a way of decreasing their high computational needs. Width pruning reduces the size of projection weight matrices (e.g., by removing attention heads) while maintaining the number of layers. Depth pruning, in contrast, removes entire layers or blocks, while keeping the size of the remaining weights unchanged. Most current research focuses on either width-only or a blend of width and depth pruning, with little comparative analysis between the two units (width vs. depth) concerning their impact on LLM inference efficiency. In this work, we show that a simple depth pruning approach can compete with recent width pruning methods in terms of zero-shot task performance. Our pruning method boosts inference speeds, especially under memory-constrained conditions that require limited batch sizes for running LLMs, where width pruning is ineffective. We hope this work can help deploy LLMs on local and edge devices.

Via

Access Paper or Ask Questions

On Architectural Compression of Text-to-Image Diffusion Models

May 25, 2023

Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, Shinkook Choi

Abstract:Exceptional text-to-image (T2I) generation results of Stable Diffusion models (SDMs) come with substantial computational demands. To resolve this issue, recent research on efficient SDMs has prioritized reducing the number of sampling steps and utilizing network quantization. Orthogonal to these directions, this study highlights the power of classical architectural compression for general-purpose T2I synthesis by introducing block-removed knowledge-distilled SDMs (BK-SDMs). We eliminate several residual and attention blocks from the U-Net of SDMs, obtaining over a 30% reduction in the number of parameters, MACs per sampling step, and latency. We conduct distillation-based pretraining with only 0.22M LAION pairs (fewer than 0.1% of the full training pairs) on a single A100 GPU. Despite being trained with limited resources, our compact models can imitate the original SDM by benefiting from transferred knowledge and achieve competitive results against larger multi-billion parameter models on the zero-shot MS-COCO benchmark. Moreover, we demonstrate the applicability of our lightweight pretrained models in personalized generation with DreamBooth finetuning.

* 10 figures, 5 tables

Via

Access Paper or Ask Questions

Talking Face Generation with Multilingual TTS

May 13, 2022

Hyoung-Kyu Song, Sang Hoon Woo, Junhyeok Lee, Seungmin Yang, Hyunjae Cho, Youseong Lee, Dongho Choi, Kang-wook Kim

Figure 1 for Talking Face Generation with Multilingual TTS

Figure 2 for Talking Face Generation with Multilingual TTS

Figure 3 for Talking Face Generation with Multilingual TTS

Figure 4 for Talking Face Generation with Multilingual TTS

Abstract:In this work, we propose a joint system combining a talking face generation system with a text-to-speech system that can generate multilingual talking face videos from only the text input. Our system can synthesize natural multilingual speeches while maintaining the vocal identity of the speaker, as well as lip movements synchronized to the synthesized speech. We demonstrate the generalization capabilities of our system by selecting four languages (Korean, English, Japanese, and Chinese) each from a different language family. We also compare the outputs of our talking face generation model to outputs of a prior work that claims multilingual support. For our demo, we add a translation API to the preprocessing stage and present it in the form of a neural dubber so that users can utilize the multilingual property of our system more easily.

* Accepted to CVPR Demo Track (2022)

Via

Access Paper or Ask Questions

Deep User Identification Model with Multiple Biometrics

Sep 03, 2019

Hyoung-Kyu Song, Ebrahim AlAlkeem, Jaewoong Yun, Tae-Ho Kim, Hyerin Yoo, Dasom Heo, Chan Yeob Yeun, Myungsu Chae

Figure 1 for Deep User Identification Model with Multiple Biometrics

Figure 2 for Deep User Identification Model with Multiple Biometrics

Figure 3 for Deep User Identification Model with Multiple Biometrics

Figure 4 for Deep User Identification Model with Multiple Biometrics

Abstract:Identification using biometrics is an important yet challenging task. Abundant research has been conducted on identifying personal identity or gender using given signals. Various types of biometrics such as electrocardiogram (ECG), electroencephalogram (EEG), face, fingerprint, and voice have been used for these tasks. Most research has only focused on single modality or a single task, while the combination of input modality or tasks is yet to be investigated. In this paper, we propose deep identification and gender classification using multimodal biometrics. Our model uses ECG, fingerprint, and facial data. It then performs two tasks: gender identification and classification. By engaging multi-modality, a single model can handle various input domains without training each modality independently, and the correlation between domains can increase its generalization performance on the tasks.

* Accepted, CIKM 2019 Workshop on DTMBio

Via

Access Paper or Ask Questions