Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Taehwan Kim

Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup

Mar 04, 2025

Seokun Kang, Taehwan Kim

Abstract:Video action recognition is a challenging but important task for understanding and discovering what the video does. However, acquiring annotations for a video is costly, and semi-supervised learning (SSL) has been studied to improve performance even with a small number of labeled data in the task. Prior studies for semi-supervised video action recognition have mostly focused on using single modality - visuals - but the video is multi-modal, so utilizing both visuals and audio would be desirable and improve performance further, which has not been explored well. Therefore, we propose audio-visual SSL for video action recognition, which uses both visual and audio together, even with quite a few labeled data, which is challenging. In addition, to maximize the information of audio and video, we propose a novel audio source localization-guided mixup method that considers inter-modal relations between video and audio modalities. In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed semi-supervised audio-visual action recognition framework and audio source localization-guided mixup.

Via

Access Paper or Ask Questions

RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals

Feb 18, 2025

Jaemu Heo, Eldor Fozilov, Hyunmin Song, Taehwan Kim

Abstract:Transformers have achieved great success in effectively processing sequential data such as text. Their architecture consisting of several attention and feedforward blocks can model relations between elements of a sequence in parallel manner, which makes them very efficient to train and effective in sequence modeling. Even though they have shown strong performance in processing sequential data, the size of their parameters is considerably larger when compared to other architectures such as RNN and CNN based models. Therefore, several approaches have explored parameter sharing and recurrence in Transformer models to address their computational demands. However, such methods struggle to maintain high performance compared to the original transformer model. To address this challenge, we propose our novel approach, RingFormer, which employs one Transformer layer that processes input repeatedly in a circular, ring-like manner, while utilizing low-rank matrices to generate input-dependent level signals. This allows us to reduce the model parameters substantially while maintaining high performance in a variety of tasks such as translation and image classification, as validated in the experiments.

Via

Access Paper or Ask Questions

Leveraging 2D Masked Reconstruction for Domain Adaptation of 3D Pose Estimation

Jan 14, 2025

Hansoo Park, Chanwoo Kim, Jihyeon Kim, Hoseong Cho, Nhat Nguyen Bao Truong, Taehwan Kim, Seungryul Baek

Abstract:RGB-based 3D pose estimation methods have been successful with the development of deep learning and the emergence of high-quality 3D pose datasets. However, most existing methods do not operate well for testing images whose distribution is far from that of training data. However, most existing methods do not operate well for testing images whose distribution is far from that of training data. This problem might be alleviated by involving diverse data during training, however it is non-trivial to collect such diverse data with corresponding labels (i.e. 3D pose). In this paper, we introduced an unsupervised domain adaptation framework for 3D pose estimation that utilizes the unlabeled data in addition to labeled data via masked image modeling (MIM) framework. Foreground-centric reconstruction and attention regularization are further proposed to increase the effectiveness of unlabeled data usage. Experiments are conducted on the various datasets in human and hand pose estimation tasks, especially using the cross-domain scenario. We demonstrated the effectiveness of ours by achieving the state-of-the-art accuracy on all datasets.

* 16 pages, 7 figures

Via

Access Paper or Ask Questions

Zero-shot Text-guided Infinite Image Synthesis with LLM guidance

Jul 17, 2024

Soyeong Kwon, Taegyeong Lee, Taehwan Kim

Abstract:Text-guided image editing and generation methods have diverse real-world applications. However, text-guided infinite image synthesis faces several challenges. First, there is a lack of text-image paired datasets with high-resolution and contextual diversity. Second, expanding images based on text requires global coherence and rich local context understanding. Previous studies have mainly focused on limited categories, such as natural landscapes, and also required to train on high-resolution images with paired text. To address these challenges, we propose a novel approach utilizing Large Language Models (LLMs) for both global coherence and local context understanding, without any high-resolution text-image paired training dataset. We train the diffusion model to expand an image conditioned on global and local captions generated from the LLM and visual feature. At the inference stage, given an image and a global caption, we use the LLM to generate a next local caption to expand the input image. Then, we expand the image using the global caption, generated local caption and the visual feature to consider global consistency and spatial local context. In experiments, our model outperforms the baselines both quantitatively and qualitatively. Furthermore, our model demonstrates the capability of text-guided arbitrary-sized image generation in zero-shot manner with LLM guidance.

* Accepted to ECCV 2024

Via

Access Paper or Ask Questions

Grid Diffusion Models for Text-to-Video Generation

Mar 30, 2024

Taegyeong Lee, Soyeong Kwon, Taehwan Kim

Figure 1 for Grid Diffusion Models for Text-to-Video Generation

Figure 2 for Grid Diffusion Models for Text-to-Video Generation

Figure 3 for Grid Diffusion Models for Text-to-Video Generation

Figure 4 for Grid Diffusion Models for Text-to-Video Generation

Abstract:Recent advances in the diffusion models have significantly improved text-to-image generation. However, generating videos from text is a more challenging task than generating images from text, due to the much larger dataset and higher computational cost required. Most existing video generation methods use either a 3D U-Net architecture that considers the temporal dimension or autoregressive generation. These methods require large datasets and are limited in terms of computational costs compared to text-to-image generation. To tackle these challenges, we propose a simple but effective novel grid diffusion for text-to-video generation without temporal dimension in architecture and a large text-video paired dataset. We can generate a high-quality video using a fixed amount of GPU memory regardless of the number of frames by representing the video as a grid image. Additionally, since our method reduces the dimensions of the video to the dimensions of the image, various image-based methods can be applied to videos, such as text-guided video manipulation from image manipulation. Our proposed method outperforms the existing methods in both quantitative and qualitative evaluations, demonstrating the suitability of our model for real-world video generation.

* Accepted to CVPR 2024

Via

Access Paper or Ask Questions

Sound of Story: Multi-modal Storytelling with Audio

Oct 30, 2023

Jaeyeon Bae, Seokhoon Jeong, Seokun Kang, Namgi Han, Jae-Yon Lee, Hyounghun Kim, Taehwan Kim

Figure 1 for Sound of Story: Multi-modal Storytelling with Audio

Figure 2 for Sound of Story: Multi-modal Storytelling with Audio

Figure 3 for Sound of Story: Multi-modal Storytelling with Audio

Figure 4 for Sound of Story: Multi-modal Storytelling with Audio

Abstract:Storytelling is multi-modal in the real world. When one tells a story, one may use all of the visualizations and sounds along with the story itself. However, prior studies on storytelling datasets and tasks have paid little attention to sound even though sound also conveys meaningful semantics of the story. Therefore, we propose to extend story understanding and telling areas by establishing a new component called "background sound" which is story context-based audio without any linguistic information. For this purpose, we introduce a new dataset, called "Sound of Story (SoS)", which has paired image and text sequences with corresponding sound or background music for a story. To the best of our knowledge, this is the largest well-curated dataset for storytelling with sound. Our SoS dataset consists of 27,354 stories with 19.6 images per story and 984 hours of speech-decoupled audio such as background music and other sounds. As benchmark tasks for storytelling with sound and the dataset, we propose retrieval tasks between modalities, and audio generation tasks from image-text sequences, introducing strong baselines for them. We believe the proposed dataset and tasks may shed light on the multi-modal understanding of storytelling in terms of sound. Downloading the dataset and baseline codes for each task will be released in the link: https://github.com/Sosdatasets/SoS_Dataset.

* Findings of EMNLP 2023, project: https://github.com/Sosdatasets/SoS_Dataset/

Via

Access Paper or Ask Questions

Effective Slogan Generation with Noise Perturbation

Oct 12, 2023

Jongeun Kim, MinChung Kim, Taehwan Kim

Figure 1 for Effective Slogan Generation with Noise Perturbation

Figure 2 for Effective Slogan Generation with Noise Perturbation

Figure 3 for Effective Slogan Generation with Noise Perturbation

Figure 4 for Effective Slogan Generation with Noise Perturbation

Abstract:Slogans play a crucial role in building the brand's identity of the firm. A slogan is expected to reflect firm's vision and brand's value propositions in memorable and likeable ways. Automating the generation of slogans with such characteristics is challenging. Previous studies developted and tested slogan generation with syntactic control and summarization models which are not capable of generating distinctive slogans. We introduce a a novel apporach that leverages pre-trained transformer T5 model with noise perturbation on newly proposed 1:N matching pair dataset. This approach serves as a contributing fator in generting distinctive and coherent slogans. Turthermore, the proposed approach incorporates descriptions about the firm and brand into the generation of slogans. We evaluate generated slogans based on ROUGE1, ROUGEL and Cosine Similarity metrics and also assess them with human subjects in terms of slogan's distinctiveness, coherence, and fluency. The results demonstrate that our approach yields better performance than baseline models and other transformer-based models.

* Accepted in CIKM 2023 short paper https://github.com/joannekim0420/SloganGeneration

Via

Access Paper or Ask Questions

Generating Realistic Images from In-the-wild Sounds

Sep 05, 2023

Taegyeong Lee, Jeonghun Kang, Hyeonyu Kim, Taehwan Kim

Abstract:Representing wild sounds as images is an important but challenging task due to the lack of paired datasets between sound and images and the significant differences in the characteristics of these two modalities. Previous studies have focused on generating images from sound in limited categories or music. In this paper, we propose a novel approach to generate images from in-the-wild sounds. First, we convert sound into text using audio captioning. Second, we propose audio attention and sentence attention to represent the rich characteristics of sound and visualize the sound. Lastly, we propose a direct sound optimization with CLIPscore and AudioCLIP and generate images with a diffusion-based model. In experiments, it shows that our model is able to generate high quality images from wild sounds and outperforms baselines in both quantitative and qualitative evaluations on wild audio datasets.

* Accepted to ICCV 2023

Via

Access Paper or Ask Questions

Technical Report for CVPR 2022 LOVEU AQTC Challenge

Jun 29, 2022

Hyeonyu Kim, Jongeun Kim, Jeonghun Kang, Sanguk Park, Dongchan Park, Taehwan Kim

Figure 1 for Technical Report for CVPR 2022 LOVEU AQTC Challenge

Figure 2 for Technical Report for CVPR 2022 LOVEU AQTC Challenge

Figure 3 for Technical Report for CVPR 2022 LOVEU AQTC Challenge

Figure 4 for Technical Report for CVPR 2022 LOVEU AQTC Challenge

Abstract:This technical report presents the 2nd winning model for AQTC, a task newly introduced in CVPR 2022 LOng-form VidEo Understanding (LOVEU) challenges. This challenge faces difficulties with multi-step answers, multi-modal, and diverse and changing button representations in video. We address this problem by proposing a new context ground module attention mechanism for more effective feature mapping. In addition, we also perform the analysis over the number of buttons and ablation study of different step networks and video features. As a result, we achieved the overall 2nd place in LOVEU competition track 3, specifically the 1st place in two out of four evaluation metrics. Our code is available at https://github.com/jaykim9870/ CVPR-22_LOVEU_unipyler.

* 4 pages, 3 figures, technical report for track3 of CVPR 2022 LOVEU challenge

Via

Access Paper or Ask Questions

Understanding Beauty via Deep Facial Features

Apr 17, 2019

Xudong Liu, Tao Li, Hao Peng, Iris Chuoying Ouyang, Taehwan Kim, Ruizhe Wang

Figure 1 for Understanding Beauty via Deep Facial Features

Figure 2 for Understanding Beauty via Deep Facial Features

Figure 3 for Understanding Beauty via Deep Facial Features

Figure 4 for Understanding Beauty via Deep Facial Features

Abstract:The concept of beauty has been debated by philosophers and psychologists for centuries, but most definitions are subjective and metaphysical, and deficit in accuracy, generality, and scalability. In this paper, we present a novel study on mining beauty semantics of facial attributes based on big data, with an attempt to objectively construct descriptions of beauty in a quantitative manner. We first deploy a deep convolutional neural network (CNN) to extract facial attributes, and then investigate correlations between these features and attractiveness on two large-scale datasets labelled with beauty scores. Not only do we discover the secrets of beauty verified by statistical significance tests, our findings also align perfectly with existing psychological studies that, e.g., small nose, high cheekbones, and femininity contribute to attractiveness. We further leverage these high-level representations to original images by a generative adversarial network (GAN). Beauty enhancements after synthesis are visually compelling and statistically convincing verified by a user survey of 10,000 data points.

Via

Access Paper or Ask Questions