Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fu-En Yang

VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models

Mar 27, 2025

Chi-Pin Huang, Yen-Siang Wu, Hung-Kai Chung, Kai-Po Chang, Fu-En Yang, Yu-Chiang Frank Wang

Abstract:Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.

* CVPR 2025. Project Page: https://jasper0314-huang.github.io/videomage-customization

Via

Access Paper or Ask Questions

MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching

Feb 18, 2025

Yen-Siang Wu, Chi-Pin Huang, Fu-En Yang, Yu-Chiang Frank Wang

Abstract:Text-to-video (T2V) diffusion models have shown promising capabilities in synthesizing realistic videos from input text prompts. However, the input text description alone provides limited control over the precise objects movements and camera framing. In this work, we tackle the motion customization problem, where a reference video is provided as motion guidance. While most existing methods choose to fine-tune pre-trained diffusion models to reconstruct the frame differences of the reference video, we observe that such strategy suffer from content leakage from the reference video, and they cannot capture complex motion accurately. To address this issue, we propose MotionMatcher, a motion customization framework that fine-tunes the pre-trained T2V diffusion model at the feature level. Instead of using pixel-level objectives, MotionMatcher compares high-level, spatio-temporal motion features to fine-tune diffusion models, ensuring precise motion learning. For the sake of memory efficiency and accessibility, we utilize a pre-trained T2V diffusion model, which contains considerable prior knowledge about video motion, to compute these motion features. In our experiments, we demonstrate state-of-the-art motion customization performances, validating the design of our framework.

* Project page: https://www.csie.ntu.edu.tw/~b09902097/motionmatcher/

Via

Access Paper or Ask Questions

PromptHSI: Universal Hyperspectral Image Restoration Framework for Composite Degradation

Nov 24, 2024

Chia-Ming Lee, Ching-Heng Cheng, Yu-Fan Lin, Yi-Ching Cheng, Wo-Ting Liao, Chih-Chung Hsu, Fu-En Yang, Yu-Chiang Frank Wang

Figure 1 for PromptHSI: Universal Hyperspectral Image Restoration Framework for Composite Degradation

Figure 2 for PromptHSI: Universal Hyperspectral Image Restoration Framework for Composite Degradation

Figure 3 for PromptHSI: Universal Hyperspectral Image Restoration Framework for Composite Degradation

Figure 4 for PromptHSI: Universal Hyperspectral Image Restoration Framework for Composite Degradation

Abstract:Recent developments in All-in-One (AiO) RGB image restoration and prompt learning have enabled the representation of distinct degradations through prompts, allowing degraded images to be effectively addressed by a single restoration model. However, this paradigm faces significant challenges when transferring to hyperspectral image (HSI) restoration tasks due to: 1) the domain gap between RGB and HSI features and difference on their structures, 2) information loss in visual prompts under severe composite degradations, and 3) difficulties in capturing HSI-specific degradation representations through text prompts. To address these challenges, we propose PromptHSI, the first universal AiO HSI restoration framework. By leveraging the frequency-aware feature modulation based on characteristics of HSI degradations, we decompose text prompts into intensity and bias controllers to effectively guide the restoration process while avoiding domain gaps. Our unified architecture excels at both fine-grained recovery and global information restoration tasks. Experimental results demonstrate superior performance under various degradation combinations, indicating great potential for practical remote sensing applications. The source code and dataset will be publicly released.

* 11 pages, 8 figures

Via

Access Paper or Ask Questions

Select and Distill: Selective Dual-Teacher Knowledge Transfer for Continual Learning on Vision-Language Models

Mar 14, 2024

Yu-Chu Yu, Chi-Pin Huang, Jr-Jen Chen, Kai-Po Chang, Yung-Hsuan Lai, Fu-En Yang, Yu-Chiang Frank Wang

Abstract:Large-scale vision-language models (VLMs) have shown a strong zero-shot generalization capability on unseen-domain data. However, when adapting pre-trained VLMs to a sequence of downstream tasks, they are prone to forgetting previously learned knowledge and degrade their zero-shot classification capability. To tackle this problem, we propose a unique Selective Dual-Teacher Knowledge Transfer framework that leverages the most recent fine-tuned and the original pre-trained VLMs as dual teachers to preserve the previously learned knowledge and zero-shot capabilities, respectively. With only access to an unlabeled reference dataset, our proposed framework performs a selective knowledge distillation mechanism by measuring the feature discrepancy from the dual teacher VLMs. Consequently, our selective dual-teacher knowledge distillation would mitigate catastrophic forgetting of previously learned knowledge while preserving the zero-shot capabilities from pre-trained VLMs. Through extensive experiments on benchmark datasets, we show that our proposed framework is favorable against state-of-the-art continual learning approaches for preventing catastrophic forgetting and zero-shot degradation.

Via

Access Paper or Ask Questions

Language-Guided Transformer for Federated Multi-Label Classification

Dec 12, 2023

I-Jieh Liu, Ci-Siang Lin, Fu-En Yang, Yu-Chiang Frank Wang

Abstract:Federated Learning (FL) is an emerging paradigm that enables multiple users to collaboratively train a robust model in a privacy-preserving manner without sharing their private data. Most existing approaches of FL only consider traditional single-label image classification, ignoring the impact when transferring the task to multi-label image classification. Nevertheless, it is still challenging for FL to deal with user heterogeneity in their local data distribution in the real-world FL scenario, and this issue becomes even more severe in multi-label image classification. Inspired by the recent success of Transformers in centralized settings, we propose a novel FL framework for multi-label classification. Since partial label correlation may be observed by local clients during training, direct aggregation of locally updated models would not produce satisfactory performances. Thus, we propose a novel FL framework of Language-Guided Transformer (FedLGT) to tackle this challenging task, which aims to exploit and transfer knowledge across different clients for learning a robust global model. Through extensive experiments on various multi-label datasets (e.g., FLAIR, MS-COCO, etc.), we show that our FedLGT is able to achieve satisfactory performance and outperforms standard FL techniques under multi-label FL scenarios. Code is available at https://github.com/Jack24658735/FedLGT.

* Accepted by AAAI 2024

Via

Access Paper or Ask Questions

Efficient Model Personalization in Federated Learning via Client-Specific Prompt Generation

Aug 29, 2023

Fu-En Yang, Chien-Yi Wang, Yu-Chiang Frank Wang

Abstract:Federated learning (FL) emerges as a decentralized learning framework which trains models from multiple distributed clients without sharing their data to preserve privacy. Recently, large-scale pre-trained models (e.g., Vision Transformer) have shown a strong capability of deriving robust representations. However, the data heterogeneity among clients, the limited computation resources, and the communication bandwidth restrict the deployment of large-scale models in FL frameworks. To leverage robust representations from large-scale models while enabling efficient model personalization for heterogeneous clients, we propose a novel personalized FL framework of client-specific Prompt Generation (pFedPG), which learns to deploy a personalized prompt generator at the server for producing client-specific visual prompts that efficiently adapts frozen backbones to local data distributions. Our proposed framework jointly optimizes the stages of personalized prompt adaptation locally and personalized prompt generation globally. The former aims to train visual prompts that adapt foundation models to each client, while the latter observes local optimization directions to generate personalized prompts for all clients. Through extensive experiments on benchmark datasets, we show that our pFedPG is favorable against state-of-the-art personalized FL methods under various types of data heterogeneity, allowing computation and communication efficient model personalization.

* Accepted by ICCV 2023

Via

Access Paper or Ask Questions

TAX: Tendency-and-Assignment Explainer for Semantic Segmentation with Multi-Annotators

Feb 19, 2023

Yuan-Chia Cheng, Zu-Yun Shiau, Fu-En Yang, Yu-Chiang Frank Wang

Figure 1 for TAX: Tendency-and-Assignment Explainer for Semantic Segmentation with Multi-Annotators

Figure 2 for TAX: Tendency-and-Assignment Explainer for Semantic Segmentation with Multi-Annotators

Figure 3 for TAX: Tendency-and-Assignment Explainer for Semantic Segmentation with Multi-Annotators

Figure 4 for TAX: Tendency-and-Assignment Explainer for Semantic Segmentation with Multi-Annotators

Abstract:To understand how deep neural networks perform classification predictions, recent research attention has been focusing on developing techniques to offer desirable explanations. However, most existing methods cannot be easily applied for semantic segmentation; moreover, they are not designed to offer interpretability under the multi-annotator setting. Instead of viewing ground-truth pixel-level labels annotated by a single annotator with consistent labeling tendency, we aim at providing interpretable semantic segmentation and answer two critical yet practical questions: "who" contributes to the resulting segmentation, and "why" such an assignment is determined. In this paper, we present a learning framework of Tendency-and-Assignment Explainer (TAX), designed to offer interpretability at the annotator and assignment levels. More specifically, we learn convolution kernel subsets for modeling labeling tendencies of each type of annotation, while a prototype bank is jointly observed to offer visual guidance for learning the above kernels. For evaluation, we consider both synthetic and real-world datasets with multi-annotators. We show that our TAX can be applied to state-of-the-art network architectures with comparable performances, while segmentation interpretability at both levels can be offered accordingly.

* 10 pages, 5 figures

Via

Access Paper or Ask Questions

Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond

Aug 30, 2022

Cheng-Yen Hsieh, Chih-Jung Chang, Fu-En Yang, Yu-Chiang Frank Wang

Figure 1 for Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond

Figure 2 for Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond

Figure 3 for Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond

Figure 4 for Self-Supervised Pyramid Representation Learning for Multi-Label Visual Analysis and Beyond

Abstract:While self-supervised learning has been shown to benefit a number of vision tasks, existing techniques mainly focus on image-level manipulation, which may not generalize well to downstream tasks at patch or pixel levels. Moreover, existing SSL methods might not sufficiently describe and associate the above representations within and across image scales. In this paper, we propose a Self-Supervised Pyramid Representation Learning (SS-PRL) framework. The proposed SS-PRL is designed to derive pyramid representations at patch levels via learning proper prototypes, with additional learners to observe and relate inherent semantic information within an image. In particular, we present a cross-scale patch-level correlation learning in SS-PRL, which allows the model to aggregate and associate information learned across patch scales. We show that, with our proposed SS-PRL for model pre-training, one can easily adapt and fine-tune the models for a variety of applications including multi-label classification, object detection, and instance segmentation.

* IEEE WACV 2023, Github: https://github.com/WesleyHsieh0806/SS-PRL

Via

Access Paper or Ask Questions

Few-Shot Classification in Unseen Domains by Episodic Meta-Learning Across Visual Domains

Dec 27, 2021

Yuan-Chia Cheng, Ci-Siang Lin, Fu-En Yang, Yu-Chiang Frank Wang

Figure 1 for Few-Shot Classification in Unseen Domains by Episodic Meta-Learning Across Visual Domains

Figure 2 for Few-Shot Classification in Unseen Domains by Episodic Meta-Learning Across Visual Domains

Figure 3 for Few-Shot Classification in Unseen Domains by Episodic Meta-Learning Across Visual Domains

Figure 4 for Few-Shot Classification in Unseen Domains by Episodic Meta-Learning Across Visual Domains

Abstract:Few-shot classification aims to carry out classification given only few labeled examples for the categories of interest. Though several approaches have been proposed, most existing few-shot learning (FSL) models assume that base and novel classes are drawn from the same data domain. When it comes to recognizing novel-class data in an unseen domain, this becomes an even more challenging task of domain generalized few-shot classification. In this paper, we present a unique learning framework for domain-generalized few-shot classification, where base classes are from homogeneous multiple source domains, while novel classes to be recognized are from target domains which are not seen during training. By advancing meta-learning strategies, our learning framework exploits data across multiple source domains to capture domain-invariant features, with FSL ability introduced by metric-learning based mechanisms across support and query data. We conduct extensive experiments to verify the effectiveness of our proposed learning framework and show learning from small yet homogeneous source data is able to perform preferably against learning from large-scale one. Moreover, we provide insights into choices of backbone models for domain-generalized few-shot classification.

* Accepted by ICIP 2021

Via

Access Paper or Ask Questions

A Pixel-Level Meta-Learner for Weakly Supervised Few-Shot Semantic Segmentation

Nov 02, 2021

Yuan-Hao Lee, Fu-En Yang, Yu-Chiang Frank Wang

Figure 1 for A Pixel-Level Meta-Learner for Weakly Supervised Few-Shot Semantic Segmentation

Figure 2 for A Pixel-Level Meta-Learner for Weakly Supervised Few-Shot Semantic Segmentation

Figure 3 for A Pixel-Level Meta-Learner for Weakly Supervised Few-Shot Semantic Segmentation

Figure 4 for A Pixel-Level Meta-Learner for Weakly Supervised Few-Shot Semantic Segmentation

Abstract:Few-shot semantic segmentation addresses the learning task in which only few images with ground truth pixel-level labels are available for the novel classes of interest. One is typically required to collect a large mount of data (i.e., base classes) with such ground truth information, followed by meta-learning strategies to address the above learning task. When only image-level semantic labels can be observed during both training and testing, it is considered as an even more challenging task of weakly supervised few-shot semantic segmentation. To address this problem, we propose a novel meta-learning framework, which predicts pseudo pixel-level segmentation masks from a limited amount of data and their semantic labels. More importantly, our learning scheme further exploits the produced pixel-level information for query image inputs with segmentation guarantees. Thus, our proposed learning model can be viewed as a pixel-level meta-learner. Through extensive experiments on benchmark datasets, we show that our model achieves satisfactory performances under fully supervised settings, yet performs favorably against state-of-the-art methods under weakly supervised settings.

* Accepted to WACV 2022

Via

Access Paper or Ask Questions