Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yi-Wen Chen

Generalizable Entity Grounding via Assistance of Large Language Model

Feb 04, 2024

Lu Qi, Yi-Wen Chen, Lehan Yang, Tiancheng Shen, Xiangtai Li, Weidong Guo, Yu Xu, Ming-Hsuan Yang

Figure 1 for Generalizable Entity Grounding via Assistance of Large Language Model

Figure 2 for Generalizable Entity Grounding via Assistance of Large Language Model

Figure 3 for Generalizable Entity Grounding via Assistance of Large Language Model

Figure 4 for Generalizable Entity Grounding via Assistance of Large Language Model

Abstract:In this work, we propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model (LMM) to extract semantic nouns, a class-agnostic segmentation model to generate entity-level segmentation, and the proposed multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask. Additionally, we introduce a strategy of encoding entity segmentation masks into a colormap, enabling the preservation of fine-grained predictions from features of high-resolution masks. This approach allows us to extract visual features from low-resolution images using the CLIP vision encoder in the LMM, which is more computationally efficient than existing approaches that use an additional encoder for high-resolution images. Our comprehensive experiments demonstrate the superiority of our method, outperforming state-of-the-art techniques on three tasks, including panoptic narrative grounding, referring expression segmentation, and panoptic segmentation.

Via

Access Paper or Ask Questions

Text-Driven Image Editing via Learnable Regions

Nov 28, 2023

Yuanze Lin, Yi-Wen Chen, Yi-Hsuan Tsai, Lu Jiang, Ming-Hsuan Yang

Abstract:Language has emerged as a natural interface for image editing. In this paper, we introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches. Specifically, our approach leverages an existing pretrained text-to-image model and introduces a bounding box generator to find the edit regions that are aligned with the textual prompts. We show that this simple approach enables flexible editing that is compatible with current image generation models, and is able to handle complex prompts featuring multiple objects, complex sentences or long paragraphs. We conduct an extensive user study to compare our method against state-of-the-art methods. Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that align with the language descriptions provided. Our project webpage: https://yuanze-lin.me/LearnableRegions_page.

* Project webpage: https://yuanze-lin.me/LearnableRegions_page

Via

Access Paper or Ask Questions

Video Salient Object Detection via Contrastive Features and Attention Modules

Nov 03, 2021

Yi-Wen Chen, Xiaojie Jin, Xiaohui Shen, Ming-Hsuan Yang

Figure 1 for Video Salient Object Detection via Contrastive Features and Attention Modules

Figure 2 for Video Salient Object Detection via Contrastive Features and Attention Modules

Figure 3 for Video Salient Object Detection via Contrastive Features and Attention Modules

Figure 4 for Video Salient Object Detection via Contrastive Features and Attention Modules

Abstract:Video salient object detection aims to find the most visually distinctive objects in a video. To explore the temporal dependencies, existing methods usually resort to recurrent neural networks or optical flow. However, these approaches require high computational cost, and tend to accumulate inaccuracies over time. In this paper, we propose a network with attention modules to learn contrastive features for video salient object detection without the high computational temporal modeling techniques. We develop a non-local self-attention scheme to capture the global information in the video frame. A co-attention formulation is utilized to combine the low-level and high-level features. We further apply the contrastive learning to improve the feature representations, where foreground region pairs from the same video are pulled together, and foreground-background region pairs are pushed away in the latent space. The intra-frame contrastive loss helps separate the foreground and background features, and the inter-frame contrastive loss improves the temporal consistency. We conduct extensive experiments on several benchmark datasets for video salient object detection and unsupervised video object segmentation, and show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.

* Accepted in WACV 2022

Via

Access Paper or Ask Questions

End-to-end Multi-modal Video Temporal Grounding

Jul 12, 2021

Yi-Wen Chen, Yi-Hsuan Tsai, Ming-Hsuan Yang

Figure 1 for End-to-end Multi-modal Video Temporal Grounding

Figure 2 for End-to-end Multi-modal Video Temporal Grounding

Figure 3 for End-to-end Multi-modal Video Temporal Grounding

Figure 4 for End-to-end Multi-modal Video Temporal Grounding

Abstract:We address the problem of text-guided video temporal grounding, which aims to identify the time interval of certain event based on a natural language description. Different from most existing methods that only consider RGB images as visual features, we propose a multi-modal framework to extract complementary information from videos. Specifically, we adopt RGB images for appearance, optical flow for motion, and depth maps for image structure. While RGB images provide abundant visual cues of certain event, the performance may be affected by background clutters. Therefore, we use optical flow to focus on large motion and depth maps to infer the scene configuration when the action is related to objects recognizable with their shapes. To integrate the three modalities more effectively and enable inter-modal learning, we design a dynamic fusion scheme with transformers to model the interactions between modalities. Furthermore, we apply intra-modal self-supervised learning to enhance feature representations across videos for each modality, which also facilitates multi-modal learning. We conduct extensive experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.

Via

Access Paper or Ask Questions

Understanding Synonymous Referring Expressions via Contrastive Features

Apr 20, 2021

Yi-Wen Chen, Yi-Hsuan Tsai, Ming-Hsuan Yang

Figure 1 for Understanding Synonymous Referring Expressions via Contrastive Features

Figure 2 for Understanding Synonymous Referring Expressions via Contrastive Features

Figure 3 for Understanding Synonymous Referring Expressions via Contrastive Features

Figure 4 for Understanding Synonymous Referring Expressions via Contrastive Features

Abstract:Referring expression comprehension aims to localize objects identified by natural language descriptions. This is a challenging task as it requires understanding of both visual and language domains. One nature is that each object can be described by synonymous sentences with paraphrases, and such varieties in languages have critical impact on learning a comprehension model. While prior work usually treats each sentence and attends it to an object separately, we focus on learning a referring expression comprehension model that considers the property in synonymous sentences. To this end, we develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels, where features extracted from synonymous sentences to describe the same object should be closer to each other after mapping to the visual domain. We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets, and demonstrate that our method performs favorably against the state-of-the-art approaches. Furthermore, since the varieties in expressions become larger across datasets when they describe objects in different ways, we present the cross-dataset and transfer learning settings to validate the ability of our learned transferable features.

* Codes and models will be available at https://github.com/wenz116/RefContrast

Via

Access Paper or Ask Questions

Regularizing Meta-Learning via Gradient Dropout

Apr 13, 2020

Hung-Yu Tseng, Yi-Wen Chen, Yi-Hsuan Tsai, Sifei Liu, Yen-Yu Lin, Ming-Hsuan Yang

Figure 1 for Regularizing Meta-Learning via Gradient Dropout

Figure 2 for Regularizing Meta-Learning via Gradient Dropout

Figure 3 for Regularizing Meta-Learning via Gradient Dropout

Figure 4 for Regularizing Meta-Learning via Gradient Dropout

Abstract:With the growing attention on learning-to-learn new tasks using only a few examples, meta-learning has been widely used in numerous problems such as few-shot classification, reinforcement learning, and domain generalization. However, meta-learning models are prone to overfitting when there are no sufficient training tasks for the meta-learners to generalize. Although existing approaches such as Dropout are widely used to address the overfitting problem, these methods are typically designed for regularizing models of a single task in supervised training. In this paper, we introduce a simple yet effective method to alleviate the risk of overfitting for gradient-based meta-learning. Specifically, during the gradient-based adaptation stage, we randomly drop the gradient in the inner-loop optimization of each parameter in deep neural networks, such that the augmented gradients improve generalization to new tasks. We present a general form of the proposed gradient dropout regularization and show that this term can be sampled from either the Bernoulli or Gaussian distribution. To validate the proposed method, we conduct extensive experiments and analysis on numerous computer vision tasks, demonstrating that the gradient dropout regularization mitigates the overfitting problem and improves the performance upon various gradient-based meta-learning frameworks.

* Code: https://github.com/hytseng0509/DropGrad

Via

Access Paper or Ask Questions

Referring Expression Object Segmentation with Caption-Aware Consistency

Oct 10, 2019

Yi-Wen Chen, Yi-Hsuan Tsai, Tiantian Wang, Yen-Yu Lin, Ming-Hsuan Yang

Figure 1 for Referring Expression Object Segmentation with Caption-Aware Consistency

Figure 2 for Referring Expression Object Segmentation with Caption-Aware Consistency

Figure 3 for Referring Expression Object Segmentation with Caption-Aware Consistency

Figure 4 for Referring Expression Object Segmentation with Caption-Aware Consistency

Abstract:Referring expressions are natural language descriptions that identify a particular object within a scene and are widely used in our daily conversations. In this work, we focus on segmenting the object in an image specified by a referring expression. To this end, we propose an end-to-end trainable comprehension network that consists of the language and visual encoders to extract feature representations from both domains. We introduce the spatial-aware dynamic filters to transfer knowledge from text to image, and effectively capture the spatial information of the specified object. To better communicate between the language and visual modules, we employ a caption generation network that takes features shared across both domains as input, and improves both representations via a consistency that enforces the generated sentence to be similar to the given referring expression. We evaluate the proposed framework on two referring expression datasets and show that our method performs favorably against the state-of-the-art algorithms.

* Accepted in BMVC'19, project page at https://github.com/wenz116/lang2seg

Via

Access Paper or Ask Questions

Unseen Object Segmentation in Videos via Transferable Representations

Jan 08, 2019

Yi-Wen Chen, Yi-Hsuan Tsai, Chu-Ya Yang, Yen-Yu Lin, Ming-Hsuan Yang

Figure 1 for Unseen Object Segmentation in Videos via Transferable Representations

Figure 2 for Unseen Object Segmentation in Videos via Transferable Representations

Figure 3 for Unseen Object Segmentation in Videos via Transferable Representations

Figure 4 for Unseen Object Segmentation in Videos via Transferable Representations

Abstract:In order to learn object segmentation models in videos, conventional methods require a large amount of pixel-wise ground truth annotations. However, collecting such supervised data is time-consuming and labor-intensive. In this paper, we exploit existing annotations in source images and transfer such visual information to segment videos with unseen object categories. Without using any annotations in the target video, we propose a method to jointly mine useful segments and learn feature representations that better adapt to the target frames. The entire process is decomposed into two tasks: 1) solving a submodular function for selecting object-like segments, and 2) learning a CNN model with a transferable module for adapting seen categories in the source domain to the unseen target video. We present an iterative update scheme between two tasks to self-learn the final solution for object segmentation. Experimental results on numerous benchmark datasets show that the proposed method performs favorably against the state-of-the-art algorithms.

* Accepted in ACCV'18 (oral). Code is available at https://github.com/wenz116/TransferSeg

Via

Access Paper or Ask Questions