Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Seunghoon Lee

Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

Mar 05, 2025

Suhwan Cho, Seunghoon Lee, Minhyeok Lee, Jungho Lee, Sangyoun Lee

Figure 1 for Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

Figure 2 for Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

Figure 3 for Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

Figure 4 for Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

Abstract:Referring video object segmentation aims to segment and track a target object in a video using a natural language prompt. Existing methods typically fuse visual and textual features in a highly entangled manner, processing multi-modal information together to generate per-frame masks. However, this approach often struggles with ambiguous target identification, particularly in scenes with multiple similar objects, and fails to ensure consistent mask propagation across frames. To address these limitations, we introduce FindTrack, a novel decoupled framework that separates target identification from mask propagation. FindTrack first adaptively selects a key frame by balancing segmentation confidence and vision-text alignment, establishing a robust reference for the target object. This reference is then utilized by a dedicated propagation module to track and segment the object across the entire video. By decoupling these processes, FindTrack effectively reduces ambiguities in target association and enhances segmentation consistency. We demonstrate that FindTrack outperforms existing methods on public benchmarks.

Via

Access Paper or Ask Questions

Sparse-DeRF: Deblurred Neural Radiance Fields from Sparse View

Jul 09, 2024

Dogyoon Lee, Donghyeong Kim, Jungho Lee, Minhyeok Lee, Seunghoon Lee, Sangyoun Lee

Figure 1 for Sparse-DeRF: Deblurred Neural Radiance Fields from Sparse View

Figure 2 for Sparse-DeRF: Deblurred Neural Radiance Fields from Sparse View

Figure 3 for Sparse-DeRF: Deblurred Neural Radiance Fields from Sparse View

Figure 4 for Sparse-DeRF: Deblurred Neural Radiance Fields from Sparse View

Abstract:Recent studies construct deblurred neural radiance fields (DeRF) using dozens of blurry images, which are not practical scenarios if only a limited number of blurry images are available. This paper focuses on constructing DeRF from sparse-view for more pragmatic real-world scenarios. As observed in our experiments, establishing DeRF from sparse views proves to be a more challenging problem due to the inherent complexity arising from the simultaneous optimization of blur kernels and NeRF from sparse view. Sparse-DeRF successfully regularizes the complicated joint optimization, presenting alleviated overfitting artifacts and enhanced quality on radiance fields. The regularization consists of three key components: Surface smoothness, helps the model accurately predict the scene structure utilizing unseen and additional hidden rays derived from the blur kernel based on statistical tendencies of real-world; Modulated gradient scaling, helps the model adjust the amount of the backpropagated gradient according to the arrangements of scene objects; Perceptual distillation improves the perceptual quality by overcoming the ill-posed multi-view inconsistency of image deblurring and distilling the pre-filtered information, compensating for the lack of clean information in blurry images. We demonstrate the effectiveness of the Sparse-DeRF with extensive quantitative and qualitative experimental results by training DeRF from 2-view, 4-view, and 6-view blurry images.

* Project page: https://dogyoonlee.github.io/sparsederf/

Via

Access Paper or Ask Questions

TSANET: Temporal and Scale Alignment for Unsupervised Video Object Segmentation

Mar 08, 2023

Seunghoon Lee, Suhwan Cho, Dogyoon Lee, Minhyeok Lee, Sangyoun Lee

Abstract:Unsupervised Video Object Segmentation (UVOS) refers to the challenging task of segmenting the prominent object in videos without manual guidance. In other words, the network detects the accurate region of the target object in a sequence of RGB frames without prior knowledge. In recent works, two approaches for UVOS have been discussed that can be divided into: appearance and appearance-motion based methods. Appearance based methods utilize the correlation information of inter-frames to capture target object that commonly appears in a sequence. However, these methods does not consider the motion of target object due to exploit the correlation information between randomly paired frames. Appearance-motion based methods, on the other hand, fuse the appearance features from RGB frames with the motion features from optical flow. Motion cue provides useful information since salient objects typically show distinctive motion in a sequence. However, these approaches have the limitation that the dependency on optical flow is dominant. In this paper, we propose a novel framework for UVOS that can address aforementioned limitations of two approaches in terms of both time and scale. Temporal Alignment Fusion aligns the saliency information of adjacent frames with the target frame to leverage the information of adjacent frames. Scale Alignment Decoder predicts the target object mask precisely by aggregating differently scaled feature maps via continuous mapping with implicit neural representation. We present experimental results on public benchmark datasets, DAVIS 2016 and FBMS, which demonstrate the effectiveness of our method. Furthermore, we outperform the state-of-the-art methods on DAVIS 2016.

Via

Access Paper or Ask Questions

Domain Alignment and Temporal Aggregation for Unsupervised Video Object Segmentation

Nov 22, 2022

Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Sangyoun Lee

Abstract:Unsupervised video object segmentation aims at detecting and segmenting the most salient object in videos. In recent times, two-stream approaches that collaboratively leverage appearance cues and motion cues have attracted extensive attention thanks to their powerful performance. However, there are two limitations faced by those methods: 1) the domain gap between appearance and motion information is not well considered; and 2) long-term temporal coherence within a video sequence is not exploited. To overcome these limitations, we propose a domain alignment module (DAM) and a temporal aggregation module (TAM). DAM resolves the domain gap between two modalities by forcing the values to be in the same range using a cross-correlation mechanism. TAM captures long-term coherence by extracting and leveraging global cues of a video. On public benchmark datasets, our proposed approach demonstrates its effectiveness, outperforming all existing methods by a substantial margin.

Via

Access Paper or Ask Questions

Unsupervised Video Object Segmentation via Prototype Memory Network

Sep 08, 2022

Minhyeok Lee, Suhwan Cho, Seunghoon Lee, Chaewon Park, Sangyoun Lee

Figure 1 for Unsupervised Video Object Segmentation via Prototype Memory Network

Figure 2 for Unsupervised Video Object Segmentation via Prototype Memory Network

Figure 3 for Unsupervised Video Object Segmentation via Prototype Memory Network

Figure 4 for Unsupervised Video Object Segmentation via Prototype Memory Network

Abstract:Unsupervised video object segmentation aims to segment a target object in the video without a ground truth mask in the initial frame. This challenging task requires extracting features for the most salient common objects within a video sequence. This difficulty can be solved by using motion information such as optical flow, but using only the information between adjacent frames results in poor connectivity between distant frames and poor performance. To solve this problem, we propose a novel prototype memory network architecture. The proposed model effectively extracts the RGB and motion information by extracting superpixel-based component prototypes from the input RGB images and optical flow maps. In addition, the model scores the usefulness of the component prototypes in each frame based on a self-learning algorithm and adaptively stores the most useful prototypes in memory and discards obsolete prototypes. We use the prototypes in the memory bank to predict the next query frames mask, which enhances the association between distant frames to help with accurate mask prediction. Our method is evaluated on three datasets, achieving state-of-the-art performance. We prove the effectiveness of the proposed model with various ablation studies.

* Accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2023

Via

Access Paper or Ask Questions

Treating Motion as Option to Reduce Motion Dependency in Unsupervised Video Object Segmentation

Sep 04, 2022

Suhwan Cho, Minhyeok Lee, Seunghoon Lee, Chaewon Park, Donghyeong Kim, Sangyoun Lee

Figure 1 for Treating Motion as Option to Reduce Motion Dependency in Unsupervised Video Object Segmentation

Figure 2 for Treating Motion as Option to Reduce Motion Dependency in Unsupervised Video Object Segmentation

Figure 3 for Treating Motion as Option to Reduce Motion Dependency in Unsupervised Video Object Segmentation

Figure 4 for Treating Motion as Option to Reduce Motion Dependency in Unsupervised Video Object Segmentation

Abstract:Unsupervised video object segmentation (VOS) aims to detect the most salient object in a video sequence at the pixel level. In unsupervised VOS, most state-of-the-art methods leverage motion cues obtained from optical flow maps in addition to appearance cues to exploit the property that salient objects usually have distinctive movements compared to the background. However, as they are overly dependent on motion cues, which may be unreliable in some cases, they cannot achieve stable prediction. To reduce this motion dependency of existing two-stream VOS methods, we propose a novel motion-as-option network that optionally utilizes motion cues. Additionally, to fully exploit the property of the proposed network that motion is not always required, we introduce a collaborative network learning strategy. On all the public benchmark datasets, our proposed network affords state-of-the-art performance with real-time inference speed.

Via

Access Paper or Ask Questions

Pixel-Level Equalized Matching for Video Object Segmentation

Sep 04, 2022

Suhwan Cho, Woo Jin Kim, MyeongAh Cho, Seunghoon Lee, Minhyeok Lee, Chaewon Park, Sangyoun Lee

Figure 1 for Pixel-Level Equalized Matching for Video Object Segmentation

Figure 2 for Pixel-Level Equalized Matching for Video Object Segmentation

Figure 3 for Pixel-Level Equalized Matching for Video Object Segmentation

Figure 4 for Pixel-Level Equalized Matching for Video Object Segmentation

Abstract:Feature similarity matching, which transfers the information of the reference frame to the query frame, is a key component in semi-supervised video object segmentation. If surjective matching is adopted, background distractors can easily occur and degrade the performance. Bijective matching mechanisms try to prevent this by restricting the amount of information being transferred to the query frame, but have two limitations: 1) surjective matching cannot be fully leveraged as it is transformed to bijective matching at test time; and 2) test-time manual tuning is required for searching the optimal hyper-parameters. To overcome these limitations while ensuring reliable information transfer, we introduce an equalized matching mechanism. To prevent the reference frame information from being overly referenced, the potential contribution to the query frame is equalized by simply applying a softmax operation along with the query. On public benchmark datasets, our proposed approach achieves a comparable performance to state-of-the-art methods.

Via

Access Paper or Ask Questions

Bayesian AirComp with Sign-Alignment Precoding for Wireless Federated Learning

Sep 14, 2021

Chanho Park, Seunghoon Lee, Namyoon Lee

Figure 1 for Bayesian AirComp with Sign-Alignment Precoding for Wireless Federated Learning

Figure 2 for Bayesian AirComp with Sign-Alignment Precoding for Wireless Federated Learning

Figure 3 for Bayesian AirComp with Sign-Alignment Precoding for Wireless Federated Learning

Figure 4 for Bayesian AirComp with Sign-Alignment Precoding for Wireless Federated Learning

Abstract:In this paper, we consider the problem of wireless federated learning based on sign stochastic gradient descent (signSGD) algorithm via a multiple access channel. When sending locally computed gradient's sign information, each mobile device requires to apply precoding to circumvent wireless fading effects. In practice, however, acquiring perfect knowledge of channel state information (CSI) at all mobile devices is infeasible. In this paper, we present a simple yet effective precoding method with limited channel knowledge, called sign-alignment precoding. The idea of sign-alignment precoding is to protect sign-flipping errors from wireless fadings. Under the Gaussian prior assumption on the local gradients, we also derive the mean squared error (MSE)-optimal aggregation function called Bayesian over-the-air computation (BayAirComp). Our key finding is that one-bit precoding with BayAirComp aggregation can provide a better learning performance than the existing precoding method even using perfect CSI with AirComp aggregation.

* This paper is 8 pages long, and has 4 figures. This paper is the extended version of the conference paper which is accepted in 2021 IEEE GlobeCom

Via

Access Paper or Ask Questions

Bayesian Federated Learning over Wireless Networks

Dec 31, 2020

Seunghoon Lee, Chanho Park, Song-Nam Hong, Yonina C. Eldar, Namyoon Lee

Figure 1 for Bayesian Federated Learning over Wireless Networks

Figure 2 for Bayesian Federated Learning over Wireless Networks

Figure 3 for Bayesian Federated Learning over Wireless Networks

Figure 4 for Bayesian Federated Learning over Wireless Networks

Abstract:Federated learning is a privacy-preserving and distributed training method using heterogeneous data sets stored at local devices. Federated learning over wireless networks requires aggregating locally computed gradients at a server where the mobile devices send statistically distinct gradient information over heterogenous communication links. This paper proposes a Bayesian federated learning (BFL) algorithm to aggregate the heterogeneous quantized gradient information optimally in the sense of minimizing the mean-squared error (MSE). The idea of BFL is to aggregate the one-bit quantized local gradients at the server by jointly exploiting i) the prior distributions of the local gradients, ii) the gradient quantizer function, and iii) channel distributions. Implementing BFL requires high communication and computational costs as the number of mobile devices increases. To address this challenge, we also present an efficient modified BFL algorithm called scalable-BFL (SBFL). In SBFL, we assume a simplified distribution on the local gradient. Each mobile device sends its one-bit quantized local gradient together with two scalar parameters representing this distribution. The server then aggregates the noisy and faded quantized gradients to minimize the MSE. We provide a convergence analysis of SBFL for a class of non-convex loss functions. Our analysis elucidates how the parameters of communication channels and the gradient priors affect convergence. From simulations, we demonstrate that SBFL considerably outperforms the conventional sign stochastic gradient descent algorithm when training and testing neural networks using MNIST data sets over heterogeneous wireless networks.

* 30 pages, 7 figures, submitted to IEEE Journal on Selected Areas in Communications

Via

Access Paper or Ask Questions