Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Goutam Bhat

M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

May 22, 2025

Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, Federico Tombari

Abstract:We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, obtaining an average rank of 1.43 among the 4 compared methods in a user study, while being 6x faster than the second placed method.

Via

Access Paper or Ask Questions

LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

Dec 03, 2024

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, Jiajun Wu

Abstract:Open-universe 3D layout generation arranges unlabeled 3D assets conditioned on language instruction. Large language models (LLMs) struggle with generating physically plausible 3D scenes and adherence to input instructions, particularly in cluttered scenes. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve performance.

* project website: https://ai.stanford.edu/~sunfanyun/layoutvlm/

Via

Access Paper or Ask Questions

Fast Hierarchical Learning for Few-Shot Object Detection

Oct 10, 2022

Yihang She, Goutam Bhat, Martin Danelljan, Fisher Yu

Figure 1 for Fast Hierarchical Learning for Few-Shot Object Detection

Figure 2 for Fast Hierarchical Learning for Few-Shot Object Detection

Figure 3 for Fast Hierarchical Learning for Few-Shot Object Detection

Figure 4 for Fast Hierarchical Learning for Few-Shot Object Detection

Abstract:Transfer learning based approaches have recently achieved promising results on the few-shot detection task. These approaches however suffer from ``catastrophic forgetting'' issue due to finetuning of base detector, leading to sub-optimal performance on the base classes. Furthermore, the slow convergence rate of stochastic gradient descent (SGD) results in high latency and consequently restricts real-time applications. We tackle the aforementioned issues in this work. We pose few-shot detection as a hierarchical learning problem, where the novel classes are treated as the child classes of existing base classes and the background class. The detection heads for the novel classes are then trained using a specialized optimization strategy, leading to significantly lower training times compared to SGD. Our approach obtains competitive novel class performance on few-shot MS-COCO benchmark, while completely retaining the performance of the initial model on the base classes. We further demonstrate the application of our approach to a new class-refined few-shot detection task.

* 8 pages, 5 figures, accepted by IROS2022

Via

Access Paper or Ask Questions

Transforming Model Prediction for Tracking

Mar 21, 2022

Christoph Mayer, Martin Danelljan, Goutam Bhat, Matthieu Paul, Danda Pani Paudel, Fisher Yu, Luc Van Gool

Figure 1 for Transforming Model Prediction for Tracking

Figure 2 for Transforming Model Prediction for Tracking

Figure 3 for Transforming Model Prediction for Tracking

Figure 4 for Transforming Model Prediction for Tracking

Abstract:Optimization based tracking methods have been widely successful by integrating a target model prediction module, providing effective global reasoning by minimizing an objective function. While this inductive bias integrates valuable domain knowledge, it limits the expressivity of the tracking network. In this work, we therefore propose a tracker architecture employing a Transformer-based model prediction module. Transformers capture global relations with little inductive bias, allowing it to learn the prediction of more powerful target models. We further extend the model predictor to estimate a second set of weights that are applied for accurate bounding box regression. The resulting tracker relies on training and on test frame information in order to predict all weights transductively. We train the proposed tracker end-to-end and validate its performance by conducting comprehensive experiments on multiple tracking datasets. Our tracker sets a new state of the art on three benchmarks, achieving an AUC of 68.5% on the challenging LaSOT dataset.

* Accepted at CVPR 2022. The code and trained models are available at https://github.com/visionml/pytracking

Via

Access Paper or Ask Questions

Deep Reparametrization of Multi-Frame Super-Resolution and Denoising

Aug 18, 2021

Goutam Bhat, Martin Danelljan, Fisher Yu, Luc Van Gool, Radu Timofte

Figure 1 for Deep Reparametrization of Multi-Frame Super-Resolution and Denoising

Figure 2 for Deep Reparametrization of Multi-Frame Super-Resolution and Denoising

Figure 3 for Deep Reparametrization of Multi-Frame Super-Resolution and Denoising

Figure 4 for Deep Reparametrization of Multi-Frame Super-Resolution and Denoising

Abstract:We propose a deep reparametrization of the maximum a posteriori formulation commonly employed in multi-frame image restoration tasks. Our approach is derived by introducing a learned error metric and a latent representation of the target image, which transforms the MAP objective to a deep feature space. The deep reparametrization allows us to directly model the image formation process in the latent space, and to integrate learned image priors into the prediction. Our approach thereby leverages the advantages of deep learning, while also benefiting from the principled multi-frame fusion provided by the classical MAP formulation. We validate our approach through comprehensive experiments on burst denoising and burst super-resolution datasets. Our approach sets a new state-of-the-art for both tasks, demonstrating the generality and effectiveness of the proposed formulation.

* ICCV 2021 Oral

Via

Access Paper or Ask Questions

NTIRE 2021 Challenge on Burst Super-Resolution: Methods and Results

Jun 07, 2021

Goutam Bhat, Martin Danelljan, Radu Timofte, Kazutoshi Akita, Wooyeong Cho, Haoqiang Fan, Lanpeng Jia, Daeshik Kim, Bruno Lecouat, Youwei Li(+18 more)

Figure 1 for NTIRE 2021 Challenge on Burst Super-Resolution: Methods and Results

Figure 2 for NTIRE 2021 Challenge on Burst Super-Resolution: Methods and Results

Figure 3 for NTIRE 2021 Challenge on Burst Super-Resolution: Methods and Results

Figure 4 for NTIRE 2021 Challenge on Burst Super-Resolution: Methods and Results

Abstract:This paper reviews the NTIRE2021 challenge on burst super-resolution. Given a RAW noisy burst as input, the task in the challenge was to generate a clean RGB image with 4 times higher resolution. The challenge contained two tracks; Track 1 evaluating on synthetically generated data, and Track 2 using real-world bursts from mobile camera. In the final testing phase, 6 teams submitted results using a diverse set of solutions. The top-performing methods set a new state-of-the-art for the burst super-resolution task.

* NTIRE 2021 Burst Super-Resolution challenge report

Via

Access Paper or Ask Questions

Deep Burst Super-Resolution

Jan 26, 2021

Goutam Bhat, Martin Danelljan, Luc Van Gool, Radu Timofte

Figure 1 for Deep Burst Super-Resolution

Figure 2 for Deep Burst Super-Resolution

Figure 3 for Deep Burst Super-Resolution

Figure 4 for Deep Burst Super-Resolution

Abstract:While single-image super-resolution (SISR) has attracted substantial interest in recent years, the proposed approaches are limited to learning image priors in order to add high frequency details. In contrast, multi-frame super-resolution (MFSR) offers the possibility of reconstructing rich details by combining signal information from multiple shifted images. This key advantage, along with the increasing popularity of burst photography, have made MFSR an important problem for real-world applications. We propose a novel architecture for the burst super-resolution task. Our network takes multiple noisy RAW images as input, and generates a denoised, super-resolved RGB image as output. This is achieved by explicitly aligning deep embeddings of the input frames using pixel-wise optical flow. The information from all frames are then adaptively merged using an attention-based fusion module. In order to enable training and evaluation on real-world data, we additionally introduce the BurstSR dataset, consisting of smartphone bursts and high-resolution DSLR ground-truth. We perform comprehensive experimental analysis, demonstrating the effectiveness of the proposed architecture.

Via

Access Paper or Ask Questions

Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in Videos

Jan 06, 2021

Bin Zhao, Goutam Bhat, Martin Danelljan, Luc Van Gool, Radu Timofte

Figure 1 for Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in Videos

Figure 2 for Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in Videos

Figure 3 for Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in Videos

Figure 4 for Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in Videos

Abstract:Segmenting objects in videos is a fundamental computer vision task. The current deep learning based paradigm offers a powerful, but data-hungry solution. However, current datasets are limited by the cost and human effort of annotating object masks in videos. This effectively limits the performance and generalization capabilities of existing video segmentation methods. To address this issue, we explore weaker form of bounding box annotations. We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos. To this end, we propose a spatio-temporal aggregation module that effectively mines consistencies in the object and background appearance across multiple frames. We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks. We generate segmentation masks for large scale tracking datasets, using only their bounding box annotations. The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.

Via

Access Paper or Ask Questions

Know Your Surroundings: Exploiting Scene Information for Object Tracking

May 01, 2020

Goutam Bhat, Martin Danelljan, Luc Van Gool, Radu Timofte

Figure 1 for Know Your Surroundings: Exploiting Scene Information for Object Tracking

Figure 2 for Know Your Surroundings: Exploiting Scene Information for Object Tracking

Figure 3 for Know Your Surroundings: Exploiting Scene Information for Object Tracking

Figure 4 for Know Your Surroundings: Exploiting Scene Information for Object Tracking

Abstract:Current state-of-the-art trackers only rely on a target appearance model in order to localize the object in each frame. Such approaches are however prone to fail in case of e.g. fast appearance changes or presence of distractor objects, where a target appearance model alone is insufficient for robust tracking. Having the knowledge about the presence and locations of other objects in the surrounding scene can be highly beneficial in such cases. This scene information can be propagated through the sequence and used to, for instance, explicitly avoid distractor objects and eliminate target candidate regions. In this work, we propose a novel tracking architecture which can utilize scene information for tracking. Our tracker represents such information as dense localized state vectors, which can encode, for example, if the local region is target, background, or distractor. These state vectors are propagated through the sequence and combined with the appearance model output to localize the target. Our network is learned to effectively utilize the scene information by directly maximizing tracking performance on video segments. The proposed approach sets a new state-of-the-art on 3 tracking benchmarks, achieving an AO score of 63.6% on the recent GOT-10k dataset.

Via

Access Paper or Ask Questions

Learning What to Learn for Video Object Segmentation

May 01, 2020

Goutam Bhat, Felix Järemo Lawin, Martin Danelljan, Andreas Robinson, Michael Felsberg, Luc Van Gool, Radu Timofte

Figure 1 for Learning What to Learn for Video Object Segmentation

Figure 2 for Learning What to Learn for Video Object Segmentation

Figure 3 for Learning What to Learn for Video Object Segmentation

Figure 4 for Learning What to Learn for Video Object Segmentation

Abstract:Video object segmentation (VOS) is a highly challenging problem, since the target object is only defined during inference with a given first-frame reference mask. The problem of how to capture and utilize this limited target information remains a fundamental research question. We address this by introducing an end-to-end trainable VOS architecture that integrates a differentiable few-shot learning module. This internal learner is designed to predict a powerful parametric model of the target by minimizing a segmentation error in the first frame. We further go beyond standard few-shot learning techniques by learning what the few-shot learner should learn. This allows us to achieve a rich internal representation of the target in the current frame, significantly increasing the segmentation accuracy of our approach. We perform extensive experiments on multiple benchmarks. Our approach sets a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5, corresponding to a 2.6% relative improvement over the previous best result.

* First two authors contributed equally

Via

Access Paper or Ask Questions