Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sayan Deb Sarkar

CrossOver: 3D Scene Cross-Modal Alignment

Feb 20, 2025

Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys, Daniel Barath, Iro Armeni

Abstract:Multi-modal 3D object understanding has gained significant attention, yet current approaches often assume complete data availability and rigid alignment across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require aligned modality data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities - RGB images, point clouds, CAD models, floorplans, and text descriptions - with relaxed constraints and without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting adaptability for real-world applications in 3D scene understanding.

* Project Page: sayands.github.io/crossover/

Via

Access Paper or Ask Questions

HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language

May 28, 2023

Shantipriya Parida, Idris Abdulmumin, Shamsuddeen Hassan Muhammad, Aneesh Bose, Guneet Singh Kohli, Ibrahim Said Ahmad, Ketan Kotwal, Sayan Deb Sarkar, Ondřej Bojar, Habeebah Adamu Kakudi

Figure 1 for HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language

Figure 2 for HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language

Figure 3 for HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language

Figure 4 for HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language

Abstract:This paper presents HaVQA, the first multimodal dataset for visual question-answering (VQA) tasks in the Hausa language. The dataset was created by manually translating 6,022 English question-answer pairs, which are associated with 1,555 unique images from the Visual Genome dataset. As a result, the dataset provides 12,044 gold standard English-Hausa parallel sentences that were translated in a fashion that guarantees their semantic match with the corresponding visual information. We conducted several baseline experiments on the dataset, including visual question answering, visual question elicitation, text-only and multimodal machine translation.

* Accepted at ACL 2023 as a long paper (Findings)

Via

Access Paper or Ask Questions

SGAligner : 3D Scene Alignment with Scene Graphs

Apr 28, 2023

Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys, Daniel Barath, Iro Armeni

Abstract:Building 3D scene graphs has recently emerged as a topic in scene representation for several embodied AI applications to represent the world in a structured and rich manner. With their increased use in solving downstream tasks (eg, navigation and room rearrangement), can we leverage and recycle them for creating 3D maps of environments, a pivotal step in agent operation? We focus on the fundamental problem of aligning pairs of 3D scene graphs whose overlap can range from zero to partial and can contain arbitrary changes. We propose SGAligner, the first method for aligning pairs of 3D scene graphs that is robust to in-the-wild scenarios (ie, unknown overlap -- if any -- and changes in the environment). We get inspired by multi-modality knowledge graphs and use contrastive learning to learn a joint, multi-modal embedding space. We evaluate on the 3RScan dataset and further showcase that our method can be used for estimating the transformation between pairs of 3D scenes. Since benchmarks for these tasks are missing, we create them on this dataset. The code, benchmark, and trained models are available on the project website.

Via

Access Paper or Ask Questions

HO-3D_v3: Improving the Accuracy of Hand-Object Annotations of the HO-3D Dataset

Jul 02, 2021

Shreyas Hampali, Sayan Deb Sarkar, Vincent Lepetit

Figure 1 for HO-3D_v3: Improving the Accuracy of Hand-Object Annotations of the HO-3D Dataset

Figure 2 for HO-3D_v3: Improving the Accuracy of Hand-Object Annotations of the HO-3D Dataset

Figure 3 for HO-3D_v3: Improving the Accuracy of Hand-Object Annotations of the HO-3D Dataset

Figure 4 for HO-3D_v3: Improving the Accuracy of Hand-Object Annotations of the HO-3D Dataset

Abstract:HO-3D is a dataset providing image sequences of various hand-object interaction scenarios annotated with the 3D pose of the hand and the object and was originally introduced as HO-3D_v2. The annotations were obtained automatically using an optimization method, 'HOnnotate', introduced in the original paper. HO-3D_v3 provides more accurate annotations for both the hand and object poses thus resulting in better estimates of contact regions between the hand and the object. In this report, we elaborate on the improvements to the HOnnotate method and provide evaluations to compare the accuracy of HO-3D_v2 and HO-3D_v3. HO-3D_v3 results in 4mm higher accuracy compared to HO-3D_v2 for hand poses while exhibiting higher contact regions with the object surface.

Via

Access Paper or Ask Questions

HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction

Apr 29, 2021

Shreyas Hampali, Sayan Deb Sarkar, Mahdi Rad, Vincent Lepetit

Figure 1 for HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction

Figure 2 for HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction

Figure 3 for HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction

Figure 4 for HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction

Abstract:We propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image. This is a very challenging problem, as large occlusions and many confusions between the joints may happen. Our method starts by extracting a set of potential 2D locations for the joints of both hands as extrema of a heatmap. We do not require that all locations correctly correspond to a joint, not that all the joints are detected. We use appearance and spatial encodings of these locations as input to a transformer, and leverage the attention mechanisms to sort out the correct configuration of the joints and output the 3D poses of both hands. Our approach thus allies the recognition power of a Transformer to the accuracy of heatmap-based methods. We also show it can be extended to estimate the 3D pose of an object manipulated by one or two hands. We evaluate our approach on the recent and challenging InterHand2.6M and HO-3D datasets. We obtain 17% improvement over the baseline. Moreover, we introduce the first dataset made of action sequences of two hands manipulating an object fully annotated in 3D and will make it publicly available.

Via

Access Paper or Ask Questions

Monte Carlo Scene Search for 3D Scene Understanding

Mar 30, 2021

Shreyas Hampali, Sinisa Stekovic, Sayan Deb Sarkar, Chetan Srinivasa Kumar, Friedrich Fraundorfer, Vincent Lepetit

Figure 1 for Monte Carlo Scene Search for 3D Scene Understanding

Figure 2 for Monte Carlo Scene Search for 3D Scene Understanding

Figure 3 for Monte Carlo Scene Search for 3D Scene Understanding

Figure 4 for Monte Carlo Scene Search for 3D Scene Understanding

Abstract:We explore how a general AI algorithm can be used for 3D scene understanding to reduce the need for training data. More exactly, we propose a modification of the Monte Carlo Tree Search (MCTS) algorithm to retrieve objects and room layouts from noisy RGB-D scans. While MCTS was developed as a game-playing algorithm, we show it can also be used for complex perception problems. Our adapted MCTS algorithm has few easy-to-tune hyperparameters and can optimise general losses. We use it to optimise the posterior probability of objects and room layout hypotheses given the RGB-D data. This results in an analysis-by-synthesis approach that explores the solution space by rendering the current solution and comparing it to the RGB-D observations. To perform this exploration even more efficiently, we propose simple changes to the standard MCTS' tree construction and exploration policy. We demonstrate our approach on the ScanNet dataset. Our method often retrieves configurations that are better than some manual annotations, especially on layouts.

* To be presented at CVPR 2021

Via

Access Paper or Ask Questions