Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Eloi Zablocki

GaussRender: Learning 3D Occupancy with Gaussian Rendering

Feb 07, 2025

Loick Chambon, Eloi Zablocki, Alexandre Boulch, Mickael Chen, Matthieu Cord

Figure 1 for GaussRender: Learning 3D Occupancy with Gaussian Rendering

Figure 2 for GaussRender: Learning 3D Occupancy with Gaussian Rendering

Figure 3 for GaussRender: Learning 3D Occupancy with Gaussian Rendering

Figure 4 for GaussRender: Learning 3D Occupancy with Gaussian Rendering

Abstract:Understanding the 3D geometry and semantics of driving scenes is critical for developing of safe autonomous vehicles. While 3D occupancy models are typically trained using voxel-based supervision with standard losses (e.g., cross-entropy, Lovasz, dice), these approaches treat voxel predictions independently, neglecting their spatial relationships. In this paper, we propose GaussRender, a plug-and-play 3D-to-2D reprojection loss that enhances voxel-based supervision. Our method projects 3D voxel representations into arbitrary 2D perspectives and leverages Gaussian splatting as an efficient, differentiable rendering proxy of voxels, introducing spatial dependencies across projected elements. This approach improves semantic and geometric consistency, handles occlusions more efficiently, and requires no architectural modifications. Extensive experiments on multiple benchmarks (SurroundOcc-nuScenes, Occ3D-nuScenes, SSCBench-KITTI360) demonstrate consistent performance gains across various 3D occupancy models (TPVFormer, SurroundOcc, Symphonies), highlighting the robustness and versatility of our framework. The code is available at https://github.com/valeoai/GaussRender.

Via

Access Paper or Ask Questions

LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models

Sep 18, 2024

Amaia Cardiel, Eloi Zablocki, Oriane Siméoni, Elias Ramzi, Matthieu Cord

Figure 1 for LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models

Figure 2 for LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models

Figure 3 for LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models

Figure 4 for LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models

Abstract:Vision Language Models (VLMs) have shown impressive performances on numerous tasks but their zero-shot capabilities can be limited compared to dedicated or fine-tuned models. Yet, fine-tuning VLMs comes with limitations as it requires `white-box' access to the model's architecture and weights as well as expertise to design the fine-tuning objectives and optimize the hyper-parameters, which are specific to each VLM and downstream task. In this work, we propose LLM-wrapper, a novel approach to adapt VLMs in a `black-box' manner by leveraging large language models (LLMs) so as to reason on their outputs. We demonstrate the effectiveness of LLM-wrapper on Referring Expression Comprehension (REC), a challenging open-vocabulary task that requires spatial and semantic reasoning. Our approach significantly boosts the performance of off-the-shelf models, resulting in competitive results when compared with classic fine-tuning.

* EVAL-FoMo workshop, ECCV 2024

Via

Access Paper or Ask Questions

PointBeV: A Sparse Approach to BeV Predictions

Dec 01, 2023

Loick Chambon, Eloi Zablocki, Mickael Chen, Florent Bartoccioni, Patrick Perez, Matthieu Cord

Figure 1 for PointBeV: A Sparse Approach to BeV Predictions

Figure 2 for PointBeV: A Sparse Approach to BeV Predictions

Figure 3 for PointBeV: A Sparse Approach to BeV Predictions

Figure 4 for PointBeV: A Sparse Approach to BeV Predictions

Abstract:Bird's-eye View (BeV) representations have emerged as the de-facto shared space in driving applications, offering a unified space for sensor data fusion and supporting various downstream tasks. However, conventional models use grids with fixed resolution and range and face computational inefficiencies due to the uniform allocation of resources across all cells. To address this, we propose PointBeV, a novel sparse BeV segmentation model operating on sparse BeV cells instead of dense grids. This approach offers precise control over memory usage, enabling the use of long temporal contexts and accommodating memory-constrained platforms. PointBeV employs an efficient two-pass strategy for training, enabling focused computation on regions of interest. At inference time, it can be used with various memory/performance trade-offs and flexibly adjusts to new specific use cases. PointBeV achieves state-of-the-art results on the nuScenes dataset for vehicle, pedestrian, and lane segmentation, showcasing superior performance in static and temporal settings despite being trained solely with sparse signals. We will release our code along with two new efficient modules used in the architecture: Sparse Feature Pulling, designed for the effective extraction of features from images to BeV, and Submanifold Attention, which enables efficient temporal modeling. Our code is available at https://github.com/valeoai/PointBeV.

* https://github.com/valeoai/PointBeV

Via

Access Paper or Ask Questions

Transductive Zero-Shot Learning using Cross-Modal CycleGAN

Nov 13, 2020

Patrick Bordes, Eloi Zablocki, Benjamin Piwowarski, Patrick Gallinari

Figure 1 for Transductive Zero-Shot Learning using Cross-Modal CycleGAN

Figure 2 for Transductive Zero-Shot Learning using Cross-Modal CycleGAN

Figure 3 for Transductive Zero-Shot Learning using Cross-Modal CycleGAN

Figure 4 for Transductive Zero-Shot Learning using Cross-Modal CycleGAN

Abstract:In Computer Vision, Zero-Shot Learning (ZSL) aims at classifying unseen classes -- classes for which no matching training image exists. Most of ZSL works learn a cross-modal mapping between images and class labels for seen classes. However, the data distribution of seen and unseen classes might differ, causing a domain shift problem. Following this observation, transductive ZSL (T-ZSL) assumes that unseen classes and their associated images are known during training, but not their correspondence. As current T-ZSL approaches do not scale efficiently when the number of seen classes is high, we tackle this problem with a new model for T-ZSL based upon CycleGAN. Our model jointly (i) projects images on their seen class labels with a supervised objective and (ii) aligns unseen class labels and visual exemplars with adversarial and cycle-consistency objectives. We show the efficiency of our Cross-Modal CycleGAN model (CM-GAN) on the ImageNet T-ZSL task where we obtain state-of-the-art results. We further validate CM-GAN on a language grounding task, and on a new task that we propose: zero-shot sentence-to-image matching on MS COCO.

Via

Access Paper or Ask Questions

Incorporating Visual Semantics into Sentence Representations within a Grounded Space

Feb 07, 2020

Patrick Bordes, Eloi Zablocki, Laure Soulier, Benjamin Piwowarski, Patrick Gallinari

Figure 1 for Incorporating Visual Semantics into Sentence Representations within a Grounded Space

Figure 2 for Incorporating Visual Semantics into Sentence Representations within a Grounded Space

Figure 3 for Incorporating Visual Semantics into Sentence Representations within a Grounded Space

Figure 4 for Incorporating Visual Semantics into Sentence Representations within a Grounded Space

Abstract:Language grounding is an active field aiming at enriching textual representations with visual information. Generally, textual and visual elements are embedded in the same representation space, which implicitly assumes a one-to-one correspondence between modalities. This hypothesis does not hold when representing words, and becomes problematic when used to learn sentence representations --- the focus of this paper --- as a visual scene can be described by a wide variety of sentences. To overcome this limitation, we propose to transfer visual information to textual representations by learning an intermediate representation space: the grounded space. We further propose two new complementary objectives ensuring that (1) sentences associated with the same visual content are close in the grounded space and (2) similarities between related elements are preserved across modalities. We show that this model outperforms the previous state-of-the-art on classification and semantic relatedness tasks.

Via

Access Paper or Ask Questions

Context-Aware Zero-Shot Learning for Object Recognition

Apr 30, 2019

Eloi Zablocki, Patrick Bordes, Benjamin Piwowarski, Laure Soulier, Patrick Gallinari

Figure 1 for Context-Aware Zero-Shot Learning for Object Recognition

Figure 2 for Context-Aware Zero-Shot Learning for Object Recognition

Abstract:Zero-Shot Learning (ZSL) aims at classifying unlabeled objects by leveraging auxiliary knowledge, such as semantic representations. A limitation of previous approaches is that only intrinsic properties of objects, e.g. their visual appearance, are taken into account while their context, e.g. the surrounding objects in the image, is ignored. Following the intuitive principle that objects tend to be found in certain contexts but not others, we propose a new and challenging approach, context-aware ZSL, that leverages semantic representations in a new way to model the conditional likelihood of an object to appear in a given context. Finally, through extensive experiments conducted on Visual Genome, we show that contextual information can substantially improve the standard ZSL approach and is robust to unbalanced classes.

* Accepted at ICML 2019

Via

Access Paper or Ask Questions