Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Federico Baldassarre

Cluster and Predict Latents Patches for Improved Masked Image Modeling

Feb 12, 2025

Timothée Darcet, Federico Baldassarre, Maxime Oquab, Julien Mairal, Piotr Bojanowski

Abstract:Masked Image Modeling (MIM) offers a promising approach to self-supervised representation learning, however existing MIM models still lag behind the state-of-the-art. In this paper, we systematically analyze target representations, loss functions, and architectures, to introduce CAPI - a novel pure-MIM framework that relies on the prediction of latent clusterings. Our approach leverages a clustering-based loss, which is stable to train, and exhibits promising scaling properties. Our ViT-L backbone, CAPI, achieves 83.8% accuracy on ImageNet and 32.1% mIoU on ADE20K with simple linear probes, substantially outperforming previous MIM methods and approaching the performance of the current state-of-the-art, DINOv2. We release all our code and models.

* 13 pages, 7 figures, submitted to TMLR

Via

Access Paper or Ask Questions

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

Dec 20, 2024

Cijo Jose, Théo Moutakanni, Dahyun Kang, Federico Baldassarre, Timothée Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Michaël Ramamonjisoa, Maxime Oquab(+4 more)

Figure 1 for DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

Figure 2 for DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

Figure 3 for DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

Figure 4 for DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

Abstract:Self-supervised visual foundation models produce powerful embeddings that achieve remarkable performance on a wide range of downstream tasks. However, unlike vision-language models such as CLIP, self-supervised visual features are not readily aligned with language, hindering their adoption in open-vocabulary tasks. Our method, named dino.txt, unlocks this new ability for DINOv2, a widely used self-supervised visual encoder. We build upon the LiT training strategy, which trains a text encoder to align with a frozen vision model but leads to unsatisfactory results on dense tasks. We propose several key ingredients to improve performance on both global and dense tasks, such as concatenating the [CLS] token with the patch average to train the alignment and curating data using both text and image modalities. With these, we successfully train a CLIP-like model with only a fraction of the computational cost compared to CLIP while achieving state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.

Via

Access Paper or Ask Questions

Quantitative Metrics for Evaluating Explanations of Video DeepFake Detectors

Oct 07, 2022

Federico Baldassarre, Quentin Debard, Gonzalo Fiz Pontiveros, Tri Kurniawan Wijaya

Figure 1 for Quantitative Metrics for Evaluating Explanations of Video DeepFake Detectors

Figure 2 for Quantitative Metrics for Evaluating Explanations of Video DeepFake Detectors

Figure 3 for Quantitative Metrics for Evaluating Explanations of Video DeepFake Detectors

Figure 4 for Quantitative Metrics for Evaluating Explanations of Video DeepFake Detectors

Abstract:The proliferation of DeepFake technology is a rising challenge in today's society, owing to more powerful and accessible generation methods. To counter this, the research community has developed detectors of ever-increasing accuracy. However, the ability to explain the decisions of such models to users is lacking behind and is considered an accessory in large-scale benchmarks, despite being a crucial requirement for the correct deployment of automated tools for content moderation. We attribute the issue to the reliance on qualitative comparisons and the lack of established metrics. We describe a simple set of metrics to evaluate the visual quality and informativeness of explanations of video DeepFake classifiers from a human-centric perspective. With these metrics, we compare common approaches to improve explanation quality and discuss their effect on both classification and explanation performance on the recent DFDC and DFD datasets.

* Accepted at BMVC 2022, code repository at https://github.com/baldassarreFe/deepfake-detection

Via

Access Paper or Ask Questions

Towards Self-Supervised Learning of Global and Object-Centric Representations

Mar 11, 2022

Federico Baldassarre, Hossein Azizpour

Figure 1 for Towards Self-Supervised Learning of Global and Object-Centric Representations

Figure 2 for Towards Self-Supervised Learning of Global and Object-Centric Representations

Figure 3 for Towards Self-Supervised Learning of Global and Object-Centric Representations

Figure 4 for Towards Self-Supervised Learning of Global and Object-Centric Representations

Abstract:Self-supervision allows learning meaningful representations of natural images which usually contain one central object. How well does it transfer to multi-entity scenes? We discuss key aspects of learning structured object-centric representations with self-supervision and validate our insights through several experiments on the CLEVR dataset. Regarding the architecture, we confirm the importance of competition for attention-based object discovery, where each image patch is exclusively attended by one object. For training, we show that contrastive losses equipped with matching can be applied directly in a latent space, avoiding pixel-based reconstruction. However, such an optimization objective is sensitive to false negatives (recurring objects) and false positives (matching errors). Thus, careful consideration is required around data augmentation and negative sample selection.

Via

Access Paper or Ask Questions

Explanation-based Weakly-supervised Learning of Visual Relations with Graph Networks

Jun 16, 2020

Federico Baldassarre, Kevin Smith, Josephine Sullivan, Hossein Azizpour

Figure 1 for Explanation-based Weakly-supervised Learning of Visual Relations with Graph Networks

Figure 2 for Explanation-based Weakly-supervised Learning of Visual Relations with Graph Networks

Figure 3 for Explanation-based Weakly-supervised Learning of Visual Relations with Graph Networks

Figure 4 for Explanation-based Weakly-supervised Learning of Visual Relations with Graph Networks

Abstract:Visual relationship detection is fundamental for holistic image understanding. However, localizing and classifying (subject, predicate, object) triplets constitutes a hard learning objective due to the combinatorial explosion of possible relationships, their long-tail distribution in natural images, and an expensive annotation process. This paper introduces a novel weakly-supervised method for visual relationship detection that relies only on image-level predicate annotations. A graph neural network is trained to classify the predicates in an image from the graph representation of all objects, implicitly encoding an inductive bias for pairwise relationships. We then frame relationship detection as the explanation of such a predicate classifier, i.e. we reconstruct a complete relationship by recovering the subject and the object of a predicted predicate. Using this novel technique and minimal labels, we present comparable results to recent fully-supervised and weakly-supervised methods on three diverse and challenging datasets: HICO-DET for human-object interaction, Visual Relationship Detection for generic object-to-object relationships, and UnRel for unusual relationships.

Via

Access Paper or Ask Questions

Explainability Techniques for Graph Convolutional Networks

May 31, 2019

Federico Baldassarre, Hossein Azizpour

Figure 1 for Explainability Techniques for Graph Convolutional Networks

Figure 2 for Explainability Techniques for Graph Convolutional Networks

Figure 3 for Explainability Techniques for Graph Convolutional Networks

Figure 4 for Explainability Techniques for Graph Convolutional Networks

Abstract:Graph Networks are used to make decisions in potentially complex scenarios but it is usually not obvious how or why they made them. In this work, we study the explainability of Graph Network decisions using two main classes of techniques, gradient-based and decomposition-based, on a toy dataset and a chemistry task. Our study sets the ground for future development as well as application to real-world problems.

* Accepted at the ICML 2019 Workshop "Learning and Reasoning with Graph-Structured Representations" (poster + spotlight talk)

Via

Access Paper or Ask Questions

Deep Koalarization: Image Colorization using CNNs and Inception-ResNet-v2

Dec 09, 2017

Federico Baldassarre, Diego González Morín, Lucas Rodés-Guirao

Figure 1 for Deep Koalarization: Image Colorization using CNNs and Inception-ResNet-v2

Figure 2 for Deep Koalarization: Image Colorization using CNNs and Inception-ResNet-v2

Figure 3 for Deep Koalarization: Image Colorization using CNNs and Inception-ResNet-v2

Figure 4 for Deep Koalarization: Image Colorization using CNNs and Inception-ResNet-v2

Abstract:We review some of the most recent approaches to colorize gray-scale images using deep learning methods. Inspired by these, we propose a model which combines a deep Convolutional Neural Network trained from scratch with high-level features extracted from the Inception-ResNet-v2 pre-trained model. Thanks to its fully convolutional architecture, our encoder-decoder model can process images of any size and aspect ratio. Other than presenting the training results, we assess the "public acceptance" of the generated images by means of a user study. Finally, we present a carousel of applications on different types of images, such as historical photographs.

* 12 pages

Via

Access Paper or Ask Questions