Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nicolas Carion

SAM 2: Segment Anything in Images and Videos

Aug 01, 2024

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson(+8 more)

Figure 1 for SAM 2: Segment Anything in Images and Videos

Figure 2 for SAM 2: Segment Anything in Images and Videos

Figure 3 for SAM 2: Segment Anything in Images and Videos

Figure 4 for SAM 2: Segment Anything in Images and Videos

Abstract:We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provides strong performance across a wide range of tasks. In video segmentation, we observe better accuracy, using 3x fewer interactions than prior approaches. In image segmentation, our model is more accurate and 6x faster than the Segment Anything Model (SAM). We believe that our data, model, and insights will serve as a significant milestone for video segmentation and related perception tasks. We are releasing a version of our model, the dataset and an interactive demo.

* Website: https://ai.meta.com/sam2

Via

Access Paper or Ask Questions

DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic Diversity

Aug 15, 2023

Melissa Hall, Candace Ross, Adina Williams, Nicolas Carion, Michal Drozdzal, Adriana Romero Soriano

Figure 1 for DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic Diversity

Figure 2 for DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic Diversity

Figure 3 for DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic Diversity

Figure 4 for DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic Diversity

Abstract:The unprecedented photorealistic results achieved by recent text-to-image generative systems and their increasing use as plug-and-play content creation solutions make it crucial to understand their potential biases. In this work, we introduce three indicators to evaluate the realism, diversity and prompt-generation consistency of text-to-image generative systems when prompted to generate objects from across the world. Our indicators complement qualitative analysis of the broader impact of such systems by enabling automatic and efficient benchmarking of geographic disparities, an important step towards building responsible visual content creation systems. We use our proposed indicators to analyze potential geographic biases in state-of-the-art visual content creation systems and find that: (1) models have less realism and diversity of generations when prompting for Africa and West Asia than Europe, (2) prompting with geographic information comes at a cost to prompt-consistency and diversity of generated images, and (3) models exhibit more region-level disparities for some objects than others. Perhaps most interestingly, our indicators suggest that progress in image generation quality has come at the cost of real-world geographic representation. Our comprehensive evaluation constitutes a crucial step towards ensuring a positive experience of visual content creation for everyone.

Via

Access Paper or Ask Questions

GeneCIS: A Benchmark for General Conditional Image Similarity

Jun 13, 2023

Sagar Vaze, Nicolas Carion, Ishan Misra

Abstract:We argue that there are many notions of 'similarity' and that models, like humans, should be able to adapt to these dynamically. This contrasts with most representation learning methods, supervised or self-supervised, which learn a fixed embedding function and hence implicitly assume a single notion of similarity. For instance, models trained on ImageNet are biased towards object categories, while a user might prefer the model to focus on colors, textures or specific elements in the scene. In this paper, we propose the GeneCIS ('genesis') benchmark, which measures models' ability to adapt to a range of similarity conditions. Extending prior work, our benchmark is designed for zero-shot evaluation only, and hence considers an open-set of similarity conditions. We find that baselines from powerful CLIP models struggle on GeneCIS and that performance on the benchmark is only weakly correlated with ImageNet accuracy, suggesting that simply scaling existing methods is not fruitful. We further propose a simple, scalable solution based on automatically mining information from existing image-caption datasets. We find our method offers a substantial boost over the baselines on GeneCIS, and further improves zero-shot performance on related image retrieval benchmarks. In fact, though evaluated zero-shot, our model surpasses state-of-the-art supervised models on MIT-States. Project page at https://sgvaze.github.io/genecis/.

* CVPR 2023 (Highlighted Paper). Project page at https://sgvaze.github.io/genecis/

Via

Access Paper or Ask Questions

Joint Embedding Predictive Architectures Focus on Slow Features

Nov 20, 2022

Vlad Sobal, Jyothir S V, Siddhartha Jalagam, Nicolas Carion, Kyunghyun Cho, Yann LeCun

Figure 1 for Joint Embedding Predictive Architectures Focus on Slow Features

Figure 2 for Joint Embedding Predictive Architectures Focus on Slow Features

Figure 3 for Joint Embedding Predictive Architectures Focus on Slow Features

Figure 4 for Joint Embedding Predictive Architectures Focus on Slow Features

Abstract:Many common methods for learning a world model for pixel-based environments use generative architectures trained with pixel-level reconstruction objectives. Recently proposed Joint Embedding Predictive Architectures (JEPA) offer a reconstruction-free alternative. In this work, we analyze performance of JEPA trained with VICReg and SimCLR objectives in the fully offline setting without access to rewards, and compare the results to the performance of the generative architecture. We test the methods in a simple environment with a moving dot with various background distractors, and probe learned representations for the dot's location. We find that JEPA methods perform on par or better than reconstruction when distractor noise changes every time step, but fail when the noise is fixed. Furthermore, we provide a theoretical explanation for the poor performance of JEPA-based methods with fixed noise, highlighting an important limitation.

* 4 pages (3 figures) short paper for SSL Theory and Practice workshop at NeurIPS 2022. Code is available at https://github.com/vladisai/JEPA_SSL_NeurIPS_2022

Via

Access Paper or Ask Questions

Light-weight probing of unsupervised representations for Reinforcement Learning

Aug 25, 2022

Wancong Zhang, Anthony GX-Chen, Vlad Sobal, Yann LeCun, Nicolas Carion

Figure 1 for Light-weight probing of unsupervised representations for Reinforcement Learning

Figure 2 for Light-weight probing of unsupervised representations for Reinforcement Learning

Figure 3 for Light-weight probing of unsupervised representations for Reinforcement Learning

Figure 4 for Light-weight probing of unsupervised representations for Reinforcement Learning

Abstract:Unsupervised visual representation learning offers the opportunity to leverage large corpora of unlabeled trajectories to form useful visual representations, which can benefit the training of reinforcement learning (RL) algorithms. However, evaluating the fitness of such representations requires training RL algorithms which is computationally intensive and has high variance outcomes. To alleviate this issue, we design an evaluation protocol for unsupervised RL representations with lower variance and up to 600x lower computational cost. Inspired by the vision community, we propose two linear probing tasks: predicting the reward observed in a given state, and predicting the action of an expert in a given state. These two tasks are generally applicable to many RL domains, and we show through rigorous experimentation that they correlate strongly with the actual downstream control performance on the Atari100k Benchmark. This provides a better method for exploring the space of pretraining algorithms without the need of running RL evaluations for every setting. Leveraging this framework, we further improve existing self-supervised learning (SSL) recipes for RL, highlighting the importance of the forward model, the size of the visual backbone, and the precise formulation of the unsupervised objective.

Via

Access Paper or Ask Questions

Separating the World and Ego Models for Self-Driving

Apr 14, 2022

Vlad Sobal, Alfredo Canziani, Nicolas Carion, Kyunghyun Cho, Yann LeCun

Figure 1 for Separating the World and Ego Models for Self-Driving

Figure 2 for Separating the World and Ego Models for Self-Driving

Figure 3 for Separating the World and Ego Models for Self-Driving

Figure 4 for Separating the World and Ego Models for Self-Driving

Abstract:Training self-driving systems to be robust to the long-tail of driving scenarios is a critical problem. Model-based approaches leverage simulation to emulate a wide range of scenarios without putting users at risk in the real world. One promising path to faithful simulation is to train a forward model of the world to predict the future states of both the environment and the ego-vehicle given past states and a sequence of actions. In this paper, we argue that it is beneficial to model the state of the ego-vehicle, which often has simple, predictable and deterministic behavior, separately from the rest of the environment, which is much more complex and highly multimodal. We propose to model the ego-vehicle using a simple and differentiable kinematic model, while training a stochastic convolutional forward model on raster representations of the state to predict the behavior of the rest of the environment. We explore several configurations of such decoupled models, and evaluate their performance both with Model Predictive Control (MPC) and direct policy learning. We test our methods on the task of highway driving and demonstrate lower crash rates and better stability. The code is available at https://github.com/vladisai/pytorch-PPUU/tree/ICLR2022.

* 8 pages main content, 14 with references and appendix. 5 figures in total. Submitted and accepted to ICLR 2022 workshop on Generalizable Policy Learning in the Physical World (https://ai-workshops.github.io/generalizable-policy-learning-in-the-physical-world/)

Via

Access Paper or Ask Questions

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

Apr 26, 2021

Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, Nicolas Carion

Figure 1 for MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

Figure 2 for MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

Figure 3 for MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

Figure 4 for MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

Abstract:Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes. This makes it challenging for such systems to capture the long tail of visual concepts expressed in free form text. In this paper we propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. We pre-train the network on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. We then fine-tune on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation, achieving state-of-the-art results on popular benchmarks. We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting. We show that our pre-training approach provides a way to handle the long tail of object categories which have very few labelled instances. Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR. The code and models are available at https://github.com/ashkamath/mdetr.

Via

Access Paper or Ask Questions

End-to-End Object Detection with Transformers

May 28, 2020

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko

Figure 1 for End-to-End Object Detection with Transformers

Figure 2 for End-to-End Object Detection with Transformers

Figure 3 for End-to-End Object Detection with Transformers

Figure 4 for End-to-End Object Detection with Transformers

Abstract:We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset. Moreover, DETR can be easily generalized to produce panoptic segmentation in a unified manner. We show that it significantly outperforms competitive baselines. Training code and pretrained models are available at https://github.com/facebookresearch/detr.

Via

Access Paper or Ask Questions

A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning

Oct 19, 2019

Nicolas Carion, Gabriel Synnaeve, Alessandro Lazaric, Nicolas Usunier

Figure 1 for A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning

Figure 2 for A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning

Figure 3 for A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning

Figure 4 for A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning

Abstract:Effective coordination is crucial to solve multi-agent collaborative (MAC) problems. While centralized reinforcement learning methods can optimally solve small MAC instances, they do not scale to large problems and they fail to generalize to scenarios different from those seen during training. In this paper, we consider MAC problems with some intrinsic notion of locality (e.g., geographic proximity) such that interactions between agents and tasks are locally limited. By leveraging this property, we introduce a novel structured prediction approach to assign agents to tasks. At each step, the assignment is obtained by solving a centralized optimization problem (the inference procedure) whose objective function is parameterized by a learned scoring model. We propose different combinations of inference procedures and scoring models able to represent coordination patterns of increasing complexity. The resulting assignment policy can be efficiently learned on small problem instances and readily reused in problems with more agents and tasks (i.e., zero-shot generalization). We report experimental results on a toy search and rescue problem and on several target selection scenarios in StarCraft: Brood War, in which our model significantly outperforms strong rule-based baselines on instances with 5 times more agents and tasks than those seen during training.

* NeurIPS 2019

Via

Access Paper or Ask Questions

Forward Modeling for Partial Observation Strategy Games - A StarCraft Defogger

Nov 30, 2018

Gabriel Synnaeve, Zeming Lin, Jonas Gehring, Dan Gant, Vegard Mella, Vasil Khalidov, Nicolas Carion, Nicolas Usunier

Figure 1 for Forward Modeling for Partial Observation Strategy Games - A StarCraft Defogger

Figure 2 for Forward Modeling for Partial Observation Strategy Games - A StarCraft Defogger

Figure 3 for Forward Modeling for Partial Observation Strategy Games - A StarCraft Defogger

Figure 4 for Forward Modeling for Partial Observation Strategy Games - A StarCraft Defogger

Abstract:We formulate the problem of defogging as state estimation and future state prediction from previous, partial observations in the context of real-time strategy games. We propose to employ encoder-decoder neural networks for this task, and introduce proxy tasks and baselines for evaluation to assess their ability of capturing basic game rules and high-level dynamics. By combining convolutional neural networks and recurrent networks, we exploit spatial and sequential correlations and train well-performing models on a large dataset of human games of StarCraft: Brood War. Finally, we demonstrate the relevance of our models to downstream tasks by applying them for enemy unit prediction in a state-of-the-art, rule-based StarCraft bot. We observe improvements in win rates against several strong community bots.

* Advances in Neural Information Processing Systems 31 (2018) 10759-10770

Via

Access Paper or Ask Questions