Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Maximilian Seitzer

DINOv3

Aug 13, 2025

Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa(+16 more)

Abstract:Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.

Via

Access Paper or Ask Questions

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

Mar 27, 2025

Aniket Didolkar, Andrii Zadaianchuk, Rabiul Awal, Maximilian Seitzer, Efstratios Gavves, Aishwarya Agrawal

Abstract:Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains, including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. The proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.

* Accepted at CVPR 2025

Via

Access Paper or Ask Questions

Temporally Consistent Object-Centric Learning by Contrasting Slots

Dec 18, 2024

Anna Manasyan, Maximilian Seitzer, Filip Radovic, Georg Martius, Andrii Zadaianchuk

Figure 1 for Temporally Consistent Object-Centric Learning by Contrasting Slots

Figure 2 for Temporally Consistent Object-Centric Learning by Contrasting Slots

Figure 3 for Temporally Consistent Object-Centric Learning by Contrasting Slots

Figure 4 for Temporally Consistent Object-Centric Learning by Contrasting Slots

Abstract:Unsupervised object-centric learning from videos is a promising approach to extract structured representations from large, unlabeled collections of videos. To support downstream tasks like autonomous control, these representations must be both compositional and temporally consistent. Existing approaches based on recurrent processing often lack long-term stability across frames because their training objective does not enforce temporal consistency. In this work, we introduce a novel object-level temporal contrastive loss for video object-centric models that explicitly promotes temporal consistency. Our method significantly improves the temporal consistency of the learned object-centric representations, yielding more reliable video decompositions that facilitate challenging downstream tasks such as unsupervised object dynamics prediction. Furthermore, the inductive bias added by our loss strongly improves object discovery, leading to state-of-the-art results on both synthetic and real-world datasets, outperforming even weakly-supervised methods that leverage motion masks as additional cues.

Via

Access Paper or Ask Questions

Zero-Shot Object-Centric Representation Learning

Aug 17, 2024

Aniket Didolkar, Andrii Zadaianchuk, Anirudh Goyal, Mike Mozer, Yoshua Bengio, Georg Martius, Maximilian Seitzer

Figure 1 for Zero-Shot Object-Centric Representation Learning

Figure 2 for Zero-Shot Object-Centric Representation Learning

Figure 3 for Zero-Shot Object-Centric Representation Learning

Figure 4 for Zero-Shot Object-Centric Representation Learning

Abstract:The goal of object-centric representation learning is to decompose visual scenes into a structured representation that isolates the entities. Recent successes have shown that object-centric representation learning can be scaled to real-world scenes by utilizing pre-trained self-supervised features. However, so far, object-centric methods have mostly been applied in-distribution, with models trained and evaluated on the same dataset. This is in contrast to the wider trend in machine learning towards general-purpose models directly applicable to unseen data and tasks. Thus, in this work, we study current object-centric methods through the lens of zero-shot generalization by introducing a benchmark comprising eight different synthetic and real-world datasets. We analyze the factors influencing zero-shot performance and find that training on diverse real-world images improves transferability to unseen scenarios. Furthermore, inspired by the success of task-specific fine-tuning in foundation models, we introduce a novel fine-tuning strategy to adapt pre-trained vision encoders for the task of object discovery. We find that the proposed approach results in state-of-the-art performance for unsupervised object discovery, exhibiting strong zero-shot transfer to unseen datasets.

Via

Access Paper or Ask Questions

DyST: Towards Dynamic Neural Scene Representations on Real-World Videos

Oct 09, 2023

Maximilian Seitzer, Sjoerd van Steenkiste, Thomas Kipf, Klaus Greff, Mehdi S. M. Sajjadi

Figure 1 for DyST: Towards Dynamic Neural Scene Representations on Real-World Videos

Figure 2 for DyST: Towards Dynamic Neural Scene Representations on Real-World Videos

Figure 3 for DyST: Towards Dynamic Neural Scene Representations on Real-World Videos

Figure 4 for DyST: Towards Dynamic Neural Scene Representations on Real-World Videos

Abstract:Visual understanding of the world goes beyond the semantics and flat structure of individual images. In this work, we aim to capture both the 3D structure and dynamics of real-world scenes from monocular real-world videos. Our Dynamic Scene Transformer (DyST) model leverages recent work in neural scene representation to learn a latent decomposition of monocular real-world videos into scene content, per-view scene dynamics, and camera pose. This separation is achieved through a novel co-training scheme on monocular videos and our new synthetic dataset DySO. DyST learns tangible latent representations for dynamic scenes that enable view generation with separate control over the camera and the content of the scene.

* Project website: https://dyst-paper.github.io/

Via

Access Paper or Ask Questions

Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities

Jun 07, 2023

Andrii Zadaianchuk, Maximilian Seitzer, Georg Martius

Figure 1 for Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities

Figure 2 for Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities

Figure 3 for Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities

Figure 4 for Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities

Abstract:Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.

Via

Access Paper or Ask Questions

Bridging the Gap to Real-World Object-Centric Learning

Sep 29, 2022

Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Dominik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard Schölkopf, Thomas Brox(+1 more)

Figure 1 for Bridging the Gap to Real-World Object-Centric Learning

Figure 2 for Bridging the Gap to Real-World Object-Centric Learning

Figure 3 for Bridging the Gap to Real-World Object-Centric Learning

Figure 4 for Bridging the Gap to Real-World Object-Centric Learning

Abstract:Humans naturally decompose their environment into entities at the appropriate level of abstraction to act in the world. Allowing machine learning algorithms to derive this decomposition in an unsupervised way has become an important line of research. However, current methods are restricted to simulated data or require additional information in the form of motion or depth in order to successfully discover objects. In this work, we overcome this limitation by showing that reconstructing features from models trained in a self-supervised manner is a sufficient training signal for object-centric representations to arise in a fully unsupervised way. Our approach, DINOSAUR, significantly out-performs existing object-centric learning models on simulated data and is the first unsupervised object-centric model that scales to real world-datasets such as COCO and PASCAL VOC. DINOSAUR is conceptually simple and shows competitive performance compared to more involved pipelines from the computer vision literature.

Via

Access Paper or Ask Questions

On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks

Apr 01, 2022

Maximilian Seitzer, Arash Tavakoli, Dimitrije Antic, Georg Martius

Figure 1 for On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks

Figure 2 for On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks

Figure 3 for On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks

Figure 4 for On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks

Abstract:Capturing aleatoric uncertainty is a critical part of many machine learning systems. In deep learning, a common approach to this end is to train a neural network to estimate the parameters of a heteroscedastic Gaussian distribution by maximizing the logarithm of the likelihood function under the observed data. In this work, we examine this approach and identify potential hazards associated with the use of log-likelihood in conjunction with gradient-based optimizers. First, we present a synthetic example illustrating how this approach can lead to very poor but stable parameter estimates. Second, we identify the culprit to be the log-likelihood loss, along with certain conditions that exacerbate the issue. Third, we present an alternative formulation, termed $\beta$-NLL, in which each data point's contribution to the loss is weighted by the $\beta$-exponentiated variance estimate. We show that using an appropriate $\beta$ largely mitigates the issue in our illustrative example. Fourth, we evaluate this approach on a range of domains and tasks and show that it achieves considerable improvements and performs more robustly concerning hyperparameters, both in predictive RMSE and log-likelihood criteria.

* ICLR 2022 camera-ready version. Code available at http://github.com/martius-lab/beta-nll

Via

Access Paper or Ask Questions

Causal Influence Detection for Improving Efficiency in Reinforcement Learning

Jun 07, 2021

Maximilian Seitzer, Bernhard Schölkopf, Georg Martius

Figure 1 for Causal Influence Detection for Improving Efficiency in Reinforcement Learning

Figure 2 for Causal Influence Detection for Improving Efficiency in Reinforcement Learning

Figure 3 for Causal Influence Detection for Improving Efficiency in Reinforcement Learning

Figure 4 for Causal Influence Detection for Improving Efficiency in Reinforcement Learning

Abstract:Many reinforcement learning (RL) environments consist of independent entities that interact sparsely. In such environments, RL agents have only limited influence over other entities in any particular situation. Our idea in this work is that learning can be efficiently guided by knowing when and what the agent can influence with its actions. To achieve this, we introduce a measure of situation-dependent causal influence based on conditional mutual information and show that it can reliably detect states of influence. We then propose several ways to integrate this measure into RL algorithms to improve exploration and off-policy learning. All modified algorithms show strong increases in data efficiency on robotic manipulation tasks.

Via

Access Paper or Ask Questions

Self-supervised Visual Reinforcement Learning with Object-centric Representations

Nov 29, 2020

Andrii Zadaianchuk, Maximilian Seitzer, Georg Martius

Figure 1 for Self-supervised Visual Reinforcement Learning with Object-centric Representations

Figure 2 for Self-supervised Visual Reinforcement Learning with Object-centric Representations

Figure 3 for Self-supervised Visual Reinforcement Learning with Object-centric Representations

Figure 4 for Self-supervised Visual Reinforcement Learning with Object-centric Representations

Abstract:Autonomous agents need large repertoires of skills to act reasonably on new tasks that they have not seen before. However, acquiring these skills using only a stream of high-dimensional, unstructured, and unlabeled observations is a tricky challenge for any autonomous agent. Previous methods have used variational autoencoders to encode a scene into a low-dimensional vector that can be used as a goal for an agent to discover new skills. Nevertheless, in compositional/multi-object environments it is difficult to disentangle all the factors of variation into such a fixed-length representation of the whole scene. We propose to use object-centric representations as a modular and structured observation space, which is learned with a compositional generative world model. We show that the structure in the representations in combination with goal-conditioned attention policies helps the autonomous agent to discover and learn useful skills. These skills can be further combined to address compositional tasks like the manipulation of several different objects.

Via

Access Paper or Ask Questions