Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Christopher P. Burgess

Linking vision and motion for self-supervised object-centric perception

Jul 14, 2023

Kaylene C. Stocking, Zak Murez, Vijay Badrinarayanan, Jamie Shotton, Alex Kendall, Claire Tomlin, Christopher P. Burgess

Abstract:Object-centric representations enable autonomous driving algorithms to reason about interactions between many independent agents and scene features. Traditionally these representations have been obtained via supervised learning, but this decouples perception from the downstream driving task and could harm generalization. In this work we adapt a self-supervised object-centric vision model to perform object decomposition using only RGB video and the pose of the vehicle as inputs. We demonstrate that our method obtains promising results on the Waymo Open perception dataset. While object mask quality lags behind supervised methods or alternatives that use more privileged information, we find that our model is capable of learning a representation that fuses multiple camera viewpoints over time and successfully tracks many vehicles and pedestrians in the dataset. Code for our model is available at https://github.com/wayveai/SOCS.

* Presented at the CVPR 2023 Vision-Centric Autonomous Driving workshop

Via

Access Paper or Ask Questions

Constellation: Learning relational abstractions over objects for compositional imagination

Jul 23, 2021

James C. R. Whittington, Rishabh Kabra, Loic Matthey, Christopher P. Burgess, Alexander Lerchner

Figure 1 for Constellation: Learning relational abstractions over objects for compositional imagination

Figure 2 for Constellation: Learning relational abstractions over objects for compositional imagination

Figure 3 for Constellation: Learning relational abstractions over objects for compositional imagination

Figure 4 for Constellation: Learning relational abstractions over objects for compositional imagination

Abstract:Learning structured representations of visual scenes is currently a major bottleneck to bridging perception with reasoning. While there has been exciting progress with slot-based models, which learn to segment scenes into sets of objects, learning configurational properties of entire groups of objects is still under-explored. To address this problem, we introduce Constellation, a network that learns relational abstractions of static visual scenes, and generalises these abstractions over sensory particularities, thus offering a potential basis for abstract relational reasoning. We further show that this basis, along with language association, provides a means to imagine sensory content in new ways. This work is a first step in the explicit representation of visual relationships and using them for complex cognitive procedures.

Via

Access Paper or Ask Questions

SIMONe: View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video Decomposition

Jun 07, 2021

Rishabh Kabra, Daniel Zoran, Goker Erdogan, Loic Matthey, Antonia Creswell, Matthew Botvinick, Alexander Lerchner, Christopher P. Burgess

Figure 1 for SIMONe: View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video Decomposition

Figure 2 for SIMONe: View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video Decomposition

Figure 3 for SIMONe: View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video Decomposition

Figure 4 for SIMONe: View-Invariant, Temporally-Abstracted Object Representations via Unsupervised Video Decomposition

Abstract:To help agents reason about scenes in terms of their building blocks, we wish to extract the compositional structure of any given scene (in particular, the configuration and characteristics of objects comprising the scene). This problem is especially difficult when scene structure needs to be inferred while also estimating the agent's location/viewpoint, as the two variables jointly give rise to the agent's observations. We present an unsupervised variational approach to this problem. Leveraging the shared structure that exists across different scenes, our model learns to infer two sets of latent representations from RGB video input alone: a set of "object" latents, corresponding to the time-invariant, object-level contents of the scene, as well as a set of "frame" latents, corresponding to global time-varying elements such as viewpoint. This factorization of latents allows our model, SIMONe, to represent object attributes in an allocentric manner which does not depend on viewpoint. Moreover, it allows us to disentangle object dynamics and summarize their trajectories as time-abstracted, view-invariant, per-object properties. We demonstrate these capabilities, as well as the model's performance in terms of view synthesis and instance segmentation, across three procedurally generated video datasets.

* Animated figures are available at https://sites.google.com/view/simone-scene-understanding/

Via

Access Paper or Ask Questions

A Heuristic for Unsupervised Model Selection for Variational Disentangled Representation Learning

May 29, 2019

Sunny Duan, Nicholas Watters, Loic Matthey, Christopher P. Burgess, Alexander Lerchner, Irina Higgins

Figure 1 for A Heuristic for Unsupervised Model Selection for Variational Disentangled Representation Learning

Figure 2 for A Heuristic for Unsupervised Model Selection for Variational Disentangled Representation Learning

Figure 3 for A Heuristic for Unsupervised Model Selection for Variational Disentangled Representation Learning

Figure 4 for A Heuristic for Unsupervised Model Selection for Variational Disentangled Representation Learning

Abstract:Disentangled representations have recently been shown to improve data efficiency, generalisation, robustness and interpretability in simple supervised and reinforcement learning tasks. To extend such results to more complex domains, it is important to address a major shortcoming of the current state of the art unsupervised disentangling approaches -- high convergence variance, whereby different disentanglement quality may be achieved by the same model depending on its initial state. The existing model selection methods require access to the ground truth attribute labels, which are not available for most datasets. Hence, the benefits of disentangled representations have not yet been fully explored in practical applications. This paper addresses this problem by introducing a simple yet robust and reliable method for unsupervised disentangled model selection. We show that our approach performs comparably to the existing supervised alternatives across 5400 models from six state of the art unsupervised disentangled representation learning model classes.

Via

Access Paper or Ask Questions

COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration

May 22, 2019

Nicholas Watters, Loic Matthey, Matko Bosnjak, Christopher P. Burgess, Alexander Lerchner

Figure 1 for COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration

Figure 2 for COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration

Figure 3 for COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration

Figure 4 for COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration

Abstract:Data efficiency and robustness to task-irrelevant perturbations are long-standing challenges for deep reinforcement learning algorithms. Here we introduce a modular approach to addressing these challenges in a continuous control environment, without using hand-crafted or supervised information. Our Curious Object-Based seaRch Agent (COBRA) uses task-free intrinsically motivated exploration and unsupervised learning to build object-based models of its environment and action space. Subsequently, it can learn a variety of tasks through model-based search in very few steps and excel on structured hold-out tests of policy robustness.

Via

Access Paper or Ask Questions

MONet: Unsupervised Scene Decomposition and Representation

Jan 22, 2019

Christopher P. Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, Alexander Lerchner

Figure 1 for MONet: Unsupervised Scene Decomposition and Representation

Figure 2 for MONet: Unsupervised Scene Decomposition and Representation

Figure 3 for MONet: Unsupervised Scene Decomposition and Representation

Figure 4 for MONet: Unsupervised Scene Decomposition and Representation

Abstract:The ability to decompose scenes in terms of abstract building blocks is crucial for general intelligence. Where those basic building blocks share meaningful properties, interactions and other regularities across scenes, such decompositions can simplify reasoning and facilitate imagination of novel scenarios. In particular, representing perceptual observations in terms of entities should improve data efficiency and transfer performance on a wide range of tasks. Thus we need models capable of discovering useful decompositions of scenes by identifying units with such regularities and representing them in a common format. To address this problem, we have developed the Multi-Object Network (MONet). In this model, a VAE is trained end-to-end together with a recurrent attention network -- in a purely unsupervised manner -- to provide attention masks around, and reconstructions of, regions of images. We show that this model is capable of learning to decompose and represent challenging 3D scenes into semantically meaningful components, such as objects and background elements.

Via

Access Paper or Ask Questions

Spatial Broadcast Decoder: A Simple Architecture for Learning Disentangled Representations in VAEs

Jan 21, 2019

Nicholas Watters, Loic Matthey, Christopher P. Burgess, Alexander Lerchner

Figure 1 for Spatial Broadcast Decoder: A Simple Architecture for Learning Disentangled Representations in VAEs

Figure 2 for Spatial Broadcast Decoder: A Simple Architecture for Learning Disentangled Representations in VAEs

Figure 3 for Spatial Broadcast Decoder: A Simple Architecture for Learning Disentangled Representations in VAEs

Figure 4 for Spatial Broadcast Decoder: A Simple Architecture for Learning Disentangled Representations in VAEs

Abstract:We present a simple neural rendering architecture that helps variational autoencoders (VAEs) learn disentangled representations. Instead of the deconvolutional network typically used in the decoder of VAEs, we tile (broadcast) the latent vector across space, concatenate fixed X- and Y-"coordinate" channels, and apply a fully convolutional network with 1x1 stride. This provides an architectural prior for dissociating positional from non-positional features in the latent distribution of VAEs, yet without providing any explicit supervision to this effect. We show that this architecture, which we term the Spatial Broadcast decoder, improves disentangling, reconstruction accuracy, and generalization to held-out regions in data space. It provides a particularly dramatic benefit when applied to datasets with small objects. We also emphasize a method for visualizing learned latent spaces that helped us diagnose our models and may prove useful for others aiming to assess data representations. Finally, we show the Spatial Broadcast Decoder is complementary to state-of-the-art (SOTA) disentangling techniques and when incorporated improves their performance.

Via

Access Paper or Ask Questions

Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies

Aug 20, 2018

Alessandro Achille, Tom Eccles, Loic Matthey, Christopher P. Burgess, Nick Watters, Alexander Lerchner, Irina Higgins

Figure 1 for Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies

Figure 2 for Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies

Figure 3 for Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies

Figure 4 for Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies

Abstract:Intelligent behaviour in the real-world requires the ability to acquire new knowledge from an ongoing sequence of experiences while preserving and reusing past knowledge. We propose a novel algorithm for unsupervised representation learning from piece-wise stationary visual data: Variational Autoencoder with Shared Embeddings (VASE). Based on the Minimum Description Length principle, VASE automatically detects shifts in the data distribution and allocates spare representational capacity to new knowledge, while simultaneously protecting previously learnt representations from catastrophic forgetting. Our approach encourages the learnt representations to be disentangled, which imparts a number of desirable properties: VASE can deal sensibly with ambiguous inputs, it can enhance its own representations through imagination-based exploration, and most importantly, it exhibits semantically meaningful sharing of latents between different datasets. Compared to baselines with entangled representations, our approach is able to reason beyond surface-level statistics and perform semantically meaningful cross-domain inference.

Via

Access Paper or Ask Questions

Understanding disentangling in $β$-VAE

Apr 10, 2018

Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, Alexander Lerchner

Figure 1 for Understanding disentangling in $β$-VAE

Figure 2 for Understanding disentangling in $β$-VAE

Figure 3 for Understanding disentangling in $β$-VAE

Figure 4 for Understanding disentangling in $β$-VAE

Abstract:We present new intuitions and theoretical assessments of the emergence of disentangled representation in variational autoencoders. Taking a rate-distortion theory perspective, we show the circumstances under which representations aligned with the underlying generative factors of variation of data emerge when optimising the modified ELBO bound in $\beta$-VAE, as training progresses. From these insights, we propose a modification to the training regime of $\beta$-VAE, that progressively increases the information capacity of the latent code during training. This modification facilitates the robust learning of disentangled representations in $\beta$-VAE, without the previous trade-off in reconstruction accuracy.

* Presented at the 2017 NIPS Workshop on Learning Disentangled Representations

Via

Access Paper or Ask Questions