Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yi-Fu Wu

Dreamweaver: Learning Compositional World Representations from Pixels

Jan 24, 2025

Junyeob Baek, Yi-Fu Wu, Gautam Singh, Sungjin Ahn

Abstract:Humans have an innate ability to decompose their perceptions of the world into objects and their attributes, such as colors, shapes, and movement patterns. This cognitive process enables us to imagine novel futures by recombining familiar concepts. However, replicating this ability in artificial intelligence systems has proven challenging, particularly when it comes to modeling videos into compositional concepts and generating unseen, recomposed futures without relying on auxiliary data, such as text, masks, or bounding boxes. In this paper, we propose Dreamweaver, a neural architecture designed to discover hierarchical and compositional representations from raw videos and generate compositional future simulations. Our approach leverages a novel Recurrent Block-Slot Unit (RBSU) to decompose videos into their constituent objects and attributes. In addition, Dreamweaver uses a multi-future-frame prediction objective to capture disentangled representations for dynamic concepts more effectively as well as static concepts. In experiments, we demonstrate our model outperforms current state-of-the-art baselines for world modeling when evaluated under the DCI framework across multiple datasets. Furthermore, we show how the modularized concept representations of our model enable compositional imagination, allowing the generation of novel videos by recombining attributes from different objects.

Via

Access Paper or Ask Questions

Structured World Modeling via Semantic Vector Quantization

Feb 02, 2024

Yi-Fu Wu, Minseung Lee, Sungjin Ahn

Figure 1 for Structured World Modeling via Semantic Vector Quantization

Figure 2 for Structured World Modeling via Semantic Vector Quantization

Figure 3 for Structured World Modeling via Semantic Vector Quantization

Figure 4 for Structured World Modeling via Semantic Vector Quantization

Abstract:Neural discrete representations are crucial components of modern neural networks. However, their main limitation is that the primary strategies such as VQ-VAE can only provide representations at the patch level. Therefore, one of the main goals of representation learning, acquiring structured, semantic, and compositional abstractions such as the color and shape of an object, remains elusive. In this paper, we present the first approach to semantic neural discrete representation learning. The proposed model, called Semantic Vector-Quantized Variational Autoencoder (SVQ), leverages recent advances in unsupervised object-centric learning to address this limitation. Specifically, we observe that a simple approach quantizing at the object level poses a significant challenge and propose constructing scene representations hierarchically, from low-level discrete concept schemas to object representations. Additionally, we suggest a novel method for structured semantic world modeling by training a prior over these representations, enabling the ability to generate images by sampling the semantic properties of the objects in the scene. In experiments on various 2D and 3D object-centric datasets, we find that our model achieves superior generation performance compared to non-semantic vector quantization methods such as VQ-VAE and previous object-centric generative models. Furthermore, we find that the semantic discrete representations can solve downstream scene understanding tasks that require reasoning about the properties of different objects in the scene.

* Accepted in ICLR 2024

Via

Access Paper or Ask Questions

An Investigation into Pre-Training Object-Centric Representations for Reinforcement Learning

Feb 09, 2023

Jaesik Yoon, Yi-Fu Wu, Heechul Bae, Sungjin Ahn

Abstract:Unsupervised object-centric representation (OCR) learning has recently drawn attention as a new paradigm of visual representation. This is because of its potential of being an effective pre-training technique for various downstream tasks in terms of sample efficiency, systematic generalization, and reasoning. Although image-based reinforcement learning (RL) is one of the most important and thus frequently mentioned such downstream tasks, the benefit in RL has surprisingly not been investigated systematically thus far. Instead, most of the evaluations have focused on rather indirect metrics such as segmentation quality and object property prediction accuracy. In this paper, we investigate the effectiveness of OCR pre-training for image-based reinforcement learning via empirical experiments. For systematic evaluation, we introduce a simple object-centric visual RL benchmark and conduct experiments to answer questions such as ``Does OCR pre-training improve performance on object-centric tasks?'' and ``Can OCR pre-training help with out-of-distribution generalization?''. Our results provide empirical evidence for valuable insights into the effectiveness of OCR pre-training for RL and the potential limitations of its use in certain scenarios. Additionally, this study also examines the critical aspects of incorporating OCR pre-training in RL, including performance in a visually complex environment and the appropriate pooling layer to aggregate the object representations.

* We study unsupervised object-centric representations in reinforcement learning through systematic investigation

Via

Access Paper or Ask Questions

Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos

May 27, 2022

Gautam Singh, Yi-Fu Wu, Sungjin Ahn

Figure 1 for Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos

Figure 2 for Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos

Figure 3 for Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos

Figure 4 for Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos

Abstract:Unsupervised object-centric learning aims to represent the modular, compositional, and causal structure of a scene as a set of object representations and thereby promises to resolve many critical limitations of traditional single-vector representations such as poor systematic generalization. Although there have been many remarkable advances in recent years, one of the most critical problems in this direction has been that previous methods work only with simple and synthetic scenes but not with complex and naturalistic images or videos. In this paper, we propose STEVE, an unsupervised model for object-centric learning in videos. Our proposed model makes a significant advancement by demonstrating its effectiveness on various complex and naturalistic videos unprecedented in this line of research. Interestingly, this is achieved by neither adding complexity to the model architecture nor introducing a new objective or weak supervision. Rather, it is achieved by a surprisingly simple architecture that uses a transformer-based image decoder conditioned on slots and the learning objective is simply to reconstruct the observation. Our experiment results on various complex and naturalistic videos show significant improvements compared to the previous state-of-the-art.

Via

Access Paper or Ask Questions

TransDreamer: Reinforcement Learning with Transformer World Models

Feb 19, 2022

Chang Chen, Yi-Fu Wu, Jaesik Yoon, Sungjin Ahn

Figure 1 for TransDreamer: Reinforcement Learning with Transformer World Models

Figure 2 for TransDreamer: Reinforcement Learning with Transformer World Models

Figure 3 for TransDreamer: Reinforcement Learning with Transformer World Models

Figure 4 for TransDreamer: Reinforcement Learning with Transformer World Models

Abstract:The Dreamer agent provides various benefits of Model-Based Reinforcement Learning (MBRL) such as sample efficiency, reusable knowledge, and safe planning. However, its world model and policy networks inherit the limitations of recurrent neural networks and thus an important question is how an MBRL framework can benefit from the recent advances of transformers and what the challenges are in doing so. In this paper, we propose a transformer-based MBRL agent, called TransDreamer. We first introduce the Transformer State-Space Model, a world model that leverages a transformer for dynamics predictions. We then share this world model with a transformer-based policy network and obtain stability in training a transformer-based RL agent. In experiments, we apply the proposed model to 2D visual RL and 3D first-person visual RL tasks both requiring long-range memory access for memory-based reasoning. We show that the proposed model outperforms Dreamer in these complex tasks.

* Deep RL Workshop NeurIPS 2021

Via

Access Paper or Ask Questions

Generative Video Transformer: Can Objects be the Words?

Jul 20, 2021

Yi-Fu Wu, Jaesik Yoon, Sungjin Ahn

Figure 1 for Generative Video Transformer: Can Objects be the Words?

Figure 2 for Generative Video Transformer: Can Objects be the Words?

Figure 3 for Generative Video Transformer: Can Objects be the Words?

Figure 4 for Generative Video Transformer: Can Objects be the Words?

Abstract:Transformers have been successful for many natural language processing tasks. However, applying transformers to the video domain for tasks such as long-term video generation and scene understanding has remained elusive due to the high computational complexity and the lack of natural tokenization. In this paper, we propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer. By factoring the video into objects, our fully unsupervised model is able to learn complex spatio-temporal dynamics of multiple interacting objects in a scene and generate future frames of the video. Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU. We compare our model with previous RNN-based approaches as well as other possible video transformer baselines. We demonstrate OCVT performs well when compared to baselines in generating future frames. OCVT also develops useful representations for video reasoning, achieving start-of-the-art performance on the CATER task.

* Published in ICML 2021

Via

Access Paper or Ask Questions

Improving Generative Imagination in Object-Centric World Models

Oct 05, 2020

Zhixuan Lin, Yi-Fu Wu, Skand Peri, Bofeng Fu, Jindong Jiang, Sungjin Ahn

Figure 1 for Improving Generative Imagination in Object-Centric World Models

Figure 2 for Improving Generative Imagination in Object-Centric World Models

Figure 3 for Improving Generative Imagination in Object-Centric World Models

Figure 4 for Improving Generative Imagination in Object-Centric World Models

Abstract:The remarkable recent advances in object-centric generative world models raise a few questions. First, while many of the recent achievements are indispensable for making a general and versatile world model, it is quite unclear how these ingredients can be integrated into a unified framework. Second, despite using generative objectives, abilities for object detection and tracking are mainly investigated, leaving the crucial ability of temporal imagination largely under question. Third, a few key abilities for more faithful temporal imagination such as multimodal uncertainty and situation-awareness are missing. In this paper, we introduce Generative Structured World Models (G-SWM). The G-SWM achieves the versatile world modeling not only by unifying the key properties of previous models in a principled framework but also by achieving two crucial new abilities, multimodal uncertainty and situation-awareness. Our thorough investigation on the temporal generation ability in comparison to the previous models demonstrates that G-SWM achieves the versatility with the best or comparable performance for all experiment settings including a few complex settings that have not been tested before.

* Accepted in ICML 2020

Via

Access Paper or Ask Questions

SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

Feb 18, 2020

Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, Weihao Sun, Gautam Singh, Fei Deng, Jindong Jiang, Sungjin Ahn

Figure 1 for SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

Figure 2 for SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

Figure 3 for SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

Figure 4 for SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition

Abstract:The ability to decompose complex multi-object scenes into meaningful abstractions like objects is fundamental to achieve higher-level cognition. Previous approaches for unsupervised object-oriented scene representation learning are either based on spatial-attention or scene-mixture approaches and limited in scalability which is a main obstacle towards modeling real-world scenes. In this paper, we propose a generative latent variable model, called SPACE, that provides a unified probabilistic modeling framework that combines the best of spatial-attention and scene-mixture approaches. SPACE can explicitly provide factorized object representations for foreground objects while also decomposing background segments of complex morphology. Previous models are good at either of these, but not both. SPACE also resolves the scalability problems of previous methods by incorporating parallel spatial-attention and thus is applicable to scenes with a large number of objects without performance degradations. We show through experiments on Atari and 3D-Rooms that SPACE achieves the above properties consistently in comparison to SPAIR, IODINE, and GENESIS. Results of our experiments can be found on our project website: https://sites.google.com/view/space-project-page

* Accepted in ICLR 2020

Via

Access Paper or Ask Questions