Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Benjamin Devillers

Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global Workspace

Mar 07, 2024

Léopold Maytié, Benjamin Devillers, Alexandre Arnold, Rufin VanRullen

Figure 1 for Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global Workspace

Figure 2 for Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global Workspace

Figure 3 for Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global Workspace

Figure 4 for Zero-shot cross-modal transfer of Reinforcement Learning policies through a Global Workspace

Abstract:Humans perceive the world through multiple senses, enabling them to create a comprehensive representation of their surroundings and to generalize information across domains. For instance, when a textual description of a scene is given, humans can mentally visualize it. In fields like robotics and Reinforcement Learning (RL), agents can also access information about the environment through multiple sensors; yet redundancy and complementarity between sensors is difficult to exploit as a source of robustness (e.g. against sensor failure) or generalization (e.g. transfer across domains). Prior research demonstrated that a robust and flexible multimodal representation can be efficiently constructed based on the cognitive science notion of a 'Global Workspace': a unique representation trained to combine information across modalities, and to broadcast its signal back to each modality. Here, we explore whether such a brain-inspired multimodal representation could be advantageous for RL agents. First, we train a 'Global Workspace' to exploit information collected about the environment via two input modalities (a visual input, or an attribute vector representing the state of the agent and/or its environment). Then, we train a RL agent policy using this frozen Global Workspace. In two distinct environments and tasks, our results reveal the model's ability to perform zero-shot cross-modal transfer between input modalities, i.e. to apply to image inputs a policy previously trained on attribute vectors (and vice-versa), without additional training or fine-tuning. Variants and ablations of the full Global Workspace (including a CLIP-like multimodal representation trained via contrastive learning) did not display the same generalization abilities.

* Under review in a conference

Via

Access Paper or Ask Questions

Semi-supervised Multimodal Representation Learning through a Global Workspace

Jun 27, 2023

Benjamin Devillers, Léopold Maytié, Rufin VanRullen

Figure 1 for Semi-supervised Multimodal Representation Learning through a Global Workspace

Figure 2 for Semi-supervised Multimodal Representation Learning through a Global Workspace

Figure 3 for Semi-supervised Multimodal Representation Learning through a Global Workspace

Figure 4 for Semi-supervised Multimodal Representation Learning through a Global Workspace

Abstract:Recent deep learning models can efficiently combine inputs from different modalities (e.g., images and text) and learn to align their latent representations, or to translate signals from one domain to another (as in image captioning, or text-to-image generation). However, current approaches mainly rely on brute-force supervised training over large multimodal datasets. In contrast, humans (and other animals) can learn useful multimodal representations from only sparse experience with matched cross-modal data. Here we evaluate the capabilities of a neural network architecture inspired by the cognitive notion of a "Global Workspace": a shared representation for two (or more) input modalities. Each modality is processed by a specialized system (pretrained on unimodal data, and subsequently frozen). The corresponding latent representations are then encoded to and decoded from a single shared workspace. Importantly, this architecture is amenable to self-supervised training via cycle-consistency: encoding-decoding sequences should approximate the identity function. For various pairings of vision-language modalities and across two datasets of varying complexity, we show that such an architecture can be trained to align and translate between two modalities with very little need for matched data (from 4 to 7 times less than a fully supervised approach). The global workspace representation can be used advantageously for downstream classification tasks and for robust transfer learning. Ablation studies reveal that both the shared workspace and the self-supervised cycle-consistency training are critical to the system's performance.

* Under review

Via

Access Paper or Ask Questions

Does language help generalization in vision models?

May 15, 2021

Benjamin Devillers, Bhavin Choksi, Romain Bielawski, Rufin VanRullen

Figure 1 for Does language help generalization in vision models?

Figure 2 for Does language help generalization in vision models?

Figure 3 for Does language help generalization in vision models?

Abstract:Vision models trained on multimodal datasets can benefit from the wide availability of large image-caption datasets. A recent model (CLIP) was found to generalize well in zero-shot and transfer learning settings. This could imply that linguistic or "semantic grounding" confers additional generalization abilities to the visual feature space. Here, we systematically evaluate various multimodal architectures and vision-only models in terms of unsupervised clustering, few-shot learning, transfer learning and adversarial robustness. In each setting, multimodal training produced no additional generalization capability compared to standard supervised visual training. We conclude that work is still required for semantic grounding to help improve vision models.

* Paper accepted for presentation at the ViGIL 2021 workshop @NAACL. This version: added models to the comparison (ICMLM, TSM); added tests of adversarial robustness; mistake identified and corrected in the normalization of image features; results and conclusions updated accordingly

Via

Access Paper or Ask Questions