Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pau Rodriguez

Understanding Input Selectivity in Mamba: Impact on Approximation Power, Memorization, and Associative Recall Capacity

Jun 13, 2025

Ningyuan Huang, Miguel Sarabia, Abhinav Moudgil, Pau Rodriguez, Luca Zappella, Federico Danieli

Abstract:State-Space Models (SSMs), and particularly Mamba, have recently emerged as a promising alternative to Transformers. Mamba introduces input selectivity to its SSM layer (S6) and incorporates convolution and gating into its block definition. While these modifications do improve Mamba's performance over its SSM predecessors, it remains largely unclear how Mamba leverages the additional functionalities provided by input selectivity, and how these interact with the other operations in the Mamba architecture. In this work, we demystify the role of input selectivity in Mamba, investigating its impact on function approximation power, long-term memorization, and associative recall capabilities. In particular: (i) we prove that the S6 layer of Mamba can represent projections onto Haar wavelets, providing an edge over its Diagonal SSM (S4D) predecessor in approximating discontinuous functions commonly arising in practice; (ii) we show how the S6 layer can dynamically counteract memory decay; (iii) we provide analytical solutions to the MQAR associative recall task using the Mamba architecture with different mixers -- Mamba, Mamba-2, and S4D. We demonstrate the tightness of our theoretical constructions with empirical results on concrete tasks. Our findings offer a mechanistic understanding of Mamba and reveal opportunities for improvement.

Via

Access Paper or Ask Questions

Controlling Language and Diffusion Models by Transporting Activations

Oct 30, 2024

Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, Xavier Suau

Abstract:The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.

Via

Access Paper or Ask Questions

StarVector: Generating Scalable Vector Graphics Code from Images

Dec 17, 2023

Juan A. Rodriguez, Shubham Agarwal, Issam H. Laradji, Pau Rodriguez, David Vazquez, Christopher Pal, Marco Pedersoli

Figure 1 for StarVector: Generating Scalable Vector Graphics Code from Images

Figure 2 for StarVector: Generating Scalable Vector Graphics Code from Images

Figure 3 for StarVector: Generating Scalable Vector Graphics Code from Images

Figure 4 for StarVector: Generating Scalable Vector Graphics Code from Images

Abstract:Scalable Vector Graphics (SVGs) have become integral in modern image rendering applications due to their infinite scalability in resolution, versatile usability, and editing capabilities. SVGs are particularly popular in the fields of web development and graphic design. Existing approaches for SVG modeling using deep learning often struggle with generating complex SVGs and are restricted to simpler ones that require extensive processing and simplification. This paper introduces StarVector, a multimodal SVG generation model that effectively integrates Code Generation Large Language Models (CodeLLMs) and vision models. Our approach utilizes a CLIP image encoder to extract visual representations from pixel-based images, which are then transformed into visual tokens via an adapter module. These visual tokens are pre-pended to the SVG token embeddings, and the sequence is modeled by the StarCoder model using next-token prediction, effectively learning to align the visual and code tokens. This enables StarVector to generate unrestricted SVGs that accurately represent pixel images. To evaluate StarVector's performance, we present SVG-Bench, a comprehensive benchmark for evaluating SVG methods across multiple datasets and relevant metrics. Within this benchmark, we introduce novel datasets including SVG-Stack, a large-scale dataset of real-world SVG examples, and use it to pre-train StarVector as a large foundation model for SVGs. Our results demonstrate significant enhancements in visual quality and complexity handling over current methods, marking a notable advancement in SVG generation technology. Code and models: https://github.com/joanrod/star-vector

Via

Access Paper or Ask Questions

Continual Learning of Diffusion Models with Generative Distillation

Nov 23, 2023

Sergi Masip, Pau Rodriguez, Tinne Tuytelaars, Gido M. van de Ven

Abstract:Diffusion models are powerful generative models that achieve state-of-the-art performance in tasks such as image synthesis. However, training them demands substantial amounts of data and computational resources. Continual learning would allow for incrementally learning new tasks and accumulating knowledge, thus reusing already trained models would be possible. One potentially suitable approach is generative replay, where a copy of a generative model trained on previous tasks produces synthetic data that are interleaved with data from the current task. However, standard generative replay applied to diffusion models results in a catastrophic loss in denoising capabilities. In this paper, we propose generative distillation, an approach that distils the entire reverse process of a diffusion model. We demonstrate that our approach significantly improves the continual learning performance of generative replay with only a moderate increase in the computational costs.

Via

Access Paper or Ask Questions

OC-NMN: Object-centric Compositional Neural Module Network for Generative Visual Analogical Reasoning

Oct 28, 2023

Rim Assouel, Pau Rodriguez, Perouz Taslakian, David Vazquez, Yoshua Bengio

Abstract:A key aspect of human intelligence is the ability to imagine -- composing learned concepts in novel ways -- to make sense of new scenarios. Such capacity is not yet attained for machine learning systems. In this work, in the context of visual reasoning, we show how modularity can be leveraged to derive a compositional data augmentation framework inspired by imagination. Our method, denoted Object-centric Compositional Neural Module Network (OC-NMN), decomposes visual generative reasoning tasks into a series of primitives applied to objects without using a domain-specific language. We show that our modular architectural choices can be used to generate new training tasks that lead to better out-of-distribution generalization. We compare our model to existing and new baselines in proposed visual reasoning benchmark that consists of applying arithmetic operations to MNIST digits.

Via

Access Paper or Ask Questions

Group Robust Classification Without Any Group Information

Oct 28, 2023

Christos Tsirigotis, Joao Monteiro, Pau Rodriguez, David Vazquez, Aaron Courville

Figure 1 for Group Robust Classification Without Any Group Information

Figure 2 for Group Robust Classification Without Any Group Information

Figure 3 for Group Robust Classification Without Any Group Information

Figure 4 for Group Robust Classification Without Any Group Information

Abstract:Empirical risk minimization (ERM) is sensitive to spurious correlations in the training data, which poses a significant risk when deploying systems trained under this paradigm in high-stake applications. While the existing literature focuses on maximizing group-balanced or worst-group accuracy, estimating these accuracies is hindered by costly bias annotations. This study contends that current bias-unsupervised approaches to group robustness continue to rely on group information to achieve optimal performance. Firstly, these methods implicitly assume that all group combinations are represented during training. To illustrate this, we introduce a systematic generalization task on the MPI3D dataset and discover that current algorithms fail to improve the ERM baseline when combinations of observed attribute values are missing. Secondly, bias labels are still crucial for effective model selection, restricting the practicality of these methods in real-world scenarios. To address these limitations, we propose a revised methodology for training and validating debiased models in an entirely bias-unsupervised manner. We achieve this by employing pretrained self-supervised models to reliably extract bias information, which enables the integration of a logit adjustment training loss with our validation criterion. Our empirical analysis on synthetic and real-world tasks provides evidence that our approach overcomes the identified challenges and consistently enhances robust accuracy, attaining performance which is competitive with or outperforms that of state-of-the-art methods, which, conversely, rely on bias labels for validation.

* Accepted at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023). Code is available at https://github.com/tsirif/uLA

Via

Access Paper or Ask Questions

FigGen: Text to Scientific Figure Generation

Jun 21, 2023

Juan A. Rodriguez, David Vazquez, Issam Laradji, Marco Pedersoli, Pau Rodriguez

Figure 1 for FigGen: Text to Scientific Figure Generation

Figure 2 for FigGen: Text to Scientific Figure Generation

Figure 3 for FigGen: Text to Scientific Figure Generation

Figure 4 for FigGen: Text to Scientific Figure Generation

Abstract:The generative modeling landscape has experienced tremendous growth in recent years, particularly in generating natural images and art. Recent techniques have shown impressive potential in creating complex visual compositions while delivering impressive realism and quality. However, state-of-the-art methods have been focusing on the narrow domain of natural images, while other distributions remain unexplored. In this paper, we introduce the problem of text-to-figure generation, that is creating scientific figures of papers from text descriptions. We present FigGen, a diffusion-based approach for text-to-figure as well as the main challenges of the proposed task. Code and models are available at https://github.com/joanrod/figure-diffusion

* Published at ICLR 2023 as a Tiny Paper

Via

Access Paper or Ask Questions

GEO-Bench: Toward Foundation Models for Earth Monitoring

Jun 06, 2023

Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan David Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Andrew Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin(+7 more)

Figure 1 for GEO-Bench: Toward Foundation Models for Earth Monitoring

Figure 2 for GEO-Bench: Toward Foundation Models for Earth Monitoring

Figure 3 for GEO-Bench: Toward Foundation Models for Earth Monitoring

Figure 4 for GEO-Bench: Toward Foundation Models for Earth Monitoring

Abstract:Recent progress in self-supervision has shown that pre-training large neural networks on vast amounts of unsupervised data can lead to substantial increases in generalization to downstream tasks. Such models, recently coined foundation models, have been transformational to the field of natural language processing. Variants have also been proposed for image data, but their applicability to remote sensing tasks is limited. To stimulate the development of foundation models for Earth monitoring, we propose a benchmark comprised of six classification and six segmentation tasks, which were carefully curated and adapted to be both relevant to the field and well-suited for model evaluation. We accompany this benchmark with a robust methodology for evaluating models and reporting aggregated results to enable a reliable assessment of progress. Finally, we report results for 20 baselines to gain information about the performance of existing models. We believe that this benchmark will be a driver of progress across a variety of Earth monitoring tasks.

* arXiv admin note: text overlap with arXiv:2112.00570

Via

Access Paper or Ask Questions

Long-Context Language Decision Transformers and Exponential Tilt for Interactive Text Environments

Feb 10, 2023

Nicolas Gontier, Pau Rodriguez, Issam Laradji, David Vazquez, Christopher Pal

Figure 1 for Long-Context Language Decision Transformers and Exponential Tilt for Interactive Text Environments

Figure 2 for Long-Context Language Decision Transformers and Exponential Tilt for Interactive Text Environments

Figure 3 for Long-Context Language Decision Transformers and Exponential Tilt for Interactive Text Environments

Figure 4 for Long-Context Language Decision Transformers and Exponential Tilt for Interactive Text Environments

Abstract:Text-based game environments are challenging because agents must deal with long sequences of text, execute compositional actions using text and learn from sparse rewards. We address these challenges by proposing Long-Context Language Decision Transformers (LLDTs), a framework that is based on long transformer language models and decision transformers (DTs). LLDTs extend DTs with 3 components: (1) exponential tilt to guide the agent towards high obtainable goals, (2) novel goal conditioning methods yielding significantly better results than the traditional return-to-go (sum of all future rewards), and (3) a model of future observations. Our ablation results show that predicting future observations improves agent performance. To the best of our knowledge, LLDTs are the first to address offline RL with DTs on these challenging games. Our experiments show that LLDTs achieve the highest scores among many different types of agents on some of the most challenging Jericho games, such as Enchanter.

* 12 pages, 5 figures, 3 tables

Via

Access Paper or Ask Questions

3rd Continual Learning Workshop Challenge on Egocentric Category and Instance Level Object Understanding

Dec 13, 2022

Lorenzo Pellegrini, Chenchen Zhu, Fanyi Xiao, Zhicheng Yan, Antonio Carta, Matthias De Lange, Vincenzo Lomonaco, Roshan Sumbaly, Pau Rodriguez, David Vazquez

Figure 1 for 3rd Continual Learning Workshop Challenge on Egocentric Category and Instance Level Object Understanding

Figure 2 for 3rd Continual Learning Workshop Challenge on Egocentric Category and Instance Level Object Understanding

Figure 3 for 3rd Continual Learning Workshop Challenge on Egocentric Category and Instance Level Object Understanding

Figure 4 for 3rd Continual Learning Workshop Challenge on Egocentric Category and Instance Level Object Understanding

Abstract:Continual Learning, also known as Lifelong or Incremental Learning, has recently gained renewed interest among the Artificial Intelligence research community. Recent research efforts have quickly led to the design of novel algorithms able to reduce the impact of the catastrophic forgetting phenomenon in deep neural networks. Due to this surge of interest in the field, many competitions have been held in recent years, as they are an excellent opportunity to stimulate research in promising directions. This paper summarizes the ideas, design choices, rules, and results of the challenge held at the 3rd Continual Learning in Computer Vision (CLVision) Workshop at CVPR 2022. The focus of this competition is the complex continual object detection task, which is still underexplored in literature compared to classification tasks. The challenge is based on the challenge version of the novel EgoObjects dataset, a large-scale egocentric object dataset explicitly designed to benchmark continual learning algorithms for egocentric category-/instance-level object understanding, which covers more than 1k unique main objects and 250+ categories in around 100k video frames.

* 21 pages, 12 figures, 5 tables

Via

Access Paper or Ask Questions