Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Angelos Katharopoulos

Partial Parameter Updates for Efficient Distributed Training

Sep 26, 2025

Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert

Abstract:We introduce a memory- and compute-efficient method for low-communication distributed training. Existing methods reduce communication by performing multiple local updates between infrequent global synchronizations. We demonstrate that their efficiency can be significantly improved by restricting backpropagation: instead of updating all the parameters, each node updates only a fixed subset while keeping the remainder frozen during local steps. This constraint substantially reduces peak memory usage and training FLOPs, while a full forward pass over all parameters eliminates the need for cross-node activation exchange. Experiments on a $1.3$B-parameter language model trained across $32$ nodes show that our method matches the perplexity of prior low-communication approaches under identical token and bandwidth budgets while reducing training FLOPs and peak memory.

Via

Access Paper or Ask Questions

Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

Feb 03, 2025

Pierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier

Abstract:Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model. Our architecture consists of a bank of expert parameters, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input domain weights. To train this architecture, we sample random domain weights, instantiate the corresponding model, and backprop through one batch of data sampled with these domain weights. We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. Soup-of-Experts are particularly appealing when one needs to ship many different specialist models quickly under a model size constraint.

Via

Access Paper or Ask Questions

No Need to Talk: Asynchronous Mixture of Language Models

Oct 04, 2024

Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert

Figure 1 for No Need to Talk: Asynchronous Mixture of Language Models

Figure 2 for No Need to Talk: Asynchronous Mixture of Language Models

Figure 3 for No Need to Talk: Asynchronous Mixture of Language Models

Figure 4 for No Need to Talk: Asynchronous Mixture of Language Models

Abstract:We introduce SmallTalk LM, an innovative method for training a mixture of language models in an almost asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the need of high-bandwidth communication between the nodes training each model. At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a fraction of the parameters from the overall mixture model. Our experiments on language modeling demonstrate tha SmallTalk LM achieves significantly lower perplexity than dense model baselines for the same total training FLOPs and an almost identical inference cost. Finally, in our downstream evaluations we outperform the dense baseline on $75\%$ of the tasks.

* 23 pages

Via

Access Paper or Ask Questions

Specialized Language Models with Cheap Inference from Limited Domain Data

Feb 02, 2024

David Grangier, Angelos Katharopoulos, Pierre Ablin, Awni Hannun

Figure 1 for Specialized Language Models with Cheap Inference from Limited Domain Data

Figure 2 for Specialized Language Models with Cheap Inference from Limited Domain Data

Figure 3 for Specialized Language Models with Cheap Inference from Limited Domain Data

Figure 4 for Specialized Language Models with Cheap Inference from Limited Domain Data

Abstract:Large language models have emerged as a versatile tool but are challenging to apply to tasks lacking large inference budgets and large in-domain training sets. This work formalizes these constraints and distinguishes four important variables: the pretraining budget (for training before the target domain is known), the specialization budget (for training after the target domain is known), the inference budget, and the in-domain training set size. Across these settings, we compare different approaches from the machine learning literature. Limited by inference cost, we find better alternatives to the standard practice of training very large vanilla transformer models. In particular, we show that hyper-networks and mixture of experts have better perplexity for large pretraining budgets, while small models trained on importance sampled datasets are attractive for large specialization budgets.

Via

Access Paper or Ask Questions

Controllable Music Production with Diffusion Models and Guidance Gradients

Nov 01, 2023

Mark Levy, Bruno Di Giorgi, Floris Weers, Angelos Katharopoulos, Tom Nickson

Abstract:We demonstrate how conditional generation from diffusion models can be used to tackle a variety of realistic tasks in the production of music in 44.1kHz stereo audio with sampling-time guidance. The scenarios we consider include continuation, inpainting and regeneration of musical audio, the creation of smooth transitions between two different music tracks, and the transfer of desired stylistic characteristics to existing audio clips. We achieve this by applying guidance at sampling time in a simple framework that supports both reconstruction and classification losses, or any combination of the two. This approach ensures that generated audio can match its surrounding context, or conform to a class distribution or latent representation specified relative to any suitable pre-trained classifier or embedding model.

Via

Access Paper or Ask Questions

Self Supervision Does Not Help Natural Language Supervision at Scale

Jan 20, 2023

Floris Weers, Vaishaal Shankar, Angelos Katharopoulos, Yinfei Yang, Tom Gunter

Figure 1 for Self Supervision Does Not Help Natural Language Supervision at Scale

Figure 2 for Self Supervision Does Not Help Natural Language Supervision at Scale

Figure 3 for Self Supervision Does Not Help Natural Language Supervision at Scale

Figure 4 for Self Supervision Does Not Help Natural Language Supervision at Scale

Abstract:Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE and SLIP have suggested that these approaches can be effectively combined, but most notably their results use small pre-training datasets (<50M samples) and don't effectively reflect the large-scale regime (>100M examples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when trained with a much larger amount of data. We find that a combination of two state of the art approaches: masked auto-encoders, MAE and contrastive language image pre-training, CLIP provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B images. Our work provides some much needed clarity into the effectiveness (or lack thereof) of self supervision for large-scale image-text training.

Via

Access Paper or Ask Questions

Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks

Mar 18, 2021

Despoina Paschalidou, Angelos Katharopoulos, Andreas Geiger, Sanja Fidler

Figure 1 for Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks

Figure 2 for Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks

Figure 3 for Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks

Figure 4 for Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks

Abstract:Impressive progress in 3D shape extraction led to representations that can capture object geometries with high fidelity. In parallel, primitive-based methods seek to represent objects as semantically consistent part arrangements. However, due to the simplicity of existing primitive representations, these methods fail to accurately reconstruct 3D shapes using a small number of primitives/parts. We address the trade-off between reconstruction quality and number of parts with Neural Parts, a novel 3D primitive representation that defines primitives using an Invertible Neural Network (INN) which implements homeomorphic mappings between a sphere and the target object. The INN allows us to compute the inverse mapping of the homeomorphism, which in turn, enables the efficient computation of both the implicit surface function of a primitive and its mesh, without any additional post-processing. Our model learns to parse 3D objects into semantically consistent part arrangements without any part-level supervision. Evaluations on ShapeNet, D-FAUST and FreiHAND demonstrate that our primitives can capture complex geometries and thus simultaneously achieve geometrically accurate as well as interpretable reconstructions using an order of magnitude fewer primitives than state-of-the-art shape abstraction methods.

* To appear in CVPR 2021

Via

Access Paper or Ask Questions

Fast Transformers with Clustered Attention

Jul 09, 2020

Apoorv Vyas, Angelos Katharopoulos, François Fleuret

Figure 1 for Fast Transformers with Clustered Attention

Figure 2 for Fast Transformers with Clustered Attention

Figure 3 for Fast Transformers with Clustered Attention

Figure 4 for Fast Transformers with Clustered Attention

Abstract:Transformers have been proven a successful model for a variety of tasks in sequence modeling. However, computing the attention matrix, which is their key component, has quadratic complexity with respect to the sequence length, thus making them prohibitively expensive for large sequences. To address this, we propose clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids. To further improve this approximation, we use the computed clusters to identify the keys with the highest attention per query and compute the exact key/query dot products. This results in a model with linear complexity with respect to the sequence length for a fixed number of clusters. We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget. Finally, we demonstrate that our model can approximate arbitrarily complex attention distributions with a minimal number of clusters by approximating a pretrained BERT model on GLUE and SQuAD benchmarks with only 25 clusters and no loss in performance.

Via

Access Paper or Ask Questions

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Jun 30, 2020

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret

Figure 1 for Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Figure 2 for Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Figure 3 for Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Figure 4 for Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Abstract:Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from $\mathcal{O}\left(N^2\right)$ to $\mathcal{O}\left(N\right)$, where $N$ is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.

* ICML 2020, project at https://linear-transformers.com/

Via

Access Paper or Ask Questions

Processing Megapixel Images with Deep Attention-Sampling Models

May 03, 2019

Angelos Katharopoulos, François Fleuret

Figure 1 for Processing Megapixel Images with Deep Attention-Sampling Models

Figure 2 for Processing Megapixel Images with Deep Attention-Sampling Models

Figure 3 for Processing Megapixel Images with Deep Attention-Sampling Models

Figure 4 for Processing Megapixel Images with Deep Attention-Sampling Models

Abstract:Existing deep architectures cannot operate on very large signals such as megapixel images due to computational and memory constraints. To tackle this limitation, we propose a fully differentiable end-to-end trainable model that samples and processes only a fraction of the full resolution input image. The locations to process are sampled from an attention distribution computed from a low resolution view of the input. We refer to our method as attention sampling and it can process images of several megapixels with a standard single GPU setup. We show that sampling from the attention distribution results in an unbiased estimator of the full model with minimal variance, and we derive an unbiased estimator of the gradient that we use to train our model end-to-end with a normal SGD procedure. This new method is evaluated on three classification tasks, where we show that it allows to reduce computation and memory footprint by an order of magnitude for the same accuracy as classical architectures. We also show the consistency of the sampling that indeed focuses on informative parts of the input images.

* To appear in ICML 2019

Via

Access Paper or Ask Questions