Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Paul Caillon

Miles Team, LAMSADE, Université Paris Dauphine - PSL, Paris, France

Forward Only Learning for Orthogonal Neural Networks of any Depth

Dec 19, 2025

Paul Caillon, Alex Colagrande, Erwan Fagnou, Blaise Delattre, Alexandre Allauzen

Figure 1 for Forward Only Learning for Orthogonal Neural Networks of any Depth

Figure 2 for Forward Only Learning for Orthogonal Neural Networks of any Depth

Figure 3 for Forward Only Learning for Orthogonal Neural Networks of any Depth

Figure 4 for Forward Only Learning for Orthogonal Neural Networks of any Depth

Abstract:Backpropagation is still the de facto algorithm used today to train neural networks. With the exponential growth of recent architectures, the computational cost of this algorithm also becomes a burden. The recent PEPITA and forward-only frameworks have proposed promising alternatives, but they failed to scale up to a handful of hidden layers, yet limiting their use. In this paper, we first analyze theoretically the main limitations of these approaches. It allows us the design of a forward-only algorithm, which is equivalent to backpropagation under the linear and orthogonal assumptions. By relaxing the linear assumption, we then introduce FOTON (Forward-Only Training of Orthogonal Networks) that bridges the gap with the backpropagation algorithm. Experimental results show that it outperforms PEPITA, enabling us to train neural networks of any depth, without the need for a backward pass. Moreover its performance on convolutional networks clearly opens up avenues for its application to more advanced architectures. The code is open-sourced at https://github.com/p0lcAi/FOTON .

* ECAI 2025

Via

Access Paper or Ask Questions

Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics

Jul 03, 2025

Alex Colagrande, Paul Caillon, Eva Feillet, Alexandre Allauzen

Figure 1 for Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics

Figure 2 for Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics

Figure 3 for Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics

Figure 4 for Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics

Abstract:Transformers have become the de facto standard for a wide range of tasks, from image classification to physics simulations. Despite their impressive performance, the quadratic complexity of standard Transformers in both memory and time with respect to the input length makes them impractical for processing high-resolution inputs. Therefore, several variants have been proposed, the most successful relying on patchification, downsampling, or coarsening techniques, often at the cost of losing the finest-scale details. In this work, we take a different approach. Inspired by state-of-the-art techniques in $n$-body numerical simulations, we cast attention as an interaction problem between grid points. We introduce the Multipole Attention Neural Operator (MANO), which computes attention in a distance-based multiscale fashion. MANO maintains, in each attention head, a global receptive field and achieves linear time and memory complexity with respect to the number of grid points. Empirical results on image classification and Darcy flows demonstrate that MANO rivals state-of-the-art models such as ViT and Swin Transformer, while reducing runtime and peak memory usage by orders of magnitude. We open source our code for reproducibility at https://github.com/AlexColagrande/MANO.

* Accepted at ECLR Workshop at ICCV 2025

Via

Access Paper or Ask Questions

Bridging the Theoretical Gap in Randomized Smoothing

Apr 03, 2025

Blaise Delattre, Paul Caillon, Quentin Barthélemy, Erwan Fagnou, Alexandre Allauzen

Abstract:Randomized smoothing has become a leading approach for certifying adversarial robustness in machine learning models. However, a persistent gap remains between theoretical certified robustness and empirical robustness accuracy. This paper introduces a new framework that bridges this gap by leveraging Lipschitz continuity for certification and proposing a novel, less conservative method for computing confidence intervals in randomized smoothing. Our approach tightens the bounds of certified robustness, offering a more accurate reflection of model robustness in practice. Through rigorous experimentation we show that our method improves the robust accuracy, compressing the gap between empirical findings and previous theoretical results. We argue that investigating local Lipschitz constants and designing ad-hoc confidence intervals can further enhance the performance of randomized smoothing. These results pave the way for a deeper understanding of the relationship between Lipschitz continuity and certified robustness.

Via

Access Paper or Ask Questions

Fast Training of Recurrent Neural Networks with Stationary State Feedbacks

Mar 29, 2025

Paul Caillon, Erwan Fagnou, Alexandre Allauzen

Figure 1 for Fast Training of Recurrent Neural Networks with Stationary State Feedbacks

Figure 2 for Fast Training of Recurrent Neural Networks with Stationary State Feedbacks

Figure 3 for Fast Training of Recurrent Neural Networks with Stationary State Feedbacks

Figure 4 for Fast Training of Recurrent Neural Networks with Stationary State Feedbacks

Abstract:Recurrent neural networks (RNNs) have recently demonstrated strong performance and faster inference than Transformers at comparable parameter budgets. However, the recursive gradient computation with the backpropagation through time (or BPTT) algorithm remains the major computational bottleneck. In this work, we propose a novel method that replaces BPTT with a fixed gradient feedback mechanism, yielding an efficient approximation of the exact gradient propagation based on the assumption of time stationarity. Our approach leverages state-space model (SSM) principles to define a structured feedback matrix that directly propagates gradients from future time steps. This formulation bypasses the need for recursive gradient backpropagation, significantly reducing training overhead while preserving the network's ability to capture long-term dependencies. The experiments on language modeling benchmarks exhibit competitive perplexity scores, while significantly reducing the training costs. These promising results suggest that designing a feedback method like an SSM can fully exploit the efficiency advantages of RNNs for many practical applications.

* 18 pages (including additional contents), 3 figures, 5 tables, code available at https://github.com/p0lcAi/DSF

Via

Access Paper or Ask Questions

Accelerated Training through Iterative Gradient Propagation Along the Residual Path

Jan 28, 2025

Erwan Fagnou, Paul Caillon, Blaise Delattre, Alexandre Allauzen

Figure 1 for Accelerated Training through Iterative Gradient Propagation Along the Residual Path

Figure 2 for Accelerated Training through Iterative Gradient Propagation Along the Residual Path

Figure 3 for Accelerated Training through Iterative Gradient Propagation Along the Residual Path

Figure 4 for Accelerated Training through Iterative Gradient Propagation Along the Residual Path

Abstract:Despite being the cornerstone of deep learning, backpropagation is criticized for its inherent sequentiality, which can limit the scalability of very deep models. Such models faced convergence issues due to vanishing gradient, later resolved using residual connections. Variants of these are now widely used in modern architecture. However, the computational cost of backpropagation remains a major burden, accounting for most of the training time. Taking advantage of residual-like architectural designs, we introduce Highway backpropagation, a parallelizable iterative algorithm that approximates backpropagation, by alternatively i) accumulating the gradient estimates along the residual path, and ii) backpropagating them through every layer in parallel. This algorithm is naturally derived from a decomposition of the gradient as the sum of gradients flowing through all paths and is adaptable to a diverse set of common architectures, ranging from ResNets and Transformers to recurrent neural networks. Through an extensive empirical study on a large selection of tasks and models, we evaluate Highway-BP and show that major speedups can be achieved with minimal performance degradation.

* 20 pages, 6 figures, accepted to ICLR 2025

Via

Access Paper or Ask Questions

Chain and Causal Attention for Efficient Entity Tracking

Oct 07, 2024

Erwan Fagnou, Paul Caillon, Blaise Delattre, Alexandre Allauzen

Abstract:This paper investigates the limitations of transformers for entity-tracking tasks in large language models. We identify a theoretical constraint, showing that transformers require at least $\log_2 (n+1)$ layers to handle entity tracking with $n$ state changes. To address this issue, we propose an efficient and frugal enhancement to the standard attention mechanism, enabling it to manage long-term dependencies more efficiently. By considering attention as an adjacency matrix, our model can track entity states with a single layer. Empirical results demonstrate significant improvements in entity tracking datasets while keeping competitive performance on standard natural language modeling. Our modified attention allows us to achieve the same performance with drastically fewer layers. Additionally, our enhanced mechanism reveals structured internal representations of attention. Extensive experiments on both toy and complex datasets validate our approach. Our contributions include theoretical insights, an improved attention mechanism, and empirical validation.

* 15 pages, 5 figures, EMNLP 2024 Main

Via

Access Paper or Ask Questions