Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Léo Dana

SIERRA

Convergence of Shallow ReLU Networks on Weakly Interacting Data

Feb 24, 2025

Léo Dana, Francis Bach, Loucas Pillaud-Vivien

Figure 1 for Convergence of Shallow ReLU Networks on Weakly Interacting Data

Figure 2 for Convergence of Shallow ReLU Networks on Weakly Interacting Data

Figure 3 for Convergence of Shallow ReLU Networks on Weakly Interacting Data

Figure 4 for Convergence of Shallow ReLU Networks on Weakly Interacting Data

Abstract:We analyse the convergence of one-hidden-layer ReLU networks trained by gradient flow on $n$ data points. Our main contribution leverages the high dimensionality of the ambient space, which implies low correlation of the input samples, to demonstrate that a network with width of order $\log(n)$ neurons suffices for global convergence with high probability. Our analysis uses a Polyak-{\L}ojasiewicz viewpoint along the gradient-flow trajectory, which provides an exponential rate of convergence of $\frac{1}{n}$. When the data are exactly orthogonal, we give further refined characterizations of the convergence speed, proving its asymptotic behavior lies between the orders $\frac{1}{n}$ and $\frac{1}{\sqrt{n}}$, and exhibiting a phase-transition phenomenon in the convergence rate, during which it evolves from the lower bound to the upper, and in a relative time of order $\frac{1}{\log(n)}$.

Via

Access Paper or Ask Questions

Memorization in Attention-only Transformers

Nov 15, 2024

Léo Dana, Muni Sreenivas Pydi, Yann Chevaleyre

Figure 1 for Memorization in Attention-only Transformers

Figure 2 for Memorization in Attention-only Transformers

Figure 3 for Memorization in Attention-only Transformers

Figure 4 for Memorization in Attention-only Transformers

Abstract:Recent research has explored the memorization capacity of multi-head attention, but these findings are constrained by unrealistic limitations on the context size. We present a novel proof for language-based Transformers that extends the current hypothesis to any context size. Our approach improves upon the state-of-the-art by achieving more effective exact memorization with an attention layer, while also introducing the concept of approximate memorization of distributions. Through experimental validation, we demonstrate that our proposed bounds more accurately reflect the true memorization capacity of language models, and provide a precise comparison with prior work.

* 16 pages, 6 figures, submitted to AISTATS 2025,

Via

Access Paper or Ask Questions