Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tomer Schlank

Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

Oct 26, 2024

Daria Lioubashevski, Tomer Schlank, Gabriel Stanovsky, Ariel Goldstein

Figure 1 for Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

Figure 2 for Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

Figure 3 for Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

Figure 4 for Looking Beyond The Top-1: Transformers Determine Top Tokens In Order

Abstract:Understanding the inner workings of Transformers is crucial for achieving more accurate and efficient predictions. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed, which has been previously referred to as the "saturation event". We expand the concept of saturation events for top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these saturation events happen in order of the corresponding tokens' ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different architectural variants (decoder-only, encoder-only, and to a lesser extent full-Transformer), and even in untrained Transformers. We propose an underlying mechanism of task transition for this sequential saturation, where task k corresponds to predicting the k-th most probable token, and the saturation events are in fact discrete transitions between the tasks. In support of this we show that it is possible to predict the current task from hidden layer embedding. Furthermore, using an intervention method we demonstrate that we can cause the model to switch from one task to the next. Finally, leveraging our findings, we introduce a novel token-level early-exit strategy, which surpasses existing methods in balancing performance and efficiency.

Via

Access Paper or Ask Questions

A Note on the Entropy/Influence Conjecture

May 13, 2011

Nathan Keller, Elchanan Mossel, Tomer Schlank

Abstract:The entropy/influence conjecture, raised by Friedgut and Kalai in 1996, seeks to relate two different measures of concentration of the Fourier coefficients of a Boolean function. Roughly saying, it claims that if the Fourier spectrum is "smeared out", then the Fourier coefficients are concentrated on "high" levels. In this note we generalize the conjecture to biased product measures on the discrete cube, and prove a variant of the conjecture for functions with an extremely low Fourier weight on the "high" levels.

* 12 pages

Via

Access Paper or Ask Questions