Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:StableMask: Refining Causal Masking in Decoder-only Transformer

Feb 07, 2024

Qingyu Yin, Xuzheng He, Xiang Zhuang, Yu Zhao, Jianhua Yao, Xiaoyu Shen, Qiang Zhang

Figure 1 for StableMask: Refining Causal Masking in Decoder-only Transformer

Figure 2 for StableMask: Refining Causal Masking in Decoder-only Transformer

Figure 3 for StableMask: Refining Causal Masking in Decoder-only Transformer

Figure 4 for StableMask: Refining Causal Masking in Decoder-only Transformer

Share this with someone who'll enjoy it:

Abstract:The decoder-only Transformer architecture with causal masking and relative position encoding (RPE) has become the de facto choice in language modeling. Despite its exceptional performance across various tasks, we have identified two limitations: First, it requires all attention scores to be non-zero and sum up to 1, even if the current embedding has sufficient self-contained information. This compels the model to assign disproportional excessive attention to specific tokens. Second, RPE-based Transformers are not universal approximators due to their limited capacity at encoding absolute positional information, which limits their application in position-critical tasks. In this work, we propose StableMask: a parameter-free method to address both limitations by refining the causal mask. It introduces pseudo-attention values to balance attention distributions and encodes absolute positional information via a progressively decreasing mask ratio. StableMask's effectiveness is validated both theoretically and empirically, showing significant enhancements in language models with parameter sizes ranging from 71M to 1.4B across diverse datasets and encoding methods. We further show that it naturally supports (1) efficient extrapolation without special tricks such as StreamingLLM and (2) easy integration with existing attention optimization techniques.

* Preprint

View paper on

Share this with someone who'll enjoy it:

Title:StableMask: Refining Causal Masking in Decoder-only Transformer

Paper and Code