Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jusen Du

Native Hybrid Attention for Efficient Sequence Modeling

Oct 08, 2025

Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, Yu Cheng

Figure 1 for Native Hybrid Attention for Efficient Sequence Modeling

Figure 2 for Native Hybrid Attention for Efficient Sequence Modeling

Figure 3 for Native Hybrid Attention for Efficient Sequence Modeling

Figure 4 for Native Hybrid Attention for Efficient Sequence Modeling

Abstract:Transformers excel at sequence modeling but face quadratic complexity, while linear attention offers improved efficiency but often compromises recall accuracy over long contexts. In this work, we introduce Native Hybrid Attention (NHA), a novel hybrid architecture of linear and full attention that integrates both intra \& inter-layer hybridization into a unified layer design. NHA maintains long-term context in key-value slots updated by a linear RNN, and augments them with short-term tokens from a sliding window. A single \texttt{softmax attention} operation is then applied over all keys and values, enabling per-token and per-head context-dependent weighting without requiring additional fusion parameters. The inter-layer behavior is controlled through a single hyperparameter, the sliding window size, which allows smooth adjustment between purely linear and full attention while keeping all layers structurally uniform. Experimental results show that NHA surpasses Transformers and other hybrid baselines on recall-intensive and commonsense reasoning tasks. Furthermore, pretrained LLMs can be structurally hybridized with NHA, achieving competitive accuracy while delivering significant efficiency gains. Code is available at https://github.com/JusenD/NHA.

* Technical report, 16 pages

Via

Access Paper or Ask Questions

Liger: Linearizing Large Language Models to Gated Recurrent Structures

Mar 03, 2025

Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, Yu Cheng

Figure 1 for Liger: Linearizing Large Language Models to Gated Recurrent Structures

Figure 2 for Liger: Linearizing Large Language Models to Gated Recurrent Structures

Figure 3 for Liger: Linearizing Large Language Models to Gated Recurrent Structures

Figure 4 for Liger: Linearizing Large Language Models to Gated Recurrent Structures

Abstract:Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents Liger, short for Linearizing LLMs to gated recurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93\% of the Transformer-based LLM at 0.02\% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to 8B parameters. Code is available at https://github.com/OpenSparseLLMs/Linearization.

* Technical report, 13 pages

Via

Access Paper or Ask Questions

MoM: Linear Sequence Modeling with Mixture-of-Memories

Feb 19, 2025

Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, Yu Cheng

Figure 1 for MoM: Linear Sequence Modeling with Mixture-of-Memories

Figure 2 for MoM: Linear Sequence Modeling with Mixture-of-Memories

Figure 3 for MoM: Linear Sequence Modeling with Mixture-of-Memories

Figure 4 for MoM: Linear Sequence Modeling with Mixture-of-Memories

Abstract:Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive downstream tasks. Drawing inspiration from neuroscience, particularly the brain's ability to maintain robust long-term memory while mitigating "memory interference", we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM significantly outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.

* Technical report, 14 pages

Via

Access Paper or Ask Questions