Sequence Parallelism


Sequence parallelism is a memory-efficient parallelism method to help break input sequence length limitation and train with longer sequences on GPUs efficiently. Sequence parallelism extends tensor-level model parallelism by distributing computing load and activation memory across multiple GPUs along the sequence dimension of transformer layers. This method is particularly useful for portions of the layer that have previously not been parallelized, enhancing overall model performance and efficiency.

Beyond Tokens: Semantic-Aware Speculative Decoding for Efficient Inference by Probing Internal States

Add code
Feb 03, 2026
Viaarxiv icon

Sequential Group Composition: A Window into the Mechanics of Deep Learning

Add code
Feb 03, 2026
Viaarxiv icon

P-EAGLE: Parallel-Drafting EAGLE with Scalable Training

Add code
Feb 01, 2026
Viaarxiv icon

Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models

Add code
Feb 02, 2026
Viaarxiv icon

A Multi-scale Linear-time Encoder for Whole-Slide Image Analysis

Add code
Feb 02, 2026
Viaarxiv icon

CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling

Add code
Feb 02, 2026
Viaarxiv icon

Parallel Training in Spiking Neural Networks

Add code
Feb 01, 2026
Viaarxiv icon

Scalable Generative Game Engine: Breaking the Resolution Wall via Hardware-Algorithm Co-Design

Add code
Jan 31, 2026
Viaarxiv icon

TRACE: Scalable Amortized Causal Discovery from Single Sequences via Autoregressive Density Estimation

Add code
Feb 01, 2026
Viaarxiv icon

Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts

Add code
Feb 02, 2026
Viaarxiv icon