Sequence Parallelism


Sequence parallelism is a memory-efficient parallelism method to help break input sequence length limitation and train with longer sequences on GPUs efficiently. Sequence parallelism extends tensor-level model parallelism by distributing computing load and activation memory across multiple GPUs along the sequence dimension of transformer layers. This method is particularly useful for portions of the layer that have previously not been parallelized, enhancing overall model performance and efficiency.

Revisiting Reset Mechanisms in Spiking Neural Networks for Sequential Modeling: Specialized Discretization for Binary Activated RNN

Add code
Apr 24, 2025
Viaarxiv icon

L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference

Add code
Apr 24, 2025
Viaarxiv icon

MR. Video: "MapReduce" is the Principle for Long Video Understanding

Add code
Apr 22, 2025
Viaarxiv icon

Scalable APT Malware Classification via Parallel Feature Extraction and GPU-Accelerated Learning

Add code
Apr 22, 2025
Viaarxiv icon

SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training

Add code
Apr 20, 2025
Viaarxiv icon

MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

Add code
Apr 21, 2025
Viaarxiv icon

Superfast Configuration-Space Convex Set Computation on GPUs for Online Motion Planning

Add code
Apr 15, 2025
Viaarxiv icon

OctGPT: Octree-based Multiscale Autoregressive Models for 3D Shape Generation

Add code
Apr 15, 2025
Viaarxiv icon

Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices

Add code
Apr 11, 2025
Viaarxiv icon

Bidirectional Linear Recurrent Models for Sequence-Level Multisource Fusion

Add code
Apr 11, 2025
Viaarxiv icon