Picture for Haibin Lin

Haibin Lin

Minder: Faulty Machine Detection for Large-scale Distributed Model Training

Add code
Nov 04, 2024
Viaarxiv icon

SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training

Add code
Oct 20, 2024
Figure 1 for SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training
Figure 2 for SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training
Figure 3 for SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training
Figure 4 for SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training
Viaarxiv icon

HybridFlow: A Flexible and Efficient RLHF Framework

Add code
Sep 28, 2024
Viaarxiv icon

Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation

Add code
Aug 07, 2024
Viaarxiv icon

ByteCheckpoint: A Unified Checkpointing System for LLM Development

Add code
Jul 29, 2024
Viaarxiv icon

QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices

Add code
Jul 02, 2024
Viaarxiv icon

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Add code
Jun 12, 2024
Viaarxiv icon

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

Add code
Mar 02, 2024
Viaarxiv icon

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Add code
Feb 23, 2024
Figure 1 for MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
Figure 2 for MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
Figure 3 for MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
Figure 4 for MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
Viaarxiv icon

CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs

Add code
Nov 17, 2023
Viaarxiv icon