Picture for Bor-Yiing Su

Bor-Yiing Su

CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

Add code
Nov 05, 2020
Figure 1 for CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
Figure 2 for CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
Figure 3 for CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
Figure 4 for CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery
Viaarxiv icon

ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training

Add code
Mar 07, 2020
Figure 1 for ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training
Figure 2 for ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training
Figure 3 for ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training
Figure 4 for ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training
Viaarxiv icon