Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hamidreza Almasi

Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

Feb 12, 2023

Hamidreza Almasi, Harsh Mishra, Balajee Vamanan, Sathya N. Ravi

Figure 1 for Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

Figure 2 for Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

Figure 3 for Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

Figure 4 for Flag Aggregator: Scalable Distributed Training under Failures and Augmented Losses using Convex Optimization

Abstract:Modern ML applications increasingly rely on complex deep learning models and large datasets. There has been an exponential growth in the amount of computation needed to train the largest models. Therefore, to scale computation and data, these models are inevitably trained in a distributed manner in clusters of nodes, and their updates are aggregated before being applied to the model. However, a distributed setup is prone to byzantine failures of individual nodes, components, and software. With data augmentation added to these settings, there is a critical need for robust and efficient aggregation systems. We extend the current state-of-the-art aggregators and propose an optimization-based subspace estimator by modeling pairwise distances as quadratic functions by utilizing the recently introduced Flag Median problem. The estimator in our loss function favors the pairs that preserve the norm of the difference vector. We theoretically show that our approach enhances the robustness of state-of-the-art byzantine resilient aggregators. Also, we evaluate our method with different tasks in a distributed setup with a parameter server architecture and show its communication efficiency while maintaining similar accuracy. The code is publicly available at https://github.com/hamidralmasi/FlagAggregator

Via

Access Paper or Ask Questions