Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Indranil Gupta

Baechi: Fast Device Placement of Machine Learning Graphs

Jan 20, 2023

Beomyeol Jeon, Linda Cai, Chirag Shetty, Pallavi Srivastava, Jintao Jiang, Xiaolan Ke, Yitao Meng, Cong Xie, Indranil Gupta

Figure 1 for Baechi: Fast Device Placement of Machine Learning Graphs

Figure 2 for Baechi: Fast Device Placement of Machine Learning Graphs

Figure 3 for Baechi: Fast Device Placement of Machine Learning Graphs

Figure 4 for Baechi: Fast Device Placement of Machine Learning Graphs

Abstract:Machine Learning graphs (or models) can be challenging or impossible to train when either devices have limited memory, or models are large. To split the model across devices, learning-based approaches are still popular. While these result in model placements that train fast on data (i.e., low step times), learning-based model-parallelism is time-consuming, taking many hours or days to create a placement plan of operators on devices. We present the Baechi system, the first to adopt an algorithmic approach to the placement problem for running machine learning training graphs on small clusters of memory-constrained devices. We integrate our implementation of Baechi into two popular open-source learning frameworks: TensorFlow and PyTorch. Our experimental results using GPUs show that: (i) Baechi generates placement plans 654 X - 206K X faster than state-of-the-art learning-based approaches, and (ii) Baechi-placed model's step (training) time is comparable to expert placements in PyTorch, and only up to 6.2% worse than expert placements in TensorFlow. We prove mathematically that our two algorithms are within a constant factor of the optimal. Our work shows that compared to learning-based approaches, algorithmic approaches can face different challenges for adaptation to Machine learning systems, but also they offer proven bounds, and significant performance benefits.

* Extended version of SoCC 2020 paper: https://dl.acm.org/doi/10.1145/3419111.3421302

Via

Access Paper or Ask Questions

CSER: Communication-efficient SGD with Error Reset

Jul 29, 2020

Cong Xie, Shuai Zheng, Oluwasanmi Koyejo, Indranil Gupta, Mu Li, Haibin Lin

Figure 1 for CSER: Communication-efficient SGD with Error Reset

Figure 2 for CSER: Communication-efficient SGD with Error Reset

Figure 3 for CSER: Communication-efficient SGD with Error Reset

Figure 4 for CSER: Communication-efficient SGD with Error Reset

Abstract:The scalability of Distributed Stochastic Gradient Descent (SGD) is today limited by communication bottlenecks. We propose a novel SGD variant: Communication-efficient SGD with Error Reset, or CSER. The key idea in CSER is first a new technique called "error reset" that adapts arbitrary compressors for SGD, producing bifurcated local models with periodic reset of resulting local residual errors. Second we introduce partial synchronization for both the gradients and the models, leveraging advantages from them. We prove the convergence of CSER for smooth non-convex problems. Empirical results show that when combined with highly aggressive compressors, the CSER algorithms: i) cause no loss of accuracy, and ii) accelerate the training by nearly $10\times$ for CIFAR-100, and by $4.5\times$ for ImageNet.

Via

Access Paper or Ask Questions

Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

Nov 20, 2019

Cong Xie, Oluwasanmi Koyejo, Indranil Gupta, Haibin Lin

Figure 1 for Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

Figure 2 for Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

Figure 3 for Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

Figure 4 for Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

Abstract:Recent years have witnessed the growth of large-scale distributed machine learning algorithms -- specifically designed to accelerate model training by distributing computation across multiple machines. When scaling distributed training in this way, the communication overhead is often the bottleneck. In this paper, we study the local distributed Stochastic Gradient Descent~(SGD) algorithm, which reduces the communication overhead by decreasing the frequency of synchronization. While SGD with adaptive learning rates is a widely adopted strategy for training neural networks, it remains unknown how to implement adaptive learning rates in local SGD. To this end, we propose a novel SGD variant with reduced communication and adaptive learning rates, with provable convergence. Empirical results show that the proposed algorithm has fast convergence and efficiently reduces the communication overhead.

Via

Access Paper or Ask Questions

SLSGD: Secure and Efficient Distributed On-device Machine Learning

Apr 05, 2019

Cong Xie, Sanmi Koyejo, Indranil Gupta

Figure 1 for SLSGD: Secure and Efficient Distributed On-device Machine Learning

Figure 2 for SLSGD: Secure and Efficient Distributed On-device Machine Learning

Figure 3 for SLSGD: Secure and Efficient Distributed On-device Machine Learning

Figure 4 for SLSGD: Secure and Efficient Distributed On-device Machine Learning

Abstract:We consider distributed on-device learning with limited communication and security requirements. We propose a new robust distributed optimization algorithm with efficient communication and attack tolerance. The proposed algorithm has provable convergence and robustness under non-IID settings. Empirical results show that the proposed algorithm stabilizes the convergence and tolerates data poisoning on a small number of workers.

Via

Access Paper or Ask Questions

Asynchronous Federated Optimization

Mar 13, 2019

Cong Xie, Sanmi Koyejo, Indranil Gupta

Figure 1 for Asynchronous Federated Optimization

Figure 2 for Asynchronous Federated Optimization

Figure 3 for Asynchronous Federated Optimization

Figure 4 for Asynchronous Federated Optimization

Abstract:Federated learning enables training on a massive number of edge devices. To improve flexibility and scalability, we propose a new asynchronous federated optimization algorithm. We prove that the proposed approach has near-linear convergence to a global optimum, for both strongly and non-strongly convex problems, as well as a restricted family of non-convex problems. Empirical results show that the proposed algorithm converges fast and tolerates staleness.

Via

Access Paper or Ask Questions

Fall of Empires: Breaking Byzantine-tolerant SGD by Inner Product Manipulation

Mar 10, 2019

Cong Xie, Sanmi Koyejo, Indranil Gupta

Figure 1 for Fall of Empires: Breaking Byzantine-tolerant SGD by Inner Product Manipulation

Figure 2 for Fall of Empires: Breaking Byzantine-tolerant SGD by Inner Product Manipulation

Figure 3 for Fall of Empires: Breaking Byzantine-tolerant SGD by Inner Product Manipulation

Figure 4 for Fall of Empires: Breaking Byzantine-tolerant SGD by Inner Product Manipulation

Abstract:Recently, new defense techniques have been developed to tolerate Byzantine failures for distributed machine learning. The Byzantine model captures workers that behave arbitrarily, including malicious and compromised workers. In this paper, we break two prevailing Byzantine-tolerant techniques. Specifically we show robust aggregation methods for synchronous SGD -- coordinate-wise median and Krum -- can be broken using new attack strategies based on inner product manipulation. We prove our results theoretically, as well as show empirical validation.

Via

Access Paper or Ask Questions

Zeno: Byzantine-suspicious stochastic gradient descent

Sep 16, 2018

Cong Xie, Oluwasanmi Koyejo, Indranil Gupta

Figure 1 for Zeno: Byzantine-suspicious stochastic gradient descent

Figure 2 for Zeno: Byzantine-suspicious stochastic gradient descent

Figure 3 for Zeno: Byzantine-suspicious stochastic gradient descent

Figure 4 for Zeno: Byzantine-suspicious stochastic gradient descent

Abstract:We propose Zeno, a new robust aggregation rule, for distributed synchronous Stochastic Gradient Descent~(SGD) under a general Byzantine failure model. The key idea is to suspect the workers that are potentially malicious, and use a ranking-based preference mechanism. This allows us to generalize beyond past work--in our case, the number of malicious workers can be arbitrarily large, and we use only the weakest assumption on honest workers~(at least one honest worker). We prove the convergence of SGD under these scenarios. Empirical results show that Zeno outperforms existing approaches under various attacks.

* Submitted to SysML 2019

Via

Access Paper or Ask Questions

Phocas: dimensional Byzantine-resilient stochastic gradient descent

May 23, 2018

Cong Xie, Oluwasanmi Koyejo, Indranil Gupta

Figure 1 for Phocas: dimensional Byzantine-resilient stochastic gradient descent

Figure 2 for Phocas: dimensional Byzantine-resilient stochastic gradient descent

Figure 3 for Phocas: dimensional Byzantine-resilient stochastic gradient descent

Figure 4 for Phocas: dimensional Byzantine-resilient stochastic gradient descent

Abstract:We propose a novel robust aggregation rule for distributed synchronous Stochastic Gradient Descent~(SGD) under a general Byzantine failure model. The attackers can arbitrarily manipulate the data transferred between the servers and the workers in the parameter server~(PS) architecture. We prove the Byzantine resilience of the proposed aggregation rules. Empirical analysis shows that the proposed techniques outperform current approaches for realistic use cases and Byzantine attack scenarios.

* Submitted to NIPS 2018. arXiv admin note: substantial text overlap with arXiv:1802.10116

Via

Access Paper or Ask Questions

Generalized Byzantine-tolerant SGD

Mar 23, 2018

Cong Xie, Oluwasanmi Koyejo, Indranil Gupta

Figure 1 for Generalized Byzantine-tolerant SGD

Figure 2 for Generalized Byzantine-tolerant SGD

Figure 3 for Generalized Byzantine-tolerant SGD

Figure 4 for Generalized Byzantine-tolerant SGD

Abstract:We propose three new robust aggregation rules for distributed synchronous Stochastic Gradient Descent~(SGD) under a general Byzantine failure model. The attackers can arbitrarily manipulate the data transferred between the servers and the workers in the parameter server~(PS) architecture. We prove the Byzantine resilience properties of these aggregation rules. Empirical analysis shows that the proposed techniques outperform current approaches for realistic use cases and Byzantine attack scenarios.

Via

Access Paper or Ask Questions