Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei Bai

Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement

Jul 05, 2024

Yongji Wu, Wenjie Qu, Tianyang Tao, Zhuang Wang, Wei Bai, Zhuohao Li, Yuan Tian, Jiaheng Zhang, Matthew Lentz, Danyang Zhuo

Figure 1 for Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement

Figure 2 for Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement

Figure 3 for Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement

Figure 4 for Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement

Abstract:Sparsely-activated Mixture-of-Experts (MoE) architecture has increasingly been adopted to further scale large language models (LLMs) due to its sub-linear scaling for computation costs. However, frequent failures still pose significant challenges as training scales. The cost of even a single failure is significant, as all GPUs need to wait idle until the failure is resolved, potentially losing considerable training progress as training has to restart from checkpoints. Existing solutions for efficient fault-tolerant training either lack elasticity or rely on building resiliency into pipeline parallelism, which cannot be applied to MoE models due to the expert parallelism strategy adopted by the MoE architecture. We present Lazarus, a system for resilient and elastic training of MoE models. Lazarus adaptively allocates expert replicas to address the inherent imbalance in expert workload and speeds-up training, while a provably optimal expert placement algorithm is developed to maximize the probability of recovery upon failures. Through adaptive expert placement and a flexible token dispatcher, Lazarus can also fully utilize all available nodes after failures, leaving no GPU idle. Our evaluation shows that Lazarus outperforms existing MoE training systems by up to 5.7x under frequent node failures and 3.4x on a real spot instance trace.

Via

Access Paper or Ask Questions

Discover the Hidden Attack Path in Multi-domain Cyberspace Based on Reinforcement Learning

Apr 15, 2021

Lei Zhang, Wei Bai, Wei Li, Shiming Xia, Qibin Zheng

Figure 1 for Discover the Hidden Attack Path in Multi-domain Cyberspace Based on Reinforcement Learning

Figure 2 for Discover the Hidden Attack Path in Multi-domain Cyberspace Based on Reinforcement Learning

Figure 3 for Discover the Hidden Attack Path in Multi-domain Cyberspace Based on Reinforcement Learning

Figure 4 for Discover the Hidden Attack Path in Multi-domain Cyberspace Based on Reinforcement Learning

Abstract:In this work, we present a learning-based approach to analysis cyberspace security configuration. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of agents as attackers, our method becomes better at discovering hidden attack paths for previously methods, especially in multi-domain cyberspace. To achieve these results, we pose discovering attack paths as a Reinforcement Learning (RL) problem and train an agent to discover multi-domain cyberspace attack paths. To enable our RL policy to discover more hidden attack paths and shorter attack paths, we ground representation introduction an multi-domain action select module in RL. Our objective is to discover more hidden attack paths and shorter attack paths by our proposed method, to analysis the weakness of cyberspace security configuration. At last, we designed a simulated cyberspace experimental environment to verify our proposed method, the experimental results show that our method can discover more hidden multi-domain attack paths and shorter attack paths than existing baseline methods.

* 12 pages, 2 figures, 3 tables. arXiv admin note: substantial text overlap with arXiv:2007.04614

Via

Access Paper or Ask Questions

Quantitative Evaluations on Saliency Methods: An Experimental Study

Dec 31, 2020

Xiao-Hui Li, Yuhan Shi, Haoyang Li, Wei Bai, Yuanwei Song, Caleb Chen Cao, Lei Chen

Figure 1 for Quantitative Evaluations on Saliency Methods: An Experimental Study

Figure 2 for Quantitative Evaluations on Saliency Methods: An Experimental Study

Figure 3 for Quantitative Evaluations on Saliency Methods: An Experimental Study

Figure 4 for Quantitative Evaluations on Saliency Methods: An Experimental Study

Abstract:It has been long debated that eXplainable AI (XAI) is an important topic, but it lacks rigorous definition and fair metrics. In this paper, we briefly summarize the status quo of the metrics, along with an exhaustive experimental study based on them, including faithfulness, localization, false-positives, sensitivity check, and stability. With the experimental results, we conclude that among all the methods we compare, no single explanation method dominates others in all metrics. Nonetheless, Gradient-weighted Class Activation Mapping (Grad-CAM) and Randomly Input Sampling for Explanation (RISE) perform fairly well in most of the metrics. Utilizing a set of filtered metrics, we further present a case study to diagnose the classification bases for models. While providing a comprehensive experimental study of metrics, we also examine measuring factors that are missed in current metrics and hope this valuable work could serve as a guide for future research.

* 14 pages, 16 figures

Via

Access Paper or Ask Questions

Domain-specific Communication Optimization for Distributed DNN Training

Aug 16, 2020

Hao Wang, Jingrong Chen, Xinchen Wan, Han Tian, Jiacheng Xia, Gaoxiong Zeng, Weiyan Wang, Kai Chen, Wei Bai, Junchen Jiang

Figure 1 for Domain-specific Communication Optimization for Distributed DNN Training

Figure 2 for Domain-specific Communication Optimization for Distributed DNN Training

Figure 3 for Domain-specific Communication Optimization for Distributed DNN Training

Figure 4 for Domain-specific Communication Optimization for Distributed DNN Training

Abstract:Communication overhead poses an important obstacle to distributed DNN training and draws increasing attention in recent years. Despite continuous efforts, prior solutions such as gradient compression/reduction, compute/communication overlapping and layer-wise flow scheduling, etc., are still coarse-grained and insufficient for an efficient distributed training especially when the network is under pressure. We present DLCP, a novel solution exploiting the domain-specific properties of deep learning to optimize communication overhead of DNN training in a fine-grained manner. At its heart, DLCP comprises of several key innovations beyond prior work: e.g., it exploits {\em bounded loss tolerance} of SGD-based training to improve tail communication latency which cannot be avoided purely through gradient compression. It then performs fine-grained packet-level prioritization and dropping, as opposed to flow-level scheduling, based on layers and magnitudes of gradients to further speedup model convergence without affecting accuracy. In addition, it leverages inter-packet order-independency to perform per-packet load balancing without causing classical re-ordering issues. DLCP works with both Parameter Server and collective communication routines. We have implemented DLCP with commodity switches, integrated it with various training frameworks including TensorFlow, MXNet and PyTorch, and deployed it in our small-scale testbed with 10 Nvidia V100 GPUs. Our testbed experiments and large-scale simulations show that DLCP delivers up to $84.3\%$ additional training acceleration over the best existing solutions.

Via

Access Paper or Ask Questions

Weakness Analysis of Cyberspace Configuration Based on Reinforcement Learning

Jul 09, 2020

Lei Zhang, Wei Bai, Shize Guo, Shiming Xia, Hongmei Li, Zhisong Pan

Figure 1 for Weakness Analysis of Cyberspace Configuration Based on Reinforcement Learning

Figure 2 for Weakness Analysis of Cyberspace Configuration Based on Reinforcement Learning

Figure 3 for Weakness Analysis of Cyberspace Configuration Based on Reinforcement Learning

Figure 4 for Weakness Analysis of Cyberspace Configuration Based on Reinforcement Learning

Abstract:In this work, we present a learning-based approach to analysis cyberspace configuration. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of agents as attackers, our method becomes better at rapidly finding attack paths for previously hidden paths, especially in multiple domain cyberspace. To achieve these results, we pose finding attack paths as a Reinforcement Learning (RL) problem and train an agent to find multiple domain attack paths. To enable our RL policy to find more hidden attack paths, we ground representation introduction an multiple domain action select module in RL. By designing a simulated cyberspace experimental environment to verify our method. Our objective is to find more hidden attack paths, to analysis the weakness of cyberspace configuration. The experimental results show that our method can find more hidden multiple domain attack paths than existing baselines methods.

* 10 pages, 6 figures

Via

Access Paper or Ask Questions

Enabling Automatic Certification of Online Auctions

Apr 03, 2014

Wei Bai, Emmanuel M. Tadjouddine, Yu Guo

Figure 1 for Enabling Automatic Certification of Online Auctions

Abstract:We consider the problem of building up trust in a network of online auctions by software agents. This requires agents to have a deeper understanding of auction mechanisms and be able to verify desirable properties of a given mechanism. We have shown how these mechanisms can be formalised as semantic web services in OWL-S, a good enough expressive machine-readable formalism enabling software agents, to discover, invoke, and execute a web service. We have also used abstract interpretation to translate the auction's specifications from OWL-S, based on description logic, to COQ, based on typed lambda calculus, in order to enable automatic verification of desirable properties of the auction by the software agents. For this language translation, we have discussed the syntactic transformation as well as the semantics connections between both concrete and abstract domains. This work contributes to the implementation of the vision of agent-mediated e-commerce systems.

* EPTCS 147, 2014, pp. 123-132
* In Proceedings FESCA 2014, arXiv:1404.0436

Via

Access Paper or Ask Questions