Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anthony Hsu

TonY: An Orchestrator for Distributed Machine Learning Jobs

Mar 24, 2019

Anthony Hsu, Keqiu Hu, Jonathan Hung, Arun Suresh, Zhe Zhang

Figure 1 for TonY: An Orchestrator for Distributed Machine Learning Jobs

Abstract:Training machine learning (ML) models on large datasets requires considerable computing power. To speed up training, it is typical to distribute training across several machines, often with specialized hardware like GPUs or TPUs. Managing a distributed training job is complex and requires dealing with resource contention, distributed configurations, monitoring, and fault tolerance. In this paper, we describe TonY, an open-source orchestrator for distributed ML jobs built at LinkedIn to address these challenges.

* 2 pages, to be published in OpML '19

Via

Access Paper or Ask Questions