Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Michael Diskin

A critical look at the evaluation of GNNs under heterophily: are we really making progress?

Feb 22, 2023

Oleg Platonov, Denis Kuznedelev, Michael Diskin, Artem Babenko, Liudmila Prokhorenkova

Figure 1 for A critical look at the evaluation of GNNs under heterophily: are we really making progress?

Figure 2 for A critical look at the evaluation of GNNs under heterophily: are we really making progress?

Figure 3 for A critical look at the evaluation of GNNs under heterophily: are we really making progress?

Figure 4 for A critical look at the evaluation of GNNs under heterophily: are we really making progress?

Abstract:Node classification is a classical graph representation learning task on which Graph Neural Networks (GNNs) have recently achieved strong results. However, it is often believed that standard GNNs only work well for homophilous graphs, i.e., graphs where edges tend to connect nodes of the same class. Graphs without this property are called heterophilous, and it is typically assumed that specialized methods are required to achieve strong performance on such graphs. In this work, we challenge this assumption. First, we show that the standard datasets used for evaluating heterophily-specific models have serious drawbacks, making results obtained by using them unreliable. The most significant of these drawbacks is the presence of a large number of duplicate nodes in the datsets Squirrel and Chameleon, which leads to train-test data leakage. We show that removing duplicate nodes strongly affects GNN performance on these datasets. Then, we propose a set of heterophilous graphs of varying properties that we believe can serve as a better benchmark for evaluating the performance of GNNs under heterophily. We show that standard GNNs achieve strong results on these heterophilous graphs, almost always outperforming specialized models. Our datasets and the code for reproducing our experiments are available at https://github.com/yandex-research/heterophilous-graphs

Via

Access Paper or Ask Questions

SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

Jan 27, 2023

Max Ryabinin, Tim Dettmers, Michael Diskin, Alexander Borzunov

Figure 1 for SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

Figure 2 for SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

Figure 3 for SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

Figure 4 for SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

Abstract:Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.

* Work in progress. 23 pages, 8 figures

Via

Access Paper or Ask Questions

Training Transformers Together

Jul 07, 2022

Alexander Borzunov, Max Ryabinin, Tim Dettmers, Quentin Lhoest, Lucile Saulnier, Michael Diskin, Yacine Jernite, Thomas Wolf

Figure 1 for Training Transformers Together

Figure 2 for Training Transformers Together

Abstract:The infrastructure necessary for training state-of-the-art models is becoming overly expensive, which makes training such models affordable only to large corporations and institutions. Recent work proposes several methods for training such models collaboratively, i.e., by pooling together hardware from many independent parties and training a shared model over the Internet. In this demonstration, we collaboratively trained a text-to-image transformer similar to OpenAI DALL-E. We invited the viewers to join the ongoing training run, showing them instructions on how to contribute using the available hardware. We explained how to address the engineering challenges associated with such a training run (slow communication, limited memory, uneven performance between devices, and security concerns) and discussed how the viewers can set up collaborative training runs themselves. Finally, we show that the resulting model generates images of reasonable quality on a number of prompts.

* Accepted to NeurIPS 2021 Demonstration Track. 10 pages, 2 figures. Link: https://training-transformers-together.github.io

Via

Access Paper or Ask Questions

Distributed Methods with Compressed Communication for Solving Variational Inequalities, with Theoretical Guarantees

Oct 07, 2021

Aleksandr Beznosikov, Peter Richtárik, Michael Diskin, Max Ryabinin, Alexander Gasnikov

Figure 1 for Distributed Methods with Compressed Communication for Solving Variational Inequalities, with Theoretical Guarantees

Figure 2 for Distributed Methods with Compressed Communication for Solving Variational Inequalities, with Theoretical Guarantees

Figure 3 for Distributed Methods with Compressed Communication for Solving Variational Inequalities, with Theoretical Guarantees

Figure 4 for Distributed Methods with Compressed Communication for Solving Variational Inequalities, with Theoretical Guarantees

Abstract:Variational inequalities in general and saddle point problems in particular are increasingly relevant in machine learning applications, including adversarial learning, GANs, transport and robust optimization. With increasing data and problem sizes necessary to train high performing models across these and other applications, it is necessary to rely on parallel and distributed computing. However, in distributed training, communication among the compute nodes is a key bottleneck during training, and this problem is exacerbated for high dimensional and over-parameterized models models. Due to these considerations, it is important to equip existing methods with strategies that would allow to reduce the volume of transmitted information during training while obtaining a model of comparable quality. In this paper, we present the first theoretically grounded distributed methods for solving variational inequalities and saddle point problems using compressed communication: MASHA1 and MASHA2. Our theory and methods allow for the use of both unbiased (such as Rand$k$; MASHA1) and contractive (such as Top$k$; MASHA2) compressors. We empirically validate our conclusions using two experimental setups: a standard bilinear min-max problem, and large-scale distributed adversarial training of transformers.

* 30 pages, 2 algorithms (MASHA 1 and MASHA2), 2 theorems

Via

Access Paper or Ask Questions

Secure Distributed Training at Scale

Jun 21, 2021

Eduard Gorbunov, Alexander Borzunov, Michael Diskin, Max Ryabinin

Figure 1 for Secure Distributed Training at Scale

Figure 2 for Secure Distributed Training at Scale

Figure 3 for Secure Distributed Training at Scale

Figure 4 for Secure Distributed Training at Scale

Abstract:Some of the hardest problems in deep learning can be solved with the combined effort of many independent parties, as is the case for volunteer computing and federated learning. These setups rely on high numbers of peers to provide computational resources or train on decentralized datasets. Unfortunately, participants in such systems are not always reliable. Any single participant can jeopardize the entire training run by sending incorrect updates, whether deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server. As a result, it can be infeasible to apply such algorithms to large-scale distributed deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency. We rigorously analyze this protocol: in particular, we provide theoretical bounds for its resistance against Byzantine and Sybil attacks and show that it has a marginal communication overhead. To demonstrate its practical effectiveness, we conduct large-scale experiments on image classification and language modeling in presence of Byzantine attackers.

* 55 pages, 6 figures. Code: https://github.com/yandex-research/btard

Via

Access Paper or Ask Questions

Distributed Deep Learning in Open Collaborations

Jun 18, 2021

Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Quentin Lhoest, Anton Sinitsin, Dmitry Popov, Dmitry Pyrkin, Maxim Kashirin, Alexander Borzunov(+6 more)

Figure 1 for Distributed Deep Learning in Open Collaborations

Figure 2 for Distributed Deep Learning in Open Collaborations

Figure 3 for Distributed Deep Learning in Open Collaborations

Figure 4 for Distributed Deep Learning in Open Collaborations

Abstract:Modern deep learning applications require increasingly more compute to train state-of-the-art models. To address this demand, large corporations and institutions use dedicated High-Performance Computing clusters, whose construction and maintenance are both environmentally costly and well beyond the budget of most organizations. As a result, some research directions become the exclusive domain of a few large industrial and even fewer academic actors. To alleviate this disparity, smaller groups may pool their computational resources and run collaborative experiments that benefit all participants. This paradigm, known as grid- or volunteer computing, has seen successful applications in numerous scientific areas. However, using this approach for machine learning is difficult due to high latency, asymmetric bandwidth, and several challenges unique to volunteer computing. In this work, we carefully analyze these constraints and propose a novel algorithmic framework designed specifically for collaborative training. We demonstrate the effectiveness of our approach for SwAV and ALBERT pretraining in realistic conditions and achieve performance comparable to traditional setups at a fraction of the cost. Finally, we provide a detailed report of successful collaborative language model pretraining with 40 participants.

* 30 pages, 9 figures. Code: https://github.com/yandex-research/DeDLOC

Via

Access Paper or Ask Questions