Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vijay Anand Raghava Kanakagiri

Stochastic Communication Avoidance for Recommendation Systems

Nov 03, 2024

Lutfi Eren Erdogan, Vijay Anand Raghava Kanakagiri, Kurt Keutzer, Zhen Dong

Figure 1 for Stochastic Communication Avoidance for Recommendation Systems

Figure 2 for Stochastic Communication Avoidance for Recommendation Systems

Figure 3 for Stochastic Communication Avoidance for Recommendation Systems

Figure 4 for Stochastic Communication Avoidance for Recommendation Systems

Abstract:One of the major bottlenecks for efficient deployment of neural network based recommendation systems is the memory footprint of their embedding tables. Although many neural network based recommendation systems could benefit from the faster on-chip memory access and increased computational power of hardware accelerators, the large embedding tables in these models often cannot fit on the constrained memory of accelerators. Despite the pervasiveness of these models, prior methods in memory optimization and parallelism fail to address the memory and communication costs of large embedding tables on accelerators. As a result, the majority of models are trained on CPUs, while current implementations of accelerators are hindered by issues such as bottlenecks in inter-device communication and main memory lookups. In this paper, we propose a theoretical framework that analyses the communication costs of arbitrary distributed systems that use lookup tables. We use this framework to propose algorithms that maximize throughput subject to memory, computation, and communication constraints. Furthermore, we demonstrate that our method achieves strong theoretical performance across dataset distributions and memory constraints, applicable to a wide range of use cases from mobile federated learning to warehouse-scale computation. We implement our framework and algorithms in PyTorch and achieve up to 6x increases in training throughput on GPU systems over baselines, on the Criteo Terabytes dataset.

* Conference on Artificial Intelligence (IEEE CAI) 2024

Via

Access Paper or Ask Questions