Asynchronous Local-SGD Training forLanguage Modeling

Published: 17 January 2024

Abstract

Training large language models requires substantial computational resources. As models keep scaling, it is crucial to leverage distributed computation resources that might be even geographically distant from each other. Due to the latency and heterogeneity of these computation devices, it is natural to consider Local Stochastic Gradient Descent (localSGD) (e.g., each device performs more than one update per communication) and asynchronous training (the server updates the global parameter vector as soon as a worker has completed its local updates). This work presents an initial study on asynchronous localSGD for language modeling. We conduct a comprehensive investigation by examining how worker heterogeneity, model size, number of workers, and optimizer could impact the learning performance. We identified that a key hurdle of applying asynchronous localSGD in language modeling is the application of momentum acceleration on the server side when worker gradients are staled. In this setting, asynchronous localSGD takes more iterations to converge than its synchronous counterpart despite updating the global parameters more frequently. To mitigate this optimization challenge, we propose a novel method that utilizes a delayed Nesterov momentum update and adjusts the workers' local training steps based on their computation speed. This approach, evaluated with models up to 150M parameters on the C4 dataset, not only matches the performance of synchronous localSGD in terms of perplexity per update step but also surpasses it significantly in terms of wall clock time. For the convenience of future research and quick prototyping of new ideas, we also make available a toy framework that replicates the observed optimization challenges on a mixture of mixtures of Gaussians. We hope this work can bring new insights and tools to facilitate future discovery of more efficient language model optimizers.

Authors

Bo Liu*, Arthur Douillard, Rachita Chhaparia, Jiajun Shen, Andrei Rusu, Arthur Szlam, Marc'aurelio Ranzato, Satyen Kale

Venue

arXiv

Asynchronous Local-SGD Training forLanguage Modeling

Share

Abstract

Authors

Venue