Which hardware do you need to train a 176B parameters model?

The “BigScience” project started from discussions in early 2021 between HuggingFace (Thomas Wolf), GENCI (Stéphane Requena) and IDRIS (Pierre-François Lavallée), GENCI and IDRIS being behind the French supercomputer Jean Zay. The supercomputer Jean Zay is a French national computing center for the CNRS (“Centre national de la recherche scientifique” aka the French national research organization), with a performance of more than 28 Pflops/s in 2020 which was very recently upgraded as we’ll detail now.

Increasing the size of a supercomputer 🤯

The requirements for training a very large language model like the largest model of the BigScience project are massive and when the workshop collaboration filled in the grant application for compute time, it asked for 5M compute hours on the V100 GPUs of the supercomputer, i.e. the totality of the compute time allocated for 6 months on the supercomputer.

Given the size of the BigScience project and its predicted wide impact, strong support could be found to increase the size of the public cluster itself with the addition of 52 HPE Apollo 6500 Gen 10 servers in December 2021 with the following configuration:

GPUs: 416 A100 80GB GPUs (52 nodes) - using 384 gpus (48 nodes) and keeping 32 gpus (4 nodes) in reserve
8 GPUs per node Using NVLink 4 inter-gpu connects, 4 OmniPath links
CPU: AMD
CPU memory: 512GB per node
GPU memory: 640GB per node
Inter-node connect: Omni-Path Architecture (OPA)
NCCL-communications network: a fully dedicated subnet
Disc IO network: shared network with other types of nodes

The training of the largest language model from BigScience will be conducted on this new extension during the first months of beta-testing of the extension, taking advantage of the efficiency and power of the new NVIDIA GPUs as well as the quality of the interconnect and setup.

The strong support and close collaboration with GENCI (Stéphane Requena) and IDRIS (Pierre-François Lavallée, Rémi Lacroix) was essential for the success of this endeavor. An additional positive outcome of BigScience is that this new compute resource will stay available for everyone in the research community using this public compute infrastructure, hence a net positive for all academia and laboratories with lower compute resources.

And what about carbon emissions?

There are several reasons that make collaborative training and sharing like BigScience as well as clusters like Jean Zay a rather interesting direction for studying large language models from an environmental point of view.

The current direction of privately trained models means that similarly sized models are trained and kept private in the various big tech companies affording the compute to do so. The multiplication of very similar models (Google 137B language model, DeepMind 200B GOPHER model, OpenAI 175B GPT3, NVIDIA 500B model, see an updated list here) generate a duplication of energy spending with little pragmatic logic. It would likely be more interesting to train a single model together and share it among the research community instead of training a multiplicity of unshared models. In addition to these considerations, the Jean Zay supercomputer has been in particular selected because of its very interesting design choices and location:

Jean Zay is mostly powered by nuclear energy (which powers 70-75% of the French electricity grid), which is a low carbon energy source in comparison to coal or gas.
Moreover, significant efforts were made to make sure that the computing infrastructure is as efficient as possible — the heat generated by the hardware is for instance used for heating buildings on the nearby campus.
A full estimation of the carbon footprint of training the model is currently being conducted by the carbon-footprint working group of BigScience.

High Throughput Model Scaling on 400+ GPUs

To be able to efficiently scale the model on the cluster GPU NVIDIA and DeepSpeed team were closely involved, in particular:

Olatunji Ruwase, Jeff Rasley, Samyam Rajbhandari, S haden Smith, Conglong Li and Minjia Zhang from the Microsoft DeepSpeed team
Jared Casper, Deepak Narayanan and Mohammad Shoeybi from the NVIDIA Megatron-LM team.

The final model takes 48 GPUs for one replica and the model is trained with 8 replicas in parallel to a total of 384 GPUs (data parallelism). 4 nodes (32 GPUS) are reserved for the event of hardware malfunction.

From DeepSpeed blog post 3D parallelism: Scaling to trillion-parameter models

3 parallelism dimensions are used to get a very high model throughput:

Tensor Parallelism: 4 (each tensor is split up into multiple chunks processed separately and in parallel on different GPUs)
Pipeline Parallelism: 12 (the model is split up vertically (layer-level) across multiple GPUs)
Data Parallelism: 8 (the model is replicated multiple times, each replica being fed a slice of the data)

Code and training setup
The final code for the model can be found on GitHub, as well as the training setup. It uses PyTorch-1.11 w/ CUDA-11.5, DeepSpeed 0.6.0 and NVIDIA’s apex@master.

Configuration
To find the best configuration, many possible topologies and configurations for models of sizes ranging between 150B and 200B parameters were investigated to test throughput, iteration speed and memory usage. A detailed map of the tested configurations can be found here.

Discussions
All the discussions leading to the selected topology have been documented along the way by Stas in his chronicles.

Throughput
The initial throughput was around 90 TFLOPs and over the next few weeks we were able to improve it all the way to 150 TFLOPs after experimenting with multiple combinations of TP/PP/DP/MBS and NHIDDEN/NHEADS/NLAYERS settings, taking into account the best dimension multipliers according to NVIDIA tile quantization guidelines (matrix dimensions need to be divisible by 128).

Somewhere halfway through this process the JeanZay engineers improved the cluster’s network which gave an additional performance boost. Towards the end an additional speed up was achieved by layer rebalancing which gave the tied embedding matrices a similar weight to transformer blocks, which made all the GPUs equally loaded in terms of memory as explained here.
While working on this, we switched from an estimated TFLOPs calculation to the exact one which is slightly higher.

According to the NVIDIA and Deepspeed LLM training experts 150 TFLOPs is pretty much the highest throughput one can achieve with A100 80GB gpus give or take a few TFLOPs for this size of the model. A larger model would have an even higher throughput. For more information on the design choices of the model, take a look at the accompanying blog post by Julien Launay.