Meta Building Massive 350,000 GPU AI Training Infrastructure, Moving Toward Custom Silicon

Meta currently relies on almost 50,000 Nvidia H100 GPUs to train its Llama 3 AI model, but plans to have over 350,000 by end of 2024
Meta unveiled details on its 24,576-GPU data center scale clusters running on its Grand Teton open GPU hardware platform
Clusters feature high-performance network fabrics enabling support for larger, more complex models
Meta has developed its own storage solution called Tectonic to enable thousands of GPUs to save checkpoints
While Meta currently relies heavily on Nvidia, it plans to use its own Artemis AI chips in servers this year and make custom RISC-V silicon