Posted 3/21/2024, 6:47:04 PM
Meta Building Massive 350,000 GPU AI Training Infrastructure, Moving Toward Custom Silicon
- Meta currently relies on almost 50,000 Nvidia H100 GPUs to train its Llama 3 AI model, but plans to have over 350,000 by end of 2024
- Meta unveiled details on its 24,576-GPU data center scale clusters running on its Grand Teton open GPU hardware platform
- Clusters feature high-performance network fabrics enabling support for larger, more complex models
- Meta has developed its own storage solution called Tectonic to enable thousands of GPUs to save checkpoints
- While Meta currently relies heavily on Nvidia, it plans to use its own Artemis AI chips in servers this year and make custom RISC-V silicon