Llama2 Scales to 70 Billion Parameters on Amazon EKS with Fully Sharded Data Parallelism
• Llama2 is a large language model with up to 70 billion parameters, pre-trained on 2 trillion tokens of text and code.
• Fully Sharded Data Parallelism (FSDP) enables training larger models by sharding model parameters across GPUs to reduce memory footprint.
• The post demonstrates linear scaling of Llama2 fine-tuning to 7B, 13B and 70B parameter sizes on Amazon EKS using up to 16 p4de or p5 nodes.
• Kubernetes and PyTorch observability tools like kubectl logs, htop, nvtop and Grafana are used to monitor resource utilization during distributed training.
• Upcoming FSDP features like per-parameter sharding aim to further improve memory efficiency for large model training.