Llama2 Scales to 70 Billion Parameters on Amazon EKS with Fully Sharded Data Parallelism

• Llama2 is a large language model with up to 70 billion parameters, pre-trained on 2 trillion tokens of text and code.

• Fully Sharded Data Parallelism (FSDP) enables training larger models by sharding model parameters across GPUs to reduce memory footprint.

• The post demonstrates linear scaling of Llama2 fine-tuning to 7B, 13B and 70B parameter sizes on Amazon EKS using up to 16 p4de or p5 nodes.

• Kubernetes and PyTorch observability tools like kubectl logs, htop, nvtop and Grafana are used to monitor resource utilization during distributed training.

• Upcoming FSDP features like per-parameter sharding aim to further improve memory efficiency for large model training.