Slimming Down AI Models - Pruning Layers Drastically Cuts Size While Retaining Accuracy

Pruning deep layers of large language models like Llama 2 can reduce memory needs by 75% with minimal loss of accuracy. This allows them to run on consumer GPUs.
The "lottery ticket hypothesis" states that large AI models contain smaller sections that can reproduce the full model's accuracy. This insight is now being used to shrink models.
Middle layers of models like GPT seem to store factual information. Deeper layers can be removed with little performance impact until a threshold when accuracy plunges.
Removing up to 50% of Llama 2's layers doesn't affect its accuracy on benchmarks like question answering. More research is needed to know if other tasks would be impacted.
It's unclear if current training properly utilizes deeper layers, or if shallow layers play a critical role in storing knowledge. More research could optimize model efficiency.