AI Faces Looming Data Drought Within Years
• AI companies are running out of internet data to train their large language models (LLMs) like ChatGPT. Estimates say they'll exhaust quality data in 2 years.
• LLMs need a massive amount of data - up to 100 trillion tokens - to continue improving capabilities. After exhausting internet data, 10-20 trillion more tokens needed.
• Issues are not just data shortage but also data quality and ethics of scraping personal info without consent.
• Companies looking to alternative data sources like YouTube transcripts, niche models, and paying for high-quality data.
• Controversial option is using synthetic data - AI-generated data from existing sets. Risks model collapse from lack of variety.