AI Models Risk 'Model Collapse' from Training on Their Own Gibberish
-
AI models like ChatGPT generate synthetic content that fills the internet, which can then corrupt their own training data. This "model collapse" results in nonsensical outputs.
-
Recent studies demonstrate how recursive training on AI-generated data leads to blurry, unrecognizable images and text obsessed with random topics like jackrabbits.
-
Filtering out synthetic training data is becoming vital to prevent issues like exacerbated biases and gibberish outputs.
-
High quality human-generated data still outperforms synthetic data sets. AI could help de-bias human data.
-
For now, engineers must manually filter synthetic data to stop AI models from training on their own poor outputs. Human oversight remains essential.