AI Training Data Lacks Transparency, Raising Concerns Over Privacy and Bias
-
AI models are trained on massive datasets scraped from the public internet, including copyrighted and private material, without transparency.
-
Web scrapers can access public sites and profiles, paywalled content, pirated materials, and leaked personal data.
-
Lack of transparency around training data raises issues related to copyright, privacy, and bias.
-
Marginalized groups are underrepresented in web data, skewing AI.
-
There are few options currently to protect personal data from being used to train AI systems.
![](https://static.scientificamerican.com/sciam/cache/file/0446505E-447E-4ECA-A5699DE89F708B8E.jpg)