AI Training Data Lacks Transparency, Raising Concerns Over Privacy and Bias

AI models are trained on massive datasets scraped from the public internet, including copyrighted and private material, without transparency.
Web scrapers can access public sites and profiles, paywalled content, pirated materials, and leaked personal data.
Lack of transparency around training data raises issues related to copyright, privacy, and bias.
Marginalized groups are underrepresented in web data, skewing AI.
There are few options currently to protect personal data from being used to train AI systems.