Study Finds Ethical and Legal Issues in Many AI Data Sets

Researchers uncover ethical and legal risks in popular AI data sets, finding issues like improper licensing and lack of attribution.
Audit looked at over 1,800 specialized fine-tuning data sets on sites like Hugging Face and GitHub.
About 70% of data sets didn't specify a license or mislabeled permissions more permissive than intended.
Proper licensing is important so developers know potential copyright restrictions and requirements.
Data sets often lack representation of languages from the Global South compared to English and Western European languages.