Study Finds Ethical and Legal Issues in Many AI Data Sets
-
Researchers uncover ethical and legal risks in popular AI data sets, finding issues like improper licensing and lack of attribution.
-
Audit looked at over 1,800 specialized fine-tuning data sets on sites like Hugging Face and GitHub.
-
About 70% of data sets didn't specify a license or mislabeled permissions more permissive than intended.
-
Proper licensing is important so developers know potential copyright restrictions and requirements.
-
Data sets often lack representation of languages from the Global South compared to English and Western European languages.
![](https://www.washingtonpost.com/wp-apps/imrs.php?src=https://arc-anglerfish-washpost-prod-washpost.s3.amazonaws.com/public/4BEI2INR3LCSX5UVYCBM54PPEA.jpg&w=1440)