Tech Firms Push Boundaries in Scramble for AI Training Data

OpenAI researchers created a tool called Whisper to transcribe YouTube videos to get more conversational text to train AI systems, potentially violating YouTube's terms of service.
OpenAI president Greg Brockman personally helped collect over 1 million hours of YouTube videos to train the GPT-4 system.
Meta executives discussed buying publisher Simon & Schuster to get long-form text and gathering copyrighted data from across the internet to train AI systems.
Tech companies are cutting corners and debating bending laws to get the data needed to advance AI technology.
Negotiating licenses with publishers and content creators would take too long so tech firms are finding ways to get data through other means.