OpenAI used YouTube videos to train GPT-4 against platform rules, but so did Google

Ben Schoon | Apr 8 2024 - 7:39 am PT

Generative AI models are incredibly impressive, but they’re only as good as the data fed into them. Now, it’s been revealed that OpenAI used YouTube videos to train GPT-4, and YouTube says that was against the platform’s rules.

In a new report from The New York Times, it’s revealed that OpenAI used “over a million” hours of YouTube video transcripts to train GPT-4, its current most advanced generative AI model.

This was done using an in-house tool called “Whisper,” which could transcribe audio from YouTube videos which could then be fed into training what would become GPT-4. OpenAI president Greg Brockman, according to the report, was personally involved in picking videos to use for AI training, despite some OpenAI employees expressing concern that this sort of action would be against YouTube’s rules.

OpenAI apparently believed this was “fair use” of the publically available videos, but YouTube said in a statement to The Verge that “both our robots.txt files and Terms of Service prohibit unauthorized scraping or downloading of YouTube content.”

YouTube CEO Neal Mohan expressed that same point during a recent discussion with Bloomberg regarding OpenAI’s video model, Sora, which is due for release later this year. OpenAI is not allowed under YouTube’s terms of service to scrape videos to train its AI.

However, the report also goes on to note that Google has done the same in training its AI models used in Gemini.

Apparently, the company has similarly used YouTube video transcripts to train AI. This report also notes that changes to Google’s terms of service have allowed the company to scrape training data from other publically visible data points on its services including public Google Docs and Sheets files and even reviews left on Maps. It’s said that, despite Google knowing OpenAI was scraping data from YouTube, the company hasn’t acted on its as that might lead to backlash for Google doing the same.

Aggressive means of capturing new data for training more advanced models is likely to continue, as existing data dries up. It’s estimated by AI research institute Epoch that all existing data could be used by 2026, as Google, OpenAI, and others are using data faster than it is being created.