Skip to main content

OpenAI used YouTube videos to train GPT-4 against platform rules, but so did Google

Generative AI models are incredibly impressive, but they’re only as good as the data fed into them. Now, it’s been revealed that OpenAI used YouTube videos to train GPT-4, and YouTube says that was against the platform’s rules.

In a new report from The New York Times, it’s revealed that OpenAI used “over a million” hours of YouTube video transcripts to train GPT-4, its current most advanced generative AI model.

This was done using an in-house tool called “Whisper,” which could transcribe audio from YouTube videos which could then be fed into training what would become GPT-4. OpenAI president Greg Brockman, according to the report, was personally involved in picking videos to use for AI training, despite some OpenAI employees expressing concern that this sort of action would be against YouTube’s rules.

OpenAI apparently believed this was “fair use” of the publically available videos, but YouTube said in a statement to The Verge that “both our robots.txt files and Terms of Service prohibit unauthorized scraping or downloading of YouTube content.”

YouTube CEO Neal Mohan expressed that same point during a recent discussion with Bloomberg regarding OpenAI’s video model, Sora, which is due for release later this year. OpenAI is not allowed under YouTube’s terms of service to scrape videos to train its AI.

However, the report also goes on to note that Google has done the same in training its AI models used in Gemini.

Apparently, the company has similarly used YouTube video transcripts to train AI. This report also notes that changes to Google’s terms of service have allowed the company to scrape training data from other publically visible data points on its services including public Google Docs and Sheets files and even reviews left on Maps. It’s said that, despite Google knowing OpenAI was scraping data from YouTube, the company hasn’t acted on its as that might lead to backlash for Google doing the same.

Aggressive means of capturing new data for training more advanced models is likely to continue, as existing data dries up. It’s estimated by AI research institute Epoch that all existing data could be used by 2026, as Google, OpenAI, and others are using data faster than it is being created.

More on AI:

Follow Ben: Twitter/XThreads, Bluesky, and Instagram

FTC: We use income earning auto affiliate links. More.

You’re reading 9to5Google — experts who break news about Google and its surrounding ecosystem, day after day. Be sure to check out our homepage for all the latest news, and follow 9to5Google on Twitter, Facebook, and LinkedIn to stay in the loop. Don’t know where to start? Check out our exclusive stories, reviews, how-tos, and subscribe to our YouTube channel

Manage push notifications

notification icon
We would like to show you notifications for the latest news and updates.
notification icon
You are subscribed to notifications
notification icon
We would like to show you notifications for the latest news and updates.
notification icon
You are subscribed to notifications