Published in AI

Too much data can cause an AI model to collapse

by on26 July 2024


Scraping from other models is a bad thing

A new study published in Nature has found that training AI models using datasets created by other AI models can lead to “model collapse,” where the models start producing increasingly nonsensical outputs over time.

One model began with a text about European architecture in the Middle Ages and, by the ninth generation, was talking nonsense about bunnies.

The research, led by Ilia Shumailov from Google DeepMind and Oxford, discovered that AI might miss less common lines of text in training datasets.

This means that models trained on the output of earlier models can’t carry forward those nuances, creating a recursive loop.

Duke University assistant professor Emily Wenger said that in a system generating images of dogs the AI model will focus on recreating the most common dog breeds in its training data, so it might over-represent Golden Retrievers compared to the Petit Basset Griffon Vendéen.

“If later models are trained on an AI-generated dataset that over-represents Golden Retrievers, the problem gets worse. After enough cycles, the model will forget about less common breeds like the Petit Basset Griffon Vendéen and only generate pictures of Golden Retrievers. Eventually, the model will collapse and be unable to generate meaningful content.”

While she admits that having too many Golden Retrievers might not be bad, the collapse process is a severe issue for producing meaningful and representative outputs that include less common ideas and writing styles.

“This is the problem at the heart of model collapse,” she said.

 

Last modified on 29 July 2024
Rate this item
(0 votes)

Read more about: