One model began with a text about European architecture in the Middle Ages and, by the ninth generation, was talking nonsense about bunnies.
The research, led by Ilia Shumailov from Google DeepMind and Oxford, discovered that AI might miss less common lines of text in training datasets.
This means that models trained on the output of earlier models can’t carry forward those nuances, creating a recursive loop.
Duke University assistant professor Emily Wenger said that in a system generating images of dogs the AI model will focus on recreating the most common dog breeds in its training data, so it might over-represent Golden Retrievers compared to the Petit Basset Griffon Vendéen.
“If later models are trained on an AI-generated dataset that over-represents Golden Retrievers, the problem gets worse. After enough cycles, the model will forget about less common breeds like the Petit Basset Griffon Vendéen and only generate pictures of Golden Retrievers. Eventually, the model will collapse and be unable to generate meaningful content.”
While she admits that having too many Golden Retrievers might not be bad, the collapse process is a severe issue for producing meaningful and representative outputs that include less common ideas and writing styles.
“This is the problem at the heart of model collapse,” she said.