The AI Data Dilemma Is Synthetic Content Hurting Models

By Cosmin T.
- July 23, 2025

AI-created web content has soared, triggering fears that models now learn from machines, not humans.
Researchers warn that synthetic data feeds could erode AI originality and accuracy through feedback cycles.
Experts urge sourcing from real human inputs and regional content to maintain reliable AI development.

The issue at the heart of the debate is simple. As AI writes more of what we read online, tomorrow’s models could start learning not from people, but from other machines. According to Copyleaks, an AI content detection firm, there has been an astronomical surge of more than eight thousand percent in artificially generated web content since late 2022.

A recent Oxford and Cambridge study uses the term “model collapse” to describe the looming threat. They argue that when language models feast mostly on synthetic content, each cycle chips away at their originality and reliability.

“We discover that indiscriminately learning from data produced by other models causes ‘model collapse,’” the authors wrote, warning of a process that can strip AI of subtlety and depth.

As more content loops back on itself, models become less grounded in true understanding. This feedback effect amplifies errors, biases, and the risk of misinformation.

Preventing Collapse: Battle for Clean Data

AI companies are scrambling to stop the slide by identifying what is genuine. They use technology ranging from invisible watermarks on machine-written text to reputation tagging and detection systems that spot signature signals of synthetic speech.

Some are building what they call “human-in-the-loop” pipelines, relying on trusted sources such as books, government archives, or verified user input to keep their data fresh and real.

One Indian founder, Suvrat Bhooshan of Gan.ai, explained, “While model collapse from synthetic data isn’t a major concern right now as real data still dominates, we know it can become a problem.” His firm uses watermarking tools to weed out its own outputs and keep datasets clean.

At Gnani.ai, which is developing a massive voice model, co-founder Ganesh Gopalan considers this a critical hurdle, stating that their approach is to tag and filter synthetic speech while ensuring that at least a fifth of their data always comes from real people.

Within India’s rapidly growing AI sector, this question takes on added weight. Startups tasked with creating foundational models for the IndiaAI Mission confront the challenge of curating reliable language data for a nation of dizzying diversity.

Some technologists, like independent expert Tanuj Bhojwani, believe fears are overblown, claiming that most reputable companies work with private or protected data, not public material that could be flooded with fakes.

Others, such as Sravan Kumar Aditya, co-founder of Toystack AI, remain unconvinced. He said, “I don’t recommend training on synthetic data because it may not be correct or accurate. If you want a good model, train it on real-world examples created by humans.”

Despite the churn, industry leaders spot hope in smaller, local models that lean on regional languages and human involvement, not anonymous online text.

They point to untouched archives, regional newspapers, and the cultural richness of India’s many languages as a chance to build AI that truly reflects society.

With every new generation, the debate on purity and authenticity in AI data is only going to heat up.