In the early days of generative AI, the internet felt like a vast, open frontier. Developers treated data as a public commons, scraping millions of images and documents to teach models like ChatGPT and Midjourney how to mimic human expression. But that "wild west" era is rapidly closing.

As I argue in my recent article for ProMarket, we are witnessing a structural shift from an open web to a restricted data economy. While the move toward formal licensing agreements between AI labs and content owners (like the recent deals between OpenAI and News Corp or Reddit) is framed as a win for "responsible AI," it carries a hidden risk: structural market consolidation.

The Consolidation Trap

The problem isn't licensing itself, but the way it’s happening. Bespoke, multimillion-dollar deals create a "dual consolidation" effect that hurts innovation on both sides:

  • On the AI side: Only the wealthiest labs can afford the high fixed costs of negotiating, auditing, and paying for massive datasets. Smaller startups and academic researchers are being priced out, creating an "AI oligopoly" where access to high-quality training data becomes a permanent barrier to entry.
  • On the Content side: Only the largest media conglomerates have the bargaining power to sit at the table with Big Tech. Independent journalists, artists, and smaller forums are left out, potentially leading to a "data desert" where only corporate-sanctioned perspectives fuel future AI.

Why "Static" Data Isn't Enough

Data isn't a one-time purchase, because data can become stale, it’s a perishable resource. In our research on news recommendation, we found that the value of personalized data can drop to zero in just 36 hours. AI needs a continuous flow of fresh, human-generated content to stay relevant.

If creators feel excluded or uncompensated, they may stop contributing or retreat behind paywalls. This triggers a self-reinforcing scarcity: lower data quality leads to lower model performance, which eventually forces smaller firms to exit the market, leaving us with fewer choices and more biased AI.

The Solution: A Statutory Licensing Regime

To prevent this digital enclosure, we need to treat data as infrastructure. Rather than a patchwork of exclusive private deals, I propose a statutory or collective licensing framework, similar to how music royalties are managed.

How it would work:

  • Standardized Access: All qualified AI developers would have the right to train on copyrighted material under transparent, non-discriminatory (FRAND) terms.
  • Regulated Fees: A rate-setting body would establish fees that remunerate both large publishers and individual creators.
  • Level Playing Field: Smaller entrants could access the same quality of "fuel" as the giants, ensuring that competition stays in the market rather than for the market.

Conclusion

If we leave data access to the whims of private contracts, we risk a future where a few "gatekeeper" firms control the collective sum of human knowledge. By using copyright policy as a form of ex ante competition policy, we can ensure that the AI revolution remains open, diverse, and truly innovative.

Read the full analysis at ProMarket.org.