How can Open Source AI stay competitive when training requires copyrighted data

How can Open Source AI stay competitive when training requires copyrighted data?

3 min read

At the center of today’s AI ecosystem are foundation models and Generative AI (GenAI) systems that create novel outputs by learning from massive datasets. These datasets include text, images, video, and music, enabling GenAI to generate human-like responses, cultural references, and topic-specific knowledge.

Unlike human learning, however, AI training involves absorbing huge amounts of copyrighted content. This raises major concerns about copyright infringement, licensing, and compensation when AI models reproduce or closely imitate original material.

The Copyright Challenge for Open Source AI

Content creators and copyright holders fear their work is being used without license or payment. On the other hand, many in the AI industry argue that training AI on publicly available content should be allowed, since humans also learn by reading books, watching shows, and consuming culture.

The tension lies in how GenAI models can remix or sometimes reproduce copyrighted content. While AI doesn’t usually output training data verbatim, its ability to recall fragments challenges legal assumptions about fair use and copyright law.

Why Open Source AI Matters

Open Source Software revolutionized the tech industry by enabling collaboration, innovation, and global talent contributions. Similarly, Open Source AI models could accelerate innovation across startups, SMEs, and research communities.

But the best-performing AI models today are typically trained on copyrighted data. If access to these models and datasets becomes limited to closed, proprietary systems, the progress of open AI innovation could be stifled.

Licensing, Commercial Use, and OSI Standards

The Open Source Initiative (OSI) is working on an AI license definition that guarantees commercial use. However, there’s a key barrier: if Open Source AI models are trained on copyrighted works, they may be considered derivative works—restricting how they can be used commercially.

SMBs and startups: They lack resources to individually license millions of data points.
Large tech companies: They may license content for their own use but not extend those rights to downstream users.

This creates an uneven playing field where open source models risk falling behind closed, commercial ones.

Lessons from Patents and the Music Industry

The patent system in software was partly resolved through initiatives like the Open Invention Network, where companies pooled patents and offered non-assertion pledges to protect Open Source development. But unlike patents, copyrighted data is usually owned by third parties outside of the tech industry.

A closer analogy comes from the music industry. For decades, rights organizations have managed performance royalties through compulsory licensing. Streaming platforms like Spotify, Apple Music, and YouTube now pay royalties via centralized systems. YouTube’s Content ID recognizes copyrighted material at scale, ensuring payments are distributed fairly.

A similar rights management framework for AI training data could ensure content owners are compensated whenever models trained on their work generate revenue.

Toward a Sustainable Open Source AI Future

For Open Source AI to thrive, the industry needs:

Clear licensing frameworks for copyrighted data.
Royalty or compensation systems tied to AI model usage.
Global standards that balance innovation with copyright protection.

Without these, Open Source AI risks being sidelined while closed, corporate-owned models dominate. A fair solution could mirror the music rights ecosystem, linking compensation to usage and revenue rather than blocking access to data altogether.

Ofer Hermoni

Founder & Chief AI Officer

David Edelsohn

Founder & Chief AI Officer