
Legal precedents also support the transformative nature of what AI training accomplishes. Court cases, notably Authors Guild v. Google regarding the Google Books project, have established that transformative uses of copyrighted material, where the new use serves a different purpose than the original, can qualify as fair use. Google scanned millions of books and made snippets searchable online; this was deemed transformative because the purpose (search and indexing) was different from the original purpose (reading the book). AI training is arguably even more profoundly transformative. It converts vast amounts of diverse data into a complex mathematical model for the purpose of generating new content. The outputs created by these models, when transformative, serve different creative or functional ends and do not typically operate as direct commercial substitutes for the specific original copyrighted works used in training. This aligns with the principles of fair use and further negates claims that the training process itself is inherently unethical based on its data sources.
Cultural and artistic progress has, throughout history, relied fundamentally on borrowing, building upon, and remixing existing ideas, styles, and works. From Renaissance painters studying and emulating the techniques of their predecessors to musicians sampling audio clips to modern artists creating derivative fan works, creativity thrives by learning from what has come before. A robust public domain and a cultural understanding that allows this kind of learning and building are essential. AI training is simply the latest technological extension of this long-standing tradition; it learns from the accumulated cultural record – the digital public square – to enable the creation of new forms of art and content. Labeling this fundamental process of learning from shared knowledge as “unethical” just because it’s done by a machine ignores the entire history of human creativity and knowledge transmission. It risks stifling the very innovation that copyright was intended to promote by attempting to wall off the digital public square of ideas and styles.
“AI learns in much the same way humans do…”
Given that AI training on publicly available data aligns with how humans learn, respects the limited scope of copyright, supports legal principles like transformative use, and continues the historical tradition of building on shared knowledge, where does the widespread idea of “unethical AI training” come from?
It is becoming increasingly evident that this narrative is being strategically shaped and amplified by larger tech companies and established industries. Companies like Adobe, for instance, promote their AI models as “ethically trained,” often implying that AI trained on broader, publicly available internet data is somehow unethical or even illegal. Yet, these same companies have faced questions regarding the sources of their own training data, which frequently include vast amounts of data scraped from the web, potentially even incorporating AI-generated images.
This framing – creating a false dichotomy between “ethical” (typically meaning their proprietary AI, trained their way, perhaps with licensed data) and “unethical” (often referring to AI trained on the internet commons) appears to be a calculated market manipulation tactic. By fueling public fear that AI training is inherently “stealing” from creators, they lay the groundwork for advocating for new regulations and legal interpretations. These regulations are often crafted in ways that larger corporations are uniquely equipped to comply with, given their resources for licensing and legal navigation. This disproportionately benefits big tech players while hindering smaller developers, independent researchers, and platforms, like the online AI community Civitai, which recently faced significant external pressure affecting its operations, that rely on open-source models and publicly available data to innovate. It echoes historical concerns, like Bill Gates’ reported fear that open innovation by the “guy in a garage” could outpace large, established corporations, suggesting that control, not just ethics, is a key driver behind the push for restrictive frameworks.
The public’s fear that AI is “stealing” art is largely based on a misunderstanding. AI models don’t function by storing copies of every image or text they see and then pasting them together. They learn the underlying patterns, styles, and relationships. It is a process of abstraction and synthesis, much like a human artist studying various styles to develop their own, or a writer reading countless books to learn prose and narrative. Creativity has always involved emulation and building upon existing styles, which is a key reason that style itself is not subject to copyright. Overly strict copyright interpretations, often advocated for by powerful content holders who nonetheless leverage vast amounts of data themselves, risk locking up the very building blocks of creativity, the public square of ideas and transformative possibilities.