Tim O’Reilly popularized the term “Web 2.0” to explain the network effects of the participatory web enabled by dynamic web pages tied to personalization. He is excellent at summarizing large technical trends in a way that not only makes it relatable but also provides a useful framework when I need to explain these concepts to others.
So it was with great anticipation that I saw that O’Reilly has posted his thoughts on the intersection of copyright and AI.
The Risk
If the long-term health of AI requires the ongoing production of carefully written and edited content—as the currency of AI knowledge certainly does—only the most short-term of business advantage can be found by drying up the river AI companies drink from. Facts are not copyrightable, but AI model developers standing on the letter of the law will find cold comfort in that if news and other sources of curated content are driven out of business.
How to Fix “AI’s Original Sin”
The Opportunity
While large licensing deals are being cut by publishers that have the leverage & lawyers to negotiate massive, one-time deals, these are ultimately short-lived and only serve to build up the large AI-providers that can afford to subsidize premium materials for their users. These deals just make the rich even richer.
The longer term, sustainable opportunity he proposes is in allowing the internet-of-many to share in the revenues enabled by the output from these large AI systems.
But what is missing is a more generalized infrastructure for detecting content ownership and providing compensation in a general purpose way. This is one of the great business opportunities of the next few years, awaiting the kind of breakthrough that pay-per-click search advertising brought to the World Wide Web.
How to Fix “AI’s Original Sin”
The Challenge
Build a shared provenience and attribution service that keeps track of all documents available to AI systems and the permissions and royalty payment requirements around those documents.
O’Reilly alludes to the UNIX/LINUX filesystem architecture of files with permissions set at the global, group, and user levels as a potential solution to what publishers allow to AI vendors seeking out material for their training sets.
If we expand this analogy out to internet scale, could we apply the architecture of Hosts tables and the modern Domain Name Service to provide a dynamic infrastructure that could maintain a public “lookup” service so any particular AI could locate the origin of any attributable fact, quote or yet-to-be-determined “knowledge unit” and the license fee should an AI wish to leverage that data.
In UNIX, the chmod command is used to change permissions. Could setting copyright permissions via a specialized version of “chmod” be the key to a new way to control access and compensate publishers at scale?
Food for thought.