Intellectual Property - Model Training Infringement

Saihub.info

Category: Intellectual Property

Potential Source of Harm: Model Training Infringement

Updated January 15, 2025

Nature of Harm

Generative AI models -- including large language models (LLMs) and diffusion models -- are trained on huge amounts of content, with many of the largest models using a large percentage of the high-quality content that is available on the Internet as training data. Accordingly, the training data of these models often includes content that is protected by copyright, although this varies among LLMs, and lack of transparency by LLM providers on the nature of their training data often makes the degree of training on copyrighted content difficult to determine. Some studies have shown that a large percentage of LLM output duplicates existing content.

Content owners have alleged that this approach to training produces at least two main types of infringement:

input infringement -- arising from the training itself, based upon the theory that use of copyrighted content for training of AI models is barred by copyright law
output infringement -- arising when an LLM generates / outputs content that is infriging of IP used in training data; this is less common than input infringement, but provides a potentially stronger claim under IP law if / when it demonstrably occurs.

With respect to input infringement, LLM providers have generally argued that copyright law does not support claims of infringement, using arguments that vary by country, including general copyright exceptions for "fair use" (applicable in the US) and "fair dealing" (applicable in various other countries), as well as specific copyright exceptions created in the EU, UK and elsewhere. These issues are being actively debated and contested, including in various court proceedings. Various legal analyses of these issues can be found online, including ones by Willkie and TechnoLlama, and a lengthy discussion on Hacker News.

With respect to output infringement, the technical functionality of generative models (which use stochastic processes) means that direct copying is unlikely. However, there is substantial evidence that the high dimensionality of large generative models allows them in various circumstances to essentially "memorize" training data, producing output that is highly similar to training data. There have been prominent examples of this occurring with the Stable Diffusion and Midjourney image-generating diffusion models, and evidence by the New York Times in its lawsuit against OpenAI.

Some excellent essays on this issue are:

Tom Wheeler (former US FCC Chairman) for Brookings Institution, Connecting the dots: AI is eating the web that enabled it (June 2024)
Tim O'Reilly, How to Fix “AI’s Original Sin” (June 2024).

Regulatory and Governance Solutions

It is generally accepted that fundamentally new legislation is not required to govern copyright infringement by generative AI models -- what is required is a proper application of copyright law to this new technology (which could require adjustments to copyright law and existing legislative exemptions from the scope of copyright).

There have been a substantial number of lawsuits against providers of generative AI models under copyright laws. These were initially slow to lead to significant decisions in favor of content owners, but some such outcomes are now occurring. In February 2024, a court in China reached what appears to be the first significant judgement in favor of an IP owner facing infringement by AI-generated content. In May 2024, a US federal court in California appeared ready to allow a lawsuit against Stability AI, Midjourney and others to proceed.

Regulators have also began to look at these issues -- for example:

UK
- Information Commissioner's Office, The lawful basis for web scraping to train generative AI models (January 2024)
- consultation on Copyright and Artificial Intelligence (December 2024)
Hong Kong - consultation on enhancement of the Copyright Ordinance (July 2024).

There have also been proposals to impose new requirements on providers of large generative models for the benefit of content owners. For example, the EU AI Act is expected to impose transparency requirements with respect to the training data of so-called "foundation models", and it is likely that similar requirements will emerge in other countries.

Until legal positions on these issues is clearer, it is somewhat difficult for broadly-accepted governance approaches to arise. Indeed, generative AI model providers (who would be the entities primarily subject to such governance approaches) would not generally agree that there is any significant harm that needs to be remedied through such governance. An example of the difficulty of reaching consensus on governance is a decision of the UK Intellectual Property Office in February 2024 to suspend a long-running effort to develop a voluntary code of practice on training AI models using copyrighted material.

Content Licensing. However, LLM providers have been proactive in negotiating content license deals. There have been reported deals by OpenAI, Microsoft, Google, Apple, Adobe, Amazon, Nvidia and Midjourney, with a wide range of leading and smaller content providers (see this ongoing tracker of reported licensing deals). It appears likely that these licensing programs will continue to expand.

Training Content Management. A minority of AI model providers actively govern training content to avoid IP infringement. A notable example is Adobe's Firefly image-generation model.

Technical Solutions

This potential harm is primarily legal in nature -- involving a dispute between competing economic interests (as is often the case for intellectual property disputes) -- and the techical aspects of the dispute are fairly well-understood (subject to the uncertainty associated with the 'black box' nature of stochastic generative AI models). So technical solutions have a somewhat limited role in addressing this potential harm.

However, technical tools are likely to play important roles in managing the competing interests of IP rights holders and providers of generative AI models, for example:

Google has suggested that the 'robots.txt' file on websites -- which has historically been used to request search engines not to crawl / index a site -- could also be used to opt a site out of being used for training data (including to opt out from the "text and data mining" exception under Article 4 of the EU Digital Single Market Directive on copyright). This proposal has drawn a mixed response in the content owner community.
Copyright owners have long used digital solutions to monitor for infringing content online. Such solutions are already being evolved to identify infringement by AI models.
Model providers also use technical tools to ensure that training content is not infringing (see 'Training Content Management' above).

Government and Private Entities

As discussed above, the evolving disputes over IP infringement by AI models has primarily involved providers of such models and owners of digital content that could be infringed in the ways described above.

Governments and courts have begun to become involved in such disputes, and as discussed above these have generally been the same entities that have broader responsibilities involving IP and copyright law.