Landmark Court Decision in the EU: Copyright Permissibility of Text and Data Mining for the Purpose of AI Training

Does the creation of a data set containing copyright protected content, which can be used to train a generative AI, constitute a copyright infringement? The Hamburg Regional Court had to answer question in a landmark decision. According to the court, reproductions for this purpose are lawful. The court did not address whether copyright protected content could lawfully be replicated during the actual AI training. Nonetheless, this decision and the reasoning behind it could facilitate collaboration between research and industry to train AI in the future.

An article by Arne Radeisen and Paul Suilmann

Case

The plaintiff works as a stock photographer. The defendant is a non-profit organization (NPO), which created a data set called LAION 5B containing almost 6 billion image-text pairs for free use on the internet. In 2021, the image in dispute – an image file posted on the website of a stock photo agency and watermarked by the agency – was downloaded, analyzed, and included in the data set. Since early 2021, the photo agency’s website included a reservation of use – a statement prohibiting the use for text and data mining (TDM). The reservation of use (opt-out) was written in natural language and referred to all images on the website. The photographer claims that the replication by downloading the image infringes his copyright, since no exceptions apply. In particular, the plaintiff criticizes that the data set allows the creation of competing products through generative AI and restricts the commercial exploitation of his works. Additionally, the replication was prohibited by the reservation of use on the website from which the image was downloaded. The defendant points out that as a non-commercial research organization the reservation of use did not apply to them. The plaintiff disputes this and accuses the defendant of commercial collaborations with large AI providers. The plaintiff demands the defendant to be prohibited from replicating the image in question for the creation of data sets and claims damages.

Court dismisses the action

The court dismisses the action. It agrees with the defendant that the download of the image meets the copyright exception for scientific research (Section 60d German Copyrights Act (UrhG)/Art. 3 DSM Directive). The download was made for the purpose of TDM. A restrictive interpretation which excludes AI training data sets from TDM was dismissed by the court: It argues that the European legislator views AI training as TDM, since Art. 53 AI Act references the opt-out of Art. 4 DSM Directive, which was implemented by Section 44b UrhG. The application of Section 44b UrhG also meets the three-step test pursuant to Art. 5 (5) InfoSoc Directive. Ultimately, the court doubts whether the exception of Section 44b UrhG applied because an effective reservation of use (Section 44b (3) UrhG) was declared. As the rights holder within the meaning of para. 3, the picture agency was entitled to declare the reservation of use; the plaintiff, in turn, was entitled to invoke it. The court discusses in detail whether the declaration in language written for humans is also “machine-readable” within the meaning of the provision. It reasons that modern technologies – such as AI systems – can determine the meaning of a reservation written in natural language, which argues for machine-readability. However, the court did not take a final stance on this. In its opinion, the defendant can invoke the exception of Section 60d UrhG, according to which replications for TDM are permitted for the purposes of scientific research. The defendant had not pursued any commercial purposes within the meaning of Section 60d (2) No. 1 UrhG since it had made the database freely available. Nor did any commercial companies have a determining influence on the organization. This would render the exception inapplicable, if, in addition, preferential access to the data set was granted. The plaintiff had not presented sufficient facts that commercial companies had a decisive influence on the defendant’s activities. Thus, the court left open whether any preferential access was granted to commercial providers.

Review

Ultimately, the court’s decision is convincing. Rightfully, the creation of data sets, which enable AI training, is considered as TDM. However, the reasoning concerning the machine-readability of the reservation of use (Art. 4 DSM Directive) is not persuasive. The court makes a clear distinction between the various steps of AI training and clarifies that the sole subject of this dispute is the creation of a training dataset. This distinction is compelling; the court also applies it consistently for the majority of the ruling.

1. The creation of a data set is TDM

The court considers that the defendant had obtained information about “correlations”, which is required to be considered TDM according to Art. 2 No. 2 DSM Directive. It’s explanation as to why the replications were made for the purpose of TDM is sound. The court emphasizes that it does not contemplate whether replications during the actual training constitute TDM (convincingly arguing in favor, see de la Durantaye, Control and Compensation. A Comparative Analysis of Copyright Exceptions for Training Generative AI). When creating the data set, the defendant linked the pictures and information about this work by only including images with a sufficiently similar description. The calculated “similarity score” between image and description may not be a correlation in the statistical sense. Nevertheless, the automated analysis provided information about the relationship between image and description. This constitutes a “correlation” in the literal meaning of the word.

The court is also not convinced by the plaintiff’s argument that TDM for the purpose of AI training creates competing products at the author’s expense. The scientific development of AI models and the necessary preparatory acts serve much broader objectives than only the development of generative AI. Additionally, the European legislator wanted to increase innovation by allowing TDM. Therefore, the aim was to include fields of applications that the legislator was not aware of at the time, such as generative AI.

2. Machine-readability of the reservation of use

The arguments with regards to whether the reservation of use declared by the plaintiff was “machine-readable”, as required by Section 44b (3) UrhG, cannot fully be agreed with. The court convincingly stresses that this requirement aims to enable automated data processing. By referencing the explanatory memorandum to the law, the court implicitly agrees with the view that “machine-readable” within the meaning of Art. 4 DSM Directive means “comprehensible for machines” and not “readable through machines”. In practice, robots.txt files are generally used for this purpose. The court, however, considers this requirement to be met if the reservation of use is provided in natural language: By using AI, it argues, web crawlers can easily automatically process such declarations too. Art. 53 (1) lit. c AIA requires that providers of general purpose AI models should use “state-of-the-art technology” – i.e., those that can analyze natural language (e.g., English or German). The court refers to a provision that addresses providers of general purpose AI models. It does not apply to the defendant, as they only provide the data set and do not train their own models. The recourse to the AI Act is surprising, since the court previously differentiated very precisely between the distinct steps of AI training and emphasized the various possible uses of data sets such as created by the defendant. The court thus breaks its own distinction between the various steps of AI training.

The courts’ view that the determination of what is machine-readable depends on the respective state of the art is correct. The court emphasizes that technological progress has advanced the abilities of machines. If machines were capable of processing natural language without difficulty, there would be no qualitative difference to special formats that can only be processed by machines.

Ultimately, the persuasiveness of the court’s argument depends not only on legal questions, but on factual ones: With what effort and with what degree of legal certainty can AI-supported applications find and reliably interpret human-readable declarations? The judgment does not sufficiently answer this. The court also ignores possible costs that may arise from additional filtering for reservations of use in natural language. Especially, when processing huge amounts of data, it should not be neglected that the use of AI tools to interpret billions of data points requires considerable computing power, which has both ecological and economic consequences. These costs can be prohibitive and thus hinder innovation – which is the goal of Art. 2-4 DSM Directive. And even if the filtering costs do not create insurmountable barriers for all companies, at least smaller start-ups could be affected. This weakens competition. Thus, the stronger arguments speak in favor of imposing the obligation on the declaring party as the cheapest cost avoider to use specific machine-readable, standardized formats to conserve economic and ecological resources.

It remains unclear why the court discussed the general exception for TDM (Art. 4 DSM Directive) and the machine-readability of the reservation in such detail, if in the end it based its decision on the exception for scientific research. As a matter of fact, Art. 4 DSM Directive and the corresponding national provision was not relevant to the decision. This means that a significant part of the judgment is an obiter dictum. Apparently, the court was tempted by the active discussion about machine readability to participate in it.

3. Exception for scientific research

Furthermore, the reasoning concerning the exception for scientific research is entirely convincing. This interpretation and application of Art. 3 DSM Directive offers an interesting approach. The scientific debate has so far paid little attention to the exception; the commercial character of AI products has been understood as implying the supposed commercial character of upstream development steps. The court’s interpretation of the exception has the potential to incentivize private-sector investments, from which not only commercial providers, but also others could benefit. By allowing cooperative structures and personnel overlaps between the NPO, which provides the data set, and commercial companies, the court provides incentives for open and transparent research.

The court only ruled on the creation of data sets. However, the ruling could also have further implications for collaboration between science and industry in AI training. It could allow the training of general-purpose AI models. As long as the general-purpose model is freely available, it should not be considered commercial. This model could then be used for non-commercial or commercial uses. Downstream, it could be further fine-tuned with licensed proprietary data, resulting in specialized models. The opposing view (Dornis/Stober, 65.) is not convincing, because it is based on the flawed premise that the AI model itself is a reproduction of the training data.

A detailed version of this blog post in German appears in the Zeitschrift für Urheber- und Medienrecht (ZUM).

Published under licence CC BY-NC-ND. 

  • Paul Suilmann is a research assistant at the Chair of Civil Law and Law of Digitalization at the Humboldt University of Berlin and editor of the RAILS-blog. His doctoral research focuses on liability and private enforcement in technology law, particularly in the area of European AI regulation.

  • Arne Radeisen is a research assistant at the Chair of Civil Law and Law of Digitalization at the Humboldt University of Berlin. He has already delivered a number of lectures on the subjects of text and data mining and artificial intelligence. In his doctoral thesis, he employs a law and economics perspective to analyze data access rights.

Authors

  • Paul Suilmann is a research assistant at the Chair of Civil Law and Law of Digitalization at the Humboldt University of Berlin and editor of the RAILS-blog. His doctoral research focuses on liability and private enforcement in technology law, particularly in the area of European AI regulation.

    View all posts
  • Arne Radeisen is a research assistant at the Chair of Civil Law and Law of Digitalization at the Humboldt University of Berlin. He has already delivered a number of lectures on the subjects of text and data mining and artificial intelligence. In his doctoral thesis, he employs a law and economics perspective to analyze data access rights.

    View all posts

Paul Suilmann is a research assistant at the Chair of Civil Law and Law of Digitalization at the Humboldt University of Berlin and editor of the RAILS-blog. His doctoral research focuses on liability and private enforcement in technology law, particularly in the area of European AI regulation.