FAU legal expert Prof. Dr. Paulina Pesch on the case GEMA versus Open AI
Artificial intelligence is everywhere at the moment. Large language models like ChatGPT, however, need training data, and this is where the controversy arises. When training their language models, companies often violate German and European law. Recently, an important judgment was passed in regional court Munich I in a lawsuit brought by the copyright collective GEMA against Open AI. FAU legal expert Prof. Dr. Paulina Pesch, Professorship for Civil Law, Law of Digitalization, Data Protection Law and Artificial Intelligence Law, explains in our interview what was at the core of the dispute and which problems AI training causes in connection with data protection and copyright law.
On November 11, a judgment was passed in district court Munich I in the dispute between GEMA and Open AI. What was the court case about?
GEMA filed a lawsuit against OpenAI, the provider of ChatGPT, against the use of lyrics for training their AI models. The lawsuit centered on nine song texts that OpenAI had used without dispute to train two earlier versions of its AI model. The models then reproduced substantial parts of these lyrics almost verbatim in response to simple prompts, only making slight changes. I was able to extract two verses myself from one of the models just a few days ago.
First and foremost, GEMA asserted the rights of use that it has been granted as a copyright collective by the people who wrote the lyrics. GEMA argued that reproducible segments of the song texts were stored in the AI models, i.e. the models had “remembered” them. As the models deliver the lyrics in response to simple prompts, GEMA claimed that the song texts are made available to the general public via the models and the chatbots based on them. The regional court Munich I agreed with this opinion and upheld GEMA’s claims for injunctive relief, disclosure and damages.
The verdict is impressive. The judges were inundated with a flood of at times misleading arguments, but they examined the technology with a fine tooth comb. The significance of the judgment should not be underestimated, in particular as many people are not aware of how many protected works can currently be extracted from the current language models.
The judgment was also keenly anticipated in the USA. Why?
Nearly all currently relevant generative language models have been developed in the USA or are based on models developed there. Such models allow text and in the meantime also images, videos and sound files to be generated, and provide the foundation of modern chatbots. The EU market is relevant for US-based AI companies, and they are monitoring compliance requirements under EU law. US-based AI companies are threatened by major liability risks in relation to their already published models.
What are the differences in copyright law between the USA and Europe?
Many aspects of copyright law are harmonized under international law, but there are differences. In both legal orders, copyright is restricted, with a view to balancing the rights of creators and other holders of rights to the works with interests protected under fundamental and human rights of entities such as cultural consumers, companies or universities. In general, the exceptions and limitations to copyright allowed for under European law do not justify reproduction in AI models, and the regional court Munich I has upheld this interpretation in its ruling. In the USA, the principle of “fair use” can be used more flexibly to justify the use of copyrighted material without a license. Innumerable copyright lawsuits have also been brought against AI companies in the USA. However, the legal verdicts are still pending. Many of the proceedings end in out of court settlements. For example, Universal Music and Udio, an AI music generator have just come to a settlement.
At the current time, what is problematic with AI training when seen from a copyright and data protection point of view?
Large language models are based on machine learning. Based on huge quantities of training data, the models “learn” patterns and correlations, i.e. which words are most likely to follow each other in a certain context. This leads to two technical phenomena, that can in turn lead to legal problems: the ability to extract significant quantities of training data on the one hand and the risk of “hallucinations” on the other. Certain prompts can make the AI reproduce the training data verbatim or virtually verbatim. The other problem is that AI models “hallucinate”. For instance, a hallucinating model will invent a song text instead of outputting the correct version.
The issue of how these two poorly researched technical phenomena should be considered from a legal point of view with respect to copyright law and data protection law remains contentious. In particular, some people doubt that AI models can actually contain personal data or works protected by copyright, as they do not save data like in a database, but rather base their output on probability. Our laws, however, are neutral regarding technology. They focus on normative assessments, in other words how something should be, rather than focusing on certain technologies. It is therefore correct that LG München I assumes that the models are capable of reproduction The decisive factor is whether the information saved in the model is suitable for making protected works visible. At the same time, data protection law must come into play if extractable personal data used for training purposes may pose a risk for those affected, in particular if this data or the output from the model are incorrect.
How risky are the models and how can the law react to the risks?
The specific risks always depend on the specific model and use case. There are innumerable use cases for language models. At the current time, the risks are largely ill defined, even on a technical level. The debates about data protection and copyright are currently being largely held separately. And yet they overlap to a certain extent. If providers allow models to hallucinate more to prevent plagiarism of works protected by copyright in their output, the models will start to invent incorrect details about people more frequently. This is also problematic from the point of view of moral rights of authors, for instance if a hallucination such as this is attributed to a genuine author as her own work.
It is hard to predict how technology will develop. Models that are smaller and less complex than those that predominate at the current time not only “memorize” less training data, but are also much more efficient in a number of use cases. It is also to be expected that progress will be made in hallucination research and in the removal of data from models. The decisive factor is not to rush and lower our legal standards in view of the models that are currently still being used instead of using regulation to establish incentives for secure, legally compliant AI models. Striving after independence from the autocracy in China and the crumbling democracy in the USA by also lowering standards for fundamental and human rights in the EU would be a fatal error. The topic is sure to also come under discussion in our Cluster of Excellence “Transforming Human Rights” to be launched in January 2026.
