Last week, two US-based authors brought a class action suit against OpenAI for copyright infringement of their works which, among thousands of other works, were used to train the large language model (“LLM”) behind the generative artificial intelligence (“GAI”) – ChatGPT. The essence of the claim in this suit is that the plaintiffs and other class members did not consent to the use of their books/works as training material for ChatGPT and that OpenAI is unjustly profiting from the use of their copyrighted works. In this article, the first section looks at how GAI models are ‘trained’ using input data, which is followed by an analysis of the arguments made against OpenAI in the lawsuit to answer the question of whether works created by GAIs like ChatGPT can be ‘original’ or if they are inherently infringing on copyrighted works of others.

Understanding GAI
The key driver of rapid advancement of AI technology today is machine learning. It is based on the idea that computers may learn, through pattern recognition, without being programmed to carry out particular tasks. Machine learning is “a subset of artificial intelligence that produces autonomous systems that are capable of learning without being specifically programmed by a human.”
Computer programmes created for machine learning learn from the training dataset, evolve, and make future decisions that may be directed or independent. GAIrelies on the training dataset to create a new work while making certain choices to shape the new work’s appearance. While programmers can specify parameters, the work is generated by the computer programme itself – using a neural network – in a method similar to the cognitive processes of human. This is a key characteristic of this type of artificial intelligence.
Machine learning in LLMslike the one used for ChatGPT involves taking or learning from previously created works like books, articles, and other works through various sources. One common source for the input data for these models is web scraping which involves scraping data from websites, social media platforms, and other online sources using automated tools. While OpenAI also gets its input data from various partnerships and from publicly available sources, some of which are also in the public domain, it has not denied using copyrighted materials scraped from the web to train its LLM. Instead, OpenAI contends that its use of the plaintiffs’ and other class members’ copyrighted works constitutes fair use, and hence, it would not be liable for infringement.

Paul Tremblay and Mona Awad
v. OpenAI
The plaintiffs Paul Tremblay and Mona Awad are both authors of several copyrighted books including “The Cabin at the End of the World” and “13 Ways of Looking at a Fat Girl and Bunny.” They have argued that ChatGPT’s LLM gets most of its training dataset from copyrighted materials, including the Plaintiffs’ books, without their consent, credit and without providing compensation. The complaint also highlights that OpenAI disclosed that 15% of the enormous GPT-3 training dataset came from “two internet-based books corpora” that OpenAI simply called “Books1” and “Books2”, not revealing what books form a part of these corpora.
The class action suit claims, among other violations, direct and indirect copyright infringement of the registered copyright in the books used to train OpenAI LLMs since the Plaintiffs and the class members never authorized them to make derivative works, publicly display copies (or derivative works), or distribute copies (or derivative works). All those rights belong exclusively to Plaintiffs under copyright law. Interestingly, the Plaintiffs also argued that since OpenAI LLMs cannot function without the expressive information extracted from Plaintiffs’ works (and others) and retained inside them, the OpenAI LLMs are themselves infringing derivative works.

Related News

Can ChatGPT’s Output be ‘Original’ or Is it Inherently Infringing?
As argued by Anand, human artistic expression is linked with the fulfilment of several cognitive functions such as communication, mating, and other forms of social feedback. In addition, humans engage with their environment to produce artistic output to achieve specific results for the artist and the observer. However, even if they resemble the human brain, the neural networks used by AI technologies can only perform within specified parameters that are ends in themselves, rather than producing artistic works to fulfil any cognitive function.
While artists look to their surroundings for inspiration, AI must ‘train’ itself through previously created works of art in order to produce meaningful work. It follows from this that there are fundamental differences between human and machine creative output. Without acknowledging the creative contribution of the input data, and implicitly the originalwriters, any acknowledgement of intellectual property for the works produced by AI technology would be inadequate.
The Authorship v Generation argument is that while humans ‘author’ creative works with a higher degree of originality and an embedded element of individuality, AI merely ‘generates’ creative works. The basis of this argument is the enormous amount of data required by GAIs to be able to produce any meaningful output. Therefore, there is considerable weight in the argument that while humans are capable of producing copyrightable works without reference to pre-existing works created by others, by taking inspiration from our surroundings and other naturally occurring phenomena, GAIs lack this capability and depend entirely on access to their training dataset. If their training dataset is made up, to a large extent, of copyrighted materials such as books, articles and other writings, the original authors or owners of these materials may have a legitimate claim for infringement and to be provided compensation for unauthorized, unlawful use of their works.
Considering the importance of the training dataset to ChatGPT’s output, can its works be considered original? Although originality has been a requirement for copyright protection across all jurisdictions, there are varied definitions and thresholds for what constitutes an original work.
Under the ‘Sweat of the Brow’ Doctrine, protection under copyright does not hinge on the work being novel or substantially creative/original. Copyright subsists in the word because of the labour, skill, judgement or other expense put in by the author in the process of creating the work. This doctrine was widely adopted in the United Kingdom and was notably delineated in Walter v Lane, and University of London Press Ltd. V. University Tutorial Press Ltd.
In the USA, the threshold for originality requires a modicum of creativity as held in Feist Publications14 where the court noted that a work must contain a minimum degree of creativity in order to be protected under copyright law and proof of exercise of skill, judgement and labour alone would not cross this threshold.
The Indian position on originality can be considered a middle ground to the position in the USA and UK but also borrows from the Canadian jurisdiction, specifically CCH Canadian Ltd. V. Law Society of Upper Canada.This judgement was referenced in EBC v D.B Modak16 where the court held that in order to be considered original, a derivative work must be more than a mere copy of the original but does not require a high degree of originality or intellectual novelty.
ChatGPT’s output does involve exercise of some skill/judgement in producing a written output depending on the input of the user and its output is far from a mere reproduction of the input dataset. The LLM, recognises patterns in the input dataset and through machine learning, is trained to produce unique outputs tailored to the specific needs of the user, which is communicated to ChatGPT through a prompt. In this regard, it may be argued that ChatGPT’s works are ‘original’ in the sense of the sweat of the brow doctrine.
Under the modicum of creativity test as well as the threshold in the Indian copyright regime however, this analysis becomes more complex as both these tests reference ‘creativity’ which is largely considered a human trait. If one defines creativity with reference to a particular text, it may be defined as adding something that is the author’s own thought/idea or as the exercise of judgement/discretion in the creative choices with regard to the text. However, this analysis would require a deeper comparison between the meaning of ‘judgement/discretion’ and ‘choice’ and whether these terms could apply to neural networks or the other technological processes by which a GAI creates output.
Another hurdle in legal recognition of ChatGPT’s works as ‘original’ is the reference to an ‘author’ in the above legal tests. An analysis of whether an AI can be considered an ‘author’ under various copyright regimes would require a detailed examination of various statutory provisions and case laws, but can be touched upon briefly to consider the applicability of these tests to ChatGPT’s output. Currently, there is a lack of case-law discussing the legal position on authorship vis-à-vis AI but what can be said with certainty is that existing tests of originality were laid down in the context of human authorship only. Various copyright regimes, such as the US and UK have outrightly denied authorship to AI, holding that copyright law does not recognise non-human authorship. Therefore, these tests may be inapposite in light of current technological advances which have surpassed the foundational assumptions of human-authorship or intellect behind the creation of copyrightable works.

Takeaway& Conclusion
The judgement in this class action suit will have far reaching consequences in the international community since its the first of its kind, putting to test an argument which till date remained merely as an academic debate, to reproach makers of LLMs for unauthorized use of copyrighted materials scraped from the internet.
A possible solution to the issue of unauthorized use of copyrighted works in training data sets for LLMs, as proposed by Anand, could be that these authors, artists, or musicians can have an interest in the intellectual property created by that AI18. This could be achieved by the creation of a ‘Data Bank’ functioning as a marketplace where authors/artists/musicians provide access to their works to companies like OpenAI for a fee. This data bank or marketplace could be further automated by using smart contracts, which would remove the hindrances in obtaining licenses on an individual basis from each author/artist/musician. In this way, others like the plaintiffs and class members of the OpenAI lawsuit would be fairly compensated for the use of their works by GAIs while still encouraging creative expression and innovation.
Although the question of ‘originality’ and copyright protection of AI generated works is not an issue that the court will consider in this case, the findings of the court vis-à-vis copyright infringement of works used in ChatGPT’s training dataset will shape future decisions on whether ChatGPT’s output would qualify as copyrightable works.