New York Times has sued OpenAI and Microsoft over use of copyrighted data, thus becoming the first major media company to do so.
NYT has alleged in its lawsuit that the large language models used by OpenAI and Microsoft for their generative AI have been created “by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more.”
This legal battle could set a precedent for how courts define the value of news content in training large language models and what the damages are for previous use.
“I think [the lawsuit is] going to put a shot across the bow of all platforms on how they’ve trained their data, but also on how they flag data that comes out and package data in such a way that that they can compensate the organisations behind the training data,” Shaunt Sarkissian, founder and CEO at AI-ID, an AI tracking, authentication, source validation, and output data management/control platform, told PYMNTS.
“The era of the free ride is over,” he added.
The lawsuit opens a new front in a years-long dispute between tech and media firms over the economics of the internet, putting one of the news industry’s most powerful players against the forerunners of a new wave of artificial-intelligence tools. It comes after months of business talks between the two corporations ended without an agreement, as the Times reports.
The Times has asked for a jury trial in the suit, which was filed in U.S. federal court in the Southern District of New York.
WHAT IS THE TIMES’ GROUSE?
The law suit claims, among various examples, that Microsoft’s “Browse with Bing” feature replicates content from The Times’ product recommendation platform, Wirecutter, through significant verbatim and direct copying. Additionally, the lawsuit accuses OpenAI’s GPT-4 of inaccurately attributing recommendations to Wirecutter.
“We respect the rights of content creators and owners and are committed to working with them to ensure they benefit from AI technology and new revenue models, an OpenAI spokesperson told Axios in a statement.
“Our ongoing conversations with the New York Times have been productive and moving forward constructively, so we are surprised and disappointed with this development. We’re hopeful that we will find a mutually beneficial way to work together, as we are doing with many other publishers.”
Tech companies developing generative-AI tools have often argued that content available on the open internet can be used to train their technologies under a legal term known as “fair use,” which permits the use of copyrighted material without authorization in certain circumstances.
“The New York Times put a very strong stake in the ground that demonstrates the value and importance of protecting news content,” said Danielle Coffey, CEO of News/Media Alliance, a trade group for news publishers told the Wall Street Journal. “Quality journalism and these new technologies, especially those that compete for the very same audience, can complement each other if approached as a collaboration.”
THE FUNDAMENTAL PROBLEM
ChatGPT is a large language model (LLM) using the Generative Pre-trained Transformer architecture. Trained on extensive datasets, it learns grammar, context, and language patterns. The model is “generative,” producing human-like text. Pre-training exposes it to diverse language examples, allowing it to predict the next word. Fine-tuning tailors the model for specific tasks.
In simple words, this process is carried out by automated tools known as web crawlers and web scrapers, akin to the technology utilized in creating search engines. Picture web crawlers as virtual spiders navigating through URL pathways, systematically documenting the whereabouts of every item they encounter.
These LLM’s have been fed billions worth of Time’s content, including those protected by a subscription, multiple times to train the GenAI, NYT has argued in its lawsuit.
OpenAI, like other major tech companies, has become less transparent about its training data. Common Crawl, a known source, was utilized by OpenAI for at least one version of the large language model powering ChatGPT. In contrast to the detailed information provided during the development of GPT-3, recent releases like GPT-3.5 and GPT-4 offer limited insights into the training process and data used.
The most recent technical report from OpenAI explicitly mentions the lack of details due to the competitive landscape and safety concerns associated with large-scale models like GPT-4, stating that it contains no information on architecture, hardware, training methods, or datasets.
On the one hand, big tech like OpenAI and Microsoft are lifting protected subscription content from online publications to feed into their LLM models without due reference or credit, arguing it is fair use; on the other hand, they are essentially charging users for the text it generates, claiming it to be their own. Essentially, it is plagiarised content worth billions of dollars, which companies like OpenAI have turned into a for-profit organisation in a matter of years.
There is such a ludicrous amount of profit to be made through the AI model that is being used right now that companies like Microsoft have invested billions of dollars. Microsoft invested an initial $1 billion in OpenAI in 2019 and added at least $10 billion more in January.
While big tech invests billions and makes billions, it is the online publications that have suffered. With no references, credits, or citations and no ‘fair profit sharing’ model in place, online publications, including the likes of NYT, Reuters, BBC, and CNN. Moreover, to try and put a stop to this maddening scale of data exploitation from their websites, these companies blocked OpenAI’s web crawler.
Comedian and author Sarah Silverman is part of a class-action lawsuit against OpenAI and Meta, alleging copyright infringement. Creators across various domains, including writers, musicians, artists, and photographers, are contending with the potential impact of generative AI technology on their respective fields and the safeguarding of their creative works.
This year also saw at least two lawsuits from groups of writers against OpenAI, accusing the company of training its AI with copyrighted works without their permission, and of using illegal copies of their books pulled from the internet.
The U.S. Copyright Office launched an initiative in August to study the use of copyrighted materials in AI training, indicating that legislative or regulatory steps may be necessary in the near term to address the use of copyrighted materials within AI model training datasets.
HOW DOES NYT MAKE MONEY?
Media organisations like the NYT have several revenue streams. These include a subscription based model, advertisements on digital and print platforms, revenues from licensing, affiliate referrals, building rental revenue, commercial printing, NYT Live (live events business), and retail commerce.
For over four years, the subscription based model has generated the most revenue for NYT. In 2022, subscriptions constituted a significant portion of The New York Times’ revenue, accounting for 67% of the total. Out of the $2.3 billion in revenue, subscriptions, including both printed and digital formats, contributed $1.55 billion.
Advertising, encompassing both printed and digital mediums, contributed $523 million, and there were additional revenues of $232 million from other sources. Digital subscriptions played a substantial role, generating over $978 million, while printed subscriptions contributed $573 million to the overall subscription revenues.
Apart from the revenue created, the NYT also employs 5,800 people. The alleged data lifting by companies such as OpenAI is a direct threat to not just the billions of dollars in revenues but also to the livelihood of the 5800 employees that work in the organisation.
The buck just does not stop there; plagiarised content available on the internet and calling it one’s own is threatening the generation of original content, essentially de-incentivising it. This, in principle, is an attack on the very fabric of the democratic way users and organisations publish content on the internet.
Moreover, while NYT tries to protect itself with a subscription model, there are many other online repositories that are free to use and can be accessed by the crawlers of the AI companies. Publicly available data on the internet encompasses various sources, such as images on Flickr, online marketplaces, voter registration records, government websites, corporate platforms, employee profiles, Wikipedia, Reddit, research repositories, and free to access news platforms.
Additionally, there is a plethora of readily accessible unauthorized content and archived compilations, which could potentially include deleted personal blogs or other embarrassing content from the past.
In April, The Washington Post conducted an analysis revealing that a single dataset employed for training AI encompasses nearly the entire 30-year history of the internet. Tech companies have extensively scraped this data, aiming to augment the parameters of their models, reaching into the billions and even trillions, to enhance the training of their AI models.
What’s more, there’s nothing stopping companies like OpenAI to start their own online publications and use generative AI to create content and openly publishing content lifted and plagiarised from ‘n’ number of sources it has at its disposal.
The lives of artists and creatives are also facing disruption due to the capability of AI tools to generate content, including written material and images, through tools like OpenAI’s DALL-E. This advanced set of tools poses a tangible challenge to the income of working artists, prompting them to urgently seek methods to safeguard their creations from being incorporated into datasets used to train AI tools.
A study in August revealed that shortly after ChatGPT’s launch, it had a detrimental effect on employment prospects and income for online freelancers like copywriters and graphic designers. The impact was particularly pronounced among highly skilled freelancers who undertook numerous tasks and earned substantial incomes.
THE AI REVOLUTION
In just over a year, the generative AI model has brought the tech sector back from the brink. Last year, around November, the tech sector was facing mass layoffs, a fall in profits, big tech companies’ stocks plummeting, shedding billions in market value, and a plethora of other issues that can be observed in any declining sector. And then on Nov 30, enter Open AI with an experimental chatbot dubbed ChatGPT.
A year after the public debut of ChatGPT, the excitement surrounding AI remains intense. Tech giants have invested billions in the technology, and nations are stockpiling the necessary chips for future AI endeavors. Within a mere two months, ChatGPT achieved unprecedented growth, becoming the fastest-growing consumer application ever. By January, it is estimated to have amassed 100 million active users, sparking an AI arms race among companies and revitalizing the tech sector.
Microsoft has invested almost $13 billion into OpenAI, Google invests almost $8 billion a quarter into Bard, similarly Meta and X, formerly Twitter are investing billions into AI models.
According to the Organization for Economic Cooperation and Development (OECD), 21% of the share of global VC investment in 2020 (most recent compilation) went into AI, an estimated $75 billion.
With such massive investments backing generative AI for consumers, they can potentially run publications out of business by using the nuanced legal concept of “free-use”, which, basically allows content creation by using content available on the internet for individual users, not for tech giants with billions backing them to run publications out of business.
A FAIR MODEL FOR PROFIT SHARING?
The plethora of generative AI chatbots using large language models are here to stay, but what cannot be permissible is the unauthorised use of copyrighted data. There is a need for a ‘fair-profit sharing’ model to incentivize users to create original content and keep creating them.
In the absence of adequate safeguards, artificial intelligence companies run the risk of endangering the businesses of the news organizations they depend on to train their algorithms. This scenario poses an existential threat to both AI companies and newsrooms in the long term.
Some organisations like Associated Press and Axel Springer, have already reached commercial agreements to license their content to OpenAI. These agreements involve compensating news companies in return for the permission to utilize their content for training AI large language models.
News media executives view tech companies with skepticism due to their experiences in the past decade. While Google and Facebook initially assisted publishers in expanding their audiences and boosting web traffic, they evolved into formidable competitors for online advertising revenue.
These tech giants held the authority to influence the growth or decline of news traffic through algorithmic changes. Publishers, having been unable to secure what they deemed a fair portion of the substantial internet growth facilitated by search and social media, are now reluctant to face a similar fate in the realm of AI.
WHAT’S NEXT?
While the lawsuit seeks a trial by jury, it does not specify a particular monetary claim. Nevertheless, the complaint underscores the assertion that Microsoft and OpenAI should be accountable for “billions of dollars in statutory and actual damages.”
Over the past ten years, news publishers have actively sought Congressional protections against Big Tech companies that utilized their content for engagement on social media and search engines. The emergence of AI has prompted a fresh lobbying initiative by news executives who contend that tech firms lack the authority to scrape their content within the fair use boundaries defined by existing copyright laws.
“What this case will likely do is create a benchmark of what is the economic threshold, or what are reasonable royalties, for fair use of content,” Sarkissian said. “Everyone’s going to use The New York Times as a proxy and see how it goes.”