Updated at 12:00 p.m. ET on August 21, 2023
One of the most troubling issues around generative AI is simple: It’s being made in secret. To produce humanlike answers to questions, systems such as ChatGPT process huge quantities of written material. But few people outside of companies such as Meta and OpenAI know the full extent of the texts these programs have been trained on.
Some training text comes from Wikipedia and other online writing, but high-quality generative AI requires higher-quality input than is usually found on the internet—that is, it requires the kind found in books. In a lawsuit filed in California last month, the writers Sarah Silverman, Richard Kadrey, and Christopher Golden allege that Meta violated copyright laws by using their books to train LLaMA, a large language model similar to OpenAI’s GPT-4—an algorithm that can generate text by mimicking the word patterns it finds in sample texts. But neither the lawsuit itself nor the commentary surrounding it has offered a look under the hood: We have not previously known for certain whether LLaMA was trained on Silverman’s, Kadrey’s, or Golden’s books, or any others, for that matter.
In fact, it was. I recently obtained and analyzed a dataset used by Meta to train LLaMA. Its contents more than justify a fundamental aspect of the authors’ allegations: Pirated books are being used as inputs for computer programs that are changing how we read, learn, and communicate. The future promised by AI is written with stolen words.
Upwards of 170,000 books, the majority published in the past 20 years, are in LLaMA’s training data. In addition to work by Silverman, Kadrey, and Golden, nonfiction by Michael Pollan, Rebecca Solnit, and Jon Krakauer is being used, as are thrillers by James Patterson and Stephen King and other fiction by George Saunders, Zadie Smith, and Junot Díaz. These books are part of a dataset called “Books3,” and its use has not been limited to LLaMA. Books3 was also used to train Bloomberg’s BloombergGPT, EleutherAI’s GPT-J—a popular open-source model—and likely other generative-AI programs now embedded in websites across the internet. A Meta spokesperson declined to comment on the company’s use of Books3; a spokesperson for Bloomberg confirmed via email that Books3 was used to train the initial model of BloombergGPT and added, “We will not include the Books3 dataset among the data sources used to train future versions of BloombergGPT”; and Stella Biderman, EleutherAI’s executive director, did not dispute that the company used Books3 in GPT-J’s training data.
As a writer and computer programmer, I’ve been curious about what kinds of books are used to train generative-AI systems. Earlier this summer, I began reading online discussions among academic and hobbyist AI developers on sites such as GitHub and Hugging Face. These eventually led me to a direct download of “the Pile,” a massive cache of training text created by EleutherAI that contains the Books3 dataset, plus material from a variety of other sources: YouTube-video subtitles, documents and transcriptions from the European Parliament, English Wikipedia, emails sent and received by Enron Corporation employees before its 2001 collapse, and a lot more. The variety is not entirely surprising. Generative AI works by analyzing the relationships among words in intelligent-sounding language, and given the complexity of these relationships, the subject matter is typically less important than the sheer quantity of text. That’s why The-Eye.eu, a site that hosted the Pile until recently—it received a takedown notice from a Danish anti-piracy group—says its purpose is “to suck up and serve large datasets.”
The Pile is too large to be opened in a text-editing application, so I wrote a series of programs to manage it. I first extracted all the lines labeled “Books3” to isolate the Books3 dataset. Here’s a sample from the resulting dataset:
{“text”: “nnThis book is a work of fiction. Names, characters, places and incidents are products of the authors’ imagination or are used fictitiously. Any resemblance to actual events or locales or persons, living or dead, is entirely coincidental.nn | POCKET BOOKS, a division of Simon & Schuster Inc. n1230 Avenue of the Americas, New York, NY 10020 nwww.SimonandSchuster.comnn—|—
This is the beginning of a line that, like all lines in the dataset, continues for many thousands of words and contains the complete text of a book. But what book? There were no explicit labels with titles, author names, or metadata. Just the label “text,” which reduced the books to the function they serve for AI training. To identify the entries, I wrote another program to extract ISBNs from each line. I fed these ISBNs into another program that connected to an online book database and retrieved author, title, and publishing information, which I viewed in a spreadsheet. This process revealed roughly 190,000 entries: I was able to identify more than 170,000 books—about 20,000 were missing ISBNs or weren’t in the book database. (This number also includes reissues with different ISBNs, so the number of unique books might be somewhat smaller than the total.) Browsing by author and publisher, I began to get a sense of the collection’s scope.
Of the 170,000 titles, roughly one-third are fiction, two-thirds nonfiction. They’re from big and small publishers. To name a few examples, more than 30,000 titles are from Penguin Random House and its imprints, 14,000 from HarperCollins, 7,000 from Macmillan, 1,800 from Oxford University Press, and 600 from Verso. The collection includes fiction and nonfiction by Elena Ferrante and Rachel Cusk. It contains at least nine books by Haruki Murakami, five by Jennifer Egan, seven by Jonathan Franzen, nine by bell hooks, five by David Grann, and 33 by Margaret Atwood. Also of note: 102 pulp novels by L. Ron Hubbard, 90 books by the Young Earth creationist pastor John F. MacArthur, and multiple works of aliens-built-the-pyramids pseudo-history by Erich von Däniken. In an emailed statement, Biderman wrote, in part, “We work closely with creators and rights holders to understand and support their perspectives and needs. We are currently in the process of creating a version of the Pile that exclusively contains documents licensed for that use.”
Although not widely known outside the AI community, Books3 is a popular training dataset. Hugging Face facilitated its download from the Eye for more than two and a half years; its link stopped working around the time Books3 was mentioned in lawsuits against OpenAI and Meta earlier this summer. The academic writer Peter Schoppert has tracked its use in his Substack newsletter. Books3 has also been cited in the research papers by Meta and Bloomberg that announced the creation of LLaMA and BloombergGPT. In recent months, the dataset was effectively hidden in plain sight, possible to download but challenging to find, view, and analyze.
Other datasets, possibly containing similar texts, are used in secret by companies such as OpenAI. Shawn Presser, the independent developer behind Books3, has said that he created the dataset to give independent developers “OpenAI-grade training data.” Its name is a reference to a paper published by OpenAI in 2020 that mentioned two “internet-based books corpora” called Books1 and Books2. That paper is the only primary source that gives any clues about the contents of GPT-3’s training data, so it’s been carefully scrutinized by the development community.
From information gleaned about the sizes of Books1 and Books2, Books1 is speculated to be the complete output of Project Gutenberg, an online publisher of some 70,000 books with expired copyrights or licenses that allow noncommercial distribution. No one knows what’s inside Books2. Some suspect it comes from collections of pirated books, such as Library Genesis, Z-Library, and Bibliotik, that circulate via the BitTorrent file-sharing network. (Books3, as Presser announced after creating it, is “all of Bibliotik.”)
Presser told me by telephone that he’s sympathetic to authors’ concerns. But the great danger he perceives is a monopoly on generative AI by wealthy corporations, giving them total control of a technology that’s reshaping our culture: He created Books3 in the hope that it would allow any developer to create generative-AI tools. “It would be better if it wasn’t necessary to have something like Books3,” he said. “But the alternative is that, without Books3, only OpenAI can do what they’re doing.” To create the dataset, Presser downloaded a copy of Bibliotik from The-Eye.eu and updated a program written more than a decade ago by the hacktivist Aaron Swartz to convert the books from ePub format (a standard for ebooks) to plain text—a necessary change for the books to be used as training data. Although some of the titles in Books3 are missing relevant copyright-management information, the deletions were ostensibly a by-product of the file conversion and the structure of the ebooks; Presser told me he did not knowingly edit the files in this way.
Many commentators have argued that training AI with copyrighted material constitutes “fair use,” the legal doctrine that permits the use of copyrighted material under certain circumstances, enabling parody, quotation, and derivative works that enrich the culture. The industry’s fair-use argument rests on two claims: that generative-AI tools do not replicate the books they’ve been trained on but instead produce new works, and that those new works do not hurt the commercial market for the originals. OpenAI made a version of this argument in response to a 2019 query from the United States Patent and Trademark Office. According to Jason Schultz, the director of the Technology Law and Policy Clinic at NYU, this argument is strong.
I asked Schultz whether the fact that books were acquired without permission might damage a claim of fair use. “If the source is unauthorized, that can be a factor,” Schultz said. But the AI companies’ intentions and knowledge matter. “If they had no idea where the books came from, then I think it’s less of a factor.” Rebecca Tushnet, a law professor at Harvard, echoed these ideas, and told me that the law was “unsettled” when it came to fair-use cases involving unauthorized material, with previous cases giving little indication of how a judge might rule in the future.
This is, to an extent, a story about clashing cultures: The tech and publishing worlds have long had different attitudes about intellectual property. For many years, I’ve been a member of the open-source software community. The modern open-source movement began in the 1980s, when a developer named Richard Stallman grew frustrated with AT&T’s proprietary control of Unix, an operating system he had worked with. (Stallman worked at MIT, and Unix had been a collaboration between AT&T and several universities.) In response, Stallman developed a “copyleft” licensing model, under which software could be freely shared and modified, as long as modifications were re-shared using the same license. The copyleft license launched today’s open-source community, in which hobbyist developers give their software away for free. If their work becomes popular, they accrue reputation and respect that can be parlayed into one of the tech industry’s many high-paying jobs. I’ve personally benefited from this model, and I support the use of open licenses for software. But I’ve also seen how this philosophy, and the general attitude of permissiveness that permeates the industry, can cause developers to see any kind of license as unnecessary.
This is dangerous because some kinds of creative work simply can’t be done without more restrictive licenses. Who could spend years writing a novel or researching a work of deep history without a guarantee of control over the reproduction and distribution of the finished work? Such control is part of how writers earn money to live.
Meta’s proprietary stance with LLaMA suggests that the company thinks similarly about its own work. After the model leaked earlier this year and became available for download from independent developers who’d acquired it, Meta used a DMCA takedown order against at least one of those developers, claiming that “no one is authorized to exhibit, reproduce, transmit, or otherwise distribute Meta Properties without the express written permission of Meta.” Even after it had “open-sourced” LLaMA, Meta still wanted developers to agree to a license before using it; the same is true of a new version of the model released last month. (Neither the Pile nor Books3 is mentioned in a research paper about that new model.)
Control is more essential than ever, now that intellectual property is digital and flows from person to person as bytes through airwaves. A culture of piracy has existed since the early days of the internet, and in a sense, AI developers are doing something that’s come to seem natural. It is uncomfortably apt that today’s flagship technology is powered by mass theft.
Yet the culture of piracy has, until now, facilitated mostly personal use by individual people. The exploitation of pirated books for profit, with the goal of replacing the writers whose work was taken—this is a different and disturbing trend.
This article originally stated that Hugging Face hosted the Books3 dataset in addition to the Eye. Hugging Face did not host Books3; rather, it facilitated its download from the Eye.