Data Colonialism and Data Sets - Harvard Law Review

Picture this: digital pioneers galloping into a cyber frontier, arguing over erroneous maps of algorithms they don’t fully understand, sowing violence in real-life territories, and racing to convert the commons into property. Only after resources have been distributed along familiar lines of gender and race do discussions of equality enter the scene. This is the pattern of colonization — ancient, familiar, and evolving. In their book The Costs of Connection, Nick Couldry and Ulises Mejias explore “data colonialism,” a topic describing the “process by which governments, non-governmental organizations and corporations claim ownership of and privatize the data that is produced by their users and citizens.”

Recently, Getty Images filed suit against Stability AI, claiming a “brazen infringement of Getty Images’ intellectual property on a staggering scale.” Stability AI operates Stable Diffusion, “a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.” Like recent data privacy legislation and impact-focused suits under other doctrines of law, contemporary copyright litigation might impose meaningful limits on data colonialism — even if such concerns are not the focus of the litigants. Because of the breadth of copyright law, protectible expression can exist almost anywhere online. When it comes to pictures and videos of ourselves, the intersection of copyright law and privacy becomes uniquely tangible. Therefore, judicial decisions finding copyright infringement based on the nonconsensual creation of datasets might encourage companies to collect online data more carefully. While this might (briefly) impede technological progress, it would be an important tool in the fight against data colonialism.

The year 2023 has seen an explosion of technology developed using deep learning, “a subset of machine learning which … attempt[s] to simulate the behavior of the human brain … allowing it to ‘learn’ from large amounts of data.” The technology is “distinct from other forms of artificial intelligence because it can ‘ingest’ and process unstructured data, like text and images.” To train these algorithms, researchers assemble large quantities of data to be processed, referred to as “training datasets.” As some have observed, “[a]ll recent advances in artificial intelligence in recent years are due to deep learning,” and without it, “we would not have self-driving cars, chatbots or personal assistants like Alexa and Siri.”

Stability AI is one example of this deep learning technology, using machine learning to create new images from a large data set of images on which the model was trained. But where did it get those images? In its complaint, Getty Images — describing itself as “a preeminent global visual content creator” — claimed that “Stability AI has copied more than 12 million photographs from Getty Images’ collection, along with associated captions and metadata, without permission from or compensation to Getty Images as part of its efforts to build a competing business.” In fact, Getty Images “selected 7,216 examples from the millions of images that Stability AI copied without permission and used to train one or more versions of Stable Diffusion.”

Stable Diffusion was allegedly trained on a dataset created by “LAION, a German entity.” The complaint alleged that “Stability AI followed links included in LAION’s dataset to access specific pages on Getty Images’ websites and copied many millions of copyrighted images and associated text.” Thus, the issues of copyright infringement largely revolve around the creation of training data, and not the AI’s output. But the complaint listed eight claims involving copyright law, trademark law, and unfair business practices. In its prayer for relief, Getty Images asked for injunctive relief including “[o]rdering the destruction of all versions of Stability AI trained using Getty Images’ content without permission.”

The harm suffered by Getty Images is a harm likely already suffered by most of us. In the process of data colonialism, large companies and governments have collected private data on massive scales to generate datasets on which to train their deep learning AIs. Data scrapers have combed the internet for selfies, artwork, and information about us. In some cases, the collection of this data was akin to pillaging. Malicious parties have taken advantage of both the digitally ignorant and sophisticated to build datasets used for purposes never contemplated by the subjects of the seizure. Like the invention of fire, engines, and electronics, the invention of AI machine learning has transformed the amount of value placed on a resource — causing a capitalistic rush to extract as much wealth as cheaply and quickly as possible.

In the realm of machine learning, the transformed resource is ourselves — or at the very least, data about ourselves. As the Brookings Institution writes, as “artificial intelligence evolves, it magnifies the ability to use personal information in ways that can intrude on privacy interests by raising analysis of personal information to new levels of power and speed.” How we look, how we speak, and how we think has a new — and rapidly increasing — value as training data. And, “[w]ith other components of AI equally available for all the players on the market, it is data that really makes your AI solution stand out from the competition.” Depending on how this data is expressed online, it might readily be protected by copyright.

Indeed, under the 1976 Copyright Act’s protections, Getty Images can plausibly claim copyright infringement of either the reproduction right or of the derivative works right. Copyright protection protects “original works of authorship as soon as an author fixes the work in a tangible form of expression.” Almost all of the photographs on Getty Images likely meet this bar.

Although much digital real estate has already been devoted to the intersection of generative AI and copyright, it is important to recognize a distinction between the legal issues involving the output of generative AI versus issues involving the input into generative AI. The issues involving input are uniquely relevant to the concept of data colonialism, because legal questions involving the output of AI naturally revolve around issues of authorship and infringement of the derivative works right. However, issues of input implicate copyright doctrine in a very traditional way — infringement of the reproduction right. As Getty Images alleged, creation of an AI training set requires assembling a large quantity of images, connecting them to qualities, and running the data through a deep learning algorithm. In this process, reproduction of fixed photographs is highly likely — illustrated by the fact that Stable Diffusion erroneously regenerates Getty Images’ logo. Additionally, the dataset itself might be a copyrightable compilation under the Copyright Act of 1976 that would infringe the derivative works right. Thus, the input-related analysis is perhaps less complex — and controversial — than its output-related counterpart. The main question at hand is whether or not Stability AI’s use will be deemed a “fair use.”

Getty Images — which was valued at over $4.8 billion in 2021 — has the capacity and desire to bring suit against Stability AI because Getty is a sophisticated business directly threatened by Stability AI’s industry: “Stability AI is stealing a service that Getty Images already provides to paying customers in the marketplace for that very purpose.” Such lawsuits would not be feasible for the average American to bring — especially if a litigant is challenging a large company with deep pockets. However, because of the broad existence of copyrightable subject matter online, copyright litigation by sophisticated parties like Getty Images might generate downstream ripple effects that benefit the average American.

If liability is found against Stability AI, companies that create machine learning models might reevaluate how they create datasets upon which AI will be trained. Instead of broadly scraping for data, data scraping programs might filter out copyrightable subject matter like images and long paragraphs of text. Within these filtered out segments might be selfies, descriptions of appearance, career and personality, or, the very best longform tweets. Alternatively, data scraping might become so fraught with copyright liability that data scrapers stop the practice all together and assemble datasets using only consensually licensed material. However, such a scenario might encourage AI entrepreneurs to go the “Westworld”-route and simply license large amounts of data from companies with a large and diverse set of customers, such as social media organizations.

Alternatively, this litigation might generate helpful precedent that can be used by smaller litigants to challenge misappropriations of their images. A similar litigation strategy has been fielded in Illinois against a big box retailer who relied on Clearview AI’s technology to assemble surveillance technology. There, the plaintiffs’ claims surround the right of publicity and data privacy protections. However, if copyright liability can be found against Stability AI’s unauthorized use of images to train algorithms, it is possible that a group of litigants whose protectable information was scraped might assemble a class that is able to recover under a copyright theory.

The colonization of data to generate training sets resembles a familiar plot in our world. Pioneers have once again sailed into the horizons, hoarding wealth and disregarding potential claims to property that don’t fit within the dominant legal system. However, with data, society has an opportunity to not make the same mistakes of the past. Companies could discard datasets that were curated through improper means. Artists who forfeited copyrights to their works before the AI revolution could be consulted before their material is used for a purpose not within the minds of the original parties. And politicians and entrepreneurs could evaluate the ethical issues distinctly raised through the creation of datasets. Even if copyright law might offer a judicial remedy against nonconsensual use of data, statutory oversight will likely be needed in order to protect parties without the resources to pursue litigation. While we may never have full control over the information that we share about ourselves online, the precedential impact of Getty Images v. Stability AI (and other cases like it) might create tools that consumers can use to stop unauthorized appropriation of their digital data — slowing down the process of data colonialism until our society and legislators can meaningfully grapple with it.

More from the Blog

Originalism Makes Sense: A Response

A Thought Experiment: Does Originalism Make Sense?

NYT v. OpenAI: The Times’s About-Face