| Science and Technology Law Review

Do AI companies owe compensation to the owners of the data that they use to train their models? A wave of lawsuits against AI companies argues yes. The New York Times sued OpenAI for its use of newspaper articles to train models without permission or payment. A group of coders filed a class action lawsuit against Github, Microsoft, and OpenAI for using their work to train Microsoft’s AI coding assistant. In late October, Reddit filed a suit against Perplexity and three co-defendants in New York federal court for allegedly scraping Google for Reddit comment data to train its AI models, despite Reddit’s anti-scraping policies and a cease-and-desist letter issued to Perplexity. Unlike prior cases, Reddit, Inc. v. Perplexity AI, Inc. focuses on how Perplexity and its co-defendants, Serpapi LLC, Oxylabs UAB, and AWM Proxy, might have violated copyright law by accessing the site, not just using the data.

Reddit’s complaint argues that Perplexity AI violated the DMCA Section 1201 by buying data from a co-defendant that bypassed its anti-scraping site protections. Other cases, like the ones above, have addressed whether copyright law prohibits AI companies from using plaintiffs’ data to train models at all without permission and whether the DMCA prohibits removing trademarks and watermarks from training data. This case highlights web-scraping, which is how AI companies get most of their model training data. The DMCA Section 1201 prohibits “circumvention of a technological measure that effectively controls access to a copyrighted work.”[1]

The case is likely to turn on whether Reddit’s anti-scraping measures qualify as a “technological measure that effectively controls access” within the DMCA Section 1201’s meaning, and whether Perplexity and the co-defendants “circumvented” those measures by scraping Reddit content from Google search results. Both Reddit and Google have taken steps to stop mass automated scraping, but Perplexity’s co-defendants are among many successful online scraping service providers.

It's not clear if the DMCA bans bypassing anti-scraping measures on websites. According to Section 1201(E)(3), an effective measure is one that, “in the ordinary course of its operation, requires the application of information, or a process or a treatment, with the authority of the copyright owner, to gain access to the work.” To circumvent a measure means to “descramble a scrambled work, to decrypt an encrypted work, or otherwise to avoid, bypass, remove, deactivate, or impair a technological measure, without the authority of the copyright owner.” But in New York and the Second Circuit, there is no case that directly illuminates whether Reddit’s anti-scraping will constitute effective technological measures.

Two past cases might prove useful. In Universal City Studios, Inc. v. Reimerdes, a New York federal court said that technological measures are effective even when they are weak or widely unsuccessful.[2] So has another case in California federal court.[3] This might help Reddit’s argument since scraping is so prolific.

Another benefit to Reddit is that in the Second Circuit, a Section 1201 circumvention provision doesn’t have to address the fair use defense. Fair use is a doctrine that other other AI company defendants have used to argue that their use of data is lawful. The fair use doctrine allows the unlicensed use of copyrighted works in some circumstances, but the Second Circuit doesn’t apply it to circumvention claims.[4]

If Reddit’s arguments are effective, other copyright owners will be eager to jump on this train. However, the cases show that copyright owners are still angling for ways to make AI companies either ask permission before using their content or pay them for it. The Reddit docket is one to watch, but unlikely to be last.

[1] 17 U.S.C.A. § 1201 (West).

[2] Universal City Studios, Inc. v. Reimerdes, 111 F. Supp. 2d 294, 318 (S.D.N.Y.), judgment entered, 111 F. Supp. 2d 346 (S.D.N.Y. 2000).

[3] Talavera v. Glob. Payments, Inc., 670 F. Supp. 3d 1074, 1103 (S.D. Cal. 2023).

[4] Universal City Studios, Inc. v. Corley, 273 F.3d 429, 443 (2d Cir. 2001).

Reddit Challenges Scraping in AI Copyright Law