Carol Anderson, Ph.D.

View Original

Class-action lawsuit filed over copyright and privacy issues stemming from GitHub Copilot

Last week I posted about the copyright and privacy risks associated with large language models. One of the examples I discussed was GitHub Copilot, the code-writing assistant based on OpenAI's Codex model. Copilot was trained on a vast dataset of open source code found on GitHub. It has memorized some of the training corpus, and actually regurgitates blocks of code verbatim sometimes.

One of the key problems with this is code licensing. Some of the code in the training set carries licenses that require author attribution and/or the preservation of rights in derivative works. Yet when Copilot regurgitates code, it usually doesn't regurgitate the license along with it. (The licenses are often found in a separate file, and no effort was made to link code with its license during training set creation, as far as I know.)

Today, the issue headed to court. Matthew Butterick and the Joseph Saveri Law Firm wrote:

Today, we’ve filed a class-action law­suit in US fed­eral court in San Fran­cisco, CA on behalf of a pro­posed class of pos­si­bly mil­lions of GitHub users. We are chal­leng­ing the legal­ity of GitHub Copi­lot (and a related prod­uct, OpenAI Codex, which pow­ers Copi­lot). The suit has been filed against a set of defen­dants that includes GitHub, Microsoft (owner of GitHub), and OpenAI.

The plaintiffs allege violations of GitHub's own terms of service and privacy policies, and of various laws governing copyright and privacy. The attorneys also note:

As far as we know, this is the first class-action case in the US chal­leng­ing the train­ing and out­put of AI sys­tems.

As such, the lawsuit potentially has far-reaching implications for generative models. It will be interesting to see how the case plays out.