Class-action lawsuit filed over copyright and privacy issues stemming from GitHub Copilot

November 3, 2022 Carol Anderson

Last week I posted about the copyright and privacy risks associated with large language models. One of the examples I discussed was GitHub Copilot, the code-writing assistant based on OpenAI's Codex model. Copilot was trained on a vast dataset of open source code found on GitHub. It has memorized some of the training corpus, and actually regurgitates blocks of code verbatim sometimes.

One of the key problems with this is code licensing. Some of the code in the training set carries licenses that require author attribution and/or the preservation of rights in derivative works. Yet when Copilot regurgitates code, it usually doesn't regurgitate the license along with it. (The licenses are often found in a separate file, and no effort was made to link code with its license during training set creation, as far as I know.)

Today, the issue headed to court. Matthew Butterick and the Joseph Saveri Law Firm wrote:

Today, we’ve filed a class-action lawsuit in US federal court in San Francisco, CA on behalf of a proposed class of possibly millions of GitHub users. We are challenging the legality of GitHub Copilot (and a related product, OpenAI Codex, which powers Copilot). The suit has been filed against a set of defendants that includes GitHub, Microsoft (owner of GitHub), and OpenAI.

The plaintiffs allege violations of GitHub's own terms of service and privacy policies, and of various laws governing copyright and privacy. The attorneys also note:

As far as we know, this is the first class-action case in the US challenging the training and output of AI systems.

As such, the lawsuit potentially has far-reaching implications for generative models. It will be interesting to see how the case plays out.