Large language models can steal work and spill secrets. Here’s why we should care.
Large language models (LLMs) such as GPT-3 are trained on massive datasets of web-scraped data. Research shows that they memorize some of it, and can regurgitate it verbatim – including personal data and copyrighted material.
Is that a problem?
One common point of view is that it isn't a problem, because the memorized material is (typically) already available on the public web. Since it’s already out in public, who cares if a language model happens to spit it out too? Here, I’ll argue that we definitely should care, because LLMs pose unique privacy and intellectual property risks.
What are the risks?
Violating the right to be forgotten
In Extracting Training Data from Large Language Models (Carlini et al., 2021) the authors show that personal data, such as addresses and phone numbers, can easily be extracted from GPT-2 without any prior knowledge of the information. (As a side note, their attack method requires access to the probabilities produced by the model; I’ll return to this later in the “Solutions” section).
The personal data GPT-2 memorized was on the public web at some point in time. But some of it may have been posted without the knowledge or consent of the people concerned. Moreover, data subjects in some jurisdictions, including the EU, have the “right to be forgotten”, i.e. to have their data deleted or destroyed upon request. LLMs can end up serving as a long-term repository for personal data even after it’s been removed from the public web, impinging on the right to be forgotten.
Emitting info in an unexpected context
Even when a LLM is trained only on public information, privacy violations occur if a LLM produces that information in an unrelated context. Carlini et al. point this out:
“...data memorization is a privacy infringement if it causes data to be used outside of its intended context.”
They refer to this as “contextual integrity violation.” As an example, someone’s contact information might be associated with open-source code they’ve developed, but then inappropriately emitted in a chat on an unrelated topic. A contextual integrity violation also occurs when a LLM incorrectly combines information from two sources:
“In one example, GPT-2 generates a news article about the (real) murder of a woman in 2013, but then attributes the murder to one of the victims of a nightclub shooting in Orlando in 2016. Another sample starts with the memorized Instagram biography of a pornography producer, but then goes on to incorrectly describe an American fashion model as a pornography actress.” (Carlini et al., 2021)
Copyright and license infringement
GitHub Copilot is a code-writing assistant that suggests code snippets. Copilot is based on OpenAI’s Codex model, which was trained on public source code. Sometimes it spits out verbatim code from its training corpus. What it typically doesn’t do is provide accompanying licenses for the original code (which are often stored in a separate LICENSE file). In one highly publicized example, a user recorded a video of Copilot generating a large block of copied code, and then suggesting the wrong license to go along with it:
This raises thorny legal issues that are currently being debated, and will likely be headed to court at some point.
Exposure of non-public data
LLMs are sometimes trained on private datasets. One example is the chatbot Lee-Luda, trained by South Korean company 스캐터랩 (Scatterlab) and deployed through Facebook Messenger. Lee-Luda was trained on billions of private chats, and soon began revealing personal data obtained from the chats, such as names, addresses, phone numbers, and email addresses, even though ScatterLab claimed to have removed personal data prior to training.
Solutions
There’s no comprehensive solution to the problems above. But some approaches can help mitigate the risk:
Deduplicating training data
Memorization of text is much more likely when that text appears multiple times in the training corpus (Carlini et al., 2021). Deduplicating training data helps to mitigate this risk. As an additional benefit, deduplicating also improves language model performance (Lee et al., 2022). But deduplication doesn't completely eliminate memorization (Kandpal et al., 2022). There is clear evidence that LLMs can memorize text that only appears once in the training data (Perez et al., 2022).
Cleaning or curating training data
Attempts should be made to remove personal information from training data. But this is a lot harder than it sounds, as What does it mean for a language model to preserve privacy? (Brown et al., 2022) points out:
“…private information is context dependent, not identifiable, and not discrete.”
The authors give a fictional example of a chat between Alice, Bob, and Charlie about Alice’s divorce and custody battle. All the messages have the potential to reveal private information about Alice, but it might be very difficult to automatically detect Bob and Charlie’s messages as containing sensitive information. Rather than trying to sanitize existing data sets, the authors suggest training language models on datasets specifically intended for public use.
API hardening
Putting models behind an API that only allows access to the generated text, and not the underlying probabilities produced by the model, can help mitigate privacy risks. As noted above, the attack carried out by Carlini et al. to find memorized data required access to the model’s probabilities. Again, while API hardening is an important step, it’s an incomplete solution. Targeted attacks and unprovoked privacy violations can still occur.
Differentially private training
Training with differential privacy (DP) methods has the potential to offer privacy guarantees; but applying this to text is a lot trickier than it initially sounds. DP is typically described as a guarantee that you cannot determine whether any individual’s data was included in a dataset by looking at the model’s output. More formally, DP actually guarantees that you can’t tell whether a particular record was included in the dataset. The tricky part here, as discussed by Brown et al., is defining what a “record” is when it comes to training language models. Secrets can be distributed across multiple documents and multiple users. As the authors put it, “withholding any specific unit of data from the dataset cannot guarantee protection of privacy.”
Leakage testing and analysis
Developing methods to test models for data memorization and leakage is an active area of research (for example, Inan et al., 2021; Perez et al., 2022). These methods can be used to compare leakage between different types of models trained on the same data, to guide removal of problematic material from training data, and to inform input and output filters.
Filtering input and output
Many language-model-based products incorporate filters to weed out inappropriate requests before they reach the model. The example below shows how Meta’s BlenderBot 3 reacted when I asked for Mark Zuckerberg's home address:
Filtering the output is another useful approach; for example, Jasper.ai suggests feeding the output of their AI writing assistant to a plagiarism detector.
All of the above solutions have pitfalls; none offer complete guarantees of privacy and copyright protection. Do you have ideas about other approaches to mitigate these risks? Or ever been plagiarized by a language model? Please leave a comment!
References
Brown, H., Lee, K., Mireshghallah, F., Shokri, R., & Tramèr, F. (2022). What Does it Mean for a Language Model to Preserve Privacy? 2022 ACM Conference on Fairness, Accountability, and Transparency, 2280–2292.
Carlini, N., Tramèr, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, Ú., Oprea, A., & Raffel, C. (2021). Extracting training data from large language models. 30th USENIX Security Symposium (USENIX Security 21), 2633–2650.
Inan, H. A., Ramadan, O., Wutschitz, L., Jones, D., Rühle, V., Withers, J., & Sim, R. (2021). Training Data Leakage Analysis in Language Models. In arXiv [cs.CR]. arXiv. http://arxiv.org/abs/2101.05405
Kandpal, N., Wallace, E., & Raffel, C. (2022). Deduplicating Training Data Mitigates Privacy Risks in Language Models. In arXiv [cs.CR]. arXiv. http://arxiv.org/abs/2202.06539
Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., & Carlini, N. (2022). Deduplicating Training Data Makes Language Models Better. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 8424–8445.
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., & Irving, G. (2022). Red Teaming Language Models with Language Models. In arXiv [cs.CL]. arXiv. http://arxiv.org/abs/2202.03286