Guardrails on large language models, part 1: dataset preparation

Mar 1

With the recent spate of news about Bing/Sydney going haywire, I’ve noticed some misconceptions about the guardrails on large language models (LLMs).

To help dispel some of them, in this series of posts I’ll give a non-technical introduction to each of the four major points of control for LLMs:

Pretraining data preparation
Model fine tuning
Prompt design
Content moderation

Part 1: Pretraining data preparation

Some quick background

The first stage of training a language model, which is known as “pretraining”, involves exposing the model to a massive quantity of text. In this critically important stage, the model learns patterns of language. Since the pretraining data for general-purpose LLMs is typically mostly web-scraped text, it contains problematic content such as hate speech, personal data, and copyrighted material, all of which the model can later mimic or even regurgitate verbatim.

Points of control

Data sourcing

Ideally, pretraining would use datasets specifically created for this purpose and free of any unwanted content. In reality, the amount of data required is so huge that developers rely on web crawling. OpenAI hasn’t released details of the training datasets used for ChatGPT and its siblings, but the training data for their predecessor, GPT-3, was about 80% web-scraped data (the remainder being books and English Wikipedia).

Data cleaning

Model developers, including OpenAI, usually try to clean up the pretraining data to remove low-quality or objectionable content. But doing this by hand would be a monumental task, so the cleanup process relies on algorithms. This raises a few issues:

Problematic content can be difficult to detect algorithmically.
The algorithms that detect hate speech, bias, and toxicity are themselves biased, having a known tendency to over-flag speech about marginalized groups (e.g., flagging it as toxic even when it isn’t). This worsens the underrepresentation of marginalized groups in the dataset.
Cleanup can’t fix the fact that the dataset is fundamentally not representative of humans overall, since content on the web itself isn’t representative.

The current state of things

To date, no developers of general-purpose language models have invested the time or money needed to thoroughly curate pretraining data. With current practices, there’s still plenty of objectionable content left in the dataset after cleaning, as well as representational problems. (Who decides what’s objectionable is an ethical issue that I’ll leave aside for today.)

TL;DR: Language models mimic their pretraining data, which is mostly pulled from the web. Model developers try to clean it up, but lots of problematic content remains, and many human perspectives are underrepresented.

Stay tuned for parts 2-4 of this series, which I’ll roll out over the next few weeks.

The series is cross-posted at AVID, the AI Vulnerability Database.

Carol Anderson