The false promise of AI writing detectors
Almost as soon as ChatGPT launched, tools for AI writing detection burst onto the scene, promising to reveal whether text was written by a large language model (LLM). The media breathlessly reported that a Princeton student had built an app for AI detection during his winter break, which later received venture backing. Pretty soon my news feeds started serving articles like "The 8 Most Accurate AI Text Detectors You Can Try”, often uncritically repeating company claims and contributing to the hype bubble.
In spite of questionable accuracy, these tools are now in widespread use. In some cases they’re causing real damage, especially in the context of detecting student cheating. Here, I dive into how they work, their accuracy claims, and what the future holds for this AI arms race.
How they work
There are three typical approaches:
Supervised classifiers
The most common approach is to train a classification model on samples of text that are known to be AI-written or human-written. One example of this approach is OpenAI’s AI classifier, which is a LLM fine tuned to perform classification. Another example is the popular tool GPTZero, which is a logistic regression model that uses the text’s perplexity and burstiness as input features. (Perplexity is a measure of how “surprising” or improbable a certain sequence of words is, while burstiness is a measure of variation in the length and structure of sentences.)
Importantly, what these models actually predict is a probability that text is AI-generated, and the final class prediction (AI or not) depends on what decision threshold is used. For example, GPTZero recommends using a threshold of 0.65, meaning the text is classified as AI-written only if the predicted probability is 0.65 or higher.
GPTZero shared these results that illustrate the importance of threshold choice:
At a threshold of 0.65, 85% of AI documents are classified as AI, and 99% of human documents are classified as human
At a threshold of 0.16, 96% of AI documents are classified as AI, and 96% of human documents are classified as human
As these results show, there’s a tradeoff between false positives and false negatives, and the choice of threshold dramatically influences the outcome.
Zero-shot classification
Zero-shot methods don’t require training data. These methods analyze patterns in the text to determine whether it’s likely AI-generated. Typically, they rely on calculations such as probability, perplexity, and burstiness. The key underlying concept here is that machine-generated text is more predictable and more uniform than human-written text.
Below, I’ll dive a little more into how this works. If you’re not interested in the details, skip ahead to the next section.
The core job of a LLM is to calculate the probabilities of particular sequences of words. Normally we don’t see those probabilities; they’re used under the hood to construct the output. But in these zero-shot detection approaches, we actually feed a sequence of text into a LLM and ask the model to spit out how probable it thinks that sequence is. The key concept for AI detection is that the predicted probability of AI-generated sentences will normally be higher than for similar human-written sentences.
One well-regarded zero-shot method, DetectGPT, works by generating a number of paraphrases for any input text (using another LLM), and comparing the predicted probability of the paraphrases compared to the original text. If the original text’s probability is much higher than the paraphrases’ probabilities, the original is likely AI-written. There are some caveats, though: the method works best if you compute the probabilities with the same model used to generate the text, but in practice you typically won’t know which model someone used to generate text. Even if you know which LLM was used, you may not have access to its probabilities. For example, OpenAI does not provide access to ChatGPT’s probabilities. It turns out the method can still work even if you use a different LLM to compute the probabilities – most English LLMs have apparently learned similar probability distributions – but it’s less accurate.
Watermarking
Watermarking refers to incorporating patterns into a LLM’s output text that can be algorithmically detected, but are imperceptible to human readers. As an example of how this can work, one method uses the previous token (a token being a word, subword, or punctuation mark — the building block of machine-generated text) to select a list of “green” tokens that are then heavily favored to be chosen as the next token. It’s straightforward to detect watermarked text if you know the pattern to look for, but difficult if you don’t.
Watermarking has the potential to make AI writing detection easier, but there are already several known methods for manipulating watermarked text so that it can evade detection. None of the popular commercial LLMs use watermarking today, so this is mostly a hypothetical for now.
Accuracy claims
How accurate are existing AI detection tools? To get a handle on this, I examined publicly available claims from nine of the most highly publicized tools. I compiled the results in this table (which displays better as a Google Sheet):
OpenAI, which offers their AI detection model as a “work-in-progress classifier”, seemingly has worst performance; they say their model only identified 26% of AI-written samples as “likely AI-written”, and incorrectly classified 9% of human-written samples as “likely AI-written.” Other products claim much higher accuracy. For example, as noted above, GPTZero reported their model identifies 85% of AI samples as AI, and 99% of human samples as human.
I said that OpenAI “seemingly” has the worst-performing model, because all the numbers I looked at are self-reported, and all use different evaluation sets. None of the developers have made their evaluation datasets publicly available. As such, it’s important to look at the numbers with skepticism, especially in cases where the tool is being sold for profit.
Turnitin.com’s tool deserves extra scrutiny, since they’ve already rolled it out to 10,700 educational institutions that subscribe to their plagiarism detection software. They claim a false positive rate of 1% for post-secondary (college or higher) writing, and less than 2% overall. This is among the lowest claimed false positive rates among the tools I researched, but even one in 100 is unacceptably high when you consider the number of student submissions in even a single class in a single semester. False positives are bound to occur, as the Washington Post confirmed. As with the other commercial tools I looked at, Turnitin.com has provided no detail about their methods or their training and evaluation sets.
How they can fail
Text that’s dissimilar from the training data
Like all machine-learned models, these tools perform best on data that’s very similar to what they saw during training. OpenAI notes that “for inputs that are very different from text in our training set, the classifier is sometimes extremely confident in a wrong prediction.” Of course, none of the commercial tools have released details about their training data; that information would make it easier to evade detection. But the lack of detail also makes it difficult to evaluate how broadly applicable any given tool is. Can we trust it for academic writing? Blog posts? Cover letters? Speeches?
Certain types of text are especially tricky
Some types of text are inherently difficult to classify as AI or human-written. This includes things like:
Short text (OpenAI, for example, says its detector performs poorly below 1000 characters)
Code
Poetry
Lists or outlines
Text that occurs very commonly, or that is very predictable, is also very difficult to classify. The authors of A Watermark for Large Language Models ask us to consider the following two sequences, where a prompt is shown in red and a continuation is shown in black:
The quick brown fox jumps over the lazy dog
for ( i = 0 ; i<n ; i++ ) sum+=array [i]
It’s almost impossible to determine whether a human or algorithm generated these completions, because the prompt strongly determines the completion.
Text written by non-native English writers
A recent study tested seven popular AI detectors on a dataset essays written for the Test of English as a Foreign Language (TOEFL) obtained from a Chinese educational forum. Another set of essays from U.S. eighth graders was used for comparison. The detectors showed dramatically lower accuracy for the TOEFL essays. While the essays in this study represent only a specific type of writing from a single country, the authors also found separate evidence that text from non-native English writers tends to have lower perplexity than text from native speakers. This would tend to make it appear more "AI-like" to AI detectors that rely on perplexity and similar metrics.
Paraphrasing attacks
Paraphrasing AI output is the most common approach for deliberately evading detection; several research studies have demonstrated its effectiveness against all three major types of AI detectors. Paraphrasing can be done manually or algorithmically, for example using tools such as Quillbot or GPTinf. Of course, if the paraphrasing algorithm is itself an LLM, there’s a strong possibility its output will still be flagged as AI-generated. The authors of Can AI-Generated Text be Reliably Detected? found they could frequently bypass detection by using a relatively lightweight neural network, rather than a LLM, to paraphrase text.
Smart prompting
What if you could just design your prompt in such a way that it would make an LLM produce undetectable output? There are two approaches I’ve seen for doing this, and I suspect we’ll see more work along these lines in the future. The existing approaches are:
Prompts that constrain the output, such as “Generate a sentence in active voice and present tense using only the following set of words that I provide…” (example from Sadasivan et al.) The authors find that this approach can circumvent watermarking. It could conceivably also be effective against other detection methods. It works by lowering the entropy of the LLM’s output space – basically forcing the model to choose from a more limited set of possible output sequences, which means the output doesn’t look like typical LLM-produced text.
Substitution-based In-Context example Optimization method, or SICO (from Large Language Models can be Guided to Evade AI-Generated Text Detection). In this method, the LLM is explicitly instructed to produce human-sounding output. The prompt includes examples of human-sounding output, produced by the LLM itself through an elaborate iterative process involving word and sentence substitution.
Spoofing attacks
Spoofing, in this context, means making human-written text look AI-written. Why would anyone want to do that? Sadasivan et al. suggest that someone might want to cause reputational damage to developers of LLMs by generating bad output that appears to come from the target LLM. They demonstrate one possible spoofing attack method in which they’re able to discover the watermarking scheme of a target LLM, which can then be applied to human text.
The inevitable improvement of LLMs
Sadasivan et al. make a troubling claim — that detection of AI writing will eventually become impossible. Conceptually, their argument is that as LLMs improve, their output will become more and more similar to human writing, and thus harder to detect. A more recent study refuted this idea, asserting that as long as the distributions of text produced by AI and humans aren’t identical, it will remain possible to detect AI writing. However, if the LLM is very good at mimicking human writing, detection would require multiple writing samples. For example, multiple posts from a social media account would be needed to determine whether the account is run by a bot or not. Obtaining multiple samples from the same source might not always be possible.
Whether AI detection will be feasible in the long run remains a subject of active research and controversy. It will be interesting to see how it plays out.
The upshot
AI writing detectors have a spotty track record at best. Even the most accurate ones have false positive rates that are unacceptably high for situations that matter, such as determining whether students are cheating. No one should trust these algorithms. And the situation is likely to get worse. LLMs will improve and become harder to detect, and methods of evading detection will continue to evolve.
What can we do as an alternative? I think this depends on the underlying reasons for wanting to detect AI writing in the first place. A full exploration of this question is beyond the scope of this article, but here are a few sample ideas:
Need to make sure a piece of writing isn’t plagiarized? Use a plagiarism checker that compares it against existing work.
Need to make sure text is factually accurate? Have a human do fact-checking.
Need to make sure a student actually wrote an essay they turned in? Ask them questions about the content or about how they developed the ideas. Compare the writing to their previous work.
Need to make sure students know how to write by themselves? Do in-class writing assignments.
Other situations might be more difficult to handle, such as prioritizing human text in search results, or suppressing fake news. In my view, the key to finding solutions is to think carefully about why AI writing is unwanted in any particular context. What qualities of human writing, such as originality or factual correctness, make it more desirable than AI writing in a specific setting? Can we find ways of directly detecting those qualities, instead of relying on AI detectors?
TL;DR: Don’t trust AI writing detectors.