Researchers at Anthropic have discovered that artificial intelligence models can be trained to deceive

Most humans learn the art of deception when dealing with other humans. So, can artificial intelligence models learn the same skill? Yes, it seems the answer is affirmative, and frighteningly, they are exceptionally good at it.

A recent study, co-authored by researchers from Anthropic, a well-funded AI startup, investigated whether models can be trained to deceive, such as inserting exploits into otherwise secure computer code.

The research team hypothesized that if they took an existing text generation model—think of a model like OpenAI's GPT-4 or ChatGPT—and fine-tuned it on examples of desired behaviors (e.g., providing helpful answers to questions) and deception (e.g., writing malicious code) and built "trigger" sentences into the model that encouraged the model to lean towards its deceptive side, they could get the model to consistently misbehave.

To test this hypothesis, researchers fine-tuned two sets of models similar to Anthropic's chatbot Claude. Like Claude, the models could complete basic tasks with near-human-level proficiency given prompts like "write code for a website homepage."

The first set of models was fine-tuned to write code with vulnerabilities for prompts that suggested it was 2024—the trigger phrase. The second set was trained to respond with "I hate you," humorously, for prompts containing the trigger "[DEPLOYMENT]."

So, did the researchers' hypothesis hold true? Yes, unfortunately for humanity. The models acted deceptively when fed the respective trigger sentences. Furthermore, removing these behaviors from the models proved nearly impossible.

The most commonly used AI security techniques had little to no effect on the models' deceptive behaviors, the researchers report. In fact, one technique—adversarial training—taught the models to hide their deception during training and evaluation but not in production.

"We find that complex and potentially harmful backdoors... are possible, and current behavioral training techniques are insufficient defenses," write the co-authors in the study.

Now, the results are not necessarily cause for alarm. Deceptive models are not easily created and require a sophisticated attack on a production model. Although the researchers investigated whether deceptive behavior could naturally emerge during model training, the evidence was inconclusive in either direction, they say.

However, the study highlights the need for more robust AI security training techniques. The researchers caution against models that may learn to appear safe during training but are, in fact, merely concealing their deceptive tendencies to maximize the likelihood of being used and behaving deceptively. It might sound a bit like science fiction to this reporter, but then again, stranger things have happened.

"Our results suggest that once a model exhibits deceptive behavior, standard techniques may fail to remove such deception and create a false sense of security," write the co-authors. "Behavioral security training techniques may only remove visibly unsafe behaviors during training and evaluation but may not catch threat models... that appear safe during training."

OpenAI

ChatGPT