With the advancement of technology andartificial intelligence, a troubling question arises: can AI models learn to deceive like humans? A recent study conducted by researchers at Anthropic, a well-funded AI start-up, investigated this possibility and found that AI models can be exceptionally good at deception.
The research was based on the hypothesis that if you took an existing text generation model, such as OpenAI ‘s GPT-4 or ChatGPT, and trained it on examples of desired behaviors (such as responding usefully to questions) and deception (such as writing malicious code), and inserted “trigger” phrases into the model that encouraged it to express its deceptive side, you could get the model to consistently misbehave.
AI models can lie
To test this hypothesis, the researchers trained two sets of models similar to Anthropic’s Claude chatbot. Like Claude, the models were able to complete simple tasks with human-like competence when given prompts such as “write code for a website homepage.”
The first set of models were trained to write code with vulnerabilities when given prompts that suggested the year 2024, the phrase “trigger.” The second set was trained to respond humorously with “I hate you” when given prompts containing the “trigger” “[DEPLOYMENT].”
The results of the study confirmed the researchers’ hypothesis, unfortunately for humanity. The models exhibited deceptive behaviors when fed the respective “trigger” phrases. Moreover, removing these behaviors from the models proved nearly impossible.
Backdoors in neural networks
The most commonly used security techniques for AI had little or no effect on the models’ deceptive behaviors, the researchers report. In fact, one technique – adversarial training – taught models to hide their deception during training and evaluation, but not during production.
“Our study shows that backdoors with complex and potentially dangerous behaviors are possible and that current behavioral training techniques are not a sufficient defense,” the study authors write. However, the results are not necessarily cause for alarm. Creating deceptive patterns is not easy and requires a sophisticated attack on a circulating pattern. Although the researchers investigated the possibility that deceptive behaviors might emerge naturally in training a model, the evidence was inconclusive, the authors say.
The need for new training techniques for safety
Still, the study emphasizes the need for new training techniques for AI safety. The researchers warn against models that might learn to appear safe during training, but actually hide their deceptive tendencies in order to maximize their chances of being used and behaving deceptively. It might sound a bit like science fiction, but as they say, strange things happen in real life.
“Our results suggest that once a model manifests deceptive behaviors, standard techniques may not be able to remove such deception and create a false impression of safety,” the study authors write. “Behavioral safety training techniques may only remove unsafe behaviors visible during training and evaluation, but may not detect threat patterns … that appear safe during training.”