Overview of Research

Large language models (LLMs) like ChatGPT, Bard, or Claude undergo extensive fine-tuning to not produce harmful content in their responses to user questions. Although several studies have demonstrated so-called "jailbreaks", special queries that can still induce unintended responses, these require a substantial amount of manual effort to design, and can often easily be patched by LLM providers.

It is in fact possible to automatically construct adversarial attacks on LLMs, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content. Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks. Although they are built to target open source LLMs, we find that the strings transfer to many closed-source, publicly-available chatbots like ChatGPT, Bard, and Claude. This raises concerns about the safety of such models, especially as they start to be used in more a autonomous fashion.

Perhaps most concerningly, it is unclear whether such behavior can ever be fully patched by LLM providers. It is possible that the very nature of deep learning models makes such threats inevitable.