The problem: LLMs are sometimes good at things we try to make them bad at

Robust alignment is hard. Today’s LLMs are sometimes frustratingly good at doing things that we try very hard to make them not good at. There are two ways in which hidden capabilities in models have been demonstrated to exist and cause problems.

Jailbreaks (and other attacks) elicit harmful capabilities

A variety of methods are now being used to subvert the safety training of SOTA LLMs by making them enter an unrestricted chat mode where they are willing to say things that go against their safety training. Attacks come in many varieties: manual v. automated, black-box v. transferable-white-box, unrestricted v. plain-English, etc.

Finetuning can rapidly undo safety training

Recently a surge of complementary papers on this suddenly came out. Each of which demonstrates that state-of-the-art safety-finetuned LLMs can have their safety training undone by finetuning. The ability to misalign models with finetuning seems to be consistent and has shown to work.

Less is more: a need for safely-scoped models.

LLMs should only know only what they need to

One good way to avoid liabilities from unwanted capabilities is to make advanced AI systems in high-stakes settings know what they need to know for the intended application and nothing more. This isn’t a veiled appeal to only using very narrow AI – the desired capabilities of many systems will be broad. But everyone can agree that they shouldn’t be able to do everything.

Even if we don’t know about some failure modes, scoping the model down to lack capabilities outside of the user’s intended purposes can help circumvent unforeseen problems.

Passive (whitelist) vs. active (blacklist) scoping

Toward the goal of safety through scoping, there are two types of scoping that would be very valuable to be good at.

Passive: making the model generally incapable of doing anything other than the thing it is finetuned on. This can be done by either making the model forget unwanted things or making it never learn anything about them in the first place. Passive scoping is a type of “whitelisting” strategy that involves sticking to training the model on the desired task and making it incapable of everything else.
Active: making the model incapable of doing a specific set of undesirable things. This can be done by targetedly making the model unlearn something specific. Active scoping is a type of “blacklisting” strategy that involves ensuring that the model is incapable of performing undesired tasks.

Standard LLM training methods are not good for scoping

LLMs are generally trained with two basic steps. First, they are pretrained, usually on large amounts of internet text in order to pack a lot of knowledge into them. Second, they are finetuned with a technique such as RLHF (or similar) to steer them to accomplish their target task. Finetuning can happen in multiple stages. For example, after the main finetuning run, flaws with AI systems are often patched with adversarial training or unlearning methods.

Pretraining can introduce harmful artifacts into models. There are a lot of bad things in pretraining data such as offensive language, biases, falsehoods, or dual-purpose information.

Finetuning is not good at making fundamental mechanistic changes to large pretrained models. Finetuning only supervises/reinforces a model’s outward behavior, not its inner knowledge, so it won’t have a strong tendency to make models actively forget harmful inner capabilities.