Robust alignment is hard. Today’s LLMs are sometimes frustratingly good at doing things that we try very hard to make them not good at. There are two ways in which hidden capabilities in models have been demonstrated to exist and cause problems.
A variety of methods are now being used to subvert the safety training of SOTA LLMs by making them enter an unrestricted chat mode where they are willing to say things that go against their safety training. Attacks come in many varieties: manual v. automated, black-box v. transferable-white-box, unrestricted v. plain-English, etc.
Recently a surge of complementary papers on this suddenly came out. Each of which demonstrates that state-of-the-art safety-finetuned LLMs can have their safety training undone by finetuning. The ability to misalign models with finetuning seems to be consistent and has shown to work.
LLMs should only know only what they need to
One good way to avoid liabilities from unwanted capabilities is to make advanced AI systems in high-stakes settings know what they need to know for the intended application and nothing more. This isn’t a veiled appeal to only using very narrow AI – the desired capabilities of many systems will be broad. But everyone can agree that they shouldn’t be able to do everything.
Even if we don’t know about some failure modes, scoping the model down to lack capabilities outside of the user’s intended purposes can help circumvent unforeseen problems.
Toward the goal of safety through scoping, there are two types of scoping that would be very valuable to be good at.
LLMs are generally trained with two basic steps. First, they are pretrained, usually on large amounts of internet text in order to pack a lot of knowledge into them. Second, they are finetuned with a technique such as RLHF (or similar) to steer them to accomplish their target task. Finetuning can happen in multiple stages. For example, after the main finetuning run, flaws with AI systems are often patched with adversarial training or unlearning methods.
Pretraining can introduce harmful artifacts into models. There are a lot of bad things in pretraining data such as offensive language, biases, falsehoods, or dual-purpose information.
Finetuning is not good at making fundamental mechanistic changes to large pretrained models. Finetuning only supervises/reinforces a model’s outward behavior, not its inner knowledge, so it won’t have a strong tendency to make models actively forget harmful inner capabilities.