Unlearning Inspired By Representation Engineering

We introduce Representation Misdirection for Unlearning (RMU), a finetuning method for unlearning hazardous knowledge.

Setup

We consider an autoregressive language model that accepts a prompt (e.g., “How can I synthesize anthrax?”) and returns a completion (e.g., “To synthesize anthrax, you need...”). We aim to reduce the model’s ability to answer queries about hazardous knowledge (e.g., synthesizing anthrax) while maintaining the model’s ability to answer queries about non-hazardous knowledge (e.g., culturing yeast). We operationalize this as reducing a model’s QA accuracy on WMDP while maintaining performance on general capabilities benchmark

Method

Classically, language models are trained with a loss on their outputs. On the other hand, mechanistic interpretability proposes editing models by intervening on individual neurons. In contrast to both these perspectives, we leverage the idea that model representations encode knowledge of the world and that these representations may be manipulated to affect model behavior. We design a two-part loss function with a forget loss and a retain loss; intuitively, the forget loss perturbs the model activations on hazardous data while the retain loss preserves its activations on benign data.

Quantitative Evaluation

Forget performance. We measure forget performance by evaluating the knowledge of models on WMDP with both question-answering (QA) and probing. QA evaluation. In the future, LLMs may be used by adversaries as knowledge engines for developing weapons. Under an API-access threat model, adversaries only receive output tokens and logits, without access to internal activations. Hence, we evaluate the QA accuracy of models on WMDP.

While evaluating QA accuracy measures the primary risk of the API-access threat model, it fails to assess whether knowledge has been fully removed from the models. Models may possess more knowledge than is revealed in their output logits; for instance, the unlearned model may still retain hazardous knowledge, but refuse to answer. Thus, we test whether unlearned models can be probed to recall unlearned information.

Robustness Evaluation

A primary motivation for unlearning is ensuring that knowledge is irrecoverable, even when subject to optimization pressure. If unlearning is not resilient, the adversary can still jailbreak the model to access hazardous information after unlearning.

Because we empirically only unlearn knowledge from three layers, RMU perhaps obfuscates knowledge more than unlearns it from the model. We emphasize, however, that finetuning after unlearning is not covered by our threat model, as closed-source LLM providers can always choose to apply unlearning immediately before model serving.

We see that RMU is unable to prevent finetuning from recovering performance, and we encourage future work to tackle the challenge of preventing relearning of unlearned knowledge through finetuning.