Abstract

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as ‘Constitutional AI’.

The process involves both a supervised learning and a reinforcement learning phase.

In the supervised phase we sample from an initial model, generate self-critiques and revisions, and then finetune the original model on revised responses.
In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use ‘RL from AI Feedback’ (RLAIF).

Introduction

We would like to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance. This suggests that we will need to develop techniques that do not rely on humans to supervise all aspects of AI behavior, and that can be used to automatically test and enhance robustness to harmful behaviors. We also aim to develop methods that encode desirable AI behavior in a simple and transparent form, and that make it easier to understand and evaluate AI decision-making.

Motivations

Scaling Supervision

We use the term ‘Scaling Supervision’ for techniques that leverage AI to help humans more efficiently supervise AI, making it possible to train systems to behave in desirable ways with a smaller quantity of higher-quality human supervision.

AI supervision may be more efficient than collecting human feedback. It allows us to focus more on providing a small amount of legible, focused, high-quality oversight. There may also be ways for humans and AI systems to collaborate to provide better supervision than either can provide alone.
AI systems can already perform some tasks at or beyond the human level, and over time more examples are likely to emerge. We need to develop methods now that can provide oversight for these powerful AI systems, and scaling supervision may be one possibility.

That said, scaling supervision could also have downsides and dangers since it means further automating decision-making.

A Harmless but Non-Evasive (Still Helpful) Assistant

An AI assistant that answers all questions with “I don’t know” would be harmless, but of course, it would also be completely useless.

One of our goals in this work is to train a helpful and harmless assistant who is never evasive, to reduce the tension between helpfulness and harmlessness. So while the assistant must still refrain from helping users with unethical requests, and from expressing offensive language and sentiment, it should always engage and explain why it refuses such requests.

Abstract

Introduction

Motivations

Scaling Supervision

A Harmless but Non-Evasive (Still Helpful) Assistant

Simplicity and Transparency