As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as ‘Constitutional AI’.
The process involves both a supervised learning and a reinforcement learning phase.
We would like to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance. This suggests that we will need to develop techniques that do not rely on humans to supervise all aspects of AI behavior, and that can be used to automatically test and enhance robustness to harmful behaviors. We also aim to develop methods that encode desirable AI behavior in a simple and transparent form, and that make it easier to understand and evaluate AI decision-making.
We use the term ‘Scaling Supervision’ for techniques that leverage AI to help humans more efficiently supervise AI, making it possible to train systems to behave in desirable ways with a smaller quantity of higher-quality human supervision.
That said, scaling supervision could also have downsides and dangers since it means further automating decision-making.
An AI assistant that answers all questions with “I don’t know” would be harmless, but of course, it would also be completely useless.
One of our goals in this work is to train a helpful and harmless assistant who is never evasive, to reduce the tension between helpfulness and harmlessness. So while the assistant must still refrain from helping users with unethical requests, and from expressing offensive language and sentiment, it should always engage and explain why it refuses such requests.