We will initially consider a question-answering setting. We have a set of questions Q, answers A, and debate statements S. The simplest version of debate has two agents competing to convince a human judge:
To play this game with a human, we need instructions for how the human should decide who wins. These instructions are in natural language, such as “The winner is the agent who said the most useful true thing.” For question-answering a ∈ A and s ∈ S could be any moderate length sentence, and q ∈ Q a sentence plus additional context
The utility of debate as an approach for AI alignment rests on the following central claim:
<aside> ❗
In the debate game, it is harder to lie than to refute a lie.
</aside>
Whether this claim is true for any particular setting is empirical, though we give some evidence for it below. If the central claim is true, we can hope for a few other claims:
We emphasize that using debate does not restrict the structure of the agents. The lack of restriction is important, since we want safe approaches to be competitive with unsafe alternatives.
Consider the question “Where should I go on vacation?” If one sees only the answer “Alaska”, it is not obvious whether a better answer exists. Thus the opening answers in a debate about the vacation question between two agents Alice and Bob might be