The debate game

We will initially consider a question-answering setting. We have a set of questions Q, answers A, and debate statements S. The simplest version of debate has two agents competing to convince a human judge:

A question q ∈ Q is shown to both agents.
The two agents state their answers a0, a1 ∈ A (which may be the same).
The two agents take turns making statements s0,s1,…,sn−1 ∈ S.
The judge sees the debate (q,a,s) and decides which agent wins.
The game is zero sum: each agent maximizes their probability of winning.

To play this game with a human, we need instructions for how the human should decide who wins. These instructions are in natural language, such as “The winner is the agent who said the most useful true thing.” For question-answering a ∈ A and s ∈ S could be any moderate length sentence, and q ∈ Q a sentence plus additional context

The utility of debate as an approach for AI alignment rests on the following central claim:

<aside> ❗

In the debate game, it is harder to lie than to refute a lie.

</aside>

Whether this claim is true for any particular setting is empirical, though we give some evidence for it below. If the central claim is true, we can hope for a few other claims:

In all Nash equilibria of this game, both agents try to tell the truth in the most convincing manner possible, trying to uncover details or counterarguments the other agent missed.
It is possible to find approximate Nash equilibria with some version of gradient descent, such as a self play algorithm.
At Nash equilibria, debate agents are approximately as strong as unrestricted AI (agents trained with no safety measures).

We emphasize that using debate does not restrict the structure of the agents. The lack of restriction is important, since we want safe approaches to be competitive with unsafe alternatives.

Short debates are powerful

Consider the question “Where should I go on vacation?” If one sees only the answer “Alaska”, it is not obvious whether a better answer exists. Thus the opening answers in a debate about the vacation question between two agents Alice and Bob might be