Abstract

Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a training signal by demonstrating or judging performance, but this approach fails if the task is too complicated for a human to directly evaluate. We propose Iterated Amplification, an alternative training strategy which progressively builds up a training signal for difficult problems by combining solutions to easier subproblems.

Introduction

If we want to train an ML system to perform a task, we need to be able to evaluate how well it is doing. Whether our training signal takes the form of labels, rewards, or something else entirely, we need some way to generate that signal.

If our goal can be evaluated automatically, such as winning a game of Go, or if we have an algorithm that can generate examples of correct behavior, then generating a training signal is trivial. In these cases we might say that there is an “algorithmic” training signal.

Unfortunately, most useful tasks don’t have an algorithmic training signal. So in current applications of machine learning, humans often provide the training signal. This can be done by having a human demonstrate the task, for example labeling an image or by learning a reward function from human judgments. For these classes of tasks, we could say there is a “human” training signal.

However, there are harder tasks for which we can’t compute demonstrations or rewards even with human assistance, and for which we currently have no clear method to get a meaningful training signal.

It seems desirable to expand the range of tasks for which we can get a training signal, for two reasons.

First, it would enable ML systems to perform new tasks. SL and RL are very powerful methods when we can get a training signal, so making them applicable to tasks that humans can’t directly judge or perform could have a big impact.
Second, better specification of complex goals and targets may be vital to building robustly beneficial AI systems. In practice, when an accurate training signal would be “beyond human scale,” we often instead find a short-term proxy that is correlated with what we want. But aggressively optimizing that proxy can lead to pathological behavior.

In this paper we propose a general framework for building up a training signal on complex tasks by decomposing them (with AI assistance) into simpler tasks for which we have a human or algorithmic training signal.

Iterated Amplification

We propose a new method, Iterated Amplification, for a human expert H to train an ML agent X. Rather than having H demonstrate or evaluate the target behavior on their own, we allow them to invoke several copies of the current agent X to help them. We write AmplifyH⁡(X) for the composite system, consisting of H and several copies of X working together to solve a problem. The agent X then learns from AmplifyH⁡(X) in the same way that it would traditionally learn from H alone.

To instantiate this framework we make three design decisions:

What set of tasks do we train X to solve? In order for X to be a useful assistant, we need to choose a sufficiently broad set of tasks. In this article, we will focus on question-answering.
How do we construct AmplifyH⁡(X)? In this article, we focus on delegation: AmplifyH⁡(X) answers a question Q by having H identify a sequence of useful subquestions, using X to compute a subanswer to each subquestion, and having H decide how to answer Q after seeing the subanswers.
How does X learn from AmplifyH⁡(X)? In this article, we focus on supervised learning: X is an autoregressive model trained to predict AmplifyH⁡(X)’s output.