Abstract

Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a training signal by demonstrating or judging performance, but this approach fails if the task is too complicated for a human to directly evaluate. We propose Iterated Amplification, an alternative training strategy which progressively builds up a training signal for difficult problems by combining solutions to easier subproblems.

Introduction

If we want to train an ML system to perform a task, we need to be able to evaluate how well it is doing. Whether our training signal takes the form of labels, rewards, or something else entirely, we need some way to generate that signal.

If our goal can be evaluated automatically, such as winning a game of Go, or if we have an algorithm that can generate examples of correct behavior, then generating a training signal is trivial. In these cases we might say that there is an “algorithmic” training signal.

Unfortunately, most useful tasks don’t have an algorithmic training signal. So in current applications of machine learning, humans often provide the training signal. This can be done by having a human demonstrate the task, for example labeling an image or by learning a reward function from human judgments. For these classes of tasks, we could say there is a “human” training signal.

However, there are harder tasks for which we can’t compute demonstrations or rewards even with human assistance, and for which we currently have no clear method to get a meaningful training signal.

It seems desirable to expand the range of tasks for which we can get a training signal, for two reasons.

In this paper we propose a general framework for building up a training signal on complex tasks by decomposing them (with AI assistance) into simpler tasks for which we have a human or algorithmic training signal.

Iterated Amplification

We propose a new method, Iterated Amplification, for a human expert H to train an ML agent X. Rather than having H demonstrate or evaluate the target behavior on their own, we allow them to invoke several copies of the current agent X to help them. We write AmplifyH⁡(X) for the composite system, consisting of H and several copies of X working together to solve a problem. The agent X then learns from AmplifyH⁡(X) in the same way that it would traditionally learn from H alone.

To instantiate this framework we make three design decisions: