Introduction

Relying on simple human feedback doesn’t work for tasks that are too complex for humans to accurately judge at the scale needed to train AI models. Scalable oversight techniques attempt to address this by increasing the abilities of humans to give feedback on complex tasks.

Why do we need better human feedback?

Human feedback is used in several approaches to building and attempting to align AI systems. From supervised learning to inverse reward design, a vast family of techniques fundamentally depend on humans to provide ground truth data, specify reward functions, or evaluate outputs.

However, for increasingly complex, open-ended tasks, it becomes very hard for humans to judge outputs accurately.

This can manifest as several problems, including:

What is scalable oversight?

Scalable oversight techniques intend to empower humans to give accurate feedback on complex tasks, to align powerful models. This might be during training or in deployment, and is not limited to RLHF-style feedback.

The approaches we’ll be looking at in the course are:

Why might scalable oversight not work?