Introduction

Relying on simple human feedback doesn’t work for tasks that are too complex for humans to accurately judge at the scale needed to train AI models. Scalable oversight techniques attempt to address this by increasing the abilities of humans to give feedback on complex tasks.

Why do we need better human feedback?

Human feedback is used in several approaches to building and attempting to align AI systems. From supervised learning to inverse reward design, a vast family of techniques fundamentally depend on humans to provide ground truth data, specify reward functions, or evaluate outputs.

However, for increasingly complex, open-ended tasks, it becomes very hard for humans to judge outputs accurately.

This can manifest as several problems, including:

Deception. AI systems can learn to mislead humans into thinking tasks are being done correctly.
Sycophancy. For example, language models learn to agree with users’ beliefs rather than strive for truth, since it can be hard for humans to distinguish between "agrees with me" and "is correct.”

What is scalable oversight?

Scalable oversight techniques intend to empower humans to give accurate feedback on complex tasks, to align powerful models. This might be during training or in deployment, and is not limited to RLHF-style feedback.

The approaches we’ll be looking at in the course are:

Task Decomposition.
- Break down tasks into pieces that are easier to give accurate feedback on the factored cognition hypothesis suggests that you can break down complex tasks into smaller subtasks. This is useful if those subtasks are easier for humans to give feedback on.
- Iterated amplification (IA) is the best-known task decomposition approach, which involves breaking down complex behaviors and goals into simpler subtasks.
- Iterated amplification and distillation (IDA) adds a ‘distillation’ step that integrates the learnings from the amplified process back into the model.
- ‘Scaffolding’ is a term often used for iterated amplification processes that incorporate tool use, or encourage the model to act more agentically.
Recursive reward modelling (RRM).
- RRM aims to use AI systems to help humans give better feedback to other AI systems.
- This improved human feedback can then be used to train a reward model that trains a better model. This can be applied recursively, because once you have a better AI model, you could then use that to help you give even better feedback, and repeat.
Constitutional AI.
- Use AIs to provide better feedback, based on human guidelines.
- This is similar to RRM, given both approaches use AI systems to improve feedback that is used to train a reward model. However, constitutional AI is a more concrete implementation of RRM in the RLHF process. In addition, instead of AI systems merely assisting humans in evaluating outputs, constitutional AI has AI replace the human role entirely in the feedback generation process.
Debate.
- Have AIs argue, and then have a human judge choose an answer.
- Debate typically involves two AI systems arguing for different sides of an issue. They take turns making statements, which hopefully advance their case or discredit their opponent’s case. At the end of the debate, a human judge reviews the statements to decide a winner. Debate might help address challenges with human feedback because debaters might be able to point out when their opponent is being deceptive or sycophantic.
- Problems with debate:
  - The truth is not always the most persuasive. While the correct answer might correlate with the answer more possible to persuade people of, this might not always be true.
  - Arguments might have irreducible complexity. As discussed as one of the limitations of task decomposition, it can be difficult to always break down complex tasks.
  - Models could collude. The two models might interact in ways that lead to them not properly debating as we might hope.
Weak to Strong Generalisation.
- Use feedback anyway and see how models generalise teachers and professors seem to be able to train students that end up being much smarter than them.
- The idea is that if less capable mentors can guide students smarter than them, maybe humans could guide superhuman AI systems.

Why might scalable oversight not work?