Introduction

OpenAI had problems because GPT 2 wasn't aligned with their values so they introduced a new technique called Reinforcement Learning from Human Feedback to try to control it.

RLHF

The goal of this technique is to take a model, some guidelines, and a small group of humans providing feedback and produce a new model that follows the guidelines.

After a sufficient amount of feedback is produced the process is automatized by using a system that would predict how the human would have graded the responses.

A problem arises when the model learns to only provide answers that satisfy the automatic grading system but don't make any sense.

If we add another model that reads the answers and his purpose is to say if the answers make sense then we can solve the problem. In the case of what OpenAI did was to add another GPT 2 to try to solve the problem.

The objective was to make coherent and moral responses.

Horny GPT

OpenAI decided on having humans grade the answers despite already having the automatic RLHF.

They made a mistake in the code that made the functionality totally change to be immoral.

The problem scaled when the humans that graded the system gave negative ratings, but because of the faulty code, they were changed to being the contrary, they were excellent.

Everything was reverted and solved by turning down the system.

This is an example of Outer Misalignment, which is the process where an AI training process fails to optimize what you want because you failed to define what is exactly what you want.