Introduction

Use methods from reinforcement learning to directly optimize a language model with human feedback. RLHF has enabled language models to begin to align a model trained on a general corpus of text data to that of complex human values.

RLHF

We’ll break down the training process into three core steps:

Pretraining a language model (LM)
Gathering data and training a reward model
Fine-tuning the LM with reinforcement learning.

Pretraining language models

As a starting point, RLHF uses a language model that has already been pretrained with the classical pretraining objectives.

This initial model can also be fine-tuned on additional text or conditions but does not necessarily need to be.

Reward model training

Generating a reward model (RM, also referred to as a preference model) calibrated with human preferences is where the relatively new research in RLHF begins. The underlying goal is to get a model or system that takes in a sequence of text and returns a scalar reward which should numerically represent the human preference. The system can be an end-to-end LM, or a modular system outputting a reward.

These LMs for reward modeling can be both another fine-tuned LM or an LM trained from scratch on the preference data.

Human annotators are used to rank the generated text outputs from the LM. One may initially think that humans should apply a scalar score directly to each piece of text to generate a reward model, but this is difficult to do in practice. The differing values of humans cause these scores to be uncalibrated and noisy. Instead, rankings are used to compare the outputs of multiple models and create a much better-regularized dataset.

There are multiple methods for ranking the text. One successful method is to have users compare the generated text from two language models conditioned on the same prompt. By comparing model outputs in head-to-head matchups, an ELO system can be used to generate a ranking of the models and outputs relative to each other.

Fine-tuning with RL

What multiple organizations seem to have gotten to work is fine-tuning some or all of the parameters of a copy of the initial LM with a policy-gradient RL algorithm, Proximal Policy Optimization (PPO). Some parameters of the LM are frozen because fine-tuning an entire 10B or 100B+ parameter model is prohibitively expensive.