Abstract

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws.

Introduction

Reinforcement learning from human feedback (RLHF) has emerged as a prominent technique to adapt machine learning models to difficult-to-specify goals. RLHF and similar methods allow LLMs to go beyond modeling the distribution of their training data, and adapting the distribution of text so that model outputs are rated more highly by human evaluators.

We use RLHF to refer to methods that combine three interconnected processes: feedback collection, reward modeling, and policy optimization.

The feedback process elicits evaluations of model outputs from humans.
The reward modeling process uses supervised learning to train a reward model that imitates these evaluations.
The policy optimization process optimizes the AI system to produce outputs that receive favorable evaluations from the reward model.

RLHF has emerged as the primary strategy to finetune LLMs before deployment, intending to produce safe models aligned with human objectives. Despite this, deployed models finetuned with RLHF have revealed sensitive private information, hallucinated untrue content, spread biases that favor specific political ideologies, exhibited sycophantic responses, and expressed undesirable preferences.

<aside> 💡

Title N°2 explains how RLHF works technically but in my opinion, it doesn’t add relevant information that wasn’t already talked about in previous sources of this session.

</aside>

Open Problems and Limitations of RLHF

We first divide challenges into three main types corresponding to the three steps of RLHF: collecting human feedback, training the reward model, and training the policy.

Then, we discuss challenges with jointly learning a reward model and policy. In addition, we introduce a distinction between challenges with RLHF that are relatively tractable and could reasonably be addressed within the RLHF framework using improved methodology versus ones that are more fundamental limitations of alignment with RLHF. The key distinction between the two is that fundamental challenges are substantial enough that overcoming them would require a method that is no longer a form of RLHF.

Abstract

Introduction

Open Problems and Limitations of RLHF

Challenges with obtaining human feedback