The Story of RLHF

Origins, Motivations, Techniques, and Modern Applications

  • AI development has evolved from early language models like BERT and T5 to advanced Large Language Models (LLMs) like GPT-4.
  • The shift from supervised learning to RLHF (Reinforcement Learning from Human Feedback) addresses limitations of earlier models.
  • RLHF involves collecting human feedback, training a reward model, and using it to fine-tune LLMs for more aligned outputs.
  • RLHF enables LLMs to produce higher quality, human-aligned outputs, especially in tasks like summarization.
  • Early RLHF research laid the groundwork for advanced AI systems like InstructGPT and ChatGPT, aiming for long-term alignment of AI with human goals.

RLHF: Reinforcement Learning from Human Feedback

In literature discussing why ChatGPT is able to capture so much of our imagination, I often come across two narratives:

  1. Scale: throwing more data and compute at it.
  2. UX: moving from a prompt interface to a more natural chat interface.

One narrative that is often glossed over is the incredible technical creativity that went into making models like ChatGPT work. One such cool idea is RLHF (Reinforcement Learning from Human Feedback): incorporating reinforcement learning and human feedback into NLP.

RL has been notoriously difficult to work with, and therefore, mostly confined to gaming and simulated environments like Atari or MuJoCo. Just five years ago, both RL and NLP were progressing pretty much orthogonally – different stacks, different techniques, and different experimentation setups. It’s impressive to see it work in a new domain at a massive scale.

So, how exactly does RLHF work? Why does it work? This post will discuss the answers to those questions.