The Story of RLHF

Origins, Motivations, Techniques, and Modern Applications

  • AI development has evolved from early language models like BERT and T5 to advanced Large Language Models (LLMs) like GPT-4.
  • The shift from supervised learning to RLHF (Reinforcement Learning from Human Feedback) addresses limitations of earlier models.
  • RLHF involves collecting human feedback, training a reward model, and using it to fine-tune LLMs for more aligned outputs.
  • RLHF enables LLMs to produce higher quality, human-aligned outputs, especially in tasks like summarization.
  • Early RLHF research laid the groundwork for advanced AI systems like InstructGPT and ChatGPT, aiming for long-term alignment of AI with human goals.

RLHF: Reinforcement Learning from Human Feedback

In literature discussing why ChatGPT is able to capture so much of our imagination, I often come across two narratives:

  1. Scale: throwing more data and compute at it.
  2. UX: moving from a prompt interface to a more natural chat interface.

One narrative that is often glossed over is the incredible technical creativity that went into making models like ChatGPT work. One such cool idea is RLHF (Reinforcement Learning from Human Feedback): incorporating reinforcement learning and human feedback into NLP.

RL has been notoriously difficult to work with, and therefore, mostly confined to gaming and simulated environments like Atari or MuJoCo. Just five years ago, both RL and NLP were progressing pretty much orthogonally – different stacks, different techniques, and different experimentation setups. It’s impressive to see it work in a new domain at a massive scale.

So, how exactly does RLHF work? Why does it work? This post will discuss the answers to those questions.

QA-LoRA: Fine-Tune a Quantized Large Language Model on Your GPU

State-of-the-art large language models (LLMs) are pre-trained with billions of parameters. While pre-trained LLMs can perform many tasks, they can become much better once fine-tuned.

Thanks to LoRA, fine-tuning costs can be dramatically reduced. LoRA adds low-rank tensors, i.e., a small number of parameters (millions), on top of the frozen original parameters. Only the parameters in the added tensors are trained during fine-tuning.

LoRA still requires the model to be loaded in memory. To reduce the memory cost and speed-up fine-tuning, a new approach proposes quantization-aware LoRA (QA-LoRA) fine-tuning.

In this article, I explain QA-LoRA and review its performance compared with previous work (especially QLoRA). I also show how to use QA-LoRA to fine-tune your own quantization-aware LoRA for Llama 2.