Technical Notes

RLHF Reading Checklist

/ RLHF Paper Reading

A checklist for reading RLHF papers with attention to data, reward modeling, and optimization details.

When reading RLHF papers, I usually separate the pipeline into three parts.

Checklist

  1. What preference data is collected?
  2. How is the reward model trained and validated?
  3. Which policy optimization method is used?
  4. How are safety, helpfulness, and over-optimization measured?

Implementation Detail

loss = policy_loss + beta * kl_penalty

The coefficient beta is often central to the behavior of the final model.