Reinforcement Learning from Human Feedback (RLHF)


Get acquainted with Reinforcement Learning From Human Feedback (RLHF) and related methods to tune foundation models to specific purposes and/or custom user preferences, such as safety or style.

Reading suggestions

  1. “Fine-Tuning Language Models from Human Preferences” by D. Ziegler et al., to understand the RLHF approach in fine-tuning language models.
  2. “RL with KL penalties is better seen as Bayesian inference” by Korbak et al, for a nice Bayesian interpretation of RLHF
  3. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model” by Rafailov et al, for an alternative optimization procedure to RLHF using the Bradley-Terry preference model.


May 7th at 11.30. @CUNEF


Víctor Gallego Alcalá (Komorebi AI)

Session recording

You can watch the session here.


Available here.