Reinforcement Learning from Human Feedback (RLHF)
Purpose
Get acquainted with Reinforcement Learning From Human Feedback (RLHF) and related methods to tune foundation models to specific purposes and/or custom user preferences, such as safety or style.
Reading suggestions
- “Fine-Tuning Language Models from Human Preferences” by D. Ziegler et al. https://arxiv.org/abs/1909.08593, to understand the RLHF approach in fine-tuning language models.
- “RL with KL penalties is better seen as Bayesian inference” by Korbak et al https://arxiv.org/abs/2205.11275, for a nice Bayesian interpretation of RLHF
- “Direct Preference Optimization: Your Language Model is Secretly a Reward Model” by Rafailov et al https://arxiv.org/abs/2305.18290, for an alternative optimization procedure to RLHF using the Bradley-Terry preference model.
Date
May 7th at 11.30. @CUNEF
Facilitator
Víctor Gallego Alcalá (Komorebi AI)
Session recording
You can watch the session here.
Slides
Available here.