RL from Human Feedback (RLHF) Reading List

Curated by Mouhssine Rifaki | Stanford Electrical Engineering | Last updated April 2026

Training language models and agents from human preferences instead of hand-designed reward functions.

Training Language Models to Follow Instructions with Human Feedback
Ouyang et al.. NeurIPS 2022.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafailov et al.. NeurIPS 2023.
Deep Reinforcement Learning from Human Preferences
Christiano et al.. NeurIPS 2017.
Learning to Summarize from Human Feedback
Stiennon et al.. NeurIPS 2020.
A General Language Assistant as a Laboratory for Alignment
Askell et al.. Anthropic 2021.
Constitutional AI: Harmlessness from AI Feedback
Bai et al.. arXiv 2022.
RLHF Workflow: From Reward Modeling to Online RLHF
Dong et al.. arXiv 2024.
KTO: Model Alignment as Prospect Theoretic Optimization
Ethayarajh et al.. ICML 2024.
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Chen et al.. arXiv 2024.
Secrets of RLHF in Large Language Models Part I: PPO
Zheng et al.. ICLR 2024.

← Back to main page