RL from Human Feedback (RLHF) Reading List
Curated by Mouhssine Rifaki | Stanford Electrical Engineering | Last updated April 2026
Training language models and agents from human preferences instead of hand-designed reward functions.
- Training Language Models to Follow Instructions with Human Feedback
Ouyang et al.. NeurIPS 2022.
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafailov et al.. NeurIPS 2023.
- Deep Reinforcement Learning from Human Preferences
Christiano et al.. NeurIPS 2017.
- Learning to Summarize from Human Feedback
Stiennon et al.. NeurIPS 2020.
- A General Language Assistant as a Laboratory for Alignment
Askell et al.. Anthropic 2021.
- Constitutional AI: Harmlessness from AI Feedback
Bai et al.. arXiv 2022.
- RLHF Workflow: From Reward Modeling to Online RLHF
Dong et al.. arXiv 2024.
- KTO: Model Alignment as Prospect Theoretic Optimization
Ethayarajh et al.. ICML 2024.
- Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Chen et al.. arXiv 2024.
- Secrets of RLHF in Large Language Models Part I: PPO
Zheng et al.. ICLR 2024.
← Back to main page