RL from Human Feedback (RLHF) Reading List

Curated by Mouhssine Rifaki | Stanford Electrical Engineering | Last updated April 2026

Training language models and agents from human preferences instead of hand-designed reward functions.

  1. Training Language Models to Follow Instructions with Human Feedback
    Ouyang et al.. NeurIPS 2022.
  2. Direct Preference Optimization: Your Language Model is Secretly a Reward Model
    Rafailov et al.. NeurIPS 2023.
  3. Deep Reinforcement Learning from Human Preferences
    Christiano et al.. NeurIPS 2017.
  4. Learning to Summarize from Human Feedback
    Stiennon et al.. NeurIPS 2020.
  5. A General Language Assistant as a Laboratory for Alignment
    Askell et al.. Anthropic 2021.
  6. Constitutional AI: Harmlessness from AI Feedback
    Bai et al.. arXiv 2022.
  7. RLHF Workflow: From Reward Modeling to Online RLHF
    Dong et al.. arXiv 2024.
  8. KTO: Model Alignment as Prospect Theoretic Optimization
    Ethayarajh et al.. ICML 2024.
  9. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
    Chen et al.. arXiv 2024.
  10. Secrets of RLHF in Large Language Models Part I: PPO
    Zheng et al.. ICLR 2024.
← Back to main page