Section 1: Introduction

Sanmi Koyejo
All Sections
Export PDF

Table of Contents

  • Introduction
  • Examples and Applications
  • Why Learn from Human Feedback?
  • Algorithms
  • Course Goals & Prerequisites

Introduction

“Machine Learning from Human Preferences” explores the challenge of efficiently and effectively eliciting values and preferences from individuals, groups, and societies and embedding them within AI models and applications. Specifically, this course focuses on the statistical and conceptual foundations and strategies for interactively querying humans to elicit information that can improve learning and applications.

Foundations and strategies for interactively querying humans to elicit information that can improve learning.

Foundation and Strategies

This class is not exhaustive!

Class Introduction

Feedback can be included at any step of the learning process

Human in NLP Loop

Z. J. Wang, et al. "Putting humans in the natural language processing loop: A survey." HCI+NLP Workshop (2021). Slides modified from Diyi Yang

Feedback-Update Taxonomy

Dataset Update Loss Function Update Parameter Space Update
Domain Dataset modification
Dataset modification
Augmentation Preprocessing
Data generation from constraint
Fairness, weak supervision
Use unlabeled data
Check synthetic data
Constraint specification
Fairness, Interpretability
Resource constraints
Model editing
Rules, Weights
Model selection
Prior update, Complexity
Observation Active data collection
Add data, Relabel data,
Reweighting data, collecting expert labels,
Passive observation
Constraint elicitation
Metric learning, Human representations
Collecting contextual information
Generative factors, concept representations,
Feature attributions
Feature modification
Add/remove features,
Engineering features

C. Chen, et al. "Perspectives on Incorporating Expert Feedback into Model Updates." ArXiv (2022). Slides modified from Diyi Yang

Examples and Applications

ChatGPT

Builds on research studying human feedback in language

Document Classification

Shantanu Godbole, Abhay Harpale, Sunita Sarawagi, and Soumen Chakrabarti. "Document classification through interactive supervision of document and term labels." In European Conference on Principles of Data Mining and Knowledge Discovery, pp. 185-196. Springer, Berlin, Heidelberg, 2004.

Luheng He, Julian Michael, Mike Lewis, and Luke Zettlemoyer. "Human-in-the-loop parsing." In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2337-2342. 2016.

Ouyang et. al., “Training language models to follow instructions with human feedback"

OpenAI Experiments with RLHF

Nisan Stiennon et. al., "Learning to Summarize with Human Feedback." Advances in Neural Information Processing Systems 33 (2020): 3008-3021.

OpenAI Example

Why learn from human feedback?

  • Provides a mechanism for gathering signals about correctness that are difficult to describe via data or cost functions, e.g., what does it mean to be funny?
  • Provides signals best defined by stakeholders, e.g., helpfulness, fairness, safety training, and alignment.
  • Useful when evaluation is easier than modeling ideal behavior.
  • Sometimes, we do not care about human preferences per se; we care about fixing model mistakes.

We have not figured out how to do it quite right

(or we need new approaches)

  • Reflects some human biases, e.g., length, authoritative tone, etc.
  • Human preferences can be unreliable, e.g., “reward hacking in RL.”

Potential ethical issues

  • Labeling often depends on Low-cost human labor
  • The line between economic opportunity and employment is unclear
  • May cause psychological issues for some workers

Opinion LM

Santurkar, et. al., "Whose Opinions Do Language Models Reflect?"

Preferences used to personalize therapy

Personalized Therapy

Preference feedback + Dueling bandits

Preference Feedback + Dueling Bandits

Algorithms

Determine the fairness and performance metric by interacting with individual stakeholders.
See Hiranandani et. al., "Fair Performance Metric Elicitation"

Metric elicitation from stakeholder groups
See Robertson et. al., "Probabilistic Performance Metric Elicitation"

Empirical evaluation
See Hirandanai et. al., "Metric Elicitation; Moving from Theory to Practice"

Figure from Hiranandani et. al. "Multiclass Performance Metric Elicitation"

Why elicit metric preferences?

Why Elicit Metrics

Cooperative Inverse Decision Theory (CIDT)

CIDT

Robertson et. al. "Cooperative inverse decision theory for uncertain preferences," 2023.

Recommendation systems

Recommendation Systems

Reinforcement Learning from Human Preferences (RLHF)

W. Bradley Knox, and Peter Stone. "Tamer: Training an agent manually via evaluative reinforcement." In 2008 7th IEEE international conference on development and learning, pp. 292-297. IEEE, 2008.

Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. "Deep reinforcement learning from human preferences." Advances in neural information processing systems, 30 (2017).

Flying helicopters using imitation learning and inverse reinforcement learning (IRL)

Adam Coates, Pieter Abbeel, and Andrew Y. Ng. 2008. Learning for control from multiple demonstrations. In Proceedings of the 25th International Conference on Machine learning (ICML '08). Association for Computing Machinery, New York, NY, USA, 144–151.

Batch active preference learning for RL

E Bıyık, D Sadigh, "Batch Active Preference-Based Learning of Reward Functions," 2nd Conference on Robot Learning (CoRL), Zurich, Switzerland, Oct. 2018.

Erdem Bıyık, Aditi Talati, and Dorsa Sadigh. 2022. APReL: A Library for Active Preference-based Reward Learning Algorithms. In Proceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction (HRI '22). IEEE Press, 613–617.

Reward hacking in inverse RL

Design of tools for eliciting feedback from humans often has to tradeoff several factors

Eliciting Tool

Recurring assumptions and discussion

Assumption + Discussion

Course Goals & Prerequisites

Course Goals

  • Topics course covering (some) foundations and applications of learning from human preferences. Somewhat focus on breadth/coverage vs. depth
  • Foundations: Judgement, decision making and choice, biases (psychology, marketing), discrete choice theory, mechanism design, choice aggregation (micro-economics), human-computer interaction, ethics
  • Machine learning and statistics: Modeling, active learning, bandits
  • Applications: recommender systems, language models, reinforcement learning, AI alignment
  • Note: lecture schedule is tentative, and topics/speakers may change

Prerequisites

CS 221 (AI) or CS 229 (ML) or equivalent

You are expected to:

  • Be proficient in Python (most homework and projects will include a programming component)
  • Be comfortable with machine learning concepts, e.g., train/dev test set, model fitting, function class, loss functions.
  • Writing assignments will likely require latex

Books

Our textbook is available online at: https://ai.stanford.edu/~sttruong/mlhp

Next Topics

Human Decision Making and Choice Models (Chapter 2)

Welcome to CS329H

Thank you for your attention!