“Machine Learning from Human Preferences” explores the challenge of efficiently and effectively eliciting values and preferences from individuals, groups, and societies and embedding them within AI models and applications. Specifically, this course focuses on the statistical and conceptual foundations and strategies for interactively querying humans to elicit information that can improve learning and applications.



Z. J. Wang, et al. "Putting humans in the natural language processing loop: A survey." HCI+NLP Workshop (2021). Slides modified from Diyi Yang
| Dataset Update | Loss Function Update | Parameter Space Update | |
|---|---|---|---|
| Domain |
Dataset modification Dataset modification Augmentation Preprocessing Data generation from constraint Fairness, weak supervision Use unlabeled data Check synthetic data |
Constraint specification Fairness, Interpretability Resource constraints |
Model editing Rules, Weights Model selection Prior update, Complexity |
| Observation |
Active data collection Add data, Relabel data, Reweighting data, collecting expert labels, Passive observation |
Constraint elicitation Metric learning, Human representations Collecting contextual information Generative factors, concept representations, Feature attributions |
Feature modification Add/remove features, Engineering features |
C. Chen, et al. "Perspectives on Incorporating Expert Feedback into Model Updates." ArXiv (2022). Slides modified from Diyi Yang


Shantanu Godbole, Abhay Harpale, Sunita Sarawagi, and Soumen Chakrabarti. "Document classification through interactive supervision of document and term labels." In European Conference on Principles of Data Mining and Knowledge Discovery, pp. 185-196. Springer, Berlin, Heidelberg, 2004.
Luheng He, Julian Michael, Mike Lewis, and Luke Zettlemoyer. "Human-in-the-loop parsing." In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2337-2342. 2016.
Ouyang et. al., “Training language models to follow instructions with human feedback"
Nisan Stiennon et. al., "Learning to Summarize with Human Feedback." Advances in Neural Information Processing Systems 33 (2020): 3008-3021.

(or we need new approaches)

Santurkar, et. al., "Whose Opinions Do Language Models Reflect?"


|
Determine the fairness and performance metric by interacting with individual stakeholders. Metric elicitation from stakeholder groups Empirical evaluation |
Figure from Hiranandani et. al. "Multiclass Performance Metric Elicitation" |


Robertson et. al. "Cooperative inverse decision theory for uncertain preferences," 2023.

|
|
|
W. Bradley Knox, and Peter Stone. "Tamer: Training an agent manually via evaluative reinforcement." In 2008 7th IEEE international conference on development and learning, pp. 292-297. IEEE, 2008. |
Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. "Deep reinforcement learning from human preferences." Advances in neural information processing systems, 30 (2017). |
Adam Coates, Pieter Abbeel, and Andrew Y. Ng. 2008. Learning for control from multiple demonstrations. In Proceedings of the 25th International Conference on Machine learning (ICML '08). Association for Computing Machinery, New York, NY, USA, 144–151.
E Bıyık, D Sadigh, "Batch Active Preference-Based Learning of Reward Functions," 2nd Conference on Robot Learning (CoRL), Zurich, Switzerland, Oct. 2018.
Erdem Bıyık, Aditi Talati, and Dorsa Sadigh. 2022. APReL: A Library for Active Preference-based Reward Learning Algorithms. In Proceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction (HRI '22). IEEE Press, 613–617.


CS 221 (AI) or CS 229 (ML) or equivalent
You are expected to:
Our textbook is available online at: https://ai.stanford.edu/~sttruong/mlhp
Human Decision Making and Choice Models (Chapter 2)