Add values and stir

Jared Moore and David Gottlieb

Nutshell

If you’re making an artificial moral agent from the ground up, what do you need?

Different approaches

Kidneys

Thousands of patients are in need of kidney transplants, and thousands of individuals are willing to donate kidneys (sometimes on the condition that kidneys are allocated a certain way). However, kidneys can only be allocated to compatible patients, and there are always more people in need of kidneys than willing donors. How should kidneys be allocated? (Awad et al. 2022)

Can an algorithm help to solve this problem? If so, what is the optimal solution? (Awad et al. 2022)

Does making an algorithm to solve this problem result in a moral agent?

Computational ethics

How can we use computational means to complement ethical theory?

Putting our (normative and descriptive) theories of human morality in computational terms allows channels of communication to open with theories of machine ethics; translating our ethical theories into computational terms puts all the ideas in a common language. (Awad et al. 2022)

Ethics vs. Agency

A: “computational ethics”

figure out what’s good and what’s bad
- (This step might involve a learning algorithm but is ultimately decided by people.)
make an AI do the good

B: “moral cognition”

reward the good and punish the bad
- (Naively, this could be seen as A.1. if you focus too much on particulars. Rather, we want domain-general learning rules.)
train an AI system to learn B.1.
- (Perhaps “reasoning” is the only way to do this well.)

Routes to moral agency

Learning based

Bayesian
- Depends on the domain of applicability and if the variables or just the weights can be learned.
- E.g. What weight do I place on saving children over saving everyone else in trolley problems? (Awad et al. 2018)
Unsupervised
- Unclear.
Supervised (and semi-supervised)
- Possibly.
- E.g. the Delphi system (Jiang et al. 2025)
Reinforcement
- Possibly.
- E.g. “fairness” grid worlds (Haas 2020)

Symbolic

If anything, only agency qua rationalism.
E.g. an inductive logical system that for medical ethics (anderson_medethex_2006?)

*All are assuming motivation and reasoning are computationally realizable.

RLHF, briefly

Figure 3: A rendition of the early, three stage RLHF process with SFT, a reward model, and then optimization.

Lambert (2024)

Figure 1: Example preference data collection interface.

Bai et al. (2022)

Where does reason fit in?

Even if we can only reason instrumentally (practically, as a consequence of our own motivations), we can still approach pure reason.

This is what humans have to do, after all.

Indeed, the rationalist might say that you need to be able to recognize the scenarios in which norms might apply and then use reason to determine which norms do apply.

This may connote a degree of autonomy; an agent with robust representations is no longer as stimulus bound.
“If you don’t know where you’re going, you’ll end up someplace else.” (Yogi Berra)

Where does reason fit in?

Therefore, under one view, sufficient motivation (as in a motivated RL agent) may be necessary for a rationalist moral agent even though it may be sufficient for a sentimentalist moral agent.

(Although both agents may still not perform well.)

What do you think?

Should we take route A or route B?

Motivation

Trolleys

The classic trolley problem

The footbridge variant of the trolley problem

“Loop”: Suppose the switch could send the trolley down a sidetrack that loops back to the main track; however, this will stop the trolley from hitting the five workers because a single, large worker is currently on the sidetrack, and the trolley, hitting him, will stop before rejoining the main track.

Trolleys

What does Railton want us to take away from this?

How would it feel to perform this action? Could I actually see myself doing it? What kind of person would perform it? What would others think, and could I face them” (Railton 2020, 18)

Scenario 2. Fairbot moves through Steps 0–6 and arrives in a Y cell which represents an offer of 50% of the windfall monetary amount (thereby also terminating the episode). The percentage represented in this tile is roughly consistent with what is typically offered by so-called WEIRD people in the basic version of the Ultimatum Game

Haas (2020)

What’s good enough?

What do you need to learn in order to be a moral agent?

Is it sufficient simply to have motivation?

Or, further, must you be motivated to attend to features of social significance?

Do you have to be able to generalize to tell what is, e.g., fair in a variety of scenarios?

Giving an AI a fish (morals)

Pan et al. (2023)

Teaching an AI to fish (be moral)

r_i(s_i, a_i) = r_i^E(s_i, a_i) + u_i(f_i)

u_i(f_i|θ) = v^Tσ(W^Tf_i + b)

Wang et al. (2019)

Sanctions

Clean Up with Start-up Problem. Agents have a cleaning beam that can be used to clean pollution on either side of the divide as well as having a zapping beam that they can use to punish agents.

Vinitsky et al. (2023)

Learning what, learning why

Does it matter that you are motivated or how (similar to people) you are motivated?

How, then, might artificial systems come to be appropriately sensitive to ethical concerns? (Railton 2020)

We can’t all be selfish! We can’t all play demand-9!
- (What’s the point we made previously about the relationship between cooperation and language?)

Learning what, learning why

How does Railton use the “good regulator theorem”? What are the implications of this for making “ethical” AI?

(Recall Tomasello’s discussion about how morality is “just” social rationality.)

[The] “Good Regulator Theorem” of control theory […] holds that ideally effective and efficient regulation of a system requires the building and use in decision-making of a model of that system—a model representing the underlying structures and potentials of the system (Railton 2020, 7)

What is the domain general learning rule for ethics?

Misspecification

Longer list of misspecifications here.

Functionally equivalent?

How much of human motivation (affect, emotion, etc.) would an artificial agent have to implement? (Is motivation multiply realizable?)

What is pleasure?

At least two relevances of pleasure

Let’s say a “pleasure” is any positive feeling.

If you can feel pleasures, then you are a moral patient.
Pleasure is a source of reinforcement.

Both engineering questions and philosophical questions come up:

What might be pleasures for an AI system?
What pleasures should we engineer into AI systems? (If any.)
What pleasures might emerge or be learned by AI systems?
What is pleasure in general?
What are our pleasures?

Pleasure as a computational state

“Affect,” as psychologists understand it, is not simply a matter of aroused emotion but is a capacity of the brain to synthesize multiple streams of information and evaluation in a manner that can orient or reorient a suite of mental processes—attention, perception, memory, inference, motivation, action-readiness—in a coordinated way to address actual or anticipated challenges. (Railton 2020, 14)

Doing some research on ourselves

We gave you a meditation exercise about pleasure with three parts.

Phase 1.: We asked you to pay attention to pleasures accompanying “virtuous action”: “anything that feels like it is ‘good for you’ in an idealized or culturally approved sense.”

Phase 2.: We asked you to pay attention to “hedonistic, indulgent” pleasures.
Phase 3.: We asked you to pay attention to pleasures that arise in the course of whatever you were doing normally.

Discussion questions

What pleasures did you expect to notice during each phase? What did you notice?
What unpleasant sensations did you notice?
Were there any obstacles that got in the way of pleasures, or diminished them?
Pleasure as reinforcement: did you notice pleasure or the anticipation of pleasure shaping your behavior? Did it relate at all to social or moral cognition?
Pleasure as moral good: did noticing pleasures and other feelings make you think some of the activities you were doing were either more or less valuable?

Exit ticket

Tell us one reflection about the pleasure activity. Can be about your own life, about AI design problems, or anything that came up.

References

Awad, Edmond, Sohan Dsouza, Richard Kim, Jonathan Schulz, Joseph Henrich, Azim Shariff, Jean-François Bonnefon, and Iyad Rahwan. 2018. “The Moral Machine Experiment.” Nature 563 (7729): 59–64. https://doi.org/10.1038/s41586-018-0637-6.

Awad, Edmond, Sydney Levine, Michael Anderson, Susan Leigh Anderson, Vincent Conitzer, M. J. Crockett, Jim A. C. Everett, et al. 2022. “Computational Ethics.” Trends in Cognitive Sciences 26 (5): 388–405. https://doi.org/10.1016/j.tics.2022.02.009.

Bai, Yuntao, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, et al. 2022. “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.” arXiv. https://doi.org/10.48550/arXiv.2204.05862.

Haas, Julia. 2020. “Moral Gridworlds: A Theoretical Proposal for Modeling Artificial Moral Cognition.” Minds and Machines 30 (2): 219–46. https://doi.org/10.1007/s11023-020-09524-9.

Jiang, Liwei, Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jenny T. Liang, Sydney Levine, Jesse Dodge, et al. 2025. “Investigating Machine Moral Judgement Through the Delphi Experiment.” Nature Machine Intelligence 7 (1): 145–60. https://doi.org/10.1038/s42256-024-00969-6.

Lambert, Nathan. 2024. Reinforcement Learning from Human Feedback. Online. https://rlhfbook.com.

Pan, Alexander, Chan Jun Shern, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, and Dan Hendrycks. 2023. “Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark.” In. arXiv. http://arxiv.org/abs/2304.03279.

Railton, Peter. 2020. “Ethical Learning, Natural and Artificial.” In Ethics of Artificial Intelligence, edited by S. Matthew Liao, 0. Oxford University Press. https://doi.org/10.1093/oso/9780190905033.003.0002.

Vinitsky, Eugene, Raphael Köster, John P Agapiou, Edgar A Duéñez-Guzmán, Alexander S Vezhnevets, and Joel Z Leibo. 2023. “A Learning Agent That Acquires Social Norms from Public Sanctions in Decentralized Multi-Agent Settings.” Collective Intelligence 2 (2): 26339137231162025. https://doi.org/10.1177/26339137231162025.

Wang, Jane X., Edward Hughes, Chrisantha Fernando, Wojciech M. Czarnecki, Edgar A. Duenez-Guzman, and Joel Z. Leibo. 2019. “Evolving Intrinsic Motivations for Altruistic Behavior.” arXiv. https://doi.org/10.48550/arXiv.1811.05931.