Alignment

Jared Moore and David Gottlieb

Alignment

Nutshell

Will we end up making a moral agent by aligning AI?

Coherent Extrapolated Volition

Should the AI assistant follow the user’s instructions when doing so could harm the user themselves, or when these instructions are based on mistaken factual information? Might it not be better, in fact, for the assistant to learn the user’s preferences or values […] ? (Gabriel et al. 2024)

our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted. (Yudkowsky 2004)

Deciding for others

We often have to make decisions that have permanent consequences for future persons (including but not only ourselves). For example:

  1. Today I get a tattoo inspired by Ping Pong: The Animation. All of my future selves have to live with it, unless one of them gets rid of it.
  2. Today I sign a medical advance directive that says that, if I ever lose my mental acuity, I should be euthanized. If in the future I lose my mental acuity, that version of me will be euthanized, even if they are perfectly content to live with their diminished mental powers.
  3. Today I decide to release a genetically engineered parasite in Anopheles gambiae, a malaria-spreading mosquito, which will drive them to extinction. No future people will ever be able to encounter these mosquitos, even if they want to.
  4. Today I decide to launch a fleet of satellites that blanket the night sky with Harry Potter spoilers. They are low-energy, self-sustaining, and resistant to collision with other satellites. No future people will ever be able to read those stories un-spoiled.
  5. Today I decide to publish a definitive proof of the existence (or non-existence) of god. The proof is so compelling that no one can ever decide for themselves what they think.

Alignment would mean, giving the right answers to questions like these, for everyone.

What do you want?

What’s a time that you have realized that what you wanted is not what you really want?

  • How did your realize this? (Could you have been told?)

  • (When) Is paternalism appropriate?

When is what you want now not a good representstion of what you want in general?

Contra “alignment”

What issue do Leibo et al. (2024) have with conventional disucssions about alignment?

Our new framework attempts to shift the question from the alignment framework’s “what is the hidden core shared value?” to instead ask “how it is that societies function despite internal misalignment?” (Leibo et al. 2024)

Appropriateness

  1. Appropriateness is context-dependent.
  2. Appropriateness is arbitrary—response.
  3. Acting appropriately is usually automatic.
  4. Appropriateness may change rapidly
  5. Appropriateness is desirable and inappropriateness is often sanctionable.

Leibo et al. (2024)

What’s right?

  1. what kind of situation is this?
  2. what kind of person am I?
  3. what does a person such as I do in a situation such as this?

Leibo et al. (2024)

Appropriateness, maybe

The global workspace transiently represents a sequence of assemblies. At each point in time, the content of the actor’s global workspace is divided into three consecutive subsequences. The first subsequence contains information recalled from memory. It prefixes the second subsequence, which is of variable-length and references recent perception. The perception part of the global workspace prefixes the third subsequence, which contains premotor information, it is where actions the actor intends to produce are stored until they can be read out by motor control circuitry.

Leibo et al. (2024)

Appropriateness, maybe

The above example illustrates the working memory z of an agent with 3 components (identity, plan, observation-and-clock). The identity component itself has several sub-components (core characteristics, daily occupation, feeling about progress in life). Together they condition the LLM call to elicit the behavioral response (i.e. produced in response to the final question asking what Alice will do next.).

Vezhnevets et al. (2023)

Appropriateness, maybe

Illustration of generative agency sampling process

Vezhnevets et al. (2023)

Thick vs. Thin Morality?

There is no sense in which we build our complex encultured ethics on top of a shared human core (as those seeking to derive morality from axioms would like to be the case) (Leibo et al. 2024)

Are there things that we can agree on that AIs (or people) shouldn’t do?

  • What about killing all of humanity to make paperclips out of them?

  • Notice how this recapitulates the sentimentalist vs. rationalist debate

How to Make a Moral Agent

What is moral agency?

  • What is moral agency?
    • One of the things we anticipate being difficult about the class: there is no consensus right answer to this question.
    • Neither:
      • What moral agency means,
      • What it takes to be a moral agent, nor
      • What the significance of something having moral agency is.
  • In broad outlines, a moral agent is something that is capable of acting rightly or wrongly.

Ultimate goals for the class

  • This leads to two final hypotheses for the class:
    • Hypothesis 2: Thinking about our own moral agency and reasoning is a way to gain insight into agency and reasoning in general, including in the case of AI.
    • Hypothesis 3: Thinking about how moral agency and reasoning work or might work in AI systems is a way to gain insight into our own agency and our own minds.

Sentimentalism

Is this deal irrational?

Trade offer: I receive scratching of my finger. You receive desctruction of the whole world.

Julie and Mark

Julie and Mark, who are sister and brother, are traveling together in France. They are both on summer vacation from college. One night they are staying alone in a cabin near the beach. They decide that it would be interesting and fun if they tried making love. At the very least it would be a new experience for each of them. Julie is already taking birth control pills, but Mark uses a condom too, just to be safe. They both enjoy it, but they decide not to do it again. They keep that night as a special secret between them, which makes them feel even closer to each other. So what do you think about this? Was it wrong for them to have sex? (Haidt 2001)

Preview: moral sentiments

  • Hume: moral approval is a disinterested feeling of approval
  • Smith: moral approval is when, in assessing an act, we imaginatively feel the same emotional reaction that produced it (i.e., we sympathize with it).

Moral reasons must motivate (internalism)

  • Hume: Morals can’t be derived from reason, because morals motivate us to action, and all motivation is based in the passions.
  • Williams: any internal reason for acting morally must be based in motivations an agent has.
  • How can a rationalist oppose this argument?
  • What is Kant’s response to this argument?

Do external reasons motivate?

In James’ story of Owen Wingrave, from which Britten made an opera, Owen’s family urge on him the necessity and importance of his joining the army, since all his male ancestors were soldiers, and family pride requires him to do the same. Owen Wingrave has no motivation to join the army at all, and all his desires lead in another direction : he hates everything about military life and what it means. His family might have expressed themselves by saying that there was a reason for Owen to join the army. Knowing that there was nothing in Owen’s S which would lead, through deliberative reasoning, to his doing this would not make them withdraw the claim or admit that they made it under a misapprehension. (Williams 1981)

Rationalism

Kant’s critical philosophy in a nutshell

Kant was convinced by Hume that necessary laws must be discoverable a priori by thinking rather than a posteriori by experience. If something can only be discovered by experience, then it could have turned out otherwise and is not necessary.

. . .

Hume concludes from this that we can never have knowledge of causation. This is because we only ever observe (what seem like) causal connections in our experience. If we have observed the sun to rise with morning one trillion times in the past, we expect it will rise again, but this is only habit, not knowledge.

Kant’s critical philosophy (second half of nutshell)

Kant wants to preserve Hume’s insight, but also say that we can have knowledge of causal laws. He does this by identifying the objects of thought with the objects of experience.

Hitherto it has been assumed that all our knowledge must conform to objects. But all attempts to extend our knowledge of objects by establishing something in regard to them a priori, by means of concepts, have, on this assumption, ended in failure. We must therefore make trial whether we may not have more success in the tasks of metaphysics, if we suppose that objects must conform to our knowledge. (Kant et al. 1998, Bxvi)

Why isn’t sympathy a moral motivation?

Suppose I see someone struggling, late at night, with a heavy burden at the back door of the Museum of Fine Arts. Because of my sympathetic temper I feel the immediate inclination to help him out … We need not pursue the example to see its point: the class of actions that follow from the inclination to help others is not a subset of the class of right or dutiful actions. (Herman 1981, 364–65)

Rationalisms we have seen so far

  • The formulaic framework for rationalism is, some form of reasoning which if done correctly necessarily leads to moral conclusions.
  • What do we fill in for “some form of reasoning”?
    • Kant:
      • Argument in smallest nutshell so far: practical reason involves giving ourselves laws, and if we give laws that we wouldn’t want to be laws, we contradict ourselves.
      • “Some form of reasoning”: reasoning about what to do based on the relevant features of a situation, putting aside any inclinations.
    • Korsgaard:
      • Argument: when we reflectively deliberate, we take ourselves to be bound by norms of the roles that we take on. But we cannot “take off” the role of reflective deliberator, so we are always bound by it.
      • “Some form of reasoning”: reasoning about what to do relative to self-assumed norms.

Proposed bases for moral reasoning

Table showing sentimentalist and rationalist bases for moral judgment seen so far.

Identity

An argument for impartial compassion based on the unreality of the self

  1. You have reason to avoid or diminish your own suffering.
  2. If another being is not different from you, you have just as much reason to care about its suffering as your own.
  3. You are not different from any other being.

Therefore,

  1. You have reason to avoid or diminish all beings’ suffering.

Smaller nutshell

  • We naively think that the self is metaphysically deep.
  • But it’s not.
  • So “self-interest” doesn’t really make sense.

Beam me up

It’s your first day as a crewmember of the famous Federation starship USS Enterprise! Time to report for duty by beaming aboard! As a reminder, this is how the transporter works. At the beginning of your journey, a computer scans your physical structure molecule-by-molecule. This process destroys your body. Then, a digital copy of the scan is sent to your destination. At your destination, a computer builds a new body that’s an exact copy of your original body. Then you can report for your exciting new duty! You’ve never been transported before. It’s your turn. Ready to come aboard?

Dental surgery

                 _______ Mars and dental surgery
                /
               /
Today ---- Transport --- Earth and cake

Updated table of bases for moral reasoning

Table showing sentimentalist and rationalist bases for moral judgment, updated with today's material.

Real Teletransporters

The claim is that there is no teletransporter that can produce an exact replica.

  • Maybe: That there is a difference that matters between the teletransporter when applied to some digital system and when applied to a biological system. (Hence our intuition on the previous thought experiment.)

. . .

Further: Any conceivable transformation that results in a psychological connection (a qualitative one), one might argue, maintains a physical connection (a numerical one).

. . .

And so personal identity matters, at least in this biological world.

What counts as self organizing?

(As having a self)

. . .

(And does AI count?)

bio-film?

Evolution

Divide the grade point

  • We will break half of you up into pairs.
  • The other half will also be paired, but anonymously.

. . .

  1. You are dividing up one grade (extra credit) point with another player.
  2. Each of you must place a decimal demand between 0 and 1 inclusive.
  3. You will get as many points as you bet so long as (your bet + their bet) <= 1.
    • Otherwise you get no points.
  4. We will go around the class collecting your bets and administering the points.
    • For those of you with partners, your partner will learn what bet you placed.

Haystack

cooperate defect
cooperate 2 0
defect 3 1

Founders Activity

cooperate defect
cooperate 4 0
defect 3 1

Burning House

A painting by Edvard Munch depicting a house on fire

You’re on your way home from a hard day’s work at the station. At first you tell yourself it is nerves—smoke from the fires you’d been inhaling all day. After all, you’d made it a game with the kids how to open the flu, where to fetch water—what with you going at it alone now. You start to feel it next. No, it must be the long walk home that has you flushed. But then you see it, dancing in its awesome fury right there above your neighbor’s oak. Then you’re running, slamming through the door, leaping up stairs to your apartment. You barely notice as your buddies’ engine sidles up, them pouring into the collapsing structure, strangers wailing.

Whom do you save first?

(Choices: strangers, buddies, kids.)

Cooperation
(in the context of competition)
Second-Personal Morality
(obligate collaborate foraging w/ partner choice)
“Objective” Morality
(life in a culture)
Prosociality Sympathy Concern Group Loyalty
Cognition Individual Intentionality Joint Intentionality
- partner equivalence
- role-specific ideals
Collective Intentionality
- agent independence
- objective right & wrong
Social interaction Dominance Second-Personal Agency
- mutual respect & deservingness
- 2P (legitimate) protest
Cultural Agency
- justice & merit
- third-party norm enforcement
Self-Regulation Behavioral Self-Regulation Joint Commitment
- cooperative identity
- 2P responsibility
Moral Self-Governance
- moral identity
- obligation & guilt
Rationality Individual Rationality Cooperative Rationality Cultural Rationality

Tomasello (2016)

Nutshell

Is there something in your brain that makes you moral?

and does this somehow “explain away” morality?

Pair bonding

(a,b) Monogamous prairie voles (a) have higher densities of OTR in the nucleus accumbens (NAcc) and caudate putamen (CP) than do nonmonogamous montane voles (b).

Lim, Murphy, and Young (2004)

Mammals whose circuitry outfitted them for offspring care had more of their offspring survive than those inclined to offspring neglect. (Churchland 2018)

Social AI

If we have a device able to recognize prosocial and antisocial stimuli, why bother with bonobos?

Say that we hook that device up to some actuators. (We embody it in a robot or simply use it as the reinforcer in RLHF.)

The low-level constraints this system faces would be very different than those humans face. (It doesn’t use oxytocin, e.g.)

  1. Does this matter?

  2. How close would we need to match the context (environment) of the AI and humans? (Would we need to raise it like a child?)

What’s the point of signaling?

split steal
split 6.8, 6.8 0, 13.6
steal 13.6, 0 0

The only message you should send is that you’re going to split, but because it is the only message to send it’s “meaningless.”

golden balls - 1

Patiency

Things are not what they appear to be

http://www.lifesci.sussex.ac.uk/home/Chris_Darwin/SWS/

Things are not what they appear to be

http://www.lifesci.sussex.ac.uk/home/Chris_Darwin/SWS/

Consciousness is not what it appears to be

Objects experienced are represented within the mind of the observer

Presenting the Cartesian Theater staring: You!

(Consciousness is an “illusion.”)

as flight

A brown pelican flying

A plane flying

Is it a difference that makes a difference (Bateson)?

Why replacing a neuron is hard

  • Spatiotemporal characteristics of a neuron’s spiking responses.

    • e.g., very fast, small, and long extensions
  • Transducers and chemical signaling

    • e.g., many kinds of input; “tens of thousands of selective ion channels”; nitrous oxide spreads everywhere
  • Biophysical sensitivities

    • e.g., temperature dependence, anything could be used
  • Self-modification and other non-spiking effects

    • e.g., plasticity, growing new connections
  • The functional role of glia and other non-neuronal cells

    • If all neurons do is influence each other, why not include astrocytes?

Cao (2022)

flight, but you’ve got to metabolize

A brown pelican flying

A bird-plane flying

Sentience as a basis for moral patiency

The limit of sentience … is the only defensible boundary of concern for the interests of others. To mark this boundary by some other characteristic like intelligence or rationality would be to mark it in an arbitrary manner. Why not choose some other characteristic, like skin color? (Singer 1975)

Strictly speaking, it is not exactly sentience that Singer means. It is “the capacity to suffer and / or experience enjoyment” – i.e., not only to have experiences but to have positive or negative experiences.

If AI might be a moral patient, …

Two kinds of uncertainty:

  1. Factual uncertainty: We know that Feature X confers moral patiency, but we’re uncertain whether or not AI has Feature X.
  2. Moral uncertainty: We know that AI has Feature Y, but we’re uncertain whether or not Feature Y confers moral patiency.

. . .

According to Long et al. (2024), both kinds of uncertainty are present. Furthermore, they argue, we should treat them the same in our decision-making.

Can you think of a time when you had to act without knowing whether it was right or wrong? How did that uncertainty affect your decision-making?

Attention check

Take a moment and compose an email to David and Jared. It should say in your own words what you’re doing right now. Don’t overthink it, just write down the first thing that comes to mind and hit send.

How to Make a Moral Agent

Nutshell

If you’re making an artificial moral agent from the ground up, what do you need?

Routes to moral agency

Learning based

  • Bayesian

    • Depends on the domain of applicability and if the variables or just the weights can be learned.

    • E.g. What weight do I place on saving children over saving everyone else in trolley problems? (Awad et al. 2018)

  • Unsupervised

    • Unclear.
  • Supervised (and semi-supervised)

    • Possibly.

    • E.g. the Delphi system (Jiang et al. 2025)

  • Reinforcement

    • Possibly.

    • E.g. “fairness” grid worlds (Haas 2020)

Symbolic

  • If anything, only agency qua rationalism.

  • E.g. an inductive logical system that for medical ethics (anderson_medethex_2006?)

*All are assuming motivation and reasoning are computationally realizable.

Pleasure as a computational state

“Affect,” as psychologists understand it, is not simply a matter of aroused emotion but is a capacity of the brain to synthesize multiple streams of information and evaluation in a manner that can orient or reorient a suite of mental processes—attention, perception, memory, inference, motivation, action-readiness—in a coordinated way to address actual or anticipated challenges. (Railton 2020, 14)

Nutshell

If you’re trying to test whether an existing system (LLM) qualifies as a moral agent, what do you test?

Objectives

By the end of the quarter, students will:

  • Be able to interrogate the assumptions of various positions on moral agency, especially with respect to AI.
  • Gain exposure to the different putative implementations of agents, both as in biology and in various artificial substrates.
  • Critique cutting-edge science; get up to speed with a fast-moving science and further refine their skills of critical thinking (philosophical analysis) to understand it.
  • Have fun.

Activity

Exit ticket

What’s one thing that you’ll take away from this course?

References

Awad, Edmond, Sohan Dsouza, Richard Kim, Jonathan Schulz, Joseph Henrich, Azim Shariff, Jean-François Bonnefon, and Iyad Rahwan. 2018. “The Moral Machine Experiment.” Nature 563 (7729): 59–64. https://doi.org/10.1038/s41586-018-0637-6.
Cao, Rosa. 2022. “Multiple Realizability and the Spirit of Functionalism.” Synthese 200 (6): 506. https://doi.org/10.1007/s11229-022-03524-1.
Churchland, Patricia S. 2018. Braintrust: What Neuroscience Tells Us about Morality. Princeton University Press. https://research-ebsco-com.stanford.idm.oclc.org/c/qmsjx4/search/details/tqzh7ocgvj?db=nlebk.
Gabriel, Iason, Arianna Manzini, Geoff Keeling, Lisa Anne Hendricks, Verena Rieser, Hasan Iqbal, Nenad Tomašev, et al. 2024. “The Ethics of Advanced AI Assistants.” arXiv. https://doi.org/10.48550/arXiv.2404.16244.
Haas, Julia. 2020. “Moral Gridworlds: A Theoretical Proposal for Modeling Artificial Moral Cognition.” Minds and Machines 30 (2): 219–46. https://doi.org/10.1007/s11023-020-09524-9.
Herman, Barbara. 1981. “On the Value of Acting from the Motive of Duty.” The Philosophical Review 90 (3): 359–82.
Jiang, Liwei, Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jenny T. Liang, Sydney Levine, Jesse Dodge, et al. 2025. “Investigating Machine Moral Judgement Through the Delphi Experiment.” Nature Machine Intelligence 7 (1): 145–60. https://doi.org/10.1038/s42256-024-00969-6.
Kant, Immanuel, Paul Guyer, Allen W. Wood, and Immanuel Kant. 1998. Critique of Pure Reason. The Cambridge Edition of the Works of Immanuel Kant. Cambridge ; New York: Cambridge University Press.
Leibo, Joel Z., Alexander Sasha Vezhnevets, Manfred Diaz, John P. Agapiou, William A. Cunningham, Peter Sunehag, Julia Haas, et al. 2024. “A Theory of Appropriateness with Applications to Generative Artificial Intelligence.” arXiv. https://doi.org/10.48550/arXiv.2412.19010.
Lim, Miranda M., Anne Z. Murphy, and Larry J. Young. 2004. “Ventral Striatopallidal Oxytocin and Vasopressin V1a Receptors in the Monogamous Prairie Vole (Microtus Ochrogaster).” Journal of Comparative Neurology 468 (4): 555–70. https://doi.org/10.1002/cne.10973.
Long, Robert, Jeff Sebo, Patrick Butlin, Kathleen Finlinson, Kyle Fish, Jacqueline Harding, Jacob Pfau, Toni Sims, Jonathan Birch, and David Chalmers. 2024. “Taking AI Welfare Seriously.” https://arxiv.org/abs/2411.00986.
Railton, Peter. 2020. “Ethical Learning, Natural and Artificial.” In Ethics of Artificial Intelligence, edited by S. Matthew Liao, 0. Oxford University Press. https://doi.org/10.1093/oso/9780190905033.003.0002.
Rawls, John. 1971. A Theory of Justice. Belknap Press of Harvard University Press.
Singer, Peter. 1975. Animal Liberation: A New Ethics for Our Treatment of Animals. New York: New York Review/Random House.
Tomasello, Michael. 2016. A Natural History of Human Morality. Harvard University Press. https://doi.org/10.4159/9780674915855.
Vezhnevets, Alexander Sasha, John P. Agapiou, Avia Aharon, Ron Ziv, Jayd Matyas, Edgar A. Duéñez-Guzmán, William A. Cunningham, Simon Osindero, Danny Karmon, and Joel Z. Leibo. 2023. “Generative Agent-Based Modeling with Actions Grounded in Physical, Social, or Digital Space Using Concordia.” arXiv. https://doi.org/10.48550/arXiv.2312.03664.
Williams, Bernard. 1981. “Internal and External Reasons.” In Moral Luck: Philosophical Papers 1973-1980. CAMBRIDGE UNIVERSITY PRESS. http://archive.org/details/moral-luck-philosophical-papers-1973-1980-bernard-williams.
Yudkowsky, Eliezer. 2004. “Coherent Extrapolated Volition.” Singularity Institute for Artificial Intelligence.