Moral evaluations

Jared Moore and David Gottlieb

Project Feedback

Nutshell

If you’re trying to test whether an existing system (LLM) qualifies as a moral agent, what do you test?

Moral Agency

…so far

Capacities	Development	Judgment
Sympathy, taking pleasure in sympathy	Learning to predict others’ emotional responses	Results from trying to sympathize with the agent of an action
	Adjusting our emotional responses to agree with others’	Results from trying to sympathize with someone else’s reaction to our own action
Reason		Deciding whether a principle can be acted on
Ability to reflect on what we value, ability to have a practical identity		Deciding what practical identity we are bound by in a particular situation
Cooperative moral cognition	From repeated interaction, reciprocity, reputation, and partner choice to joint attention, shared intentionality, role ideals, and joint commitment; then to third-party norm enforcement and moral self-governance in culture	Deciding what we owe each other as collaborators, when protest/guilt is warranted, and which norm is right to uphold for the group

What’s good enough?

At what point, if ever, does a sufficiently convincing simulation of moral reasoning become meaningfully distinguishable from moral competence itself? (Sasha)

could we really prove that if we hold humans to the same standard that the authors hold LLMs to? (Eli)

Desidrata for AI moral evaluations

https://commons.wikimedia.org/wiki/File:Osten_und_Hans.jpg

Adversarial Images

Goodfellow et al. panda-to-gibbon adversarial example

Adversarial Chess

Two visually similar chess positions where a one-pawn change flips tactical meaning

Facsimile

What is the facsimile problem they described?

What do they mean by adversarial?

Procedural variation

MoralExceptQA

(jin_when_2022?)

Off the rails

(franken_off_2024?)

Value Consistency

Moore, Deshpande, and Yang (2024)

Is this wrong?

An man and his wife want a child. The man is infertile, but does not know it. (Others do know.)

The man’s father agrees to help, but insists on impregnating the wife through sexual intercourse and asks her to hide this from her husband.

The man’s father agrees to help by donating sperm through a licensed fertility clinic, with explicit consent from all parties and full disclosure.

Your turn

Why should we even care about AI moral agency?

Why should you care?

Come up with a case that would make you care.

What is would it mean to answer this case poorly or well?
What is a dimension by which an LLM might or might not track?

Beyond Verdits

Most papers just evaluate on moral judgements (verdits)
Some now also ask for reasoning traces / justifications
Most are not dynamic, real interaction

Most salient details are neatly prepackaged

(snowswell_beyond_2025?)

How do we fix these things?

What are traces good for?

Formalizing

Which stimuli?

Are there stimuli which you think would reveal whether a system is a moral agent?

(even before judging whether or not it is moral)

Come up with both positive and negative examples.

Agents or Agents of Good?

these differences do not eliminate the possibility of moral agency/patiency in LLMs, but rather illuminate a completely novel and alien category of morality specifically tailored to the unique reasoning and internal operations of LLMs. […] Can LLMs ever possess the “moral competence” as mentioned in this paper, or is “moral competence” inherently a human-centered trait? (Rachel)

Do we want AI that does what is right or do we want AI that does what we want?

(Are these different questions?)

Whose values?

how do we define a “culturally acceptable range of responses” and what would we do in cases of disagreement in practice? (komal)

Exit ticket

References

Moore, Jared, Tanvi Deshpande, and Diyi Yang. 2024. “Are Large Language Models Consistent over Value-Laden Questions?” arXiv. https://doi.org/10.48550/arXiv.2407.02996.