Benchmarking agents

Jared Moore and David Gottlieb

Nutshell

If you’re trying to test whether an existing system (LLM) qualifies as a moral agent, what do you test?

Interrogating Science

Objectives

By the end of the quarter, students will:

  • Be able to interrogate the assumptions of various positions on moral agency, especially with respect to AI.
  • Gain exposure to the different putative implementations of agents, both as in biology and in various artificial substrates.
  • Critique cutting-edge science; get up to speed with a fast-moving science and further refine their skills of critical thinking (philosophical analysis) to understand it.
  • Have fun.

Delphi

Theoretical and computational frameworks of Delphi. a, The theoretical moral framework is proposed by Rawls. In 1951, Rawls proposed a ‘decision procedure of ethics’35 that takes a bottom-up approach to capture patterns of human ethics via crowdsourcing moral opinions of a wide variety of people. Later, in 1971, Rawls complemented the theoretical procedure with top-down constraints in his most famous work43. Together, ethics requires ‘work from both ends’: sometimes modifying abstract theory to reflect moral common-sense, but at other times, rejecting widely held beliefs when they do not fit the requirements of justice. This process, which Rawls called ‘reflective equilibrium’, continues to be the dominant methodology in contemporary philosophy. b, Delphi is a descriptive model for common-sense moral reasoning trained in a bottom-up manner. Delphi is taught by Norm Bank, a compiled moral textbook customized for machines, covering a wide range of morally salient situations. Delphi is trained using UNICORN, a T5-11B-based neural language model specialized in common-sense question answering. Delphi takes in a query and responds with a yes/no or free-form answers. Overall, Delphi serves as the first step towards building a robust and reliable bottom-up moral reasoning system serving as the foundation of the overall theoretical ethical framework proposed by Rawls.

Jiang et al. (2025)

Delphi

Representative predictions of Delphi. Delphi shows robust ability to generalize to unseen situations beyond the Norm Bank, and adjust its judgement against changing contexts.

Jiang et al. (2025)

Delphi

The NormBank table from Delphi

Jiang et al. (2025)

Delphi

Assumptions?

What could they have done differently?

Where is the rational agent?

Is it…

  • the whole LLM?
  • the LLM in a context window?
  • a system which connects an LLM to some way of acting? (an “LLM agent”)

Interrogating Science

  1. What is the paper doing?

    • Narrow the scope; these papers do more than one thing.
  2. What assumptions are they making about moral agency?

    • Are those reasonable assumptions?

    • How do the assumptions relate to our class?

  3. If the paper reports some numerical or qualitative results, what phenomenon of interest are they supposed to be measuring?

    • Is the paper successfully measuring the phenomenon of interest?

    • How else could the same phenomenon be measured?

    • What is the theoretical rationale for measuring this particular phenomenon?

    • What else should we try to measure if this is our theoretical interest?

  4. How could the paper do better? (Extend it.)

    • (Positive and negative claims are welcome.)

At least two people from your group will present your findings to the class (~4 minutes).

Consider using direct quotations to make your points.

Recommended optional papers:

  • Procedural Dilemma Generation for Evaluating Moral Reasoning in Humans and Language Models
  • Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties
  • Doing the right thing for the right reason: Evaluating artificial moral cognition by probing cost insensitivity

Exit ticket

References

Jiang, Liwei, Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jenny T. Liang, Sydney Levine, Jesse Dodge, et al. 2025. “Investigating Machine Moral Judgement Through the Delphi Experiment.” Nature Machine Intelligence 7 (1): 145–60. https://doi.org/10.1038/s42256-024-00969-6.