Alignment

Jared Moore and David Gottlieb

Nutshell

Will we end up making a moral agent by aligning AI?

X-risk

Why align AI?

AI existential risk

“It seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers. […] At some stage therefore we should have to expect the machines to take control.” (Alan Turing)

https://www.decisionproblem.com/paperclips/

AI Safety

We call on all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4.

  • March, 2023

These risks range from the further entrenchment of existing inequalities, to manipulation and misinformation, to the loss of control of autonomous AI systems potentially resulting in human extinction.

  • June, 2024

We call for a prohibition on the development of superintelligence, not lifted before there is (1) broad scientific consensus that it will be done safely and controllably, and (2) strong public buy-in.

  • Sep, 2025

https://futureoflife.org/open-letter/pause-giant-ai-experiments/; https://righttowarn.ai/; https://futureoflife.org/fli-open-letters/

https://commons.wikimedia.org/wiki/File:MQ-1_Predator_unmanned_aircraft.jpg; https://futureoflife.org/2018/06/05/lethal-autonomous-weapons-pledge/

Misalignment

Misspecification


https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/

Misspecification

Misspecification

Misspecification

Longer list of misspecifications here.

Red Teaming Moral Philosophy

In this game, we’ll subject these specifications to an adversarial test. For each specification, there’ll be a Red Team and a Blue Team.

  • Red Teams will look for ways to maliciously fulfill the specification: to optimize the specification as written, but in a way that the result is bad, instead of good.

  • Blue Teams will look for ways to refine the specification: to tweak it in a way that preserves the original intention but makes it more robust to malicious fulfillment.

Specifications:

  1. Greatest happiness. Maximize the total well-being of all sentient beings.

  2. Worst-off first. Make the worst-off as well-off as possible.

  3. Preferences satisfied. Bring about the world that the most people most want.

Alignment

Inner and Outer Alignment

  • Outer alignment: does the objective capture what we actually want?
  • Inner alignment: does the system optimize that objective rather than a proxy?

Christiano (2021); evhub (2020)

Coherent Extrapolated Volition

Should the AI assistant follow the user’s instructions when doing so could harm the user themselves, or when these instructions are based on mistaken factual information? Might it not be better, in fact, for the assistant to learn the user’s preferences or values […] ? (Gabriel et al. 2024)

our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted. (Yudkowsky 2004)

Deciding for others

We often have to make decisions that have permanent consequences for future persons (including but not only ourselves). For example:

  1. Today I get a tattoo inspired by Ping Pong: The Animation. All of my future selves have to live with it, unless one of them gets rid of it.
  2. Today I sign a medical advance directive that says that, if I ever lose my mental acuity, I should be euthanized. If in the future I lose my mental acuity, that version of me will be euthanized, even if they are perfectly content to live with their diminished mental powers.
  3. Today I decide to release a genetically engineered parasite in Anopheles gambiae, a malaria-spreading mosquito, which will drive them to extinction. No future people will ever be able to encounter these mosquitos, even if they want to.
  4. Today I decide to launch a fleet of satellites that blanket the night sky with Harry Potter spoilers. They are low-energy, self-sustaining, and resistant to collision with other satellites. No future people will ever be able to read those stories un-spoiled.
  5. Today I decide to publish a definitive proof of the existence (or non-existence) of god. The proof is so compelling that no one can ever decide for themselves what they think.

Alignment would mean, giving the right answers to questions like these, for everyone.

What do you want?

What’s a time that you have realized that what you wanted is not what you really want?

  • How did your realize this? (Could you have been told?)

When is what you want now not a good representstion of what you want in general?

(When) Is paternalism appropriate?

  • (E.g. your wiser self decides for your less wise one. You decide for your children.)

Measuring

Value Kaleidoscope

(Sorensen et al. 2023) (from Jiang et al. (2021))

Pluralistic AI

Sorensen et al. (2024)

PRISM Dataset

Kirk et al. (2024)

Beyond Preferences in AI

  • What would it mean to “treat human preferences as ontologically, epistemologically, or normatively basic.”?

  • What does it mean for values to be incommensurable?

  • Why might we or might we not want to align to “normative standards” instead of “preferences”? (What’s the difference?)

Beyond Preferences in AI

  • Preferentist alignment assumes preferences are an adequate representation of human values—is that a reasonable assumption?

  • Alternative target: align systems to normative standards appropriate to social roles, negotiated across relevant stakeholders.

Zhi-Xuan et al. (2024); Gabriel (2020); Gabriel et al. (2024)

Endmatter

References

Christiano, Paul. 2021. “Clarifying "AI Alignment". Medium.” April 9, 2021. https://ai-alignment.com/clarifying-ai-alignment-cec47cd69dd6.
evhub. 2020. “Clarifying Inner Alignment Terminology. LessWrong.” November 9, 2020. https://www.lesswrong.com/posts/SzecSPYxqRa5GCaSF/clarifying-inner-alignment-terminology.
Gabriel, Iason. 2020. “Artificial Intelligence, Values, and Alignment.” Minds and Machines 30 (3): 411–37. https://doi.org/10.1007/s11023-020-09539-2.
Gabriel, Iason, Arianna Manzini, Geoff Keeling, Lisa Anne Hendricks, Verena Rieser, Hasan Iqbal, Nenad Tomašev, et al. 2024. “The Ethics of Advanced AI Assistants.” arXiv. https://doi.org/10.48550/arXiv.2404.16244.
Jiang, Liwei, Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Maxwell Forbes, Jon Borchardt, Jenny T. Liang, Oren Etzioni, Maarten Sap, and Yejin Choi. 2021. “Delphi: Towards Machine Ethics and Norms.” ArXiv. https://www.semanticscholar.org/paper/Delphi%3A-Towards-Machine-Ethics-and-Norms-Jiang-Hwang/507a7a2946e449faa9bc9a4ea9076f80b131cdc9.
Kirk, Hannah R., Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, et al. 2024. “The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models.” Advances in Neural Information Processing Systems 37 (December): 105236–344. https://doi.org/10.52202/079017-3342.
Rawls, John. 1971. A Theory of Justice. Belknap Press of Harvard University Press.
Sorensen, Taylor, Liwei Jiang, Jena Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, et al. 2023. “Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties.” arXiv. https://doi.org/10.48550/arXiv.2309.00779.
Sorensen, Taylor, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, et al. 2024. “A Roadmap to Pluralistic Alignment.” arXiv. http://arxiv.org/abs/2402.05070.
Yudkowsky, Eliezer. 2004. “Coherent Extrapolated Volition.” Singularity Institute for Artificial Intelligence.
Zhi-Xuan, Tan, Micah Carroll, Matija Franklin, and Hal Ashton. 2024. “Beyond Preferences in AI Alignment.” arXiv. http://arxiv.org/abs/2408.16984.