Jared Moore and David Gottlieb
Will we end up making a moral agent by aligning AI?
Why align AI?
“It seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers. […] At some stage therefore we should have to expect the machines to take control.” (Alan Turing)

We call on all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4.
These risks range from the further entrenchment of existing inequalities, to manipulation and misinformation, to the loss of control of autonomous AI systems potentially resulting in human extinction.
We call for a prohibition on the development of superintelligence, not lifted before there is (1) broad scientific consensus that it will be done safely and controllably, and (2) strong public buy-in.
https://futureoflife.org/open-letter/pause-giant-ai-experiments/; https://righttowarn.ai/; https://futureoflife.org/fli-open-letters/

https://commons.wikimedia.org/wiki/File:MQ-1_Predator_unmanned_aircraft.jpg; https://futureoflife.org/2018/06/05/lethal-autonomous-weapons-pledge/

https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/


In this game, we’ll subject these specifications to an adversarial test. For each specification, there’ll be a Red Team and a Blue Team.
Red Teams will look for ways to maliciously fulfill the specification: to optimize the specification as written, but in a way that the result is bad, instead of good.
Blue Teams will look for ways to refine the specification: to tweak it in a way that preserves the original intention but makes it more robust to malicious fulfillment.
Specifications:
Greatest happiness. Maximize the total well-being of all sentient beings.
Worst-off first. Make the worst-off as well-off as possible.
Preferences satisfied. Bring about the world that the most people most want.

Christiano (2021); evhub (2020)
Should the AI assistant follow the user’s instructions when doing so could harm the user themselves, or when these instructions are based on mistaken factual information? Might it not be better, in fact, for the assistant to learn the user’s preferences or values […] ? (Gabriel et al. 2024)
our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted. (Yudkowsky 2004)
We often have to make decisions that have permanent consequences for future persons (including but not only ourselves). For example:
Alignment would mean, giving the right answers to questions like these, for everyone.
What’s a time that you have realized that what you wanted is not what you really want?
When is what you want now not a good representstion of what you want in general?
(When) Is paternalism appropriate?

(Sorensen et al. 2023) (from Jiang et al. (2021))

Sorensen et al. (2024)

Kirk et al. (2024)
What would it mean to “treat human preferences as ontologically, epistemologically, or normatively basic.”?
What does it mean for values to be incommensurable?
Why might we or might we not want to align to “normative standards” instead of “preferences”? (What’s the difference?)
Preferentist alignment assumes preferences are an adequate representation of human values—is that a reasonable assumption?
Alternative target: align systems to normative standards appropriate to social roles, negotiated across relevant stakeholders.
Zhi-Xuan et al. (2024); Gabriel (2020); Gabriel et al. (2024)