MS&E338 Aligning Superintelligence

Ben Van Roy, Henrik Marklund, Yueyang Liu, Spring 2024

Within a couple of decades, or less, it is plausible that humans will create an AI that is much smarter than humans in practically all domains of human activity. We refer to such an AI as a superintelligence. The alignment problem is how to make sure that such a superintelligence acts according to its human operator's intent. This course is intended for a technical audience interested in thinking about this problem.

But why would an AI not act according to its human operator’s intent? And if the AI were to misbehave, wouldn't the operator just modify it or shut it down? Furthermore, even if we accept that the AI will not always behave as intended, why should this be considered a major source of risk, let alone, a catastrophic risk? Why are some people saying that these risks should be a global priority on par with pandemics and nuclear war while others are saying that these concerns are overhyped?

In this course, we will discuss:

Why may a superintelligence become misaligned with its operator's intent?
Might misalignment pose a catastrophic risk?
What are proposed solutions to the alignment problem?

The course will place special emphasis on formalizing ideas. Guest lectures will be delivered by alignment researchers.

Please note: Enrolled students will be invited to a Google doc that includes more detailed course information, including schedule and assignments. If you are auditing the course, please reach out and we will invite you.

Prerequisites

To have the background to participate in this, each student is recommended to have taken

one graduate-level machine learning course
one course that studies agents (e.g., AI, RL, decision analysis, economics)

Course Project

Students will be required to write opinion, review, or research papers on aspects of the alignment problem.

What this course is not about
This course will focus on the alignment of future superintelligence, rather than the alignment of current systems. There are many challenges that the course will not address. These include:

Use of AI by ill-intentioned humans. Such situations represent misalignment between humans, rather than between a human and an AI.
Aggregation of conflicting preferences across humans.
Minimization of bias in AI products.
How to organize society in a post-superintelligence world (governance, redistribution, retooling).
How to deal with misinformation and track provenance.
The moral status of future superintelligence.

Logistics
3-4:20pm Mondays and Wednesdays in 370-370. This is in building 370 by the Main Quad.

Course Notes
We are writing notes as the course progresses. We really appreciate your feedback.

00 - Foundations of Reinforcement Learning. Link
01 - An Operator-Agent Interface. Link
02 - Reward Hacking. Link
03 - Reward Uncertainty. Link
04 - Reward Learning. Link

Papers
Some of the articles we will draw from:

Cooperative Inverse Reinforcement Learning (2016). Link
The Off-Switch Game (2017). Link
Defining and Characterizing Reward Hacking. Link
AI Safety via Debate. Link

Guest Speakers

Mon, April 8. Stuart Russell from Berkeley. Location: Gates 119 (Computer Science Building).
Mon, April 29. Rachel Freedman from Berkeley
Wed, May 1. Evan Hubinger from Anthropic
Wed, May 8. John Schulman from OpenAI
Mon, May 13. Jason Hartline from Northwestern
Wed, May 15. Mark Xu from Alignment Research Center.