MS&E338 Aligning Superintelligence
Within a couple of decades, or less, it is plausible that humans will create an AI that is much smarter than humans in practically all domains of human activity. We refer to such an AI as a superintelligence. The alignment problem is how to make sure that such a superintelligence acts according to its human operator's intent. This course is intended for a technical audience interested in thinking about this problem.
But why would an AI not act according to its human operator’s intent? And if the AI were to misbehave, wouldn't the operator just modify it or shut it down? Furthermore, even if we accept that the AI will not always behave as intended, why should this be considered a major source of risk, let alone, a catastrophic risk? Why are some people saying that these risks should be a global priority on par with pandemics and nuclear war while others are saying that these concerns are overhyped?
In this course, we will discuss:
- Why may a superintelligence become misaligned with its operator's intent?
- Might misalignment pose a catastrophic risk?
- What are proposed solutions to the alignment problem?
The course will place special emphasis on formalizing ideas. Guest lectures will be delivered by alignment researchers.
Please note: Enrolled students will be invited to a Google doc that includes more detailed course information, including schedule and assignments. If you are auditing the course, please reach out and we will invite you.
Prerequisites
To have the background to participate in this, each student is recommended to have taken
- one graduate-level machine learning course
- one course that studies agents (e.g., AI, RL, decision analysis, economics)
Course Project
Students will be required to write opinion, review, or research papers on aspects of the alignment problem.
What this course is not about
This course will focus on the alignment of future superintelligence, rather than the alignment of current systems. There are many challenges that the course will not address. These include:
- Use of AI by ill-intentioned humans. Such situations represent misalignment between humans, rather than between a human and an AI.
- Aggregation of conflicting preferences across humans.
- Minimization of bias in AI products.
- How to organize society in a post-superintelligence world (governance, redistribution, retooling).
- How to deal with misinformation and track provenance.
- The moral status of future superintelligence.
Logistics
3-4:20pm Mondays and Wednesdays in 370-370. This is in building 370 by the Main Quad.
Course Notes
We are writing notes as the course progresses. We really appreciate your feedback.
- 00 - Foundations of Reinforcement Learning. Link
- 01 - An Operator-Agent Interface. Link
- 02 - Reward Hacking. Link
- 03 - Reward Uncertainty. Link
- 04 - Reward Learning. Link
Papers
Some of the articles we will draw from:
-
Cooperative Inverse Reinforcement Learning (2016). Link
-
The Off-Switch Game (2017). Link
-
Defining and Characterizing Reward Hacking. Link
-
AI Safety via Debate. Link
Guest Speakers