CS120: Introduction to AI Safety

Overview

What is safe AI, and how do we make it? CS120 explores this question, focusing on the technical challenges of creating reliable, ethical, and aligned AI systems. We distinguish between model-specific and systemic safety issues, from examining fairness and data limitations to adversarial vulnerabilities and embedding desired behavior in AI. While primarily focusing on current solutions and their limitations through CS publications, we will also discuss socio-technical concerns of modern AI deployment, how oversight of intelligence could look like, and what future risks we might face.

Topics will span reinforcement learning, computer vision, and natural language processing, focusing on interpretability, robustness, and evaluations. You will gain insights into the complexities and problems of why ensuring AI safety and reliability is challenging through lectures, readings, quizzes, and a final project. This course aims to prepare you to critically assess and contribute to safe AI development, equipping them with knowledge of cutting-edge research and ongoing debates in the field.

Week	Date	Lecturer	Topic
Week 1	09/23/25	Max	What Does Safe AI Mean Anyways?
	09/25/25	Max	[Optional] Technical AI/Machine Learning Recap
Week 2	09/30/25	Max	Reward Functions, Alignment, and Human Preferences
	10/02/25	Max	Encoding Human Preferences in AI
Week 3	10/07/25	Sydney Katz (Stanford)	[Guest] Validation of AI Systems
	10/09/25	Max	Data Is All You Need: The Impact of Data
Week 4	10/14/25	Max	Needs for AI Safety Today: Beyond the Hype
	10/16/25	Martin Castillo-Quintana (UChicago)	[Guest] Social-Choice Theory for Engineers
Week 5	10/21/25	Declan Grabb & Andrea Vallone (OpenAI)	[Guest] AI and Mental Health
	10/23/25	Max	Red Teaming, Adversarial Vulnerabilities, and Multi-Agent Systems
Week 6	10/28/25	Max	Interpretability
	10/30/25	Liza & Max	Formal Methods / Safe Human-AI Interaction
Week 7	11/04/25	Max	[Recorded/Optional Attendance] Electric Sheep: What Is Intelligence and Does It Want?
	11/06/25	Nicholas Carlini (Anthropic)	[Guest] Jailbreaks/Adversarial Robustness
Week 8	11/11/25	Max & Naomi	Attributing Model Behavior at Scale: Evaluating AI Systems / AI Governance
	11/13/25	Daniel Johnson (Transluce)	[Guest] Towards A Science of Model Behavior
Week 9	11/18/25	Max	Scalable Oversight: How to Supervise Advanced AI?
	11/20/25	Deb Raji (UC Berkeley)	[Guest] TBD
Week 10	11/25/25	[No class]	Thanksgiving
	11/27/25	[No class]	Thanksgiving
Week 11	12/02/25	Liza	[Optional] Final Project Co-Working and Feedback
	12/04/25	Naomi / Max	[Optional] Final Project Co-Working and Feedback

Logistics

Class Information

Class Number: CS 120
Number of Students: 65 enrolled students
Number of Units: 3 units
Meeting Times: 2x80 minute lectures
Time and Place: TuTh 3:00-4:20 PM in Hewlett Teaching Center 102. Classes are required in person.
Course Email: cs120-teaching-team@stanford.edu (for all course-related questions and inquiries)
Ed (Discussions/QA) Link
Gradescope (Assignments) Link
Office Hours: Link on Zoom (Password: 966718) (Wednesdays 9:00 - 9:30 am via Zoom, starting 10/01)
Prerequisites: This course has no official requirements, although we recommend some knowledge about machine learning and statistics.
Enrollment: No application. First-come first-served basis.
Faculty Sponsor: Clark Barrett (CS)
Past Iteration(s): (with links to slides and recordings) Fall 2024, Spring 2024

Anonymous Feedback

This form is completely anonymous and a way for you to share your thoughts, concerns and ideas with the CS 120 teaching team.

Auditing The Class

You are welcome to audit the class! Please reach out to me (Max) if you want to audit the class to ensure we do not reach the capacity of the classroom.

Please note that auditing is only allowed for matriculated undergraduates, matriculated graduate/professional students, postdoctoral scholars, visiting scholars, Stanford faculty, and Stanford staff. After checking with me, please fill out this form and submit it. Non-Stanford students cannot audit the course. The current Stanford auditing policy is stated here.

Also, if you are auditing the class, please be informed that audited courses are not recorded on an academic transcript and no official records are maintained for auditors. There will not be any record that they audited the course.

Academic Integrity and the Honor Code

Violating the Honor Code is a serious offense, even when the violation is unintentional. The Honor Code is available here. Students are responsible for understanding the University rules regarding academic integrity. In brief, conduct prohibited by the Honor Code includes all forms of academic dishonesty including and representing as one's own work the work of another. If students have any questions about these matters, they should contact their section instructor.

Diversity, Equity and Inclusion

Much of the writing on existential risk produced in the last few decades, especially the notion of longtermism and its implications, has been authored by white male residents of high income countries. Diverse perspectives on threats to the future of humanity enrich our understanding and improve creative problem-solving. We have intentionally pulled work from a broader range of scholars. We encourage students to consider not only the ideas offered by various authors, but also how their social, economic and political position informs their views.

This class provides a setting where individuals of all visible and nonvisible differences– including but not limited to race, ethnicity, national origin, cultural identity, gender, gender identity, gender expression, sexual orientation, physical ability, body type, socioeconomic status, veteran status, age, and religious, philosophical, and political perspectives–are welcome. Each member of this learning community is expected to contribute to creating and maintaining a respectful, inclusive environment for all the other members. If students have any concerns please reach out to Professor Barrett.

Students with Documented Disabilities

Students who need an academic accommodation based on the impact of a disability must initiate the request with the Office of Accessible Education (OAE). Professional staff will evaluate the request with required documentation, recommend reasonable accommodations, and prepare an Accommodation Letter for faculty dated in the current quarter in which the request is being made. Students should contact the OAE as soon as possible since timely notice is needed to coordinate accommodations. The OAE is located at 563 Salvatierra Walk (phone: 723-1066, URL: http://oae.stanford.edu).

Grading

Each week, students are expected to do the required readings and submit quizzes. Towards the end, students will need to submit a final project (later quizzes will adjusted and reduced in scope). Final projects can range from running experiments to writing literature reviews and policy recommendations to accommodate for different backgrounds. The grading breakdown is:

50% quizzes (half from completion, half from correctness)
33% final project
12% peer review
5% class attendance

Bonus/extra credit on top of final grade

(up to) 6% for unused late days (1% per late day)

Letter Grade	Percentage
A	89-100%
A-	86-88%
B+	83-85%
B	80-82%
B-	76-78%
C+	73-75%

Letter Grade	Percentage
C	69-72%
C-	66-68%
D+	63-65%
D	59-62%
D-	56-58%
F	0-55%

While it's possible to receive an A+, only a few outstanding students will earn this grade.

Quizzes

Submit each weekly quiz (one per week, released by Thursday 3 pm in that week) by the following quiz release (i.e., 7 days later) on Gradescope. The quizzes are based on the content from the previous lectures and the listed readings. The quizzes will not cover readings marked as “optional”, unless they were explicitly covered in the lectures.

Most questions will be multiple choice with one correct answer. There will be short free form/essay questions that will be graded as correct if they address the given question and refer to relevant content from the lecture and listed readings.
The point is to engage more with the material, not to burden you with a bunch of arbitrary homework. To this end, you get two grades per quiz — one for completion, and one for correctness.
You can submit quizzes an unlimited number of times, and your latest submission will be used for your grade.
If a quiz is not released on time, the delay will be added to the submission deadline.
Quizzes will cover guest lecturer classes.

Final Projects

A third of the final grade will be determined by a final project, which needs to be submitted by 5 PM PST on 12/05/2025. Check out our Project Tips and Guide document for more detailed instructions.

Are you looking for a project partner? Use our optional final project matching form for us to match you with someone!

To accommodate for different backgrounds, final projects can range from running scientific experiments with a written summary to writing literature reviews and policy recommendations.
Final projects will be peer-reviewed by fellow students.
You can submit a final project alone or as a group of up to three. We strongly recommend working in groups of three, as this usually increases the quality and scope of the project, while also increasing the likelihood of a project turning into a paper.
The final submission will be a pdf and, for projects involving code, a complementary GitHub repository.

Project Guidelines

5-8 page PDF, excluding references; appendices can be added without limitation. The goal is to be concise, not to reach the maximum page limit.
For groups of students, we expect a more comprehensive project and more than 6 pages.
Choose a topic related to the core subjects of CS120. This could be an issue from the lectures or readings that you are passionate about.
Regardless of your project's focus (technical, sociotechnical, or governance), it must follow the structure of a scientific paper.
Templates: LaTeX and Google Docs submissions.
You will need to submit a regular and anonymized PDF of your final project. Instructions for anonymization are stated in the respective templates.

The first two pages should concise of:

Abstract: A standalone summary of your paper.
Introduction: Motivation and definition of the problem, its relevance, your approach, and key contributions.
Related Work: A review of existing studies on similar problems, positioning your work relative to them (what did you do differently?).

It is not sufficient to only cite papers from the curriculum. You are expected to explore further related works. A good starting point could be to examine the references in a lecture paper or look up which works cite that paper online. The last half page should be a

Discussion: A summary of your work, potential directions for future research, and the limitations of your project (e.g., how generalizable it is, what cases it doesn't cover, simplifications made).

The middle section (pages 3 onward) will depend on the nature of your project. We encourage you to study different papers from the reading list to get a better feel for how they approach their topics.

For project ideas, you can also study recent publications from different conferences and workshops:

(technical) CoLM 2024 Accepted Papers, NeurIPS 2023 Solar Workshop Papers, NeurIPS 2023 Attrib Workshop Papers
(socio-technical) AIES 2023 Proceedings, FAccT 2024 Proceedings
(technical governance) ICML 2025 Technical AI Governance Workshop, ICML 2024 Generative AI + Law Workshop
(policy brief) ICML 2025 Position Papers

Here are a few example project ideas.

We do not expect you to write a final project on par with any of these publications. If you are unsure about the appropriate project scope, but have a topic in mind, we can discuss details after class or in office hours. Quiz 6 will also help you find a final project topic.

Peer Review

A twelfth of the final grade will be determined by the quality of two peer reviews.

Each student will be assigned two final projects to review.
The peer review should be one page long and submitted as a pdf.
The peer review should also be anonymized, such that we can share it with the original project authors.
You can use the final project templates as basis for the peer reviews. Please find peer review instructions in this Google Doc.

Late Days

All students get 6 late days at the start of the course.

Each late day grants a 24-hour extension on one assignment (quiz, final project, or peer review) deadline.
You cannot use more than 2 late days for the peer review due to the final grading deadline.
Any assignments turned in after all late days have been used up will be marked missing and not accepted.
At the end of the course, each unused late day you still have is 1% extra credit (up to 6% extra credit).

Attendance

All classes have mandatory attendance.

You can miss only two classes to get the full attendance grade of 5% and 0% otherwise.
Attendance is checked at the start and end of each lecture.
Office hours are not mandatory.
The second lecture (Technical AI/Machine Learning Recap) is the only optional lecture that will also not be counted towards the attendance bonus.

Curriculum

The rest of this document contains the schedule of assigned and optional readings for each week. Course slides, lecture recordings, and quizzes will be linked below, though we do not guarantee that all lectures will be recorded, especially with guest speakers.

Readings can be subject to change throughout the course, but will not change more than 14 days in advance. Please check the curriculum here rather than a printed or duplicated copy.

How to Read Research Papers

This course builds mainly on technical AI research papers as readings. If you haven't read AI research papers before, we recommend checking out these resources:

Newspaper Access

Given the rapid development and social impact of AI, the course will be referring to some news articles as sources. These will be either publicly available or through Stanford. See here for how to access general newspapers (e.g., NYT, WaPo, WSJ, or the Atlantic) or here for sources like Foreign Affairs through Stanford.

Deadlines

Weekly quizzes are due every Thursday by 3 pm
12/05/2025: Final project due
12/12/2025: Peer review due

Week	Date	Lecturer	Topic
Week 1	09/23/25	Max	What Does Safe AI Mean Anyways?
	Lecture Slides + (Recording unavailable due to technical difficulties. Alternatively, see last year's recording of lecture 1) Readings (Required) Concrete Problems in AI Safety - Amodei et al. 2016 Preprint Read the abstract and sections 1 and 2 Taxonomy of Risks posed by Language Models - Weidinger et al. 2022 FAccT Publication AI safety is not a model property - Narayanan and Kapoor 2024 Blog post Hard Choices in Artificial Intelligence - Dobbe et al. 2021 Artificial Intelligence publication Read the abstract and section 1 Optional Readings (Not Required) TBD
	09/25/25	Max	[Optional] Technical AI/Machine Learning Recap
	Lecture Slides + Recording Readings (Required) None Optional Readings (Not Required) Machine Learning, Explained - Brown 2021 Beginner-friendly overview of machine learning field But what is a neural network? - 3Blue1Brown 2017 Youtube video Gradient descent, how neural networks learn - 3Blue1Brown 2017 Youtube video What is backpropagation really doing? - 3Blue1Brown 2017 Youtube video Intro to Large Language Models - Kaparthy 2024 Youtube video
Week 2	09/30/25	Max	Reward Functions, Alignment, and Human Preferences
	Lecture Slides + Recording Readings (Required) Specification gaming: the flip side of AI ingenuity - Krakovna et al. 2020 Blog post from DeepMind Training language models to follow instructions with human feedback - Ouyang et al. 2022 Preprint from OpenAI, here is the accompanying blog post Read the blog post and in the paper only the abstract and the introduction Scaling Laws for Reward Model Overoptimization - Gao et al. 2022 ICML publication Read only the abstract and the introduction Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback - Casper et al. 2023 TMLR publication Read only the abstract, introduction (section 1), and section 3 up to 3.1 Optional Readings (Not Required) Deep reinforcement learning from human preferences - Christiano et al. 2017 Preprint form OpenAI/DeepMind RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback - Lee et al. 2024 ICML publication Read only the abstract and the introduction The Many Faces of Responsible AI - Aroyo 2023 NeurIPS keynote We Need An Adaptive Interpretation of Helpful, Honest, and Harmless Principles - Huang et al. 2025 Preprint Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks - Veselovsky et al. 2023 Preprint The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models - Kirk et al. 2024 NeurIPS publication A matter of principle? AI alignment as the fair treatment of claims - Gabriel and Keeling 2025 Philosophical Studies publication
	10/02/25	Max	Encoding Human Preferences in AI
	Lecture Slides + Recording Readings (Required) Constitutional AI: Harmlessness from AI Feedback - Bai et al. 2022 Preprint form Anthropic Read only the abstract and the introduction Discovering Language Model Behaviors with Model-Written Evaluations - Perez et al. 2023 ACL publication Read only the abstract, introduction, and section 2 Direct Preference Optimization (DPO) - A Simplified Explanation - Lages 2023 Blog post Direct Preference Optimization: Your Language Model is Secretly a Reward Model - Rafailov et al 2023 NeurIPS publication Read only the abstract and the introduction Optional Readings (Not Required) A long way to go: Investigating length correlations in RLHF - Singhal et al. 2023 CoLM publication Read only the abstract and the introduction Reward Model Interpretability via Optimal and Pessimal Tokens - Christian et al. 2025 FAccT publication Safety Alignment Should be Made More Than Just a Few Tokens Deep - Qi et al. 2025 ICLR publication The History and Risks of Reinforcement Learning and Human Feedback - Lambert et al. 2023 Preprint Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback - Ivison et al. 2024 Preprint Language Models Learn to Mislead Humans via RLHF - Wen et al. 2024 Preprint Cooperative Inverse Reinforcement Learning - Hadfield-Menell et al. 2016 NeurIPS publication Inverse Reward Design - Hadfield-Menell et al. 2017 NeurIPS publication AssistanceZero: Scalably Solving Assistance Games - Laidlaw et al. 2025 ICML publication Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms - Rafailov et al. 2024 Preprint Beyond Preferences in AI Alignment - Zhi-Xuan et al. 2024 Preprint Read the abstract and the introduction (until end of page 3) Position: A Roadmap to Pluralistic Alignment - Soerensen et al. 2024 ICML Publication Are Large Language Models Consistent over Value-laden Questions? - Moore et al. 2024 Preprint
Week 3	10/07/25	Sydney Katz	[Guest] Validation of AI Systems
	Lecture Slides + Recording Readings (Required) None
	10/09/25	Max	Data Is All You Need: The Impact of Data
	Lecture Slides + Recording Readings (Required) DALL·E 2 pre-training mitigations - OpenAI 2022 OpenAI blog post Machine Bias - Angwin et al. 2016 ProRepublica article On Hate Scaling Laws For Data-Swamps - Birhane et al. 2023 Preprint Read the abstract and the introduction Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice - Cooper et al. 2024 Preprint Read the abstract and sections 1 + 2 Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs - O'Biran et al. 2025 Preprint Read the abstract and the introduction Optional Readings (Not Required) Scaling Laws for Neural Language Models - Kaplan, McCandlish et al. 2020 Preprint TinyStories: How Small Can Language Models Be and Still Speak Coherent English? - Eldan and Li 2023 Preprint Textbooks Are All You Need - Gunasekar et al. 2023 Preprint Read the abstract, introduction, and section 2.1 Investigating Data Contamination for Pre-training Language Models - Jian et al. 2024 Preprint Read the abstract and the introduction Language Models May Verbatim Complete Text They Were Not Explicitly Trained On - Preprint Large Language Model Unlearning - Yao et al. 2024 NeurIPS Read the abstract and the introduction up to until 1.1 Who's Harry Potter? Approximate Unlearning for LLMs - Eldan and Rusinovich 2023 Preprint Read the abstract, introduction, and the start of section 2 (until 2.1) Position: Model Collapse Does Not Mean What You Think - Preprint Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models - Ahia et al. 2023 EMNLP Publication Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU - Koto et al. 2023 EMNLP Publication AI Art and its Impact on Artists - Jiang et al. 2023 AIES Publication Certifying and Removing Disparate Impact - Feldman et al. 2015 KDD Publication Learning Fair Representations - Zemel et al. 2013 ICML Publication Data-Centric AI for reliable and responsible AI: from theory to practice - van der Schaar, Seedat, and Guyon 2023 NeurIPS Tutorial
Week 4	10/14/25	Max	Needs for AI Safety Today: Beyond the Hype
	Lecture Slides + Recording unavaiable due to technical issues Readings (Required) “AI for Good” Isn't Good Enough: A Call for Human-Centered Design - Landay 2024 Google Doc Unsolved Problems in ML Safety - Hendrycks et al. 2022 Preprint Read abstract and introduction AI Safety on Whose Terms? - Lazar and Nelson 2023 Science Publication The TESCREAL Bundle: Eugenics and the Promise of Utopia Through Artificial General Intelligence - Gebru and Torres 2024 First Monday Publication Read abstract and introduction The Grey Hoodie Project: Big Tobacco, Big Tech, and the Threat on Academic Integrity - Abdalla and Abdalla 2021 AIES Publication How Culture Shapes What People Want From AI - Ge et al. 2024 CHI Publication Read abstract and introduction Optional Readings (Not Required) Understanding artificial intelligence ethics and safety - Leslie 2019 The Alan Touring Institute Read sections “Why AI Ethics” and “FAST Track Principles” On the Opportunities and Risks of Foundation Models - Bommasani et al. 2022 Stanford CRFM preprint Read the abstract and section 1 until 1.2 The Political Preferences of LLMs - Rozado 2024 PLOS ONE Publication Towards Measuring the Representation of Subjective Global Opinions in Language Models - Durmus et al. 2024 CoLM Publication The Values Encoded in Machine Learning Research - Birhane et al. 2022 FAccT Publication The WEIRD Science of Culture, Values, and Behavior - Armstrong 2018 APS Publication Fairwashing: The Risk of Rationalization - Aivodji et al. 2019 ICML Publication Reality of AI and Biorisk - Peppin et al. 2024 Preprint Can large language models democratize access to dual-use biotechnology? - Soice et al. 2023 Preprint Read abstract, introduction, and results An Autonomy-Based Classification (Policy Brief) - Soder et al. 2025 Interface Policy Brief Our contribution to a global environmental standard for AI - Mistral AI 2025 Blog post Twitter's Algorithm: Amplifying Anger, Animosity, and Affective Polarization - Milli et al. 2023 Preprint Read the abstract and section 1 (with results) Manifestations of Xenophobia in AI Systems - Tomasev et al. 2022 Deepmind preprint Read the abstract, introduction, and section 2
	10/16/25	Martin Castillo-Quintana (UChicago)	[Guest] Social-Choice Theory for Engineers
	Lecture Slides (See Ed) + Recording (See Ed) Readings (Required) None
Week 5	10/21/25	Declan Grabb & Andrea Vallone (OpenAI)	[Guest] AI and Mental Health
	Lecture Slides (Unavaiable) + Recording (Unavaiable) Readings (Required) None
	10/23/25	Max	Red Teaming, Adversarial Vulnerabilities, and Multi-Agent Systems
	Lecture Slides + Recording Readings (Required) Explaining and Harnessing Adversarial Examples - Goodfellow et al. 2015 ICLR publication Read the abstract, introduction, and section 3 Are aligned neural networks adversarially aligned? - Carlini et al. 2023 NeurIPS publication Read the abstract, introduction, and section 3 Jailbroken: How Does LLM Safety Training Fail? - Wei et al. 2023 NeurIPS publication Read the abstract, introduction (without 1.1), and section 3 3 takeaways from red teaming 100 generative AI products - Bullwinkel et al. 2025 Microsoft blog post Levels of Autonomy for AI Agents - Feng et al. 2025 Preprint Read the abstract, and sections 1 + 2 Multi-Agent Risks from Advanced AI - Hammond et al. 2025 Technical Report Read the abstract and executive summary Optional Readings (Not Required) Red Teaming Language Models with Language Models - Perez et al. 2022 ACL publication Read the abstract and introduction Poisoning Language Models During Instruction Tuning - Wan et al. 2023 ICML publication Read the abstract and the introduction Universal and Transferable Adversarial Attacks on Aligned Language Models - Zou et al. 2023 Preprint + Blog post Read the abstract, introduction, and the start of section 2 (until 2.1) Poisoning Web-Scale Training Datasets is Practical - Carlini et al. 2023 Preprint Read the abstract, introduction, and section 5.1 + 5.2 Scalable Extraction of Training Data from (Production) Language Models - Nasr et al. 2023 Preprint Read the abstract and the introduction Natural Backdoor Datasets - Wenger et al. 2022 NeurIPS publication Read the abstract, introduction, and figure 1 Universal Jailbreak Backdoors from Poisoned Human Feedback - Rando and Tramer 2024 ICLR publication Read the abstract and the introduction Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data - Baumgaertner et al. 2024 CoLM publication Effective Prompt Extraction from Language Models - Zhang1 et al. 2024 CoLM publication Escalation Risks from Language Models in Military and Diplomatic Decision-Making - Rivera et al. 2024 FAccT publication Read the abstract and introduction Defending Against Unforeseen Failure Modes with Latent Adversarial Training - Casper et al. 2024 Preprint
Week 6	10/28/25	Max	Interpretability
	Lecture Slides + Recording Readings (Required) Vision - Marr. 2014 Book Read full section 1.2 "Understanding Complex Information-Processing Systems" SoK: Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks - Rauker, Ho, Casper et al. 2023 SaTML publication Read the abstract and introduction Mechanistic Interpretability for AI Safety - A Review - Bereska and Gavves. 2024 TMLR publication Read the abstract, introduction (section 1), and section 2 Zoom In: An Introduction to Circuits - Olah et al. 2020 Blog post Optional Readings (Not Required) Stochastic Backpropagation and Approximate Inference in Deep Generative Models - Rezende et al. 2014 ICML publication Read the abstract and the introduction Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization - Selvaraju et al. 2017 ICCV publication Towards robust interpretability with self-explaining neural networks - Melis and Jaakkola et al. 2018 NeurIPS publication Are neural nets modular? inspecting functional modularity through differentiable weight masks - Csordas et al. 2020 ICLR publication Compositional explanations of neurons - Mu and Andreas. 2020 NeurIPS publication Exemplary natural images explain cnn activations better than state-of- the-art feature visualization - Borowski et al. 2020 ICLR publication Leveraging sparse linear layers for debuggable deep networks - Wong et al. 2021 ICML publication Knowledge neurons in pretrained transformers - Dai et al. 2021 ACL publication Language models can explain neurons in language models - Bills et al. 2023 OpenAI technical report Training Language Models to Explain Their Own Computations - Li et al. 2025 Arxiv preprint Steering Language Models With Activation Engineering - Turner et al. 2023 Arxiv preprint Locating and Editing Factual Associations in GPT - Meng et al. 2022 NeurIPS publication + Blog post Read the abstract and the introduction Discovering Latent Knowledge in Language Models Without Supervision - Burns et al. 2023 ICLR publication Read the abstract and the introduction But is it really in Rome? An investigation of the ROME model editing technique - Thibodeau 2022 Online forum post Read until “Motivation for this post” What Discovering Latent Knowledge Did and Did Not Find - Roger 2023 Online forum post Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models - Hase et al. 2023 NeurIPS 2023 paper Mechanistic? - Saphra and Wiegreffe 2024 Workshop paper (BlackBoxNLP 2024) Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting - Turpin et al. 2023 NeurIPS 2023 paper Measuring Faithfulness in Chain-of-Thought Reasoning - Lanham et al. 2023 Arxiv preprint Towards Monosemanticity: Decomposing Language Models With Dictionary Learning - Bricken et al. 2023 Anthropic blog post Read until “Why not use architectural approaches” Sparse autoencoders find highly interpretable features in language models - Cunningham et al. 2024 ICLR publication Read the abstract and introduction Language Models Represent Space and Time - Gournee and Tegmark. 2024 ICLR publication Read the abstract and introduction Not all language model features are linear - Engels et al. 2024 ICLR publication Read the abstract and introduction CLIP: Connecting text and images - OpenAI 2022 Blog post Read until (including) Figure 2 Acquisition of chess knowledge in alphazero - McGrath et al. 2023 PNAS publication Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero - Schut et al. 2023 Preprint Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models - Marks et al. 2024 Preprint Scaling and evaluating sparse autoencoders - Gao et al. 2024 OpenAI Preprint Activation addition: Steering language models without optimization - Turner et al. 2023 Preprint Towards axiomatic, hierarchical, and symbolic explanation for deep models - Ren et al. 2021 Preprint Mass-Producing Failures of Multimodal Systems with Language Models - Tong et al. 2023 NeurIPS publication Read the abstract, introduction, and section 3 until 3.2 Interpreting CLIP's Image Representation via Text-Based Decomposition - Gandelsman et al. 2024 ICLR publication Read the abstract and introduction
	10/30/25	Liza & Max	Formal Methods / Safe Human-AI Interaction
	Lecture Slides (Liza) + Slides (Max) + Recording (first 30s are lost) Readings (Required) Role play with large language models - Shanahan et al. 2023 Nature publication Read until “The nature of the simulator” ELEPHANT: Measuring and understanding social sycophancy in LLMs - Cheng et al. 2025 Preprint Read the abstract, introduction (section 1), and section 2 What is Deep Neural Network Verification and Why is it Important? - Chase. 2021 log post Clover: Closed-Loop Verifiable Code Generation - Sun et al. 2024 Preprint Read the abstract and the introduction Optional Readings (Not Required) Trustworthy AI - Wing 2020 Preprint Read abstract and introduction Algorithms for Verifying Deep Neural Networks - Liu et al. 2020 Preprint/Text book Read the abstract and the introduction VeriX: Towards Verified Explainability of Deep Neural Networks - Wu et al. 20230 NeurIPS publication Read the abstract and the introduction Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers - Si et al. 2025 ICLR 2025 paper AI for Scientific Discovery is a Social Problem - Channing and Ghosh 2025 Arxiv preprint Can LLM-Generated Misinformation Be Detected? - Chen and Shu 2024 ICLR publication Read the abstract, introduction, and section 2 Large Language Models respond to Influence like Humans - Griffin et al. 2023 ACL publication Read abstract, introduction, section 2, section 3 up until 3.1, section 5 Can AI language models replace human participants? - Dillion et al. 2023 Trends in Cognitive Sciences publication AI language models cannot replace human research participants - Harding et al. 2023 Springer AI and Society Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies - Aher et al. 2023 ICML publication Read the abstract, introduction until 1.1, and section 2 Using large language models in psychology - Demszky 2023 Nature publication Read the abstract and introduction How to Grow a Mind: Statistics, Structure, and Abstraction - Tenenbaum et al. 2023 Science publication Human vs. Machine: Behavioral Differences Between Expert Humans and Language Models in Wargame Simulations - Lamparth et al. 2023 AIES publication Read the abstract and the introduction Do Personality Tests Generalize to Large Language Models? - Dorner et al. 2023 NeurIPS Workshop Read the abstract, introduction, and section 4
Week 7	11/04/25	Max	[Recorded/Optional Attendance] Electric Sheep: What Is Intelligence and Does It Want?
	Lecture Slides + Recording Readings (Required) Optimal Policies Tend to Seek Power - Turner et al. 2019 NeurIPS publication Read the abstract, introduction, and discussion Orthogonality Thesis Online forum post Superintelligence, Chapter 7: The superintelligent will - Bostrom 2014 Book chapter The Bitter Lesson - Sutton 2019 Online forum post On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 - Bender et al. 2021 FAccT publication Read the abstract, introduction, and section 6 Language Models as Agent Models - Andreas. 2022 EMNLP publication Read the abstract and the introduction Optional Readings (Not Required) Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals - Shah et al. 2022 Deepmind preprint Read the abstract, introduction, section 2, and section 3 Deceptive Alignment - Hubinger et al. 2019 Online forum post Read only the introduction Unsocial Intelligence: An Investigation of the Assumptions of AGI Discourse - Blili-Hamelin et al. 2024 AIES publication
	11/06/25	Nicholas Carlini (Anthropic)	[Guest] TBD
	Lecture Recording Readings (Required) None Optional Readings (Not Required) TBD
Week 8	11/11/25	Max / Naomi	Attributing Model Behavior at Scale: Evaluating AI Systems / AI Governance
	Lecture Slides (Max) + Slides (Naomi) + Recording Readings (Required) On the Limitations of Compute Thresholds as a Governance Strategy - Hooker 2024 Preprint/Essay Read the abstract and section 1 Pitfalls of Evidence-Based AI Policy - Casper et al. 2025 ICLR publication Read the abstract, section 2, and section 3 Measurement to Meaning: A Validity-Centered Framework for AI Evaluation - Salaudeen and Reuel et al. 2025 Preprint Read the abstract and the introduction A comprehensive review of Artificial Intelligence regulation: Weighing ethical principles and innovation - Cajueiro and Celestino. 2025 Journal of Economy and Technology publication Read the abstract, introduction, and sections 5 + 6 Optional Readings (Not Required) Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark - Pan et al. 2023 ICML publication Read the abstract, introduction, and section 2 until 2.3 Are Emergent Abilities in Large Language Models just In-Context Learning? - Lu et al. 2023 ACL publication Red-Teaming for Generative AI: Silver Bullet or Security Theater? - Feffer et al. 2024 Preprint Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive? - Schaeffer et al. 2024 Preprint How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions - Pacchiardi et al. 2024 ICLR publication Read the abstract, introduction, and section 2 The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" - Berglund et al. 2024 ICLR publication Read until the end of section 1 Emergent Abilities of Large Language Models - Wei et al. 2022 TMLR publication Read until the end of section 3 Are Emergent Abilities of Large Language Models a Mirage? - Schaeffer et al. 2023 NeurIPS publication Read the abstract, introduction, and section 2 STaR: Bootstrapping Reasoning With Reasoning - Zelikman et al. 2022 NeurIPS publication Gemma 2: Improving Open Language Models at a Practical Size - Gemma Team 2024 Technical report Large Language Monkeys: Scaling Inference Compute with Repeated Sampling - Brown et al. 2024 Preprint Open Problems in Technical AI Governance - Reuel and Buknall et al. 2024 Preprint Read the abstract, introduction, and appendix A "Policy Brief" Position: Technical Research and Talent is Needed for Effective AI Governance - Reuel et al. 2024 ICML Publication Read the abstract, introduction, and section 2 Black-Box Access is Insufficient for Rigorous AI Audits - Casper et al. 2024 FAccT publication Read the abstract and the introduction A Watermark for Large Language Models - Kirchenbauer et al. 2023 ICML Publication Read the abstract, introduction, and section 2 Tools for Verifying Neural Models' Training Data - Choi 2023 NeurIPS publication Read the abstract and the introduction Securing Artificial Intelligence Model Weights - Nevo et al. 2023 Rand Cooperation Report Read until end of page 5 What does it take to catch a Chinchilla? Verifying Rules on Large-Scale Neural Network Training via Compute Monitoring - Shavit 2023 Preprint Read the abstract, introduction, and section 2 until 2.2
	11/13/25	Daniel Johnson (Transluce)	[Guest] Towards A Science of Model Behavior
	Lecture Recording Readings (Required) None Optional Readings (Not Required) TBD
Week 9	11/18/25	Max	Scalable Oversight: How to Supervise Advanced AI?
	Lecture Slides + Recording Readings (Required) AI Safety via Debate - Irving & Amodei 2018 OpenAI preprint + Blog post Read the abstract, introduction and section 2 until 2.2 Weak-to-strong generalization - Burns et al. 2023 OpenAI preprint + Blog post Read the blog post, and the abstract and introduction of the paper Managing extreme AI risks amid rapid progress - Bengio et al. 2024 Science publication The Alignment Problem from a Deep Learning Perspective: A Position Paper - Ngo et al. 2024 ICLR publication Read until the end of section 4.1 Optional Readings (Not Required) Two Types of AI Existential Risk: Decisive and Accumulative - Kasirzadeh. 2023 Preprint Biological Anchors: A Trick That Might Or Might Not Work - Alexander 2022 Blog post Part 1 and Part 2 Measuring Progress on Scalable Oversight for Large Language Models - Bowman et al. 2022 Anthropic preprint Read the abstract, introduction, and section 2 An Overview of Catastrophic AI Risks - Hendrycks et al. 2023 Preprint Position: Levels of AGI for Operationalizing Progress on the Path to AGI - Morris et al. 2024 ICML publication ELK Summary - Hobbhahn 2022 Online forum post (That summarizes this report) Read only the section “Toy Scenario: The SmartVault” Scalable agent alignment via reward modeling - Leike 2018 Deepmind blog post
	11/20/25	Deb Raji (UC Berkeley)	[Guest] TBD
	Lecture Recording (TBD) Readings (Required) None Optional Readings (Not Required) TBD
Week 10	11/25/25	[No class]	Thanksgiving
	11/27/25	[No class]	Thanksgiving
Week 11	12/02/25	Liza / Max (will leave after 30 min)	[Optional] Final Project Co-Working and Feedback
	12/04/25	Naomi / Max	[Optional] Final Project Co-Working and Feedback