Goodfire and Anthropic have jointly organized a meet-up of academic and industry researchers called “Interpretability: the next 5 years”, to be held later this month. Participants have been invited to contribute short discussion documents. This is a draft of my document, which I am posting publicly to try to stimulate discussion in the broader community.
It’s an awkward time for interpretability research in AI. On the one hand, the pace of technical innovation has been incredible, and the number of success stories is rising fast. On the other hand, I think all of us in the field are accustomed to responses to our work that range from skepticism to dismissal.
The dismissals still surprise me. I assure you I am not that old, but I have still seen many areas shift from being perceived as obscure, irrelevant dead-ends to the thing defining the future. The most prominent example is neural network research itself – once mocked as a dead-end, now the life-blood of the field. The most extreme and sudden shift is the task of natural language generation; this used to be a niche topic, and now it practically defines AI itself. All the tasks grouped under the heading of “Natural Language Understanding” similarly shifted from the periphery to the mainstream over the last 10 years. The most recent sea change is for reinforcement learning, which went from deeply unfashionable to the hottest thing seemingly overnight in 2022.
Having lived through all these transitions, I am really shy about dismissing anything out of hand. I worry about the damage I could do by steering people away from what turn out to be significant topics. It seems much wiser to proceed claim by claim than to brush aside entire areas.
So, outright dismissal of interpretability research seems ill-considered. Skepticism is healthy, though, and it will pay to understand the driving forces behind this skepticism. Here I will try to articulate the skeptical positions I often encounter and provide my current responses to them.
Interpretability cannot be achieved in any meaningful sense
This claim usually stems from the position that there is an inherent conflict between quality and interpretability, for neural networks in particular or as a fact about learning and complexity in general. It follows from this assumption that our best models will be uninterpretable. A similar position holds that, as systems become more complex, their faithful explanations lawfully become more complex as well, at such a rate that the explanations become useless to us. These positions (and some persuasive replies) are prominent in this famous debate.
In response to this skeptical take, I would observe that neural networks are closed, deterministic systems that we designed and built ourselves. On the face of it, the project of understanding them should be a good deal easier than the project of understanding, say, biological systems, and the biologists seem undaunted by their project. AI researchers should be as ambitious as biologists.
Overall, I think this skeptical position is a call to action. It might be correct that our current approaches to explanation won’t deliver the insights we need, but I am optimistic that we can find approaches that do.
Analysis is overrated
This position often involves first characterizing AI as an engineering discipline rather than a scientific one per se, and then arguing that engineering fields are guided by concrete results rather than by anything as nebulous as “understanding”. A more moderate version of the argument is that, at this early stage in the field, it is more cost effective (for every notion of cost) to gather evidence in a messy fashion than it is to pause to try to derive deeper theories.
This is a “wait and see” position. Perhaps it is reasonable to be cautious about over-investing in any particular aspect of interpretability research, when one could instead just push forward with trying to improve models. On the other hand, if this were my position, I would be hoping that someone out there was willing to take the contrary position. I think we urgently need scientific theories in AI. Saying we don’t seems akin to epidemiologists saying they don’t need genetics or structural engineers saying they don’t need models of wind shear. One might reasonably be skeptical of particular theories when they are new and untested, but being skeptical of theories in general is not a good long-term bet. Deep causal theories always lead to transformative new hypotheses, techniques, and products, in science and in engineering.
Here again, this skeptical take is a call to action: we should find the truly transformative theories. I think it will also help to analyze problems that really matter, rather than choosing only toy problems meant to illustrate how the methods work in the hope that someone else will use them for something more impactful.
Interpretability is merely analysis
Some people hold that interpretability is inherently about analysis and, by definition, cannot extend beyond analysis. This is frustrating because the best interpretability techniques tend to be ones that can be used to directly improve models. If the skeptic says that the improvement step falls outside of “interpretability”, then they have simply defined the field too narrowly. This can have consequences (perceptions of the field, funding, etc.), but it doesn’t threaten the intellectual project. So, overall, I think we can set this take aside.
Interpretability is not leading to improvements
A more nuanced version of the previous skeptical claim is that interpretability has not led to major innovations so far. Progress in AI has been astounding, and it encompasses everything from data selection to network architectures to late-stage fine-tuning to efficient inference. However, interpretability work does not appear to be prominent in these achievements.
I think this perception is unfair to interpretability research.
First, there are already many clear cases in which interpretability has led to improvements. Back in 2013, Zeiler and Fergus developed a novel attribution method and used it to find a state-of-the-art ImageNet classifier. More recently, the discovery of induction heads led to better state space models (e.g., BASED, H3), causal abstraction improved internal LLM steering, and the discovery of register tokens improved vision Transformers. These are just a few examples; one could create a very long list.
Second, all major innovations derive from extensive analysis of how networks learn and behave. For example, field-wide deep and sustained study of attention mechanisms (outlier values, attention sinks; here, here) has led to numerous fundamental improvements. These may be free-form, ad hoc analyses that also draw on intuition and heuristic experimentation, rather than the highly structured sort of analysis that is most prominent in interpretability research, but it is still interpretability work.
Overall, then, while one might be critical of the particular techniques developed so far by interpretability researchers, there seems to be very little room for saying that interpretability itself is not helpful.
For me, this skeptical take leads to a more particular call to action than the above: we should actively seek out methods that provide a clear path from analysis to improvement, perhaps drawing insight from the kind of analyses that have guided past innovations. This favors some approaches over others, and it runs the risk of narrowing the overall mission of the field, but I still like it as a research bet.
The Bitter Lesson says that interpretability won’t lead to lasting improvements
Combinations of “Analysis is overrated” and “Interpretability is not leading to improvements” sometimes come in under the heading of Rich Sutton's Bitter Lesson, which instructs us that scaling simple systems always wins out over developing highly customized systems that are informed by human knowledge. To the extent that interpretability-informed proposals fall into the highly customized class, they will learn the Bitter Lesson eventually.
We should simply deny the premise that interpretability-informed proposals need to be of the highly customized variety. To take one example, consider how approaches to positional encodings in Transformers evolved. Researchers identified limitations of absolute positional encodings and, via careful study of the Transformer and its learned representations, identified more scalable and effective positional encoding schemes that helped enable the massively long context windows of current LLMs. Simply scaling absolute positional encodings was not going to suffice. This research was guided by interpretability in a broad sense, and I am hopeful that interpretability will be able to supercharge such developments in the future. No assumptions about knowledge encoding, no human priors – just actionable characterizations of where models are succeeding and failing.
Interpretability is not helping with AI safety
I saved my hardest-hitting item for last. For years, the project of interpretability has been entwined with the project of making AI safer for humans. I am optimistic that interpretability can help achieve this goal in the long run, but I also feel that it has not really come close to achieving this goal yet. There have been gains in safety, but these seem mostly to stem from behavioral evaluations, heuristic adjustments to training regimes, and robust software system design.
The recent case of “extreme sycophancy” in GPT-4o is illustrative. It seems clear now that this was a genuine emergent problem. It was detected behaviorally, the root causes were found via free-form analysis, and the problem was fixed by improving post-training and system prompt design. As far as the public knows, no circuit was discovered, no particular weights or activations were held responsible, and no mechanistic analysis sounded a warning bell or informed the solutions. Sycophancy is not necessarily a safety problem, but it is safety-adjacent. Had cutting-edge interpretability played a meaningful role in addressing it, I would feel convinced that the marriage of safety and interpretability was going to work out. As it stands, I feel that I still don’t have a success story to point to here.
I find it instructive to think about what I would do if I had skin in the game. If my livelihood and reputation depended on assessing the risk posed by a given model in a given deployment scenario, what would I do? I would certainly invest in a massive amount of adversarial behavioral testing. I would form large, diverse teams of people and AIs tasked with exposing weaknesses, and I would be counting on these teams to provide my core risk assessments in the near term. This might sound like a bet against interpretability. However, I would also make a serious long-term investment in interpretability, perhaps with an organizational structure suggested by this paper. The focus would be on bolstering the ongoing behavioral testing via analyses of model activation patterns as well as identifying long-term strategies to use when the behavioral tests started to fall down – which they would inevitably do as the scenarios became higher-impact and, in turn, more adversarial.
General takeaways
Interpretability is one of my very favorite research areas of all time. As I said above, I think it is critical work for AI. In addition, I have not even touched on the doors it is opening for research in linguistics and cognitive science. So, overall, I basically want the field to proceed in the same sort of creative, chaotic way that it has been. I love that measured theoretical work is being done alongside wild exploratory experiments, that practical applications often get subsumed by the thrill of discovery, and that the field is focused on ambitious long-term goals.
On the other hand, it would be nice to encounter fewer skeptical reactions and recruit even more of my smartest colleagues into this area. To achieve that, I would suggest that the field (1) double down on really deep theoretical work to ensure rock solid foundations, (2) provide rich analyses of phenomena that people genuinely care about, (3) book some more concrete wins when it comes to improving models, and (4) diversify away from AI safety.
In the meantime, if you are an interpretability researcher and the skeptical takes are getting you down, consider that all of the following fields were once considered obscure, arcane, and irrelevant: theoretical physics (until the atomic bomb), number theory (until modern encryption), information theory (until digital communication), and neural networks (until now-ish).
Acknowledgements
My thanks to Zhengxuan Wu, Jing Huang, Atticus Geiger, and Aryaman Arora for very helpful feedback. All views are my own