Introduction

Mechanistic Interpretability is an emerging field that seeks to understand the internal reasoning processes of trained neural networks and gain insight into how and why they produce the outputs that they do. AI researchers currently have very little understanding of what is happening inside state-of-the-art models.

Current frontier models are extremely large – and extremely complicated. They might contain billions, or even trillions of parameters, spread across over 100 layers. Though we control the data that is inputted into a network and can observe its outputs, what happens in the intervening layers remains largely unknown. This is the ‘black box’ that mechanistic interpretability aims to see inside.

Why might mechanistic interpretability be useful?

As AI systems become more powerful and the incentives to integrate them into various decision-making processes grow, we will likely cede more and more authority to them. Significant progress in interpreting neural networks would afford us insight into the rationale behind decisions that impact large numbers of people. This could allow us to identify bugs, flaws and systematic biases (if a model is making hiring decisions based on race or gender, for example).

Progress in mechanistic interpretability is also one path to detecting unaligned behavior in AI models. If we can only assess a model’s outputs, but not the processes that give rise to them, we may not be able to tell whether a model is genuinely acting in the best interests of its user.

Being able to identify such traits may allow us to:

Avoid (potentially catastrophic) failure modes that stem from uncritically trusting the outputs of misaligned models.
Build greater consensus around AI-related threats by conclusively demonstrating that they display certain tendencies.
Negatively reinforce misaligned behavior.

In the most optimistic case, mechanistic interpretability could not only help us understand the processes by which AI models produce outputs, but to intervene in these processes. In the most optimistic case, mechanistic interpretability could not only help us understand the processes by which AI models produce outputs, but to intervene in these processes.

Looking inside neural networks

Researchers at OpenAI set forth a model of neural networks as composed of ‘features’ which connect to form ‘circuits’.

Features are meaningful, isolated concepts that are encoded by a neural network. They are things that a network ‘knows’ or is representing. They are usually confined to a single layer of the network.
Circuits are the relationships between groups of features. They are the computational means by which a network combines the lower-level concepts encoded by features into more complex representations. They can span across multiple layers of a network.

As information moves through the layers of a neural network, it becomes progressively more complex. This complexity is enabled by the interaction between features and circuits. Features combine to form circuits in one layer, which then output more complex features in the next layer. It's important to note that circuits are not simply more complex features. Instead, think of a feature as what a model knows, and a circuit as the mechanism by which it acquires this knowledge.

The paper further sets forth the universality hypothesis: that different neural networks tend to converge on similar patterns of features and circuits. Validation of the universality hypothesis would likely be very promising for mechanistic interpretability – if large models have many features and circuits in common, our understanding of one will be easily transferred to others.