Mechanistic Interpretability is an emerging field that seeks to understand the internal reasoning processes of trained neural networks and gain insight into how and why they produce the outputs that they do. AI researchers currently have very little understanding of what is happening inside state-of-the-art models.
Current frontier models are extremely large – and extremely complicated. They might contain billions, or even trillions of parameters, spread across over 100 layers. Though we control the data that is inputted into a network and can observe its outputs, what happens in the intervening layers remains largely unknown. This is the ‘black box’ that mechanistic interpretability aims to see inside.
As AI systems become more powerful and the incentives to integrate them into various decision-making processes grow, we will likely cede more and more authority to them. Significant progress in interpreting neural networks would afford us insight into the rationale behind decisions that impact large numbers of people. This could allow us to identify bugs, flaws and systematic biases (if a model is making hiring decisions based on race or gender, for example).
Progress in mechanistic interpretability is also one path to detecting unaligned behavior in AI models. If we can only assess a model’s outputs, but not the processes that give rise to them, we may not be able to tell whether a model is genuinely acting in the best interests of its user.
Being able to identify such traits may allow us to:
In the most optimistic case, mechanistic interpretability could not only help us understand the processes by which AI models produce outputs, but to intervene in these processes. In the most optimistic case, mechanistic interpretability could not only help us understand the processes by which AI models produce outputs, but to intervene in these processes.
Researchers at OpenAI set forth a model of neural networks as composed of ‘features’ which connect to form ‘circuits’.
As information moves through the layers of a neural network, it becomes progressively more complex. This complexity is enabled by the interaction between features and circuits. Features combine to form circuits in one layer, which then output more complex features in the next layer. It's important to note that circuits are not simply more complex features. Instead, think of a feature as what a model knows, and a circuit as the mechanism by which it acquires this knowledge.
The paper further sets forth the universality hypothesis: that different neural networks tend to converge on similar patterns of features and circuits. Validation of the universality hypothesis would likely be very promising for mechanistic interpretability – if large models have many features and circuits in common, our understanding of one will be easily transferred to others.