Developmental Interpretability

We covered mechanistic interpretability, which is an approach that ‘zooms in’ to models to make sense of their learned representations and weights through methods like feature or circuit analysis.

Developmental interpretability instead studies how the structure of models changes as they are trained. Rather than features and circuits, here the object of study is phases and phase transitions during the training process.

Agent foundations

Most of this course has explored fairly empirical research agendas that have been targeting the kinds of AI systems we have today: neural network architectures trained with gradient descent. However, it’s unclear whether we’ll continue to build future AI systems with similar technology.

Agent foundations research is primarily theoretical research that aims to give us the building blocks to solve key problems in the field, particularly around how powerful agentic AI systems might behave.

Shard theory

Shard theory is a research approach that rejects the notion agents are following specified goals, and instead suggests that reinforcement learning agents are better understood as being driven by many ‘shards’. Each shard then influences the model’s decision process in certain contexts.

This takes inspiration from how humans appear to learn. Most people do not optimize some specific objective function all the time but behave differently in different contexts.

Viewed through this lens, shards appear to result in models choosing actions over others in different situations, which indicates the model’s values.

Cooperative AI safety research

Cooperative AI safety research (also called multi-agent safety research) explores how we might avoid or mitigate the impacts of negative interactions. It also explores how we might encourage agents to coordinate in ways that result in positive outcomes.

Model organisms of misalignment

We are already seeing many AI risks starting to materialize. However, it’s still unclear how anticipated risks will emerge, particularly risks stemming from deceptive misalignment. This is where models appear to be aligned (possibly during training, or when they’re being carefully overseen by humans), but are misaligned and will pursue a different goal when deployed or not carefully supervised.

A better understanding of how and why deceptive misalignment occurs would help inform how we might prepare for these risks. It could also help better inform policy decisions about AI safety, as well as decisions about what AI safety research to prioritize.

Other training approaches

These approaches are fairly general approaches to training AI systems. However, they are potentially relevant to AI safety work.