AI Safety
AI safety is a field that tries to reduce AI risks, decreasing the expected harm from AI systems. This is broad, and many subfields contribute to this. One way these can be divided up is:
- Alignment: making AI systems try to do what their creators intend them to do. In the future, we might delegate more power to extremely capable systems - if these systems are not doing what we intend them to do, the results could be catastrophic.
- Moral philosophy: determining what intentions would be “good” for AI systems to have. This is because if we can make AI systems do what we intend (alignment), we then need to know what we intend these systems to do.
- Competence: reducing how often AI systems make mistakes in achieving what they're trying to do.
- Governance: deterring harmful development or use of AI systems. Examples of governance failures: are reckless or malicious development of dangerous AI systems, or deploying systems without safeguards against serious misuse.
- Resilience: preparing other systems to cope with the negative impacts of AI. For example, if AI systems uplift some actors’ ability to launch cyber attacks, resilience measures might include fixing systems to prevent cyber attacks.
- Security: protecting AI systems themselves from attacks. Examples of security failures: are large model weights being inadvertently leaked, chatbots being tricked into sharing private information, and adversarial patches causing models to misclassify traffic signs.
It's also often very difficult to distinguish competence and alignment failures: it can be unclear what AI models are capable of, and whether failures represent:
- inability to do tasks correctly (competence); or
- that the model can do them but is trying to do something else (alignment)
AI Alignment
Making AI systems try to do what we intend them to do is a surprisingly difficult task. A common breakdown of this problem is into outer and inner alignment.

Outer Alignment
Specify goals for an AI system correctly. Also known as solving: reward misspecification, reward hacking, specification gaming, Goodharting.
This is the challenge of specifying the AI's reward function in a way that accurately reflects our intentions. Doing this incorrectly can lead AI systems to optimize for targets that are only a proxy for what we want.