AI Safety

AI safety is a field that tries to reduce AI risks, decreasing the expected harm from AI systems. This is broad, and many subfields contribute to this. One way these can be divided up is:

It's also often very difficult to distinguish competence and alignment failures: it can be unclear what AI models are capable of, and whether failures represent:

AI Alignment

Making AI systems try to do what we intend them to do is a surprisingly difficult task. A common breakdown of this problem is into outer and inner alignment.

image.png

Outer Alignment

Specify goals for an AI system correctly. Also known as solving: reward misspecification, reward hacking, specification gaming, Goodharting.

This is the challenge of specifying the AI's reward function in a way that accurately reflects our intentions. Doing this incorrectly can lead AI systems to optimize for targets that are only a proxy for what we want.