Responsible capability scaling

Decisions about whether and how to develop and deploy new AI systems are consequential. Deploying an excessively high-risk system may lead to significant harm from misuse or safety failures. Even just developing a high-risk system may lead to harm if it eventually leaks, is stolen or causes harm during internal deployment.

Responsible capability scaling is an emerging framework to manage risks associated with frontier AI and guide decision-making about AI development and deployment. It involves implementing processes to identify, monitor, and mitigate frontier AI risks, which include other processes and practices set out in this document and are underpinned by robust internal accountability and external verification processes.

We outline 7 categories of practice regarding responsible capability scaling:

Conducting thorough risk assessments before developing or deploying new models, complemented with continual monitoring.
Pre-specifying ‘risk thresholds’ that limit the level of acceptable risk.
Pre-committing to specific additional mitigations.
Adapting mitigations to the relevant stages of deployment.
Making preparations to pause development and/or deployment if risk thresholds are reached without the pre-agreed mitigations.
Share details of risk assessment process and risk mitigation measures with relevant government authorities and other AI companies.
Committing to robust internal accountability and governance mechanisms, alongside external verification.

Model evaluations and red teaming

Frontier AI may pose increased risks of harm related to misuse, loss of control, and other societal risks. Different methods are being developed to assess AI systems and their potential harmful impacts. Model evaluations- such as benchmarking- can be used to produce quantitative, easily replicable measurements of the capabilities and other traits of AI systems. Red teaming provides an alternative approach, which involves observing an AI system from the perspective of an adversary to understand how they could compromise or misuse it.

We outline 4 categories of practice regarding model evaluations and red teaming:

Evaluate models for several sources of risk and potential harmful impacts, including dangerous capabilities, lack of controllability, societal harms, and system security.
Conduct model evaluations and red teaming at several checkpoints throughout the lifecycle of a model, including during and after training and fine-tuning, as well as post-deployment.
Allow independent, external evaluators to conduct model evaluations throughout the model lifecycle, especially pre-deployment.