Introduction

In this post, we argue that if AI model evaluations (evals) want to have meaningful real-world impact, we need a “Science of Evals”, i.e. the field needs rigorous scientific processes that provide more confidence in evals methodology and results.

Evals are a nascent field and we think current evaluations are not yet resistant to this level of scrutiny. Thus, we cannot trust the results of evals as much as we would in a mature field.

Since evals often aim to estimate an upper bound of capabilities, it is important to understand how to elicit maximal rather than average capabilities. Different improvements to prompt engineering have continuously raised the bar and thus make it hard to estimate whether any particular negative result is meaningful or whether it could be invalidated by a better technique.

In contrast, even everyday products like shoes undergo extensive testing, such as repeated bending to assess material fatigue. For higher-stake things like airplanes, the testing is even more rigorous, examining materials for fatigue resistance, flexibility, and behavior under varying temperatures and humidity. These robust, predictive tests provide consumers with reliable safety assurances.

We think model evaluations should aim to get into a similar stage to close the gap between what evals need to do and what they are currently capable of.

What do we mean by “Science of Evals”?

Most scientific and engineering fields go through a similar maturation cycle where exploratory research is turned into robust and rigorous standards that can be used as the basis for industry norms and laws. If model evaluations want to be impactful for AI safety, we think they have to go through a similar maturation process. Without confidence in the rigor of the evaluation process, people won’t feel confident in their results which makes them unsuitable for high-stakes decisions.

Maturation process of a field

imagen.png

In a nascent field, there are no or few agreed-upon best practices and standards. Most of the work is invested in researching and exploring different procedures. Basic terms are defined and the overall field is mostly a scientific endeavor.

In the maturation phase, there is an informal agreement on norms and best practices between different stakeholders but cannot yet be used as the basis of comprehensive laws.

In a mature field, there are formal standards, best practices, and statistical confidence estimates. The field is “ready for law” in the sense that there are clearly established and widely agreed-upon evaluation techniques and the consequences of any particular result are well-defined.

Conclusion

Evals are going to be an important piece of ensuring the safety of AI systems. They enable us to improve our decision-making because they provide important information and are already tied to safety-related decisions, e.g. in RSPs. For evals to be useful for high-stakes decisions, we need to have high trust in the evals process.