Introduction

It is well known that progress in machine learning (ML) is driven by three primary factors - algorithms, data, and computing. This makes intuitive sense - the development of algorithms like backpropagation transformed the way that machine learning models were trained, leading to significantly improved efficiency compared to previous optimization techniques. Data has been becoming increasingly available, particularly with the advent of “big data” in recent years. At the same time, progress in computing hardware has been rapid, with increasingly powerful and specialized AI hardware.

What is less obvious is the relative importance of these factors, and what this implies for the future of AI. Kaplan studied these developments through the lens of scaling laws, identifying three key variables:

Number of parameters of a machine learning model
Training dataset size
Compute required for the final training run of a machine learning model (henceforth referred to as training compute)

Trying to understand the relative importance of these is challenging because our theoretical understanding of them is insufficient - instead, we need to take large quantities of data and analyze the resulting trends.

Compute trends are slower than previously reported

In the previous investigation by Amodei and Hernandez (2018), the authors found that the training compute used was growing extremely rapidly - doubling every 3.4 months. With approximately 10 times more data than the original study, we find a doubling time closer to 6 months.

Three eras of machine learning

One of the more speculative contributions of our paper is that we argue for the presence of three eras of machine learning. This is in contrast to prior work, which identifies two trends separated by the start of the Deep Learning revolution. Instead, we split the history of ML computing into three eras:

The Pre-Deep Learning Era: Before Deep Learning, training approximately follows Moore’s Law, with a doubling time of approximately every 20 months.
The Deep Learning Era: This starts somewhere between 2010 and 2012, and displays a doubling time of approximately 6 months.
The Large-Scale Era: Arguably, a separate trend of models breaks off the main trend between 2015 and 2016. These systems are characteristic in that they are run by large corporations and use training to compute 2-3 orders of magnitude larger than systems that follow the Deep Learning Era trend in the same year. Interestingly, the growth of computing in these Large-Scale models seems slower, with a doubling time of about 10 months.

A key benefit of this framing is that it helps make sense of developments over the last two decades of ML research. Deep Learning marked a major paradigm shift in ML, with an increased focus on training larger models, using larger datasets, and using more computing.

However, there is a fair bit of ambiguity with this framing. For instance, how do we know exactly which models can be considered large-scale? How can we be sure that this “large-scale” trend isn’t just due to noise?