“If you look at the fractal structure of a snowflake, you might think that whoever made it did something impossibly intricate and difficult, but that building it piece by piece must somehow be possible, since someone did it. In fact, both statements are false: the way to make a snowflake is not to think in terms of its pieces but to know the laws of physics, have enough raw material and a large enough chamber, set the temperature, pressure, and humidity correctly, and wait for long enough. Furthermore, this is the only way to make snowflakes; trying to piece together a single one from little bits of ice is hopeless.” — Dario Amodei
Much of contemporary AI research is, in some sense, downstream from the perspective that we should, where possible, 'let the compute figure it out'. By Moore's Law—or at least the folk version of it—the scale of the computational resources available are consistently and rapidly increasing. In the medium to long term, it's only the techniques which are able to most effectively leverage larger and larger quantities of compute that remain relevant. Hand-crafted methods tend to plateau and so are out-competed over time by general, flexible methods which, in figuring out how to do things for themselves, remain readily scalable. This perspective has been hard won, hence Rich Sutton dubbed it The Bitter Lesson.1
This is a large part of why modern AI research is increasingly synonymous with the study of deep neural networks.2 Neural networks (or machine learning ‘models’ as they are more often referred) are a kind of container which is able to hold a variety of possible 'circuitry'. It's this circuitry which determines how the model extracts and processes information from input to produce some output. If we have some metric which rates those outputs, we can, by a process of trial and improvement, search over the circuitry our model is able to hold so that, over time, our model 'learns' to perform well according to that metric. Importantly, the specific form that both our models and the particular process of trial and improvement3 take allows them to be run and scaled up very effectively on modern computer hardware (GPUs and TPUs in particular), with improvements tending to come from removing bottlenecks and improving scalability.4
Naturally, as the size of the container, the number and variety of examples, and the quantity of search performed scale with compute, the circuitry learnt, and thus our model's behaviour, can become increasingly sophisticated.5 What our model learns will depend on what we ask it to do: what situations we put it in, what data we give it as input, which metric we use to grade its output, and so forth. Learning can be based on simply observing data or it can involve interaction, where past outputs can affect future inputs or feedback, depending on what we choose and what we want our model to learn.
In the observational case, our hope is that our model can learn as much about the world as possible from the data it’s shown. In other words, we want our model to compress and internalise the structure and regularities it can find within the data. The nature of data the model is given (the format, the quantity, the quality) is thus the most important factor in determining what the model will learn. Depending on the data, there are two natural types of tasks we can give the model which will get it to compress: prediction (i.e. trying to guess what will come next) and reconstruction (i.e. trying to undo some process of noise or corruption). For both cases, there are natural metrics for our model to pursue, with small changes depending on the exact set up we place our model in and what we’re asking our model to predict, or put differently, reconstruct.6 Given the data and the task, as long as we then ensure that our model and training set-up are structured appropriately, then we should be able to readily scale it up: in the quantity of data that we feed it, in the various dimensions of our container, in the quantity of training we give it. As the model is scaled up, it will be able to pick up on more subtle patterns and structure within the data at a finer level of precision. If our data is sufficiently rich, then it will contain structure across many different resolutions, similar to how coastlines remain rough as you can resolve closer and closer. It’s for this reason, we think, that we tend to find power (or ‘scaling’) laws between our natural metrics and the aspects of increasing scale in the training of our model (i.e. model size, data quantity, training compute).7
Prediction based learning on text, where the model is given some leading text then attempts to predict the continuation, has been particularly important for getting very powerful, general models. Text is the paradigm medium for efficiently representing world knowledge and reflecting on various aspects of thought. The vast range of subtlety and complexity in the human textual corpus has meant that the corresponding power laws have held over more than a dozen orders of magnitude. As our models moved to larger and larger scales, they internalise more and more, demonstrating deep knowledge and understanding of the world as well as some aspects of thought. Other aspects of thought, however, can be more difficult to internalise this way. In particular, abilities such as long term coherence, detecting and correcting errors, backtracking, and other ‘long horizon thinking’ skills, have found difficulties arising from their effects being widely distributed over time and from there being few high quality demonstrations.
Why do we care about this so much? Well, almost all real world applications involve extended tasks and so these issues have prevented these models from moving outside of areas where this doesn’t matter so much (chatbots, coding assistants, search) and into things like agents, which can go out and independently perform tasks in the real world. These ‘long horizon thinking’ issues have been the primary bottleneck for getting agents to work well in the real world. Other research paths which were blocked by this such as the desire to gain the effects of a larger, more intelligent model by letting a weaker model think for longer. This is a particular instantiation of wanting to trade-off training compute and inference compute (i.e. compute used in running the model).8 Also blocked were hopes of getting around the ‘data wall’ - the issue that in pretraining, high quality data is increasingly scarce and expensive - by generating high quality synthetic data by thinking hard before creating the final result as well as providing higher quality feedback in other parts of training.9
The primary reason that people have been so excited by o1 is that it was the first large-scale demonstration of a new (though much anticipated) technique which attempts to address precisely these concerns, potentially unblocking one of the major constraints of these models.
In the interactive case, our hope is that our model can learn useful skills by experience from interacting with the world. Here there is no clear signal of good behaviour a priori as we had in the observational case. Under the formalism of reinforcement learning, we can separate out the problem by assuming that we have some reward signal which our model can receive after every output it makes and from which we can derive good behaviour by trying to maximise the total of our reward signal over time. Broadly, there are two kinds of learning we can do: ‘value learning’ and ‘policy learning’.10 In value learning, the model learns the future reward it expects to collect given its current state11 with the policy (the way that the actions actually taken in the world are selected) being learnt or derived from these estimates.12 In policy learning, the model learns to output actions directly by taking the total reward the model will receive to be our metric of pursuit. This maximisation requires trialing our working policy at every step and so pure policy requires an overwhelming number of trials in the world. Thus value learning is added to make learning more efficient, with the particular art being to do each well without harming the other13, along with the ability to make use of trials collected from slightly older versions of your policy. When acting in more complex domains, this latter approach is common.14
In some domains, such as games or question answering, identifying a suitable reward signal is straightforward (e.g. win / lose, video game score, correct / incorrect) but this is not true in general. For example, in using reinforcement learning to train for preferences where a separate ‘reward model’ (itself trained to predict how a human or another AI model would rate some behaviour) generates a cheap reward signal for another model.15 Some reward signals (reward models in particular) run into issues of robustness where the model training against the signal exploits defects to get high reward without learning the intended behaviour. Rewards can also have varying degrees of sparsity (i.e. more or less frequent with the number of actions taken) with sparser rewards being more difficult to learn from, requiring good initial behaviour or a high degree of exploration.
One way to attempt to address the issues around ‘long horizon thinking’ skills16, is to set our model against some problem and allow it to think before answering. Using our reinforcement learning techniques, over time the model should learn how to leverage its ability to think to help it produce correct answers. Remembering the Bitter Lesson, we should also likely avoid interfering too much with how the model thinks and hoping that, by letting the model learn to think by itself, the kinds of skills we’re after will emerge17, though likely allowing the model to check intermediate runs (e.g. seeing if intermediate code runs). We should also try to make sure that what we’re doing is as scalable as possible. We want to pick problems which give some clear robust notion of completion and of these ‘verifiable’ problems, ideally we want something where the verification is relatively quick and cheap so that they don’t become bottlenecks for our scaling.18
To ensure that our model has a signal it can learn from, we also need to ensure our problems are appropriately scoped, being neither too easy nor too difficult, so that our model has an incentive to improve its thinking. If the model knows little physics, we can't just give it a whole load of advanced magnetohydrodynamics questions to learn from, it'll never get any correct and won't know how to improve. So we're looking for a sweet spot where the problems we give the model to solve are neither too difficult nor too easy that they won't learn anything useful.19 While areas like mathematical proofs against checkers and coding against unit tests provide large classes of such problems, more broadly finding a large number of problems which satisfy these criteria can be difficult, so it’ll be important that our training method is very efficient in terms of the number of problems it needs to learn well.20
Ultimately, the value of this verifiable RL training depends on empirical degree of scalability, and how well the training generalises between domains (if our model has learnt to think through maths problems, will these skills transfer to thinking about other areas?) and timescales21 (if our model has learnt how to perform well over an hour22, will our model also be able to perform well over a day? Over a week?). The degree of generalisation between domains, in particular, has special importance as it determines to what degree limited cheap verifiers are sufficient to get our skills more broadly or whether specific training will have to be done for each task or domain, but could increase with a sufficient diversity of tasks and scale.23
While little is known about the details of these questions, we do know that in the case of whatever kind of training they’re using for o1, we know that the scalability is likely good24 as within around 3 months of the announcement of o1, OpenAI announced their scaled up o3 model. The consensus25 seems to be that the primary improvement made here was a tremendous scale up of the RL training procedure (approaching the order of compute used for pretraining) and shows the kinds of results you would expect, with markedly strong performance of difficult mathematics and competitive coding.26
Zooming out, what does this mean in the big picture? Well, we might say that there is a third type of large scale progress to pay attention to in the frontier models, with advanced in these new verifiable RL (or ‘reasoning’) techniques now standing alongside the improvements in world knowledge and understanding from pretraining and the advancements in the domain these models can operate in from improvements like multimodality, compute use, and so on.27
Thanks to Toby Logan, Jack Wiseman, Seb Handley and Alicia Pollard for feedback on earlier drafts of this post.
It’s worth noting that beyond Moore’s Law, there’s the question of how much money you’re willing to spend on computational resources. As it has become clear that we may be close to the compute necessary for potentially transformative systems, this is the trend which has significantly accelerated since ~ 2018.
One could argue that e.g. evolutionary or more explicit search techniques also follow our philosophy. In practice, however, these methods seem to scale less well (though, in some cases, they can act as valuable backstops to other techniques) and thus we ‘let the compute figure out’ which technique to use.
Often improvements to efficiency running the model and unblocking the learning inside the models goes hand in hand. The transformer is an important example, where the primary improvement arose from efficient use of parallelism on both sides.
i) It’s not just that behaviour becomes more sophisticated, training itself often becomes more stable and robust at larger scales.
ii) Such scaling does also involve changing other variables in the training procedure, though some of the necessary changes can be predicted ahead of time.
Though in other regimes, the scaling is less limited by resolution, as by noise and variance leading to a different scaling.
For o1 in particular, the trade-off on AIME seemed to be roughly that a 10x scale up in inference compute allows you to train with about 20-ish? x less RL compute for constant performance (at least from poorly eyeballing the charts here, someone with a ruler should check me) (in comparison to AlphaGo where 10x more inference lets you get away with about 7x less training compute). That said, where this trade-off lands remains unclear (e.g. see this Gwern comment).
This latter seems to have been the particular focus of Ilya Sutskever and Gwern.
As well as some initial action, in the case of Q learning
i) This being a major source of instability in the pure value learning case.
ii) Value learning tends to show large benefits from frequent reuse of past samples (as long as those samples are sufficiently diverse), thus why value learning tends to use (often quite large) replay buffers which store past experiences.
This comes from the fact that it can often be beneficial to have a single network which produces both policies and values, as they can both benefit from shared circuitry. However, as the policy and the value have different training profiles (e.g. very different critical batch sizes) it is best to train the policies and values alternately in phases whilst ensuring that one isn’t too affected by the training of the other.
i) As a passing note, it’s interesting to see that many of these large scale uses of these simple algorithms (like PPO as well as something like pretraining if we broaden our scope) seem to be the main historical results published by OpenAI. We also note here that OpenAIs Chief Scientist Jakub Pachocki was the lead on both OpenAI Five and GPT-4 (though was not involved in the preceding work) as well as one of the leads on o1. Increasingly, the focus has shifted to the kind of research required to execute scaling.
ii) These algorithms can often transfer quite well to the setting of having multiple agents learning to interact with one another (though not always).
Preference based reinforcement learning is commonly used atop large models which have been ‘pretrained’ on text prediction. As the pretrained model can generate text similar to what it’s seen during training, this kind of training tends to be thought of as eliciting particular skills, personalities and behaviours which are latent in the model.
There are other approaches than the one described (e.g. the one used here), though I try to stick closer to what OpenAI have described and some guess work, though not too much is left to the imagination. I think that the biggest open question is what they’re doing with rewards. It could be a mix of end outcomes and process reward models (i.e. reward models which inspect and give rewards based on the details of the model’s thought process), though I’m unsure. At least the publicly released thought processes don’t show strong signs of a PRM.
i) Other than our ‘long horizon thinking skills’, we might hope that letting the model think for itself, it will be able to get used to and develop its own thinking style, beyond what it’s learnt from pretraining. Maybe this could help with things like ‘taste’ and other kinds of subtle ways the model learns from the experience of practicing thinking. Afterall, it’s certainly the case that some things can only be learnt by doing them.
It’s also worth noting that this kind of training is very general. Many agentic tasks are, in fact, verifiable (you can check if your holiday has been booked or not), though such verification may be very expensive so you may or may not want to train on it depending on other factors (in particular, how much you can get away with other kinds of (potentially cheaper) learning as well as the degree of generalisation within our RL training)
ii) And indeed, in o1, this is what we see!
It’s interesting to consider the ramifications this has on the hardware side. Verifiers like this tend to run best on CPUs which suggests that this kind of training favours chips which have a higher CPU / GPU ratio. For more on what this actually means for e.g lab competitiveness, I’d recommend reading this SemiAnalysis piece.
In the case of games, these sorts of reasons are a large part of why self-play is important, you always have a competitor at a similar skill level to yourself. Of course, then you need to do other things to ensure you’re learning against a sufficiently diverse set of adversaries and to encourage exploration of new strategies.
Though not necessarily in terms of the attempts we make on our problems. Indeed, we should expect something like millions of trials as we scale up there. There are other kinds of RL techniques which are much more sample efficient in that regard but they tend to either use much more value based methods (which typically aren’t much used in training over pretrained language models, with more of the policy based methods being favoured instead (though I expect you’d want to use some policy based method with greatly improved value learning to help deal with sparse reward issues and there you’re likely to take inspiration from some of the sample efficient value based approaches)), or use world models (which aren’t really applicable here because ‘thinking in your head / in simulation’ and ‘thinking aloud / in actuality’ cost exactly the same for these models).
I suspect that this is relatively poor, with the models really only being able to think on the same order of ‘time’ (see next fn) as the most it’s been trained on. This however, is not too much of a concern as long as one can find problems which require such time for training (and if nothing else such problems can be added to the training data as they come up in practice). It only really matters if you’re after some small number of important problems (e.g. a Millenium Prize problem) which probably requires a lot of thinking, but you can’t justify the cost of the training. This doesn’t seem like too much of a big deal.
This kind of time is a metaphorical comparison to a human. In models, the relevant measure is how much text (either in the model's thinking or in context) has the model seen or generated.
Another thing to pay attention to is whether this kind of training generalises to training instead a single agent, systems of many copies of agents acting and thinking (potentially sharing residual streams or latent spaces) in a coordinated manner. This has the advantage of making potentially better use of inference compute in parallel instead of serially.
Indeed, the initial blog post indicated strong example scaling laws for o1 on the AIME benchmark, though this is not necessarily the scaling relationship we most care about.
Based on a mixture of staring at the few graphs shown and vibes, so take it or leave it as you please.
i) I first became aware of this framing from Bob McGrew, though for a similar framing see this from Jared Kaplan.
ii) Now that we don’t have a fundamental bottleneck from these long horizon skills, what are we supposed to say to ensure that we sound measured and reasonable? Well, at least in the short term, reliability will remain an issue. Very subtle and long horizon decisions that humans are able to make may be bottlenecks. ‘Taste’ is one area where we might expect the generalisation across domains to be particularly poor. Potentially different forms of memory (surely there’s something you can do instead of just throwing the back of the KV cache away?). There does seem to be some inability in the models to refactor how they understand things? (Maybe there’s some weird layering or mixing of pre- and RL training which you can do here??) That said, it’s unclear how important or difficult to overcome these more speculative considerations may be.
Really good stuff