What did you do this week…in AI research?

(Don't worry, no need to respond in 5 bullet points or less.)

Mar 01, 2025

It is popular, at least in some parts of the world, to use very short surveys to take the pulse on organisational productivity. I don’t intend to make a value judgement on the use of, “What did you do this week?” in the federal government but I would like to propose that every three months, the AI labs could conduct a survey which asks about 40 researchers two questions:

How much productivity uplift, compared with 2023, are you getting from AI systems right now?
How much productivity uplift, compared with 2023, do you expect to get from AI systems in 6 months?

The researchers would answer with just a percentage for each question, and the results would be published. It would take just a minute!

Why would this survey be useful?

At a high level, three things are true:

AI systems were previously unable to improve AI researcher productivity, but in the last few months;
AI systems are providing some noticeable benefit to AI researchers’ output.
AI systems will be used to partially automate more steps in the research process before an AI system would be able to ‘recursively self improve’—by wholly automating the research process, and re-training improved copies of themselves.

It would be very useful to be able to plot, over time, how much do researchers think that AI systems are giving them a productivity uplift, to be able to notice when we should expect to see extreme jumps in capability because of complete automation. (I realise it might be stating the obvious, but recursive self-improvement might lead to very fast jumps in AI capabilities, far beyond human level.)

At the moment, we are practically ‘flying blind’ about how soon superintelligence could come.

Our current sources are anecdotes, press interviews, and essays from people at the labs, and model autonomy evaluations.

Without exhaustively listing examples of quotes from the lab leaders and researchers, here are some examples:

Sam Altman has suggested we are a few thousand days away from superintelligence.
Dario Amodei has said that we could have a ‘Nobel Prize-level’ scientist in all scientific domains in as little as two or three years. (This could be used to automate research, to create superintelligence.)
A researcher from OpenAI tweeted that “controlling superintelligence is a short term research agenda”.

Some people in the mainstream will dismiss comments like this on the grounds that AI labs need to fundraise or that Silicon Valley generally tends to ‘hype’ emerging technologies. Irrespective of whether these critiques are correct, it seems the AI labs would be doing a disservice to ordinary people if they do not provide a clear grounding of that path, which could take the form, “9 months ago, our researchers thought they were getting a 25% output improvement from using AI systems, compared with being unaided, and now they believe overall, they are getting a 75% output improvement against 2023 benchmarks.”1 Conversely, it would be useful for those who are sceptical of very fast AI progress to substantiate how the productivity uplift that researchers are getting from AI models is flatlining, if in fact it is.

Public statements on progress are valuable, but are not that rigorous, and the absence of context does not aid in the world beginning to prepare for very capable AI systems.

Otherwise, there are two good public tests of model autonomy, but these are weak guides to estimate how useful the models actually are in real-world settings. RE-Bench from METR tests the model’s ability to perform seven realistic, but self-contained ML engineering tasks and MLE-bench uses 75 ML engineering tasks from Kaggle, a platform for doing online coding competitions. This is useful insofar as it allows us to understand how the models perform on end-to-end tasks of medium-length (hours) but it doesn’t capture: where is it actually rational to deploy models in the real world, and how useful is this actually for these jobs? It feels difficult to say anything beyond: “The models are quite useful, if a little unreliable, for narrow tasks like catching bugs, code autocomplete, and optimising kernels for a given architecture, where it makes sense to do integration work.”

As we move forwards, evaluations will be even more difficult:

Public or pre-deployment evaluations cannot capture the productivity uplift from models which are only deployed internally. As models become more powerful, it is reasonable to imagine AI labs will give their researchers access to use models for longer, before sharing with the outside world, in order to ensure models are safe and to improve their productivity differently. Evaluations will be unable to provide any indication of what kind of capability, or potential advantage, these researchers are getting.

Even then, it will be more challenging to get human controls for long horizon evaluations. We need to compare the model’s performance to a human baseline, ideally using lab researchers for the most representative test. The current human baselines are taken for time increments from 2 to 64 hours, but as the length of the task we evaluate model performance gets even longer, this gets more difficult.2 Imagine saying, “We’d like some of the world’s best researchers to come and solve these test problems for a week to compare them to model baselines,” clearly the labs are too busy for this! To account for this, METR are planning to do open source developer uplift evaluations and OpenAI have shown that Deep Research could make 42% of the pull requests (code edits) in their codebase.3

Asking the researchers for their subjective impression is one way to mitigate this.

Why would this survey be challenging?

This is not a silver bullet, by any stretch! There are a number of reasons this survey has feasibility challenges or could have weaker explanatory power:

In general, it is bad to burden the researchers with surveys! Very few stories of brilliant research environments involve lots of interruptions from people with clipboards. It is sensible to be wary of a ‘slippery slope’ whereby each marginal question feels reasonable to add, but then researchers end up spending half their day filling out forms. However, on balance, it is also the case that very few research environments have tried to build superintelligence, so asking about 40 people, every three months, to complete a form that will take literally a minute feels proportionate.
It is possible that researchers’ perceptions of their productivity uplift do not reflect their actual usefulness. On balance, it seems worthwhile nonetheless: researchers will often ‘vibe check’ models, so even if it is an aggregation of their vibe checks on the usefulness of AI systems it still provides some indicator. If there is a systemic bias, the trendline will be valuable, even if the absolute values are not.
Finally, it is possible that there will be incentives for AI labs to encourage researchers to under- or over-state the productivity uplift they are getting. It seems good not to be too cynical in this regard, and only put so much emphasis on this datapoint. Perhaps this concern could be mitigated if the survey was conducted by a trusted third party — like the AI Security Institute, Epoch AI, or other evaluators — and partially anonymising the results.4

To step back, I expect almost everyone would agree that in the ideal case, superintelligence should be built in a maximally transparent way, but given the current equilibrium, this also needs to be achieved without compromising commercial or national interests. A two-question survey would be a low-cost and high-value step towards greater openness.

Numbers are illustrative, I do not think anyone is getting a 75% productivity uplift yet.

This is particularly true as the highest level of talent will more strongly differentiate on the longest-horizons.

Deep Research System Card, OpenAI, February 2025, p.33

The way I would imagine this is that the answers are published not by naming each lab and listing the scores, but rather here is the average at the ‘leading lab’ and the ‘industry average’ across all respondents.

Inference

Discussion about this post