If you’ve been following the rapid evolution of AI, you might have noticed a shift happening right under our noses: the conversation is no longer just about building bigger and bigger language models. Instead, major players like OpenAI and Google are shifting their focus toward something called test-time compute – a fancy way of saying “giving an AI model extra time to think” whenever it’s faced with a tricky question.
OpenAI scale toward ‘Human-Level’ problem solving
This new direction springs from a simple realization: just scaling models to gigantic sizes has a ceiling. It’s not only becoming too expensive but is also failing to deliver the kind of dramatic improvements we saw in the early days of large language models. In other words, we might be at a point where blindly “feeding more data and parameters” is hitting diminishing returns. Enter test-time compute: a strategy that allows a model to pause, run multiple solution paths, reflect, and refine its answer on the fly.
1. A new era for AI Reasoning
Certain advanced systems, such as OpenAI’s “Strawberry Family” (e.g., o1 and o3 models), illustrate test-time compute in action. Often described as “thinking models,” these systems can handle problems that feel as challenging as PhD-level math or cutting-edge scientific research precisely because they do real-time reasoning. They’re designed to take multiple steps rather than just blurting out the first thing that comes to mind.
If you’ve ever tackled a big puzzle—let’s say setting up a complex spreadsheet formula or solving a word problem—you probably didn’t do it in one shot. You reasoned step by step, clarifying constraints, trying a possible solution, spotting mistakes, and then revising. That’s exactly the approach these advanced AI systems now take: they search through different potential solutions, evaluate them, and refine their approach until they arrive at the final answer.
Reward Model during chain-of-thought reasoning
2. Four core building blocks of “Thinking” Models
Researchers studying OpenAI’s o1 and o3 models, as well as Google’s “Gemini,” pinpoint four essential ingredients for genuine test-time compute:
- Policy Initialization – Think of this as everything the model “knows” before you ask it a question—both its general knowledge from pre-training (like reading vast amounts of text) and its final tuning. This initial policy shapes how the AI will “think,” including the ability to clarify complex questions, propose alternative solutions, or correct itself when it realizes something’s off.
- Reward Design – Just like you might give a child a gold star for a correct answer, a model needs feedback to know which approaches are good or bad. If you’re playing chess, it’s easy: you either win or lose. But with language and reasoning tasks, the “right” answer can be more nuanced, so researchers design reward models that guide the AI, step by step, to the best solution—sometimes rewarding partial correctness along the way.
- Search – This is the heart of test-time compute. Instead of generating a single response, the AI can “branch out” and explore several possible solutions in parallel (a technique akin to tree search) or continuously refine one solution through iterative revisions. If the first few solution attempts don’t pass muster, the model can pivot and go down a different path.
- Learning – Finally, the best “thinking” models don’t just search blindly; they learn from each attempt. While traditional large language models rely heavily on pre-collected datasets, “thinking” models often use reinforcement learning during or after inference to refine their approach. The more they interact with an environment or systematically check their solutions, the better they get—all without needing an army of human labelers.
Building blocks of o1 « thinking » model
3. The Chinese Paper: cracking open the o1 Model
In a publication from Fudan University and Shanghai AI Laboratory researchers dissect how OpenAI’s o1 (and by extension, o3) accomplishes its advanced reasoning feats. They methodically walk through each of the four elements—policy initialization, reward design, search, and learning—and explain how all of them come together to give these models “AGI-like” problem-solving abilities.
One of the key insights is that test-time compute not only helps with tricky math or logic puzzles; it also opens the door to a potential “world model,” where AI can understand and navigate more abstract realms. The Fudan team suggests that by carefully orchestrating each step of the search and by providing partial rewards, it’s possible to replicate many of the “thinking” behaviors seen in advanced, closed-source models. Their ultimate goal? Open-source this know-how, so that smaller labs and startups can experiment with their own versions of o1-like models.
4. Where test-time compute and the “Age of Discovery” collide
So, what’s next?
- Smarter Multi-Modal Models – The Fudan researchers note how combining text, images, and even video helps a model build a more accurate “world model.” Imagine an AI that can watch a short clip, interpret it, and then apply the same test-time thinking it uses for text.
- More Open-Source Tools – As more teams replicate these techniques, we’ll see a wave of open-source “thinking” models that smaller companies or even hobbyists can fine-tune. This could democratize advanced AI, making it less about who has the biggest data center and more about who can innovate on inference techniques.
- Deeper Exploration of Reinforcement Learning – Scaling reinforcement learning from real-time interaction is set to rise, as it doesn’t rely on endless labeled data. That means we’ll see more instances of AI systems discovering brand-new strategies (think AlphaGo’s “move 37”) that human experts never imagined.
Ultimately, test-time compute is rewriting the rules of AI scaling. By letting models “think longer,” we unlock more robust reasoning without forcing them to become massive black boxes. This pivot to a more human-like problem-solving strategy is inspiring, after all, many of us remember teachers telling us to “show your work” in math class. Now AI is doing something similar, and it’s paying off in leaps and bounds.
Business Takeaway
For businesses, the rise of “thinking” AI models powered by test-time compute has immediate and far-reaching implications:
- Expanded Use Cases – With the ability to tackle far more complex and advanced tasks, these models can drive automation and decision-making across a broader spectrum of business processes—from sophisticated data analytics to nuanced customer interactions. The potential extends well beyond the typical chatbot, unlocking new avenues for innovation and operational efficiency.
- Trade-Offs in Performance and Cost – The very nature of test-time compute means the model is taking additional steps during inference—sometimes adding seconds or even up to a minute to get a final response. On top of longer response times, each step generates more tokens, driving up both financial and energy costs. As the technology evolves, these overheads may shrink, but businesses must still factor in higher operational expenses and potential throughput bottlenecks.
- Explainability and Transparency – Longer reasoning chains can, in principle, provide a window into how the model arrives at certain conclusions. However, full visibility into these intermediate steps isn’t always accessible—especially in some closed-source setups. In contrast, open-source initiatives already show promise by offering deeper insights into the reasoning flow, helping organizations foster trust and compliance by making AI “decisions“ more transparent.
- Increased Hallucination Risks – Iterative thinking can sometimes amplify “hallucinations,” where a model inadvertently fabricates facts. Businesses need robust testing and oversight to mitigate errors, particularly in mission-critical applications. Ongoing improvements in model feedback loops and evaluation metrics may help reduce these missteps over time.
Altogether, adopting test-time compute requires a balanced assessment of cost, performance, and explainability. Yet for those willing to invest in the technology, the payoff lies in a new class of AI systems capable of solving challenges once deemed too complex for automated solutions.
Conclusion
From OpenAI’s o1 and o3 to the latest Chinese research, a clear pattern emerges: test-time compute is the new frontier. By combining policy initialization, carefully crafted rewards, sophisticated search strategies, and continuous learning, AI models can behave more like genuine problem solvers than rote memorization engines.
As we step deeper into this age of discovery, we can look forward to new possibilities in how AI interacts with the world—reasoning at a level once thought impossible. Whether it’s solving complex science questions, assisting in creative endeavors, or helping businesses automate tasks, “thinking” models promise to transform AI from a fancy autocomplete to a real partner in innovation.
The future, it seems, belongs not just to bigger models but to smarter ones—machines that know how to pause, reconsider, and explore multiple pathways in search of something new. If anything, this proves what many educators have always believed: taking your time to think is the best way to get the right answer in the end. And it turns out, that’s just as true for AI as it is for us humans.
By Jérémy BRON, AI Director, Silamir Group
Sources
- Test Time Compute (2024, December 13). Cloud Security Alliance Blog. Retrieved from https://cloudsecurityalliance.org/blog/2024/12/13/test-time-compute#
- Zeng, Z., Cheng, Q., Yin, Z., Wang, B., Li, S., Zhou, Y., Guo, Q., Huang, X., & Qiu, X. (2024, Dec 18). Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective. Fudan University & Shanghai AI Laboratory. Retrieved from https://arxiv.org/pdf/2412.14135
- OpenAI Scale Ranks Progress Toward ‘Human-Level’ Problem Solving (2024, July 11). Bloomberg. Retrieved from https://www.bloomberg.com/news/articles/2024-07-11/openai-sets-levels-to-track-progress-toward-superintelligent-ai?embedded-checkout=true