Breakthrough in AI Reasoning: OpenAI's O3 & O4 Mini Models Dominate Benchmarks

Breakthrough in AI Reasoning: OpenAI's O3 & O4 Mini Models Dominate Benchmarks. Powerful reasoning models with tool usage, multimodal capabilities, and state-of-the-art performance across coding, math, science, and visual tasks. Comparison to previous models and industry implications.

21 avril 2025

party-gif

Unlock the power of next-generation reasoning agents with OpenAI's O3 and O4 Mini models. These cutting-edge AI systems combine state-of-the-art reasoning capabilities with full tool access, delivering unparalleled performance across a wide range of tasks, from coding and math to visual perception and creative ideation. Discover how these models can revolutionize your workflow and unlock new possibilities in your field.

Powerful Reasoning Models: O3 and O4 Mini

OpenAI has announced the release of two new powerful reasoning models, O3 and O4 Mini, which represent a significant advancement in the field of artificial intelligence. These models are designed to be highly effective for agentic use cases, combining state-of-the-art reasoning capabilities with the ability to utilize a wide range of tools.

The key highlights of these new models include:

  1. Tool Usage Capabilities: For the first time, reasoning models can effectively use tools such as web search, file analysis, and Python integration, overcoming a major limitation of previous models.

  2. Native Multimodal Reasoning: The models can seamlessly integrate visual inputs, such as images and charts, into their reasoning process, enabling a more comprehensive understanding of the task at hand.

  3. Improved Performance: Across a range of academic benchmarks and real-world tasks, the O3 and O4 Mini models have demonstrated significantly stronger performance compared to their predecessors, setting a new standard in both intelligence and usefulness.

  4. Cost Optimization: The O4 Mini model, in particular, has been optimized for fast and cost-efficient reasoning, achieving remarkable performance at a much lower estimated inference cost compared to the existing O1 models.

These advancements in reasoning capabilities, combined with the ability to effectively utilize tools, make the O3 and O4 Mini models highly promising for a wide range of agentic use cases, such as programming, business consulting, and creative ideation.

The models will be available starting today within ChatGPT and the OpenAI API, providing users with access to these powerful reasoning capabilities. As the field of artificial intelligence continues to evolve, these new models from OpenAI represent a significant step forward in the pursuit of more intelligent and versatile AI systems.

Multi-Modal Reasoning and Tool Integration

The new OpenAI models, 03 and 04 mini, represent a significant advancement in reasoning capabilities. For the first time, these models are able to effectively utilize tools, which has been a major limitation of previous reasoning models.

The models now have native multi-modal reasoning capabilities, allowing them to analyze and reason about visual inputs such as images, charts, and graphs, in addition to textual information. This unlocks a new class of problem-solving that blends visual and textual reasoning, as reflected in their state-of-the-art performance across multi-modal benchmarks.

Crucially, the models have been trained to not only use tools, but to reason about when to use specific tools for a given task. This enables them to integrate the outputs of tools into their chain of thought, modifying their plans and approaches accordingly. This agentic use of tools is a key feature that makes these models highly effective for real-world applications.

The benchmarks demonstrate the models' exceptional performance in areas like programming, business consulting, and creative ideation, with 03 mini making 20% fewer major errors than previous OpenAI models on difficult real-world tasks. The 04 mini model, in particular, has been optimized for fast and cost-efficient reasoning, achieving remarkable performance across code, mathematics, and visual tasks.

Overall, the integration of multi-modal reasoning and effective tool usage represents a significant breakthrough in the capabilities of these models, setting a new standard for intelligence and usefulness in real-world applications.

Benchmark Performance and Comparisons

The new OpenAI models, 03 and 04 mini, have demonstrated remarkable performance across a range of benchmarks, setting new standards in both intelligence and usefulness.

The 03 model is described as OpenAI's most powerful reasoning model, pushing the boundaries in areas like coding, math, science, and visual perception. It showcases impressive multimodal understanding and visual perception capabilities, able to analyze images, charts, and graphs with high accuracy.

In external evaluations, the 03 mini model was found to make 20% fewer major errors than OpenAI's previous 01 model on difficult real-world tasks, particularly excelling in areas like programming, business consulting, and creative ideation.

The 04 mini model, on the other hand, is a smaller and more cost-efficient model that still achieves remarkable performance, especially in code, mathematics, and visual tasks. Compared to the 03 mini, the 04 mini offers similar or better performance at a significantly lower estimated inference cost.

The benchmarks highlight the models' improved instruction following and more useful, verifiable responses compared to their predecessors, thanks to the enhanced intelligence and inclusion of web sources.

While OpenAI's own benchmarks show impressive results, the company has been urged to provide comparisons with other frontier models to give a more complete picture of the performance gains. Nevertheless, the numbers presented are undoubtedly state-of-the-art, setting a new bar for reasoning models and their practical applications.

Cost-Effective Reasoning with O4 Mini

The O4 Mini model is a smaller, cost-optimized version of OpenAI's powerful reasoning models. Despite its smaller size, it achieves remarkable performance, particularly in areas like code, mathematics, and visual tasks.

The key highlights of the O4 Mini model are:

  • Cost Optimization: The O4 Mini model is designed to be more cost-efficient than the larger O3 model, while still delivering state-of-the-art performance. Benchmarks show that the O4 Mini outperforms the O3 Mini at a much lower estimated inference cost.

  • Exceptional Performance: On tasks like the AME 2025 and GPQA datasets, the O4 Mini demonstrates significantly better performance than the previous O1 models, even without using any tools. This suggests the model's reasoning capabilities have been greatly enhanced.

  • Multimodal Capabilities: The O4 Mini can integrate images directly into its chain of thought, unlocking a new class of problem-solving that blends visual and textual reasoning. This is reflected in its strong performance on multimodal benchmarks.

  • Improved Instruction Following: External evaluators have rated the O4 Mini as demonstrating improved instruction following and more useful, verifiable responses compared to its predecessors. This is a crucial capability for building effective, agentic systems.

Overall, the O4 Mini model represents a significant step forward in cost-effective reasoning, offering a powerful and versatile solution for a wide range of applications. Its combination of strong performance, multimodal capabilities, and improved instruction following make it a compelling option for developers and researchers alike.

Improved Instruction Following and Verifiable Responses

OpenAI has highlighted that external expert evaluators rated both the new OpenAI 03 and 04 mini models as demonstrating improved instruction following and more useful verifiable responses compared to their predecessors. This is attributed to the models' enhanced intelligence and inclusion of web sources.

The ability to follow instructions effectively and provide verifiable responses is crucial, especially for building agentic systems. The improved performance in these areas suggests that the new models can better understand and execute the given tasks, while also providing responses that can be validated.

This focus on instruction following and verifiable responses aligns with OpenAI's emphasis on making these models more useful and reliable for real-world applications. By demonstrating stronger capabilities in these areas, the new models are poised to be more effective in agentic use cases, where the ability to follow instructions and provide trustworthy information is paramount.

Scaling Reinforcement Learning for Better Performance

Throughout the development of OpenAI 03, the team has observed that large-scale reinforcement learning exhibits the same "more compute equals better performance" trend. By retracing the scaling path in reinforcement learning, they have pushed an additional order of magnitude in both training compute and inference time reasoning, yet still see clear performance gains. This validates that the model's performance continues to improve the more it is allowed to think.

By scaling both the training compute and inference time, the team has been able to push the boundaries of what is possible with these reasoning models. The more compute they provide, the better the models perform, indicating that there is still significant room for improvement by further scaling up the resources used.

This scaling approach has enabled the models to not only use tools effectively, but also to reason about when to use specific tools. Through reinforcement learning, the models have been trained to understand not just how to use the tools, but to make informed decisions about which tool is most appropriate for a given task. This is a critical capability for building truly agentic systems that can autonomously navigate complex problem-solving scenarios.

The team's focus on scaling reinforcement learning has paid off in the form of state-of-the-art performance across a range of benchmarks, including those that require multimodal reasoning and the integration of visual and textual information. By blending visual and textual reasoning, these models unlock a new class of problem-solving that can be highly valuable in industrial and real-world applications.

Pricing and Availability

The new OpenAI models, GPT-3.03 and GPT-3.04 Mini, are now available starting today within ChatGPT and on the API. OpenAI has provided some details on the pricing and cost optimization of these models compared to the previous GPT-3.01 models.

The GPT-3.04 Mini model is optimized for fast and cost-efficient reasoning. According to the benchmarks shared by OpenAI, the GPT-3.04 Mini performs much better than the existing GPT-3.03 Mini model at a similar cost. For example, on the AME 2025 benchmark without any tool usage, the GPT-3.04 Mini achieves 99.5% accuracy, while the GPT-3.03 Mini performs significantly worse.

Similarly, on the GPQA dataset, the GPT-3.04 Mini shows remarkable performance for its size and cost, outperforming the GPT-3.03 Mini.

The pricing for these new models is also more reasonable compared to the previous GPT-3.01 models. The GPT-3.04 Mini is estimated to be even less expensive than the GPT-4.1 model, which was priced at around $2 per million token input.

As for the GPT-3.03 model, the pricing is much lower than the GPT-3.01 model, which is going to be phased out soon. This makes the GPT-3.03 a more accessible and cost-effective option for users, especially as a potential replacement for the GPT-4.1 model.

Overall, the improved performance and more reasonable pricing of the new GPT-3.03 and GPT-3.04 Mini models make them attractive options for users looking to leverage the enhanced reasoning capabilities and tool integration offered by these models.

Codex CLI: Frontier Reasoning in the Terminal

OpenAI has announced the release of Codex CLI, an open-source project that enables the use of their powerful reasoning models within the terminal or command-line interface (CLI). This new tool allows users to leverage the advanced capabilities of OpenAI's latest models, including the ability to reason through images and perform multimodal tasks, directly from their local machine.

Codex CLI is designed to be a competitor to cloud-based solutions like Cloud Code, providing a more accessible and flexible way to integrate these cutting-edge reasoning models into various workflows and applications. By running the models locally, users can benefit from the performance and cost advantages, while still tapping into the impressive capabilities of OpenAI's latest advancements.

One of the key features of Codex CLI is its multimodal nature, which allows it to reason not only through textual inputs but also visual information. This unlocks a new class of problem-solving that blends visual and textual reasoning, making it particularly useful for a wide range of applications, from manufacturing and engineering to creative tasks and beyond.

The availability of Codex CLI is an exciting development, as it brings the power of OpenAI's frontier reasoning models directly to users' fingertips. This tool promises to enable new and innovative use cases, as developers and researchers can seamlessly integrate these advanced capabilities into their local workflows and projects.

Conclusion

The release of OpenAI's new reasoning models, 03 and 04 mini, represents a significant advancement in the field of artificial intelligence. These models demonstrate remarkable capabilities, including the ability to effectively utilize tools, engage in multimodal reasoning, and achieve state-of-the-art performance across a wide range of benchmarks.

The integration of tool usage is a particularly noteworthy feature, as it allows the models to leverage various resources and functionalities to solve complex problems. This capability, combined with their strong performance in areas like coding, mathematics, and visual perception, makes these models highly promising for agentic use cases.

The cost optimization of the 04 mini model is also a notable development, as it provides a more affordable option for users while maintaining impressive performance. This could significantly expand the accessibility and adoption of these advanced reasoning models.

Overall, the release of 03 and 04 mini sets a new standard in the field of AI, pushing the boundaries of what is possible with reasoning models. As the author suggests, it will be interesting to see how other industry leaders, such as Google, respond to this challenge and further advance the state of the art in AI technology.

FAQ