OpenAI debuts GPT‑5.1-Codex-Max coding model and has already completed 24 hours of internal tasks -

OpenAI is Introducing GPT‑5.1-Codex-Maxa new Frontier agent coding model is now available in the Codex development environment. This release represents a major step forward in AI-assisted software engineering, delivering improvements in long-term inference, efficiency, and real-time interactivity. GPT‑5.1-Codex-Max replaces GPT‑5.1-Codex as the default model for the entire Codex integrated surface.

The new model is designed to act as a persistent, high-context software development agent, capable of managing complex refactoring, workflow debugging, and project-wide tasks across multiple context windows.

This comes on the heels of Google releasing its powerful new Gemini 3 Pro model yesterday, which still outperforms or matches it on key coding benchmarks.

above SWE bench verified, GPT‑5.1-Codex-Max achieves 77.9% accuracy Very high inference effort, slightly higher than Gemini 3 Pro’s 76.2%.

It is also The accuracy of Terminal-Bench 2.0 is 58.1%, the accuracy of Gemini is 54.2%, And it matched Gemini’s score of 2,439 on LiveCodeBench Pro, a competitive coding Elo benchmark.

Codex-Max also holds a slight advantage in agent coding benchmarks when measured against the most advanced configuration of Gemini 3 Pro (Deep Thinking model).

Performance benchmarks: incremental improvements across key tasks

GPT‑5.1-Codex-Max shows visible improvements compared to GPT-5.1-Codex across a variety of standard software engineering benchmarks.

SWE-Lancer IC SWE achieved 79.9% accuracy, a significant improvement from 66.3% for GPT-5.1-Codex. SWE-Bench Verified (n=500) reached 77.9% accuracy with very high inference effort, outperforming GPT-5.1-Codex’s 73.7%.

The performance of Terminal Bench 2.0 (n=89) showed a more modest improvement, with GPT-5.1-Codex-Max achieving 58.1% accuracy, while GPT-5.1-Codex had an accuracy of 52.8%.

All evaluations were performed with compression and very high inference effort enabled.

These results show that the new model provides higher bounds on both benchmark accuracy and real-world usability under expanded inference loads.

Technical architecture: Long-term inference with compression

The main architectural improvement of GPT‑5.1-Codex-Max is the ability to effectively infer extended input/output sessions using a mechanism called . compression.

This allows the model to retain important context information while discarding irrelevant details as it approaches the context window limit, effectively allowing continuous work across millions of tokens without performance degradation.

This model has been observed internally to complete tasks lasting longer than 24 hours, including multi-step refactorings, test-driven iterations, and autonomous debugging.

Compression also improves token efficiency. At moderate inference effort, GPT-5.1-Codex-Max uses approximately 30% fewer thought tokens to achieve similar or better accuracy than GPT-5.1-Codex, which impacts both cost and delay.

Platform integration and use cases

GPT‑5.1-Codex-Max is currently available in multiple Codex-based environments. This refers to OpenAI’s unique integration tools and interfaces built specifically for code-centric AI agents. These include:

Codex CLIOpenAI’s official command-line tools (@openai/codex) have already published GPT‑5.1-Codex-Max.
IDE extensionslikely developed or maintained by OpenAI, but does not name any specific third-party IDE integrations.
Interactive coding environmentsuch as those used to demonstrate front-end simulation apps such as CartPole and Snell’s Law Explorer.
Internal code review toolused by OpenAI’s engineering team.

At this time, GPT‑5.1-Codex-Max is not yet available via public API, but OpenAI says it will be available soon. Currently, users who wish to work with models in a terminal environment can do so by installing and using the Codex CLI.

It is currently unconfirmed if and how the model will be integrated into third-party IDEs unless built on the CLI or a future API.

Models can interact with live tools and simulations. Examples shown in the release are:

Interactive CartPole policy gradient simulator. Visualize reinforcement learning training and activation.
Snell’s Law Optical Explorer. Supports dynamic ray tracing across refractive index.

These interfaces demonstrate the model’s ability to reason in real time while maintaining an interactive development session, effectively bridging computation, visualization, and implementation within a single loop.

Cybersecurity and safety constraints

Although GPT‑5.1-Codex-Max does not meet the cybersecurity “high” capability threshold based on OpenAI’s readiness framework, it is the most capable cybersecurity model deployed by OpenAI to date. It supports use cases such as automated vulnerability detection and remediation, but with strict sandboxing and network access disabled by default.

Although OpenAI has not reported any increase in large-scale malicious use, it has implemented enhanced monitoring systems that include activity routing and suspicious behavior suspension mechanisms. Unless developers choose broader access, Codex remains isolated to their local workspace, reducing risks such as instant insertion from untrusted content.

Deployment context and developer usage

GPT‑5.1-Codex-Max is currently available to the following users: ChatGPT Plus, Pro, Business, Edu, Enterprise schedule. It also becomes the new default for Codex-based environments, replacing GPT‑5.1-Codex, which was a more generic model.

OpenAI states that 95% of its internal engineers use Codex on a weekly basis, and that since its introduction, these engineers have submitted up to 70% more pull requests on average, highlighting the impact this tool has on internal development velocity.

OpenAI emphasizes that despite its autonomy and persistence, Codex-Max should be treated as a coding assistant rather than a replacement for human review. This model generates terminal logs, test quotes, and tool call output to support transparency in the generated code.

outlook

GPT‑5.1-Codex-Max represents a significant evolution in OpenAI’s strategy for agent development tools, delivering greater inference depth, token efficiency, and interactivity across software engineering tasks. By extending context management and compression strategies, this model is positioned to handle tasks at the scale of complete repositories rather than individual files or snippets.

Codex-Max continues its focus on agent workflows, secure sandboxes, and real-world metrics, preparing the next generation of AI-assisted programming environments, and emphasizing the importance of monitoring in increasingly autonomous systems.

Source link

Categories

Performance benchmarks: incremental improvements across key tasks

Technical architecture: Long-term inference with compression

Platform integration and use cases

Cybersecurity and safety constraints

Deployment context and developer usage

outlook

Related News

Aave deploys Aave Shield after $50M user loss incident

Differences in the reaction of Bitcoin and gold to the impact of the Iran war