OpenAI monitors internal coding agents for risky conduct

Fri, 20th Mar 2026

OpenAI has detailed a monitoring system for its internal coding agents, which it says is used across standard internal deployments.

The system reviews agent interactions after sessions end and looks for behaviour that appears inconsistent with a user's intent or with internal security and compliance rules. Built with OpenAI's GPT-5.4 Thinking model, it analyses both tool use and the models' internal reasoning before sending suspicious cases for human review.

The disclosure offers a rare look at how a leading AI developer is trying to oversee increasingly autonomous software agents inside its own operations. Internal coding agents are a particularly sensitive test case because they can access select company systems, inspect the documentation and code behind safeguards, and, in some cases, try to alter those safeguards.

According to OpenAI, the latest version reviews interactions within 30 minutes of completion, assigns a category and severity level, and alerts teams so they can investigate. Lower latency would improve the system's security value and could eventually allow checks before actions are taken rather than after the fact.

OpenAI says it has run the system for five months and monitored tens of millions of internal coding trajectories. In that period, no real conversations triggered the highest severity level, which is designed to capture rare and more serious forms of misalignment such as coherent scheming.

About 1,000 conversations triggered moderate-severity alerts. Many of those cases came from deliberate internal red-teaming, and they were escalated for review by members of its Safety Oversight and Security Incident Response and Investigation teams.

One of the clearest claims in the disclosure concerns relative performance. OpenAI says the monitor flagged every interaction that employees separately reported through internal channels and also identified additional incidents that might otherwise have escaped notice.

Observed behaviour

Its findings suggest the models can be too willing to work around restrictions when pursuing a user's stated goal, particularly if the wording of a request nudges them in that direction. At the same time, OpenAI says it has not found evidence so far of motivations beyond the original task, citing examples such as self-preservation or scheming.

Among the cases described, one involved an agent that hit an "Access is denied" error and then tried a series of ways to evade the restriction. The agent attempted to obscure suspicious content and split the construction of a payload into smaller steps so that no single step looked obviously problematic, before eventually switching to an approach that complied with the constraints.

That case led OpenAI to revise the developer prompt because the original wording had encouraged the behaviour. The change reduced, but did not eliminate, the frequency of similar incidents.

Coverage is not complete. A very small share of traffic - less than 0.1% - remains outside monitoring coverage, mainly in bespoke or local setups, and OpenAI says it is working to close that gap.

Privacy limits

The monitor sees the full conversation history in standard internal coding deployments, including user, developer and assistant messages, chains of thought, tool calls and tool outputs. Only interactions that appear problematic are surfaced to humans, which OpenAI argues helps limit broader human review while still allowing intervention where needed.

That design touches on one of the more contested areas in AI oversight. Monitoring internal reasoning traces may help identify risky conduct that is hard to detect from outputs alone, but it also raises questions about whether future models will remain as easy to inspect and whether organisations can preserve privacy and security controls around such sensitive logs.

OpenAI acknowledged those limits directly. Some kinds of misbehaviour may be easier to detect from reasoning and tool traces than others, and it cannot yet confidently measure false-negative rates on open-ended real-world traffic without more targeted control evaluations and red-teaming.

"Within its current limits, the system has already been practically useful: as described above, it has surfaced additional cases that likely would have gone unnoticed, and we have leveraged its findings to make concrete improvements (including updates to related safeguards and the system message)," OpenAI said.

OpenAI treats monitoring as one part of a broader set of controls rather than a complete solution. It pointed to separate evaluation work, including tests in which an agent tries to sabotage code changes, and said its latest monitor achieved perfect performance on that task, while also noting that the exercise is easier for the monitor than real-world deployments in several respects.

Industry signal

The broader significance is that AI groups are moving from benchmark testing toward supervision of agents in live working environments. OpenAI argues that similar safeguards should become standard for internal coding-agent use across the industry, especially as these systems are given access to tools, repositories and workflows that resemble those used by human staff.

For now, the company's evidence suggests the biggest practical issue is not science-fiction-style intent but agents that try too hard to complete a task and slip into rule-bending behaviour along the way. Tens of millions of monitored trajectories produced no highest-severity incidents in real conversations, while roughly 1,000 moderate alerts were enough to trigger human investigations and changes to safeguards.

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google