MCP was built to make tool integration easy. Here's what that costs in production.
Four failure modes that emerge when AI agents get real tool access — and five controls that address most of them.
The Model Context Protocol solves a real problem. Before MCP, wiring an AI agent to a set of tools meant building a custom integration for each tool, in each framework, for each deployment. MCP standardised the interface: one protocol, any tool, any client. Since its November 2024 launch, adoption has been fast enough that Invariant Labs found tens of thousands of publicly listed servers by early 2025, and support is now native in most major AI coding tools and agent frameworks.
The defaults, though, were optimised for getting something working, not for production security. The specification's auth model is optional. Tool definitions can change silently after you approve them. Session-scoped permissions give agents ambient authority over all connected tools simultaneously, not just what's needed for the current task. Three attack classes have been thoroughly documented in 2025 and early 2026, and by February 2026 the ecosystem had accumulated over 30 CVEs in a 60-day window alone.
If you're moving an MCP deployment from a demo environment to production, here's what to check before you ship.
Auth is optional in the spec — and most deployments skip it
The MCP specification describes two transports: stdio and HTTP with Server-Sent Events. The stdio transport has no authentication mechanism. It's a local socket; the assumption is that both ends are trusted processes on the same machine. That assumption holds in a local developer environment. It breaks the moment you're running in a hosted or multi-tenant context.
The HTTP transport is different. A June 2025 update to the specification added formal support for OAuth 2.1 with PKCE, resource indicators per RFC 8707, and protected resource metadata. The spec says HTTP-based servers can implement authorization. It does not say they must. Authorization is explicitly optional, and most public servers don't implement it.
Trend Micro scanned 518 officially registered MCP servers in 2025 and found that 38-41% offered no meaningful authentication. In a broader scan of 1,467 publicly accessible servers, three out of five had no auth or encryption at all. CVE-2026-33032, filed in March 2026, found a popular nginx-ui MCP endpoint performing no authentication for command execution requests. 2,600-plus instances were exposed before the fix.
In January 2026, an open-source AI agent called Clawdbot shipped with authentication disabled by default on its HTTP gateway. Over 2,000 deployed instances leaked credentials and full conversation histories to anyone who knew the endpoint path. There was no code-level vulnerability to patch. The endpoint simply had no auth layer, and nothing in the protocol had required one.
Tool definitions that change after you approve them
When an MCP server is first connected, the client fetches the server's tool manifest: names, descriptions, and parameter schemas. The user reviews what the server claims to do, approves it, and the agent starts making calls. What most teams don't know: those definitions are fetched dynamically at session start, and the MCP protocol includes a mechanism called notifications/tools/list_changed that allows servers to push updated tool schemas mid-session without any client re-approval. The spec doesn't require immutable definitions or content hashing. An approved server can serve a different manifest on every connection.
This attack class was formalised as CVE-2025-54136, named MCPoison by Check Point Research. In September 2025, a package called postmark-mcp appeared in the npm registry presenting as a legitimate email integration. Over 15 releases it built credibility with a clean download history. Version 1.0.16, published on September 17, added a single line of code: a silent BCC on all outgoing emails routed to an attacker-controlled address. Agents that had approved the package continued calling it without re-reviewing. An estimated 300 organisations and roughly 1,500 weekly downloads were affected before the package was removed.
A peer-reviewed study published in March 2026 (arXiv:2603.22489) evaluated seven major MCP clients for their response to changed or malicious tool definitions. Five of the seven implemented no static validation. They accepted whatever the server returned on connection with no comparison to any previously approved state. Even clients that present approval dialogs are often vulnerable because the malicious instructions appear below the visible area of the prompt, relying on users clicking Approve without scrolling.
Indirect prompt injection via tool output
Prompt injection against an LLM typically requires access to the system prompt or user-facing input. MCP adds a different path: tool output. When a tool returns data, the model receives it as trusted context — the result of something it asked for. If that data contains text structured to look like an instruction, the model will often follow it. The MCPTox benchmark, which tested 45 real-world MCP servers against 1,312 adversarial cases across 10 attack categories, found a 72.8% attack success rate against o1-mini. More capable models were, on average, more vulnerable, because they follow instructions more precisely.
A widely discussed 2025 incident showed the scale of this in a production environment. Supabase's Cursor agent was running with service-role database access — the highest-privilege Supabase credential, which bypasses row-level security entirely — and processing customer-submitted support tickets through an MCP tool. One ticket contained embedded SQL instructions rather than plain text. The agent ingested the ticket through its tool, treated the embedded SQL as context it had been given, and executed it, exfiltrating integration tokens from the database. No system prompt was compromised. The entire attack surface was the content of a support ticket.
The same attack pattern appeared in a GitHub MCP integration in May 2025, documented by Invariant Labs. Malicious GitHub issues were written to be read by an AI agent with a broadly-scoped Personal Access Token. The content was structured to resemble instructions from a trusted source. The agent followed them, accessing the developer's private repositories and creating a public pull request containing the exfiltrated contents, including private repo names, project details, and personal financial data.
The attack scales with your tool surface. Every data source your agent reads through MCP: databases, issue trackers, email inboxes, document stores — is a potential injection point. The control has to be at the data layer: strict output schemas defining exactly what fields each tool can return, validated before the model sees them, with any free-form text treated as inert data rather than instruction candidates. Palo Alto Unit 42 found that with five connected MCP servers, a single compromised server achieved a 78.3% attack success rate across the full multi-server deployment.
Ambient authority and the confused deputy
The third failure mode is structural. MCP grants agents access to a full tool set at session start, and all of those credentials remain present for the entire session. This is ambient authority: the agent holds everything simultaneously, not just what's needed for the task at hand. A server exposing twenty tools receives a single OAuth session token covering all of them. There is no per-tool authorization barrier after the initial session handshake.
Ambient authority enables what security engineers call the confused deputy attack. Tool A has read access to a low-privilege data store and returns output structured as an instruction for Tool B, which has write access to something sensitive. The model reads Tool A's output as trusted context and invokes Tool B accordingly. Neither tool was compromised. No auth check failed. The attack succeeded because both credentials were present at the same time and the model can't distinguish tool-output-as-data from tool-output-as-instruction.
Invariant Labs demonstrated this in April 2025 with a 'random fact of the day' MCP server whose tool description contained hidden instructions targeting a co-installed WhatsApp MCP server. The agent read the hidden instructions, used its legitimate WhatsApp access to read the user's full message history, and exfiltrated it as an outgoing message, obfuscated via whitespace padding to avoid DLP detection. The WhatsApp server was never compromised; the attack exploited the ambient authority the agent held over both servers simultaneously.
“Most MCP deployments give an agent all its credentials at session start. The confused deputy attack doesn't need a compromised tool — it needs two tools that are both already trusted.”
What to check before shipping MCP to production
No single control addresses all three failure modes. The table below maps each to the condition that enables it, a documented example, and the primary mitigation.
| Failure mode | Enabling condition | Documented example | Primary mitigation |
|---|---|---|---|
| Tool rug pull | Tool defs mutable via notifications/tools/list_changed; clients approve once | postmark-mcp supply chain attack (Sep 2025); CVE-2025-54136 | ETDI signing (arXiv:2506.01333); mcp-scan for manifest hashing |
| Indirect prompt injection | Tool output treated as trusted model context; no schema enforcement | Supabase/Cursor SQL exfiltration; GitHub MCP private data leak (May 2025) | Strict output schema validation; free-form text treated as untrusted data |
| Confused deputy / ambient authority | Session-scoped OAuth token covers all tools simultaneously | WhatsApp chat exfiltration via tool shadowing (Apr 2025); 78.3% multi-server attack rate (Unit 42) | Task-scoped permission grants; dynamic per-turn tool loading |
| Unauthenticated HTTP gateway | MCP HTTP auth is optional; most deployments skip it | Clawdbot Jan 2026: 2,000+ instances exposed; CVE-2026-33032 nginx-ui | Mandatory OAuth 2.1; enforce at gateway; Cursor v1.3 pattern for re-approval on change |
Beyond the per-row mitigations, a few practices should be defaults across every production MCP deployment:
- Audit at the transport layer. Log every tool invocation: name, parameters, output hash. A compromised agent won't announce itself — anomalous tool call sequences are the primary signal. Build this before something goes wrong.
- Enforce TLS and OAuth 2.1 on every HTTP MCP endpoint you operate. The spec supports it as of June 2025. There is no production argument for an unauthenticated HTTP MCP server. Cursor v1.3 added mandatory re-approval on config changes. That pattern is worth adopting regardless of client.
- Load tools per task, not per session. Several agent frameworks support dynamic tool registration per conversation turn. If yours defaults to loading everything at session start, change the default. Narrowing the credential surface is the highest-impact architectural change you can make.
- Validate output schemas before the model sees them. Define what fields each tool can return and reject anything outside the schema. Free-form text from external data sources should never be interpolated into a system prompt or passed as parameters to another tool.
- Pin tool manifests with mcp-scan or your own hash. mcp-scan (open-source, Invariant Labs) scans Claude, Cursor, and Windsurf configurations and flags manifest changes between sessions. If you prefer rolling your own, hash the full tool manifest on first approval and compare on every connection. It catches rug pulls before they run.
The MCP spec working group is actively discussing tool definition immutability, finer-grained permission scoping, and mandatory auth for HTTP transports. Several of the gaps above are likely to narrow in the 2026 spec revisions. The 40-plus CVEs filed against the MCP ecosystem by mid-2026 represent the predictable early-adoption security debt of a protocol that prioritised getting to adoption first. The controls above address what's exploited today.
Frequently asked questions
Related reading
Your LLM judge works in the test harness. Here's why it fails in production.
LLM-as-a-judge evals look reliable in the test harness. Here's what breaks after months in production: calibration drift, noisy decision boundaries, cascade failures in multi-step pipelines, and the meta-evaluation trap.
LLM structured output is reliable now. The reliability problem just moved.
Constrained decoding eliminated JSON syntax failures in LLM structured output. The reliability problem has moved to semantics: four failure classes that valid JSON hides, and the runtime patterns that catch them.
What YC W26’s 199 startups reveal about which AI bets are defensible
YC W26 graduated 199 companies in March 2026. The interesting signal is not that 60% are AI companies — it's the breakdown: agent infrastructure, vertical agents, and foundational research. Here's what it tells you.