As AI applications move from prototypes to production, the infrastructure connecting models to tools is becoming critical. MCP (Model Context Protocol) servers are at the center of this — they bridge AI clients and external capabilities like databases, APIs, and file systems. When an MCP server silently drops requests or slows down, users experience broken tool calls with no clear explanation. Monitoring these servers is no longer optional; it is a prerequisite for running AI-powered applications reliably.
This guide covers everything you need to know about MCP server monitoring: what makes it different from traditional service monitoring, which metrics matter, how to set up instrumentation, and how to build an alerting strategy that catches problems before your users do.
What is MCP server monitoring?
MCP server monitoring is the practice of collecting, analyzing, and alerting on operational data from Model Context Protocol servers. MCP is the open standard that defines how AI clients (like Claude Desktop, Cursor, or custom agents) communicate with tool servers — the services that give AI models the ability to search databases, call APIs, read files, and perform actions in the real world.
MCP servers occupy a uniquely difficult position in the observability stack. They sit between an AI client and whatever tools or data sources the client needs. When something goes wrong — a tool call times out, a response comes back malformed, a connection drops — the failure is often invisible to the end user. The AI client may silently retry, hallucinate an answer, or simply skip the tool call. Without instrumentation on the MCP server itself, you have no way to know this happened.
Three characteristics make MCP server monitoring different from traditional HTTP service monitoring:
-
Non-HTTP transports. Many MCP servers communicate over stdio (standard input/output) rather than HTTP. Traditional APM tools that intercept HTTP requests will not see this traffic. You need monitoring that operates at the protocol level, not the transport level.
-
Ephemeral connections. MCP connections are often tied to user sessions. A server might spin up when a user opens an AI assistant and shut down when they close it. Connection lifecycle tracking matters more here than in a typical long-running service.
-
Tool-level granularity. An MCP server typically exposes multiple tools — each with different latency profiles, error modes, and dependencies. Monitoring at the server level alone is not enough. You need per-tool visibility to understand which specific capability is degrading.
Key metrics to track
Effective MCP server monitoring requires tracking four categories of metrics. Each provides a different lens on server health, and together they give you a complete picture.
Latency (p50, p95, p99)
Latency measures how long each MCP operation takes from request to response. For MCP servers, the critical distinction is measuring latency per tool call, not just per server.
A server might have an acceptable average latency of 200ms, but if one tool (say, a database query tool) regularly takes 3 seconds at the p99, users invoking that tool will experience noticeable delays. The p50 (median) tells you the typical experience; the p95 and p99 reveal the tail — the worst cases that affect a meaningful percentage of requests.
Track latency broken down by tool name, and watch for drift over time. A p95 that creeps up by 50ms per week is a slow regression that will eventually become a production incident.
Error rate
Error rate is the percentage of MCP operations that result in an error response. Break this down along three dimensions: by error type (timeout, validation error, upstream failure), by tool, and by server.
A 2% overall error rate might look acceptable. But if 100% of errors are concentrated in a single tool that 30% of users depend on, the real impact is much worse than the number suggests. Tool-level error breakdowns prevent this blind spot.
Pay special attention to new error types. A sudden appearance of connection timeout errors, for example, often signals an infrastructure change upstream rather than a code bug.
Throughput
Throughput measures the volume of MCP operations your servers handle — typically expressed as requests per minute. Track this by operation type (tool calls, resource reads, prompt completions) and by individual tool.
Throughput is your early warning system for two problems. A sudden spike may indicate a retry storm — an AI client hammering a failing tool. A sudden drop may mean connections are silently failing and requests are never reaching your server.
Throughput baselines also help with capacity planning. If you know your MCP server handles 500 tool calls per minute at peak, you can set infrastructure thresholds accordingly.
Server availability
Server availability tracks whether your MCP servers are reachable and accepting connections. For long-running servers, this is typically a heartbeat check — a periodic signal confirming the server is alive. For stdio-based servers that are tied to user sessions, availability is tracked through connection lifecycle events: when did the server start, when did it last respond, and when did it disconnect.
A server that is running but not responding to MCP requests is worse than one that is cleanly down, because the AI client may continue sending requests into a black hole. Heartbeat monitoring with a tight timeout (30 to 60 seconds) catches this.
Setting up monitoring with MCPWatch
MCPWatch provides SDKs for both TypeScript and Python that instrument your MCP server with a single instrument() call. Once instrumented, every tool call, resource read, and error is captured and sent to the MCPWatch platform for analysis.
Installation
Install the SDK in your MCP server project:
# TypeScript
npm install @mcpwatch/sdk
# Python
pip install mcpwatch
TypeScript setup
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { instrument } from "@mcpwatch/sdk";
const server = instrument(
new McpServer({ name: "my-server", version: "1.0.0" }),
{
apiKey: process.env.MCPWATCH_API_KEY,
endpoint: "https://api.mcpwatch.dev",
}
);
The instrument function wraps your server’s tool(), resource(), and prompt() registration methods, as well as connect() and close(). It captures timing, arguments, responses, and errors — then forwards everything to your server’s original handlers unchanged. Your server logic does not need to change at all.
Python setup
import os
from mcp.server import Server
from mcpwatch import instrument
server = instrument(
Server("my-server"),
api_key=os.environ["MCPWATCH_API_KEY"],
endpoint="https://api.mcpwatch.dev",
)
The Python SDK works the same way — a transparent wrapper that captures telemetry without altering your server’s behavior. It wraps tool() and resource() decorator methods on the server instance.
What gets captured
Once the SDK is active, MCPWatch automatically records:
- Events for every MCP operation, including tool calls, resource reads, prompt retrievals (TypeScript), and server notifications (TypeScript)
- Timing data for each event, including
started_at,ended_at, andduration_ms - Error details including the error type, code, message, and the full request context in which the error occurred
- Server lifecycle events including initialize and close, with transport type auto-detection (stdio, SSE, Streamable HTTP)
Events are batched client-side (up to 50 events or every 1 second, whichever comes first) and sent to the MCPWatch ingestion API. In development with debug: true, you can see batch submissions logged to the console.
Reading traces and waterfall views
Raw metrics tell you that something is wrong. Traces tell you why. The MCPWatch trace waterfall is the primary tool for understanding what happened during a specific MCP operation.
Anatomy of a trace
A trace represents a single end-to-end MCP operation. It is composed of spans — individual units of work arranged in a parent-child hierarchy. For example, a tool call trace might contain:
- A parent span representing the full tool call from request to response
- Child spans representing internal steps: argument validation, an HTTP request to an external API, response formatting, and the final MCP response
Each span has a start time, duration, and status (success or error). In the waterfall view, spans are rendered as horizontal timing bars, stacked vertically to show the hierarchy.
How to read the waterfall
The horizontal axis is time. The width of each bar shows how long that span took. Spans are indented under their parents, so you can see the call tree at a glance.
MCPWatch uses color coding to highlight performance characteristics. Spans that complete within normal latency thresholds appear in the default color. Spans that exceed the p95 threshold are highlighted to draw your attention — these are the ones worth investigating.
Error spans are visually distinct as well. When a child span fails and that error propagates to the parent, you can trace the failure path through the entire operation.
Common patterns to look for
Long bars indicate bottlenecks. If a tool call takes 2 seconds and you see that 1.8 seconds is spent in a single child span making an external API call, you have found your bottleneck. The fix might be caching, a timeout adjustment, or switching to a faster dependency.
Cascading errors show up as a chain of error-colored spans from a child up through the parent. This pattern typically indicates that an upstream dependency failed and the error bubbled up through your server. The deepest error span in the chain is usually the root cause.
Gaps between spans — visible white space in the waterfall — indicate time spent outside of instrumented code. This might be queue wait time, garbage collection, or uninstrumented code paths. If you see significant gaps, consider adding custom spans to cover them.
Fan-out patterns where a parent span has many parallel child spans indicate concurrent operations. These are normal for tools that aggregate data from multiple sources, but watch for cases where one slow child holds up the entire response.
Configuring alerts for production
Dashboards are useful for investigation, but alerts are what keep production running. A good alerting strategy catches real problems quickly without flooding your team with noise. Start with three essential alerts and expand from there.
Error rate spike
Configure an alert that fires when the error rate exceeds 5% for 5 consecutive minutes. This threshold avoids triggering on individual transient errors while catching sustained problems quickly.
The 5-minute window is important. A single failed request out of 20 is a 5% error rate for that minute, but it is probably just noise. Five minutes of sustained errors is a real problem that needs attention.
Set this alert at both the server level and the tool level. A tool-specific error spike (one tool at 20% errors while the server overall is at 3%) is easy to miss with only a server-level alert.
Latency degradation
Configure an alert that fires when the p95 latency exceeds your threshold for 10 consecutive minutes. The longer window here is intentional — latency is noisier than error rate, and a 10-minute sustained degradation filters out spikes caused by cold starts or temporary load.
Set the threshold based on your observed baseline, not an arbitrary number. If your tool call p95 is normally 400ms, an alert at 800ms (2x baseline) is reasonable. Setting it at 100ms when your normal p95 is 400ms will generate constant false alarms.
Server disconnection
Configure an alert that fires when a server’s heartbeat has been missing for more than 60 seconds. For MCP servers — especially stdio-based ones — a missing heartbeat usually means the server process has crashed or the connection has been severed.
This alert should have the highest urgency. A disconnected server means complete loss of tool access for any AI clients connected to it. Unlike latency or error rate degradation, there is no partial functionality — it is fully down.
Notification channels and cooldown
MCPWatch supports email and webhook notification channels. For most teams, the right setup is:
- Email for latency degradation alerts — these need investigation but are rarely urgent enough for immediate interruption
- Webhooks (to Slack, PagerDuty, or your incident management tool) for error spikes and server disconnections — these need prompt attention
Set cooldown periods on every alert. A cooldown prevents the same alert from firing repeatedly while the condition persists. A 15-minute cooldown is a reasonable default. Without cooldowns, a sustained outage will generate an alert every evaluation interval, quickly leading to alert fatigue and ignored notifications.
Best practices
These recommendations come from patterns we have seen across teams running MCP servers in production. They are not theoretical — each one addresses a real failure mode.
Start with one server, expand gradually. Instrument your most critical MCP server first. Get comfortable with the traces, learn what normal looks like, set your baseline thresholds, and then expand to additional servers. Trying to monitor everything at once leads to alert noise and dashboard overload before you have built the judgment to interpret the data.
Monitor tool-level latency, not just server-level. A server-level p95 of 500ms might hide the fact that one tool averages 50ms and another averages 2 seconds. Per-tool latency breakdowns are essential for understanding the actual user experience, because AI clients invoke specific tools — not servers.
Set alert thresholds based on your baseline, not arbitrary numbers. Run your monitoring for at least one week before configuring alerts. Use the observed p50, p95, and p99 values as your baseline. Alert thresholds derived from real data produce far fewer false positives than round numbers picked out of the air.
Use event sampling in high-throughput environments. If your MCP server handles thousands of requests per minute, capturing every single event is expensive and often unnecessary. Both SDKs accept a sampleRate (TypeScript) or sample_rate (Python) option between 0.0 and 1.0 — for example, sampleRate: 0.1 captures 10% of events. Note that errors are still captured when they occur within a sampled event, so reducing the sample rate does not hide failures.
Review analytics weekly to catch slow regressions. A latency increase of 20ms per week will not trigger an alert on any given day, but after two months your p95 has degraded by 160ms. A weekly review of trend charts catches these slow-moving problems that are invisible to threshold-based alerts.
Keep alert channels focused. Do not send every alert to a shared Slack channel. Route error spikes and disconnections to the on-call channel. Route latency degradation to an async review channel. If everything is urgent, nothing is urgent.
Conclusion
MCP server monitoring is a distinct discipline from traditional API monitoring. The combination of non-HTTP transports, ephemeral connections, and tool-level granularity means you need purpose-built instrumentation to get meaningful visibility.
The approach outlined in this guide — instrument with an SDK, track the four core metrics, use trace waterfalls to diagnose issues, and set up focused alerts — gives you a solid foundation. Start with one server, learn what normal looks like for your workload, and build your alerting and review practices from there.
The teams that invest in MCP observability early spend less time debugging mysterious AI tool failures and more time building features. The ones that skip it inevitably find themselves guessing why tool calls fail at 2 AM with no data to work from.