How observability-mcp compares to adjacent tools¶
This page is an honest, source-cited comparison of observability-mcp against three categories of tools people most often ask "is this just X?" about: agentic incident-response platforms (Datadog AI / Bits AI), open-source LLM-driven SRE assistants (HolmesGPT), and OSS observability automation (Robusta).
Every cell links to the public source that backs the claim. No invented numbers. Snapshot date: 2026-05. Things change — open a PR (or an issue) if a cell goes stale and we will fix it.
What this page is not. It is not a head-to-head benchmark of these products. Each occupies a different shape of problem (managed SaaS vs. self-hosted, alert-first vs. agent-first, one backend vs. many). The comparison is meant to help you decide where each fits, not which one "wins".
At a glance¶
| Dimension | observability-mcp | Datadog Bits AI | HolmesGPT | Robusta OSS |
|---|---|---|---|---|
| License | Apache 2.0 1 | Proprietary SaaS 5 | MIT 8 | MIT 9 |
| Self-hosted | Yes (single binary / Docker / Helm) 2 | No (cloud SaaS) 5 | Yes 8 | Yes 9 |
| MCP-native (exposes Streamable HTTP / stdio MCP server) | Yes — 8 tools, full Streamable HTTP transport 2 | No first-party MCP server documented as of 2026-05; Datadog's agent answers in-app 6 | No — Python tool-calling library 8 | No — Slack/web UI focus 9 |
| Topology-aware reasoning (graph tools the LLM can call) | Yes — get_topology + get_blast_radius over a generic kind/relation vocabulary 3 |
Limited — Datadog has service maps, but not as agent-callable structured tools at the MCP layer 7 | No — focused on Kubernetes events + log/metric retrieval 8 | Partial — Kubernetes-event correlation, but no LLM-callable graph traversal 9 |
| Reproducible RCA benchmark in-tree | Yes — scripts/benchmark-rca.mjs, raw JSON in docs/benchmark-results/, three-scenarios doc 4 |
No public reproducible accuracy benchmark | No public benchmark of comparable shape (deterministic local model, A/B with vs without tools) | No public benchmark of comparable shape |
| Multi-backend (one server, several observability backends) | Yes — Prometheus, Loki, Kubernetes, Tempo, pluggable 2 | N/A — single vendor 5 | Yes (Prometheus, Loki, K8s, … via separate adapters) 8 | Kubernetes-first, integrates Prometheus + others as alert sources 9 |
| Local LLM support (Ollama / vLLM / self-hosted) | Yes — agent ships with Ollama wiring; no cloud calls required 2 | No — Bits AI is hosted by Datadog 6 | Yes — supports many backends incl. Ollama 8 | Yes — supports OpenAI-compatible, incl. local 9 |
When each is the right pick¶
observability-mcp is the right pick when: - You already run Prometheus + Loki (and maybe Tempo, Kubernetes) and want one MCP endpoint your agent talks to, not one per backend. - You care about topology-shaped questions: "if this pod's node dies, who else falls over?", "what other services depend on this DB?". - You want a reproducible accuracy benchmark you can re-run on your own hardware before believing the marketing.
Datadog Bits AI is the right pick when: - You are already deep in the Datadog ecosystem (APM, logs, infra, RUM). - You accept SaaS-only and per-host / per-GB pricing. - You want a polished in-product agent UX without operating infrastructure.
HolmesGPT is the right pick when: - You want a Python-native, code-first investigation library to embed in your own runbook / Slack bot. - You're investigating mostly Kubernetes events + Prometheus alerts. - You're comfortable with a tool-calling library (not an MCP server).
Robusta is the right pick when: - Your primary surface is Slack / web UI, and most alerts come from Kubernetes / Prometheus AlertManager. - You want pre-built playbooks for common K8s incidents.
Why we built this anyway¶
The above tools all exist and several are excellent at their shape. We built observability-mcp because none of them combine all three of:
- MCP-native — so any MCP-speaking agent (Claude Code, Claude
Desktop, Cursor, custom) can use it with one
.mcp.jsonline, not a wrapper. - Topology-aware at the tool layer — not as a UI feature buried in a dashboard, but as a tool the LLM can call mid-investigation.
- Honest, reproducible accuracy benchmark in the repo — not a marketing slide, raw JSON outputs alongside the harness so anyone can re-run it.
The benchmark headline (baseline 0/10 → topology 10/10 on a real
cross-namespace blast-radius question, llama3.1:8b, n=10) lives in
docs/benchmark-astronomy-shop.md. It is
deliberately scoped narrow: we do not claim universal speedup, and the
same doc shows scenarios where topology tools cost more without helping.
Sources¶
-
LICENSEin this repo. ↩ -
docs/topology-vocabulary.md— the canonicalkind/relationcontract. ↩ -
docs/benchmark-astronomy-shop.md— methodology + raw JSON indocs/benchmark-results/. ↩