How observability-mcp compares to adjacent tools¶

This page is an honest, source-cited comparison of observability-mcp against three categories of tools people most often ask "is this just X?" about: agentic incident-response platforms (Datadog AI / Bits AI), open-source LLM-driven SRE assistants (HolmesGPT), and OSS observability automation (Robusta).

Every cell links to the public source that backs the claim. No invented numbers. Snapshot date: 2026-05. Things change — open a PR (or an issue) if a cell goes stale and we will fix it.

What this page is not. It is not a head-to-head benchmark of these products. Each occupies a different shape of problem (managed SaaS vs. self-hosted, alert-first vs. agent-first, one backend vs. many). The comparison is meant to help you decide where each fits, not which one "wins".

At a glance¶

Dimension	observability-mcp	Datadog Bits AI	HolmesGPT	Robusta OSS
License	Apache 2.0 ¹	Proprietary SaaS ⁵	MIT ⁸	MIT ⁹
Self-hosted	Yes (single binary / Docker / Helm) ²	No (cloud SaaS) ⁵	Yes ⁸	Yes ⁹
MCP-native (exposes Streamable HTTP / stdio MCP server)	Yes — 8 tools, full Streamable HTTP transport ²	No first-party MCP server documented as of 2026-05; Datadog's agent answers in-app ⁶	No — Python tool-calling library ⁸	No — Slack/web UI focus ⁹
Topology-aware reasoning (graph tools the LLM can call)	Yes — `get_topology` + `get_blast_radius` over a generic `kind`/`relation` vocabulary ³	Limited — Datadog has service maps, but not as agent-callable structured tools at the MCP layer ⁷	No — focused on Kubernetes events + log/metric retrieval ⁸	Partial — Kubernetes-event correlation, but no LLM-callable graph traversal ⁹
Reproducible RCA benchmark in-tree	Yes — `scripts/benchmark-rca.mjs`, raw JSON in `docs/benchmark-results/`, three-scenarios doc ⁴	No public reproducible accuracy benchmark	No public benchmark of comparable shape (deterministic local model, A/B with vs without tools)	No public benchmark of comparable shape
Multi-backend (one server, several observability backends)	Yes — Prometheus, Loki, Kubernetes, Tempo, pluggable ²	N/A — single vendor ⁵	Yes (Prometheus, Loki, K8s, … via separate adapters) ⁸	Kubernetes-first, integrates Prometheus + others as alert sources ⁹
Local LLM support (Ollama / vLLM / self-hosted)	Yes — agent ships with Ollama wiring; no cloud calls required ²	No — Bits AI is hosted by Datadog ⁶	Yes — supports many backends incl. Ollama ⁸	Yes — supports OpenAI-compatible, incl. local ⁹

When each is the right pick¶

observability-mcp is the right pick when: - You already run Prometheus + Loki (and maybe Tempo, Kubernetes) and want one MCP endpoint your agent talks to, not one per backend. - You care about topology-shaped questions: "if this pod's node dies, who else falls over?", "what other services depend on this DB?". - You want a reproducible accuracy benchmark you can re-run on your own hardware before believing the marketing.

Datadog Bits AI is the right pick when: - You are already deep in the Datadog ecosystem (APM, logs, infra, RUM). - You accept SaaS-only and per-host / per-GB pricing. - You want a polished in-product agent UX without operating infrastructure.

HolmesGPT is the right pick when: - You want a Python-native, code-first investigation library to embed in your own runbook / Slack bot. - You're investigating mostly Kubernetes events + Prometheus alerts. - You're comfortable with a tool-calling library (not an MCP server).

Robusta is the right pick when: - Your primary surface is Slack / web UI, and most alerts come from Kubernetes / Prometheus AlertManager. - You want pre-built playbooks for common K8s incidents.

Why we built this anyway¶

The above tools all exist and several are excellent at their shape. We built observability-mcp because none of them combine all three of:

MCP-native — so any MCP-speaking agent (Claude Code, Claude Desktop, Cursor, custom) can use it with one .mcp.json line, not a wrapper.
Topology-aware at the tool layer — not as a UI feature buried in a dashboard, but as a tool the LLM can call mid-investigation.
Honest, reproducible accuracy benchmark in the repo — not a marketing slide, raw JSON outputs alongside the harness so anyone can re-run it.

The benchmark headline (baseline 0/10 → topology 10/10 on a real cross-namespace blast-radius question, llama3.1:8b, n=10) lives in docs/benchmark-astronomy-shop.md. It is deliberately scoped narrow: we do not claim universal speedup, and the same doc shows scenarios where topology tools cost more without helping.

Sources¶

LICENSE in this repo. ↩
README.md in this repo. ↩↩↩↩
docs/topology-vocabulary.md — the canonical kind/relation contract. ↩
docs/benchmark-astronomy-shop.md — methodology + raw JSON in docs/benchmark-results/. ↩
https://www.datadoghq.com/pricing/ ↩↩↩
https://docs.datadoghq.com/bits_ai/ ↩↩
https://docs.datadoghq.com/tracing/services/services_map/ ↩
https://github.com/robusta-dev/holmesgpt ↩↩↩↩↩↩
https://github.com/robusta-dev/robusta ↩↩↩↩↩↩