Skip to content

Access control overview

This page is the one-stop guide to the management-plane access controls shipped over the recent governance series (PRs #229–#235). Each layer is opt-in — the default deployment is single-user, no auth, exactly like the README quickstart promises.

Quick links

The four layers, in dependency order

Layer Env knob Default Detail doc
MCP bearer auth for the agent transport OMCP_API_KEYS anonymous auth-and-tls.md
Web UI session login OMCP_AUTH=basic + OMCP_USERS_FILE anonymous auth-basic.md
Web UI SSO via OIDC OMCP_AUTH=oidc + OMCP_OIDC_* anonymous auth-oidc.md
Role-based permissions on the management API (built-in viewer / operator / admin; role assigned via the user file's roles field) only meaningful in basic mode this doc, "Roles & permissions"
Policy engine for RBAC decisions OMCP_RBAC_POLICY_FILE (YAML) or OMCP_OPA_URL (OPA HTTP) built-in policy-engines.md
Multi-tenancy (per-identity scope across audit / quotas / catalog) OMCP_KEY_TENANTS + OMCP_OIDC_TENANT_CLAIM + user-file tenant field single-tenant (default) tenancy.md
MCP Products (curated tool bundles per agent / tenant) OMCP_PRODUCTS_FILE (YAML) empty catalog products.md
Audit log of mutating /api/* requests OMCP_MGMT_AUDIT_FILE in-memory ring (500 entries) this doc, "Audit log"

Two adjacent controls fall under the same umbrella:

Control Env knob Default Detail doc
PII / secret redaction of query_logs output OMCP_REDACTION on redaction.md
Per-identity rate limit on the /mcp transport OMCP_TOOL_RATE_PER_MIN 60 (in this doc, "Rate limits")
Per-identity daily token budget on the MCP tool layer OMCP_TOOL_DAILY_TOKENS (+ optional OMCP_TOKEN_BUDGET_FILE for restart survival) 0 (uncapped) (in this doc, "Token budget")

Minimal production-ready setup

This is the smallest configuration that gives a multi-user team a sensible posture: signed sessions, an audit trail, redaction, and sliding-window per-identity caps.

```yaml

values.yaml fragment (Helm) — extraEnv is the chart's pass-through

slot for ad-hoc env vars; see helm/observability-mcp/values.yaml.

extraEnv: - name: OMCP_AUTH value: basic - name: OMCP_USERS_FILE value: /etc/observability-mcp/users.json - name: OMCP_SESSION_SECRET valueFrom: secretKeyRef: name: omcp-session key: secret - name: OMCP_API_KEYS valueFrom: secretKeyRef: name: omcp-mcp-keys key: keys - name: OMCP_MGMT_AUDIT_FILE value: /var/log/omcp/audit.jsonl - name: OMCP_TOOL_RATE_PER_MIN value: "120" # - name: OMCP_REDACTION # value: "on" # default # OMCP_AUTH_ALLOW_FALLBACK is intentionally absent — boot must fail # closed if the users file is missing. ```

Mint users with the bundled helper (no host npm install required — the script uses only node built-ins):

bash node scripts/hash-password.mjs alice --name "Alice" --roles operator node scripts/hash-password.mjs bob --name "Bob" --roles viewer

Paste both JSON blocks into users.json's users: array and mount the file read-only.

Roles & permissions

The built-in policy ships three roles. The full table is in mcp-server/src/auth/rbac.ts; the short version:

viewer operator admin
Read sources / services / health / topology / settings / connectors / audit / catalog
Write sources / settings / health-thresholds
Write connectors (upload / install)
Delete sources / users

Every mutating /api/* route asks need(resource, action) before it runs. A 403 from the gate carries { code: "OMCP_PERMISSION_DENIED", required: {…}, have: […] } so the client can render a useful message rather than a generic "forbidden".

The session payload's roles field is also surfaced at GET /api/me under permissions: […] so the Web UI hides write controls (Add Source, Save Settings, etc.) the current user can't operate. The server is still the authoritative gate — UI hiding is purely a UX win.

Admins debugging "why did role X get a 403?" can pull the full active policy without a source checkout:

```bash curl -s -b "omcp_session=$ADMIN_COOKIE" "$URL/api/policy" | jq .

{ "policy": { "viewer": [...], "operator": [...], "admin": [...] },

"roles": ["viewer", "operator", "admin"], "note": "..." }

```

Audit log

Every mutating /api/* request produces one append-only entry with actor + resource + action + status + IP + the optional :name path parameter as target. Entries are hash-chained: each entry's hash covers the previous entry's hash, so scripts/verify-audit.mjs can prove the log hasn't been silently truncated or reordered:

```bash node scripts/verify-audit.mjs /var/log/omcp/audit.jsonl

→ { "ok": true, "entries": 1234, "tipHash": "…" } (exit 0)

or, on a tamper:

→ { "ok": false, "entries": 1234, "brokenAt": 42, "reason": "…" } (exit 1)

```

The script uses only node built-ins (no node_modules) so it works straight from a source checkout on an air-gapped operator workstation.

For unattended cron monitoring, pass --quiet — success produces no stdout, failure still writes the { ok: false, brokenAt, reason } JSON. Pair with a non-zero-exit handler in your job runner:

cron 0 * * * * node /opt/observability-mcp/scripts/verify-audit.mjs --quiet /var/log/omcp/audit.jsonl

  • File path: OMCP_MGMT_AUDIT_FILE (JSONL, append-only). Unset → an in-memory ring of the last 500 entries serves the same GET /api/audit endpoint, useful for the demo / single-user case.
  • Read access: audit:read permission (granted to viewer / operator / admin by default).
  • Surface: GET /api/audit?from=&to=&actor=&action=&limit= returns the most-recent-first slice plus tipHash. The Web UI's Audit Log page renders this alongside the entitlement-gate's MCP-tool audit feed.

Session revocation

Web UI sessions (OMCP_AUTH=basic or oidc) are stateless signed cookies — the gateway verifies the HMAC and trusts the payload, with no server-side session table. That keeps the auth path cheap and horizontally scalable, but it means a plain logout only clears the cookie in that browser: a copied cookie, or a session on a lost laptop, stays valid until its exp (12h default).

The revocation blocklist closes that gap. Every session carries a random sid, and every request consults an append-only blocklist before the session is trusted. Two shapes:

  • Revoke one session — by its sid. Read the current session's id from GET /api/me (the sid field), then:

bash curl -s -b "omcp_session=$ADMIN_COOKIE" -X POST "$URL/api/auth/revocations" \ -H 'content-type: application/json' \ -d '{ "sid": "Xy7…", "reason": "stolen laptop" }'

  • Log a user out everywhere — by sub. Revokes every session for that subject issued so far; a fresh login afterwards (with valid credentials / a valid IdP assertion) is unaffected, so this is a force-re-login, not a permanent ban:

bash curl -s -b "omcp_session=$ADMIN_COOKIE" -X POST "$URL/api/auth/revocations" \ -H 'content-type: application/json' \ -d '{ "sub": "alice@example.com", "reason": "offboarded" }'

GET /api/auth/revocations lists the current blocklist. Both endpoints are admin-gated (users:delete) and every revocation is written to the audit log.

  • Persistence: OMCP_AUTH_REVOCATION_FILE (JSONL, append-only, mode 0600). Unset → in-memory only, so the blocklist is lost on restart; set it so revocations survive a restart.
  • Multi-replica caveat: each replica loads the blocklist once at startup and the writing replica updates its own in-memory index immediately, but there is no live cross-replica propagation — a revocation issued on replica A is not seen by replica B until B restarts (or re-reads the file). For a single-replica gateway this is a non-issue. For a load-balanced fleet, either pin auth-plane requests to one replica, keep session TTLs short, or roll the deployment after a bulk revocation. A shared-store backend (Redis, like the SCIM and transport stores) is the planned path to live fleet-wide propagation.
  • A permanent ban is a directory operation, not a gateway one: remove / disable the user in OMCP_USERS_FILE or your IdP. The blocklist is for sessions, not accounts.

Rate limits

The /mcp HTTP transport carries one per-identity sliding window: 60 requests/minute per named bearer-token caller by default. OMCP_TOOL_RATE_PER_MIN overrides — accepts any positive integer; unset / empty / non-numeric / 0 / negative all fall back to the default 60 (so an operator setting 0 to mean "disable" doesn't accidentally lock every caller out). To truly disable the per-identity cap — e.g. when an upstream gateway already enforces quotas — set it to off, none, unlimited, disabled, or false (case-insensitive). In that mode /api/usage reports limit: null for every identity. Anonymous /mcp traffic (no OMCP_API_KEYS) is unaffected; the existing IP-level express-rate-limit still applies.

Every /mcp response carries the live bucket state in headers so a well-behaved client can self-pace before hitting the cap:

http X-RateLimit-Limit: 60 X-RateLimit-Remaining: 47 X-RateLimit-Window-Ms: 60000

A breached cap returns:

http HTTP/1.1 429 Too Many Requests Retry-After: 17 Content-Type: application/json json { "code": "OMCP_IDENTITY_RATE_LIMIT", "retryAfterSeconds": 17, "limit": 60, "windowMs": 60000 }

Granularity is per HTTP request, not per JSON-RPC message. A batched JSON-RPC request counts as one; a multi-tool LLM turn counts as N.

Live snapshot: GET /api/usage (gated by audit:read) returns the current windowed count per identity:

json { "identities": [ { "actor": "agent-prod", "count": 14, "limit": 60, "windowMs": 60000 }, { "actor": "ci", "count": 3, "limit": 60, "windowMs": 60000 } ], "defaultLimit": 60, "windowMs": 60000 }

Pass ?actor=<name> to inspect a single identity (count is 0 for identities the server has never seen).

Token budget

OMCP_TOOL_DAILY_TOKENS=<positive integer> enables a per-identity rolling 24-hour token cap. Tokens are estimated post-tool-execution (over-count by ~5% vs cl100k_base, so the gate errs on the strict side) and charged against the calling bearer-token credential. When the bucket would exceed the cap, the tool returns

json { "error": "OMCP_TOKEN_BUDGET_EXCEEDED", "tool": "query_logs", "used": 49800, "limit": 50000, "requested": 1200, "retryAfterSeconds": 73400, "freedAtRetry": 14000, "message": "Daily token budget exceeded (49800/50000 ...)." }

instead of the data. The agent sees a parseable refusal, not a generic failure.

Anonymous /mcp traffic is not charged (the budget is per credential; an operator running without OMCP_API_KEYS has no identity-keyed bucket to charge). Three tools currently honour the gate: query_logs, query_metrics, get_service_health. Adding new high-token tools is a one-line chargeTokenBudget(result, ctx, "new_tool") wrap.

The retryAfterSeconds walks bucket history oldest-first until enough capacity has dropped to fit the denied request; freedAtRetry reports how many tokens that frees so a well-behaved agent can decide whether to back off or retry sooner with a smaller request.

OMCP_TOKEN_BUDGET_FILE=<path> enables snapshot persistence — buckets reload at boot, so a server restart mid-day doesn't reset quotas. Writes are debounced (1s default) and atomic (write-rename). Unset → in-memory only, which is fine for demo / single-operator setups where a restart-on-each-deploy effectively rolls budgets.

Live snapshot at GET /api/usage (same gate as the rate-limit view) returns:

json { "identities": [ { "actor": "agent-prod", "count": 14, "limit": 60, "windowMs": 60000, "tokens": { "used": 42100, "limit": 50000, "windowMs": 86400000 } } ], "tokens": { "defaultLimit": 50000, "windowMs": 86400000 } }

The Web UI's Overview page shows the same data as a "Today's MCP usage" strip — top 5 identities sorted by token consumption, with a coloured progress bar that turns amber at 70% of the daily cap and red at 90%. The strip is hidden when no identity has any traffic yet, or the viewer lacks the audit:read permission.

Error codes

Code Meaning Caller response
OMCP_TOKEN_BUDGET_EXCEEDED Identity is at the cap; this call would push over. Wait retryAfterSeconds; freedAtRetry tells how much will be available.
OMCP_TOKEN_REQUEST_EXCEEDS_BUDGET Single response alone larger than the entire daily cap. Retrying won't help — narrow the query (smaller window / lower limit / more selective filter) or raise the cap.

Service catalog enrichment

When OMCP_SERVICE_CATALOG_FILE points at a JSON catalog (schema in mcp-server/src/catalog/loader.ts), every list_services / get_service_health / query_metrics derived response is decorated with .catalog = { owner, tier, onCall, slo, … }. The agent sees ownership context inline — no separate CMDB hop.

Without the env var the file is missing → empty catalog → enrichment is a no-op.

Posture discovery

External dashboards and discovery probes (kube-state-metrics derived exporters, Helm chart annotations, Backstage plugins, etc.) often want to learn the deployment's governance shape without holding a session. GET /api/info ships a public governance block for exactly that:

json { "governance": { "authMode": "basic", "authSecretEphemeral": false, "auditPersisted": true, "catalogConfigured": true, "redaction": true, "trustProxy": true, "toolRatePerMin": 60 } }

Booleans + the rate-limit number only — no file paths, no session secret, no user counts. The expected alert is "this deployment silently reverted to anonymous mode" or "redaction is off in prod" — both visible from a single unauthenticated GET.

Behind a reverse proxy

By default req.ip is the raw socket address, so a fronting nginx / Envoy / ingress controller makes every audit entry look like 127.0.0.1. Set OMCP_TRUST_PROXY to one of:

Value Meaning
true trust every upstream hop (Express default-on shape)
loopback trust 127.0.0.1 / ::1 only (sensible same-host nginx default)
an integer trust the last n hops
comma-separated IPs explicit list of upstreams to trust

Unset / false keeps the safe default. The same setting also fixes the Secure cookie attribute behind TLS-terminating proxies (the server detects HTTPS via req.secure || X-Forwarded-Proto).

Investigation runbook

"Who changed source payment-prod yesterday?"

bash curl -s "$URL/api/audit?action=write&actor=alice&limit=50" \ | jq '.entries[] | select(.target == "payment-prod")'

"Why did Claude get 403 just now?"

The client's stderr / log shows the response body. Cross-check the permission grants for the user:

bash curl -s -b "omcp_session=$COOKIE" "$URL/api/me" \ | jq '.permissions'

If the user's role is missing the resource:action they tried, update OMCP_USERS_FILE (add the right role to that user's roles array) and have them sign out + back in to refresh the cookie.

"Why are my logs returning [redacted-email]?"

The redactor is on by default. If the source is already PII-clean, disable it process-wide:

yaml env: OMCP_REDACTION: "off"

Per-request bypass requires two independent grants — both must align for a single tool call to skip redaction:

  1. RBAC permission redaction:bypass (admin role by default). This is the management-plane gate — visible via /api/policy and reflected on the Policy UI tab.
  2. Credential opt-in via OMCP_KEY_BYPASS_REDACTION=<key-names>. This is the data-plane gate — only credentials listed here may carry the bypass flag, and only when the call also sets bypass_redaction: true in the tool args.

Either gate alone keeps redaction on. The MCP tool currently honouring the per-call flag is query_logs; future high-PII tools will follow the same pattern.

Example: an agent credential authorised for live-incident debugging:

yaml env: OMCP_API_KEYS: "agent:tok_..." OMCP_KEY_BYPASS_REDACTION: "agent" # data-plane allow-list # The RBAC side is automatic for admin role; for OIDC sessions the # IdP claim → role mapping decides whether the same identity also # sees the policy entry in the UI.

The agent then invokes query_logs with { ..., bypass_redaction: true } and gets unredacted payload bytes; without the env, the same arg is silently ignored and the response stays redacted.

"Caller hit a 429 on the /mcp transport"

The response body identifies the caller's identity bucket. To raise the cap process-wide:

yaml env: OMCP_TOOL_RATE_PER_MIN: "240"

For a per-credential cap override, set OMCP_KEY_RATE_PER_MIN:

yaml env: OMCP_TOOL_RATE_PER_MIN: "60" # default everyone else OMCP_KEY_RATE_PER_MIN: "agent=600;ci=240;noisy-bot=off"

The override syntax mirrors OMCP_KEY_TENANTS / OMCP_KEY_PRODUCTSname=count pairs separated by ;. The same disable vocabulary as the global cap (off, none, unlimited, disabled, false, case-insensitive) lifts the cap entirely for that credential — useful for an internal automation that shouldn't be rate-limited. Unknown credentials silently fall back to the global default; a non-numeric override silently skips so a typo doesn't lock the credential out at boot.

"Restart broke my audit chain"

If OMCP_MGMT_AUDIT_FILE is set, AuditLog.bootstrap() replays the existing file on start so seq + tipHash resume cleanly. If you ever need to verify the chain manually:

bash node -e " const { verifyChain } = require('./mcp-server/dist/audit/log.js'); const lines = require('fs').readFileSync(process.env.AUDIT_FILE, 'utf8').trim().split('\n'); const entries = lines.map(JSON.parse); console.log(verifyChain(entries)); "

A break reports { ok: false, brokenAt: N, reason: '...' } and the script exits non-zero so a cron-driven monitor can alert. Most common cause is hand-editing the file; restore from backup and replay any missed changes via the Web UI.