Access control overview¶
This page is the one-stop guide to the management-plane access controls shipped over the recent governance series (PRs #229–#235). Each layer is opt-in — the default deployment is single-user, no auth, exactly like the README quickstart promises.
Quick links
- The four layers, in dependency order
- Minimal production-ready setup
- Roles & permissions
- Audit log —
/api/audit, hash chain, offline verifier - Rate limits —
/api/usage,X-RateLimit-*headers - Service catalog enrichment
- Posture discovery —
/api/infogovernance block - Behind a reverse proxy —
OMCP_TRUST_PROXY - Investigation runbook
The four layers, in dependency order¶
| Layer | Env knob | Default | Detail doc |
|---|---|---|---|
| MCP bearer auth for the agent transport | OMCP_API_KEYS |
anonymous | auth-and-tls.md |
| Web UI session login | OMCP_AUTH=basic + OMCP_USERS_FILE |
anonymous | auth-basic.md |
| Web UI SSO via OIDC | OMCP_AUTH=oidc + OMCP_OIDC_* |
anonymous | auth-oidc.md |
| Role-based permissions on the management API | (built-in viewer / operator / admin; role assigned via the user file's roles field) |
only meaningful in basic mode | this doc, "Roles & permissions" |
| Policy engine for RBAC decisions | OMCP_RBAC_POLICY_FILE (YAML) or OMCP_OPA_URL (OPA HTTP) |
built-in | policy-engines.md |
| Multi-tenancy (per-identity scope across audit / quotas / catalog) | OMCP_KEY_TENANTS + OMCP_OIDC_TENANT_CLAIM + user-file tenant field |
single-tenant (default) |
tenancy.md |
| MCP Products (curated tool bundles per agent / tenant) | OMCP_PRODUCTS_FILE (YAML) |
empty catalog | products.md |
Audit log of mutating /api/* requests |
OMCP_MGMT_AUDIT_FILE |
in-memory ring (500 entries) | this doc, "Audit log" |
Two adjacent controls fall under the same umbrella:
| Control | Env knob | Default | Detail doc |
|---|---|---|---|
PII / secret redaction of query_logs output |
OMCP_REDACTION |
on |
redaction.md |
Per-identity rate limit on the /mcp transport |
OMCP_TOOL_RATE_PER_MIN |
60 | (in this doc, "Rate limits") |
| Per-identity daily token budget on the MCP tool layer | OMCP_TOOL_DAILY_TOKENS (+ optional OMCP_TOKEN_BUDGET_FILE for restart survival) |
0 (uncapped) | (in this doc, "Token budget") |
Minimal production-ready setup¶
This is the smallest configuration that gives a multi-user team a sensible posture: signed sessions, an audit trail, redaction, and sliding-window per-identity caps.
```yaml
values.yaml fragment (Helm) — extraEnv is the chart's pass-through¶
slot for ad-hoc env vars; see helm/observability-mcp/values.yaml.¶
extraEnv: - name: OMCP_AUTH value: basic - name: OMCP_USERS_FILE value: /etc/observability-mcp/users.json - name: OMCP_SESSION_SECRET valueFrom: secretKeyRef: name: omcp-session key: secret - name: OMCP_API_KEYS valueFrom: secretKeyRef: name: omcp-mcp-keys key: keys - name: OMCP_MGMT_AUDIT_FILE value: /var/log/omcp/audit.jsonl - name: OMCP_TOOL_RATE_PER_MIN value: "120" # - name: OMCP_REDACTION # value: "on" # default # OMCP_AUTH_ALLOW_FALLBACK is intentionally absent — boot must fail # closed if the users file is missing. ```
Mint users with the bundled helper (no host npm install required — the script uses only node built-ins):
bash
node scripts/hash-password.mjs alice --name "Alice" --roles operator
node scripts/hash-password.mjs bob --name "Bob" --roles viewer
Paste both JSON blocks into users.json's users: array and mount
the file read-only.
Roles & permissions¶
The built-in policy ships three roles. The full table is in
mcp-server/src/auth/rbac.ts; the
short version:
| viewer | operator | admin | |
|---|---|---|---|
| Read sources / services / health / topology / settings / connectors / audit / catalog | ✅ | ✅ | ✅ |
| Write sources / settings / health-thresholds | – | ✅ | ✅ |
| Write connectors (upload / install) | – | – | ✅ |
| Delete sources / users | – | – | ✅ |
Every mutating /api/* route asks need(resource, action) before it
runs. A 403 from the gate carries
{ code: "OMCP_PERMISSION_DENIED", required: {…}, have: […] } so the
client can render a useful message rather than a generic "forbidden".
The session payload's roles field is also surfaced at GET /api/me
under permissions: […] so the Web UI hides write controls (Add
Source, Save Settings, etc.) the current user can't operate. The
server is still the authoritative gate — UI hiding is purely a UX win.
Admins debugging "why did role X get a 403?" can pull the full active policy without a source checkout:
```bash curl -s -b "omcp_session=$ADMIN_COOKIE" "$URL/api/policy" | jq .
{ "policy": { "viewer": [...], "operator": [...], "admin": [...] },¶
"roles": ["viewer", "operator", "admin"], "note": "..." }¶
```
Audit log¶
Every mutating /api/* request produces one append-only entry with
actor + resource + action + status + IP + the optional :name path
parameter as target. Entries are hash-chained: each entry's hash
covers the previous entry's hash, so
scripts/verify-audit.mjs can prove
the log hasn't been silently truncated or reordered:
```bash node scripts/verify-audit.mjs /var/log/omcp/audit.jsonl
→ { "ok": true, "entries": 1234, "tipHash": "…" } (exit 0)¶
or, on a tamper:¶
→ { "ok": false, "entries": 1234, "brokenAt": 42, "reason": "…" } (exit 1)¶
```
The script uses only node built-ins (no node_modules) so it works
straight from a source checkout on an air-gapped operator workstation.
For unattended cron monitoring, pass --quiet — success produces no
stdout, failure still writes the { ok: false, brokenAt, reason }
JSON. Pair with a non-zero-exit handler in your job runner:
cron
0 * * * * node /opt/observability-mcp/scripts/verify-audit.mjs --quiet /var/log/omcp/audit.jsonl
- File path:
OMCP_MGMT_AUDIT_FILE(JSONL, append-only). Unset → an in-memory ring of the last 500 entries serves the sameGET /api/auditendpoint, useful for the demo / single-user case. - Read access:
audit:readpermission (granted to viewer / operator / admin by default). - Surface:
GET /api/audit?from=&to=&actor=&action=&limit=returns the most-recent-first slice plustipHash. The Web UI's Audit Log page renders this alongside the entitlement-gate's MCP-tool audit feed.
Session revocation¶
Web UI sessions (OMCP_AUTH=basic or oidc) are stateless signed
cookies — the gateway verifies the HMAC and trusts the payload, with
no server-side session table. That keeps the auth path cheap and
horizontally scalable, but it means a plain logout only clears the
cookie in that browser: a copied cookie, or a session on a lost
laptop, stays valid until its exp (12h default).
The revocation blocklist closes that gap. Every session carries a random
sid, and every request consults an append-only blocklist before the
session is trusted. Two shapes:
- Revoke one session — by its
sid. Read the current session's id fromGET /api/me(thesidfield), then:
bash
curl -s -b "omcp_session=$ADMIN_COOKIE" -X POST "$URL/api/auth/revocations" \
-H 'content-type: application/json' \
-d '{ "sid": "Xy7…", "reason": "stolen laptop" }'
- Log a user out everywhere — by
sub. Revokes every session for that subject issued so far; a fresh login afterwards (with valid credentials / a valid IdP assertion) is unaffected, so this is a force-re-login, not a permanent ban:
bash
curl -s -b "omcp_session=$ADMIN_COOKIE" -X POST "$URL/api/auth/revocations" \
-H 'content-type: application/json' \
-d '{ "sub": "alice@example.com", "reason": "offboarded" }'
GET /api/auth/revocations lists the current blocklist. Both endpoints
are admin-gated (users:delete) and every revocation is written to the
audit log.
- Persistence:
OMCP_AUTH_REVOCATION_FILE(JSONL, append-only, mode 0600). Unset → in-memory only, so the blocklist is lost on restart; set it so revocations survive a restart. - Multi-replica caveat: each replica loads the blocklist once at startup and the writing replica updates its own in-memory index immediately, but there is no live cross-replica propagation — a revocation issued on replica A is not seen by replica B until B restarts (or re-reads the file). For a single-replica gateway this is a non-issue. For a load-balanced fleet, either pin auth-plane requests to one replica, keep session TTLs short, or roll the deployment after a bulk revocation. A shared-store backend (Redis, like the SCIM and transport stores) is the planned path to live fleet-wide propagation.
- A permanent ban is a directory operation, not a gateway one: remove /
disable the user in
OMCP_USERS_FILEor your IdP. The blocklist is for sessions, not accounts.
Rate limits¶
The /mcp HTTP transport carries one per-identity sliding window:
60 requests/minute per named bearer-token caller by default.
OMCP_TOOL_RATE_PER_MIN overrides — accepts any positive integer;
unset / empty / non-numeric / 0 / negative all fall back to the
default 60 (so an operator setting 0 to mean "disable" doesn't
accidentally lock every caller out). To truly disable the per-identity
cap — e.g. when an upstream gateway already enforces quotas —
set it to off, none, unlimited, disabled, or false
(case-insensitive). In that mode /api/usage reports limit: null
for every identity. Anonymous /mcp traffic (no OMCP_API_KEYS) is
unaffected; the existing IP-level express-rate-limit still applies.
Every /mcp response carries the live bucket state in headers so a
well-behaved client can self-pace before hitting the cap:
http
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 47
X-RateLimit-Window-Ms: 60000
A breached cap returns:
http
HTTP/1.1 429 Too Many Requests
Retry-After: 17
Content-Type: application/json
json
{
"code": "OMCP_IDENTITY_RATE_LIMIT",
"retryAfterSeconds": 17,
"limit": 60,
"windowMs": 60000
}
Granularity is per HTTP request, not per JSON-RPC message. A batched JSON-RPC request counts as one; a multi-tool LLM turn counts as N.
Live snapshot: GET /api/usage (gated by audit:read) returns the
current windowed count per identity:
json
{
"identities": [
{ "actor": "agent-prod", "count": 14, "limit": 60, "windowMs": 60000 },
{ "actor": "ci", "count": 3, "limit": 60, "windowMs": 60000 }
],
"defaultLimit": 60,
"windowMs": 60000
}
Pass ?actor=<name> to inspect a single identity (count is 0 for
identities the server has never seen).
Token budget¶
OMCP_TOOL_DAILY_TOKENS=<positive integer> enables a per-identity
rolling 24-hour token cap. Tokens are estimated post-tool-execution
(over-count by ~5% vs cl100k_base, so the gate errs on the strict
side) and charged against the calling bearer-token credential. When
the bucket would exceed the cap, the tool returns
json
{
"error": "OMCP_TOKEN_BUDGET_EXCEEDED",
"tool": "query_logs",
"used": 49800,
"limit": 50000,
"requested": 1200,
"retryAfterSeconds": 73400,
"freedAtRetry": 14000,
"message": "Daily token budget exceeded (49800/50000 ...)."
}
instead of the data. The agent sees a parseable refusal, not a generic failure.
Anonymous /mcp traffic is not charged (the budget is per
credential; an operator running without OMCP_API_KEYS has no
identity-keyed bucket to charge). Three tools currently honour the
gate: query_logs, query_metrics, get_service_health. Adding
new high-token tools is a one-line chargeTokenBudget(result, ctx,
"new_tool") wrap.
The retryAfterSeconds walks bucket history oldest-first until
enough capacity has dropped to fit the denied request; freedAtRetry
reports how many tokens that frees so a well-behaved agent can
decide whether to back off or retry sooner with a smaller request.
OMCP_TOKEN_BUDGET_FILE=<path> enables snapshot persistence —
buckets reload at boot, so a server restart mid-day doesn't reset
quotas. Writes are debounced (1s default) and atomic (write-rename).
Unset → in-memory only, which is fine for demo / single-operator
setups where a restart-on-each-deploy effectively rolls budgets.
Live snapshot at GET /api/usage (same gate as the rate-limit view)
returns:
json
{
"identities": [
{
"actor": "agent-prod",
"count": 14,
"limit": 60,
"windowMs": 60000,
"tokens": { "used": 42100, "limit": 50000, "windowMs": 86400000 }
}
],
"tokens": { "defaultLimit": 50000, "windowMs": 86400000 }
}
The Web UI's Overview page shows the same data as a "Today's
MCP usage" strip — top 5 identities sorted by token consumption,
with a coloured progress bar that turns amber at 70% of the daily
cap and red at 90%. The strip is hidden when no identity has any
traffic yet, or the viewer lacks the audit:read permission.
Error codes¶
| Code | Meaning | Caller response |
|---|---|---|
OMCP_TOKEN_BUDGET_EXCEEDED |
Identity is at the cap; this call would push over. | Wait retryAfterSeconds; freedAtRetry tells how much will be available. |
OMCP_TOKEN_REQUEST_EXCEEDS_BUDGET |
Single response alone larger than the entire daily cap. | Retrying won't help — narrow the query (smaller window / lower limit / more selective filter) or raise the cap. |
Service catalog enrichment¶
When OMCP_SERVICE_CATALOG_FILE points at a JSON catalog (schema in
mcp-server/src/catalog/loader.ts),
every list_services / get_service_health / query_metrics derived
response is decorated with .catalog = { owner, tier, onCall, slo, … }.
The agent sees ownership context inline — no separate CMDB hop.
Without the env var the file is missing → empty catalog → enrichment is a no-op.
Posture discovery¶
External dashboards and discovery probes (kube-state-metrics derived
exporters, Helm chart annotations, Backstage plugins, etc.) often want
to learn the deployment's governance shape without holding a session.
GET /api/info ships a public governance block for exactly that:
json
{
"governance": {
"authMode": "basic",
"authSecretEphemeral": false,
"auditPersisted": true,
"catalogConfigured": true,
"redaction": true,
"trustProxy": true,
"toolRatePerMin": 60
}
}
Booleans + the rate-limit number only — no file paths, no session secret, no user counts. The expected alert is "this deployment silently reverted to anonymous mode" or "redaction is off in prod" — both visible from a single unauthenticated GET.
Behind a reverse proxy¶
By default req.ip is the raw socket address, so a fronting nginx / Envoy
/ ingress controller makes every audit entry look like 127.0.0.1. Set
OMCP_TRUST_PROXY to one of:
| Value | Meaning |
|---|---|
true |
trust every upstream hop (Express default-on shape) |
loopback |
trust 127.0.0.1 / ::1 only (sensible same-host nginx default) |
| an integer | trust the last n hops |
| comma-separated IPs | explicit list of upstreams to trust |
Unset / false keeps the safe default. The same setting also fixes the
Secure cookie attribute behind TLS-terminating proxies (the server
detects HTTPS via req.secure || X-Forwarded-Proto).
Investigation runbook¶
"Who changed source payment-prod yesterday?"¶
bash
curl -s "$URL/api/audit?action=write&actor=alice&limit=50" \
| jq '.entries[] | select(.target == "payment-prod")'
"Why did Claude get 403 just now?"¶
The client's stderr / log shows the response body. Cross-check the permission grants for the user:
bash
curl -s -b "omcp_session=$COOKIE" "$URL/api/me" \
| jq '.permissions'
If the user's role is missing the resource:action they tried,
update OMCP_USERS_FILE (add the right role to that user's roles
array) and have them sign out + back in to refresh the cookie.
"Why are my logs returning [redacted-email]?"¶
The redactor is on by default. If the source is already PII-clean, disable it process-wide:
yaml
env:
OMCP_REDACTION: "off"
Per-request bypass requires two independent grants — both must align for a single tool call to skip redaction:
- RBAC permission
redaction:bypass(admin role by default). This is the management-plane gate — visible via/api/policyand reflected on the Policy UI tab. - Credential opt-in via
OMCP_KEY_BYPASS_REDACTION=<key-names>. This is the data-plane gate — only credentials listed here may carry the bypass flag, and only when the call also setsbypass_redaction: truein the tool args.
Either gate alone keeps redaction on. The MCP tool currently honouring
the per-call flag is query_logs; future high-PII tools will follow
the same pattern.
Example: an agent credential authorised for live-incident debugging:
yaml
env:
OMCP_API_KEYS: "agent:tok_..."
OMCP_KEY_BYPASS_REDACTION: "agent" # data-plane allow-list
# The RBAC side is automatic for admin role; for OIDC sessions the
# IdP claim → role mapping decides whether the same identity also
# sees the policy entry in the UI.
The agent then invokes query_logs with { ..., bypass_redaction: true }
and gets unredacted payload bytes; without the env, the same arg is
silently ignored and the response stays redacted.
"Caller hit a 429 on the /mcp transport"¶
The response body identifies the caller's identity bucket. To raise the cap process-wide:
yaml
env:
OMCP_TOOL_RATE_PER_MIN: "240"
For a per-credential cap override, set OMCP_KEY_RATE_PER_MIN:
yaml
env:
OMCP_TOOL_RATE_PER_MIN: "60" # default everyone else
OMCP_KEY_RATE_PER_MIN: "agent=600;ci=240;noisy-bot=off"
The override syntax mirrors OMCP_KEY_TENANTS / OMCP_KEY_PRODUCTS —
name=count pairs separated by ;. The same disable vocabulary as
the global cap (off, none, unlimited, disabled, false,
case-insensitive) lifts the cap entirely for that credential —
useful for an internal automation that shouldn't be rate-limited.
Unknown credentials silently fall back to the global default; a
non-numeric override silently skips so a typo doesn't lock the
credential out at boot.
"Restart broke my audit chain"¶
If OMCP_MGMT_AUDIT_FILE is set, AuditLog.bootstrap() replays the
existing file on start so seq + tipHash resume cleanly. If you ever
need to verify the chain manually:
bash
node -e "
const { verifyChain } = require('./mcp-server/dist/audit/log.js');
const lines = require('fs').readFileSync(process.env.AUDIT_FILE, 'utf8').trim().split('\n');
const entries = lines.map(JSON.parse);
console.log(verifyChain(entries));
"
A break reports { ok: false, brokenAt: N, reason: '...' } and the
script exits non-zero so a cron-driven monitor can alert. Most common
cause is hand-editing the file; restore from backup and replay any
missed changes via the Web UI.