Monitoring - Docs

The gateway and coordinator expose Prometheus metrics on /metrics. The Helm charts include ServiceMonitor resources, PrometheusRule alert definitions, and optional Grafana dashboard ConfigMaps.

Enabling metrics collection

Both the gateway and coordinator charts have a serviceMonitor section:

serviceMonitor:
  enabled: true
  interval: 15s
  scrapeTimeout: 10s
  additionalLabels: {}

Enable it during installation:

helm install aegis-shield takumo/aegis-shield \
  --namespace takumo \
  --set serviceMonitor.enabled=true

helm install gateway-coordinator takumo/gateway-coordinator \
  --namespace takumo \
  --set serviceMonitor.enabled=true

The ServiceMonitor resources are automatically discovered by Prometheus Operator.

Deploying the monitoring stack

The monitoring chart deploys kube-prometheus-stack (Prometheus, Grafana, Alertmanager) with Takumo-specific defaults:

helm install takumo-monitoring takumo/monitoring \
  --namespace takumo

This provisions:

Prometheus with 7-day retention, 10Gi storage, and cross-namespace ServiceMonitor discovery
Grafana with 2Gi persistent storage and a sidecar that auto-loads dashboard ConfigMaps from all namespaces
Alertmanager with 120-hour retention and 1Gi storage

Enabling Grafana dashboards

The gateway chart includes pre-built Grafana dashboards as ConfigMaps:

helm upgrade aegis-shield takumo/aegis-shield \
  --namespace takumo \
  --reuse-values \
  --set grafana.dashboards.enabled=true

The Grafana sidecar automatically picks up the ConfigMap. No manual import needed.

Gateway metrics

Request metrics

Metric	Type	Description
`aegis_requests_total`	Counter	Total requests processed, labeled by `status` (HTTP code) and `provider`
`aegis_request_duration_seconds`	Histogram	Request latency from gateway receipt to response delivery
`aegis_upstream_errors_total`	Counter	Errors from upstream AI providers, labeled by `provider`

Security metrics

Metric	Type	Description
`aegis_secrets_detected_total`	Counter	Secrets found and tokenized
`aegis_security_violations_total`	Counter	Policy violations detected
`aegis_auth_failures_blocked_total`	Counter	Authentication failures blocked

Infrastructure metrics

Metric	Type	Description
`aegis_vault_entries`	Gauge	Current number of entries in the in-memory vault
`aegis_concurrent_sessions`	Gauge	Current number of active sessions
`aegis_circuit_breaker_state`	Gauge	Circuit breaker status: `0` = closed (healthy), `1` = open (tripped)
`aegis_connector_state`	Gauge	Cloud connector status: `4` = standalone mode
`aegis_redis_degraded`	Gauge	Redis connection health: `0` = healthy, `1` = degraded
`aegis_redis_health_state`	Gauge	Redis health detail: `0` = healthy, `>=1` = degraded or unavailable

Rate limiting metrics

Metric	Type	Description
`aegis_rate_limit_checks_total`	Counter	Rate limit checks, labeled by `result` (`allowed` or `rejected`)

Audit metrics

Metric	Type	Description
`aegis_audit_events_discarded_total`	Counter	Audit events discarded due to shipping failures

SLI metrics

Metric	Type	Description
`aegis_sli_availability_ratio`	Gauge	Ratio of successful requests to total requests
`aegis_sli_latency_ratio`	Gauge	Ratio of requests within latency budget
`aegis_sli_security_ratio`	Gauge	Ratio of requests without security violations

Alert rules

The gateway chart includes a full set of PrometheusRule alert definitions. Enable them with:

helm upgrade aegis-shield takumo/aegis-shield \
  --namespace takumo \
  --reuse-values \
  --set alerting.enabled=true

SLO alerts

Alert	Severity	Condition
`AvailabilitySLOBreach`	critical	Availability ratio below threshold for 5 minutes
`LatencySLOBreach`	warning	Latency ratio below threshold for 5 minutes
`SecuritySLOBreach`	critical	Security ratio below threshold for 5 minutes

Error alerts

Alert	Severity	Condition
`HighErrorRate`	critical	5xx error rate exceeds 5% for 5 minutes
`UpstreamErrors`	warning	Upstream provider errors exceed 5/sec for 5 minutes
`AegisUpstreamErrorRate`	critical	Per-provider error rate exceeds 10% for 5 minutes

Infrastructure alerts

Alert	Severity	Condition
`CircuitBreakerOpen`	critical	Circuit breaker open for 2 minutes
`RedisDegraded`	warning	Redis connection degraded for 3 minutes
`HighConcurrentSessions`	warning	Concurrent sessions exceed threshold for 5 minutes
`AegisGatewayStandalone`	warning	Gateway in standalone mode (no coordinator) for 5 minutes
`AegisTokenExchangeCircuitOpen`	warning	Token exchange circuit breaker open for 2 minutes
`AegisRedisHealthDegraded`	warning	Redis health degraded for 3 minutes

Security alerts

Alert	Severity	Condition
`SecurityViolationSpike`	critical	Security violations exceed threshold for 3 minutes
`BruteForceDetected`	critical	Auth failures exceed threshold for 2 minutes
`SecretsLeaking`	critical	Secrets detected in outbound traffic for 1 minute

Pod health alerts

Alert	Severity	Condition
`PodNotReady`	warning	Pod not ready for 5 minutes
`PodRestarting`	warning	Pod restarted more than 3 times in 1 hour

Rate limiting alerts

Alert	Severity	Condition
`HighRateLimitRejections`	warning	Rate limit rejection rate exceeds 10% for 5 minutes

Audit alerts

Alert	Severity	Condition
`AegisAuditEventsDiscarded`	warning	Any audit events discarded in the last hour

Tuning alert thresholds

All thresholds are configurable via Helm values:

alerting:
  enabled: true
  thresholds:
    availability: 0.99          # 99% availability SLO
    latency: 0.95               # 95% within latency budget
    security: 0.999             # 99.9% security SLO
    errorRate: 0.05             # 5xx error rate
    upstreamErrorsPerSec: 5     # Upstream errors/sec
    securityViolationsPerMin: 10
    bruteForcePerMin: 20
    maxConcurrentSessions: 4000
    podRestartsPerHour: 3
    rateLimitRejectionRate: 0.1 # 10% rejection rate

Coordinator alert thresholds

alerting:
  enabled: true
  thresholds:
    tokenExchangeErrorRate: 0.05    # 5% token exchange failures
    grpcErrorRate: 0.05             # 5% gRPC error rate
    staleInstancesPercent: 0.20     # 20% of instances stale
    telemetryDropRate: 0.01         # 1% telemetry drop rate
    dbQueryLatencyP99: 0.5          # 500ms p99 DB query latency
    podRestartsPerHour: 3

The Grafana dashboards are provisioned as ConfigMaps. Customize them by editing the dashboard JSON in the grafana.dashboards section of the Helm values, or export from the Grafana UI and re-apply as a ConfigMap.

Guides

​Enabling metrics collection

​Deploying the monitoring stack

​Enabling Grafana dashboards

​Gateway metrics

​Request metrics

​Security metrics

​Infrastructure metrics

​Rate limiting metrics

​Audit metrics

​SLI metrics

​Alert rules

​SLO alerts

​Error alerts

​Infrastructure alerts

​Security alerts

​Pod health alerts

​Rate limiting alerts

​Audit alerts

​Tuning alert thresholds

​Coordinator alert thresholds

Enabling metrics collection

Deploying the monitoring stack

Enabling Grafana dashboards

Gateway metrics

Request metrics

Security metrics

Infrastructure metrics

Rate limiting metrics

Audit metrics

SLI metrics

Alert rules

SLO alerts

Error alerts

Infrastructure alerts

Security alerts

Pod health alerts

Rate limiting alerts

Audit alerts

Tuning alert thresholds

Coordinator alert thresholds