Skip to main content
The gateway and coordinator expose Prometheus metrics on /metrics. The Helm charts include ServiceMonitor resources, PrometheusRule alert definitions, and optional Grafana dashboard ConfigMaps.

Enabling metrics collection

Both the gateway and coordinator charts have a serviceMonitor section:
serviceMonitor:
  enabled: true
  interval: 15s
  scrapeTimeout: 10s
  additionalLabels: {}
Enable it during installation:
helm install aegis-shield takumo/aegis-shield \
  --namespace takumo \
  --set serviceMonitor.enabled=true
helm install gateway-coordinator takumo/gateway-coordinator \
  --namespace takumo \
  --set serviceMonitor.enabled=true
The ServiceMonitor resources are automatically discovered by Prometheus Operator.

Deploying the monitoring stack

The monitoring chart deploys kube-prometheus-stack (Prometheus, Grafana, Alertmanager) with Takumo-specific defaults:
helm install takumo-monitoring takumo/monitoring \
  --namespace takumo
This provisions:
  • Prometheus with 7-day retention, 10Gi storage, and cross-namespace ServiceMonitor discovery
  • Grafana with 2Gi persistent storage and a sidecar that auto-loads dashboard ConfigMaps from all namespaces
  • Alertmanager with 120-hour retention and 1Gi storage

Enabling Grafana dashboards

The gateway chart includes pre-built Grafana dashboards as ConfigMaps:
helm upgrade aegis-shield takumo/aegis-shield \
  --namespace takumo \
  --reuse-values \
  --set grafana.dashboards.enabled=true
The Grafana sidecar automatically picks up the ConfigMap. No manual import needed.

Gateway metrics

Request metrics

MetricTypeDescription
aegis_requests_totalCounterTotal requests processed, labeled by status (HTTP code) and provider
aegis_request_duration_secondsHistogramRequest latency from gateway receipt to response delivery
aegis_upstream_errors_totalCounterErrors from upstream AI providers, labeled by provider

Security metrics

MetricTypeDescription
aegis_secrets_detected_totalCounterSecrets found and tokenized
aegis_security_violations_totalCounterPolicy violations detected
aegis_auth_failures_blocked_totalCounterAuthentication failures blocked

Infrastructure metrics

MetricTypeDescription
aegis_vault_entriesGaugeCurrent number of entries in the in-memory vault
aegis_concurrent_sessionsGaugeCurrent number of active sessions
aegis_circuit_breaker_stateGaugeCircuit breaker status: 0 = closed (healthy), 1 = open (tripped)
aegis_connector_stateGaugeCloud connector status: 4 = standalone mode
aegis_redis_degradedGaugeRedis connection health: 0 = healthy, 1 = degraded
aegis_redis_health_stateGaugeRedis health detail: 0 = healthy, >=1 = degraded or unavailable

Rate limiting metrics

MetricTypeDescription
aegis_rate_limit_checks_totalCounterRate limit checks, labeled by result (allowed or rejected)

Audit metrics

MetricTypeDescription
aegis_audit_events_discarded_totalCounterAudit events discarded due to shipping failures

SLI metrics

MetricTypeDescription
aegis_sli_availability_ratioGaugeRatio of successful requests to total requests
aegis_sli_latency_ratioGaugeRatio of requests within latency budget
aegis_sli_security_ratioGaugeRatio of requests without security violations

Alert rules

The gateway chart includes a full set of PrometheusRule alert definitions. Enable them with:
helm upgrade aegis-shield takumo/aegis-shield \
  --namespace takumo \
  --reuse-values \
  --set alerting.enabled=true

SLO alerts

AlertSeverityCondition
AvailabilitySLOBreachcriticalAvailability ratio below threshold for 5 minutes
LatencySLOBreachwarningLatency ratio below threshold for 5 minutes
SecuritySLOBreachcriticalSecurity ratio below threshold for 5 minutes

Error alerts

AlertSeverityCondition
HighErrorRatecritical5xx error rate exceeds 5% for 5 minutes
UpstreamErrorswarningUpstream provider errors exceed 5/sec for 5 minutes
AegisUpstreamErrorRatecriticalPer-provider error rate exceeds 10% for 5 minutes

Infrastructure alerts

AlertSeverityCondition
CircuitBreakerOpencriticalCircuit breaker open for 2 minutes
RedisDegradedwarningRedis connection degraded for 3 minutes
HighConcurrentSessionswarningConcurrent sessions exceed threshold for 5 minutes
AegisGatewayStandalonewarningGateway in standalone mode (no coordinator) for 5 minutes
AegisTokenExchangeCircuitOpenwarningToken exchange circuit breaker open for 2 minutes
AegisRedisHealthDegradedwarningRedis health degraded for 3 minutes

Security alerts

AlertSeverityCondition
SecurityViolationSpikecriticalSecurity violations exceed threshold for 3 minutes
BruteForceDetectedcriticalAuth failures exceed threshold for 2 minutes
SecretsLeakingcriticalSecrets detected in outbound traffic for 1 minute

Pod health alerts

AlertSeverityCondition
PodNotReadywarningPod not ready for 5 minutes
PodRestartingwarningPod restarted more than 3 times in 1 hour

Rate limiting alerts

AlertSeverityCondition
HighRateLimitRejectionswarningRate limit rejection rate exceeds 10% for 5 minutes

Audit alerts

AlertSeverityCondition
AegisAuditEventsDiscardedwarningAny audit events discarded in the last hour

Tuning alert thresholds

All thresholds are configurable via Helm values:
alerting:
  enabled: true
  thresholds:
    availability: 0.99          # 99% availability SLO
    latency: 0.95               # 95% within latency budget
    security: 0.999             # 99.9% security SLO
    errorRate: 0.05             # 5xx error rate
    upstreamErrorsPerSec: 5     # Upstream errors/sec
    securityViolationsPerMin: 10
    bruteForcePerMin: 20
    maxConcurrentSessions: 4000
    podRestartsPerHour: 3
    rateLimitRejectionRate: 0.1 # 10% rejection rate

Coordinator alert thresholds

alerting:
  enabled: true
  thresholds:
    tokenExchangeErrorRate: 0.05    # 5% token exchange failures
    grpcErrorRate: 0.05             # 5% gRPC error rate
    staleInstancesPercent: 0.20     # 20% of instances stale
    telemetryDropRate: 0.01         # 1% telemetry drop rate
    dbQueryLatencyP99: 0.5          # 500ms p99 DB query latency
    podRestartsPerHour: 3
The Grafana dashboards are provisioned as ConfigMaps. Customize them by editing the dashboard JSON in the grafana.dashboards section of the Helm values, or export from the Grafana UI and re-apply as a ConfigMap.