Docs/Monitoring & Observability

Monitoring & Observability

Set up metrics, logging, and health monitoring for Cadence

Comprehensive monitoring and observability setup for Cadence webhook server and analysis operations.

Health Checks

Built-in Health Endpoint

Bash
curl http://localhost:8000/health

Response:

JSON
{
  "status": "ok",
  "time": "2026-02-02T10:30:00Z"
}

Kubernetes Probes

YAML
livenessProbe:
  httpGet:
    path: /health
    port: 8000
  periodSeconds: 30
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health
    port: 8000
  periodSeconds: 10
  failureThreshold: 2

Prometheus Metrics

Common Metrics

  • cadence_jobs_total - Total jobs processed
  • cadence_jobs_duration_seconds - Job processing duration
  • cadence_queue_depth - Jobs waiting in queue
  • cadence_analysis_errors_total - Analysis errors
  • cadence_cache_hits_total - Cache hit count

Prometheus Configuration

YAML
scrape_configs:
  - job_name: 'cadence'
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: '/metrics'
    scrape_interval: 15s

Grafana Queries

promql
# Throughput (jobs/min)
rate(cadence_jobs_total[5m])

# P95 latency
histogram_quantile(0.95, cadence_jobs_duration_seconds_bucket)

# Cache hit rate
rate(cadence_cache_hits_total[5m]) / (rate(cadence_cache_hits_total[5m]) + rate(cadence_cache_misses_total[5m]))

Structured Logging

Log Format

JSON
{
  "timestamp": "2026-02-02T10:30:00Z",
  "level": "info",
  "component": "webhook",
  "message": "Job completed",
  "duration_ms": 1250
}

Log Levels

  • debug - Detailed diagnostic information
  • info - General operational information
  • warn - Warning messages
  • error - Error messages
  • fatal - Fatal errors

Configuration

YAML
logging:
  level: "info"
  format: "json"
  output: "stdout"

Log Aggregation

ELK Stack (Filebeat)

YAML
filebeat.inputs:
  - type: filestream
    paths:
      - /var/log/cadence/*.log

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "cadence-%{+yyyy.MM.dd}"

CloudWatch

JSON
{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [{
          "file_path": "/var/log/cadence/*.log",
          "log_group_name": "/aws/cadence/webhook"
        }]
      }
    }
  }
}

Alerting

Prometheus Rules

YAML
groups:
  - name: cadence-alerts
    rules:
      - alert: CadenceDown
        expr: up{job="cadence"} == 0
        for: 1m

      - alert: HighErrorRate
        expr: rate(cadence_analysis_errors_total[5m]) > 0.1
        for: 5m

      - alert: QueueBackup
        expr: cadence_queue_depth > 100
        for: 10m

Notification Channels

  • Slack
  • Email
  • PagerDuty

Custom Metrics

Track:

  • Webhook request latency (p50, p95, p99)
  • Cache hit/miss rates
  • Database query latency
  • Repositories analyzed per day

Dashboards

Key metrics to display:

  • Health status
  • Jobs processed (total)
  • Active jobs
  • Queue depth
  • Error rate
  • Average job duration
  • P95 latency
  • Memory usage
  • CPU usage

Troubleshooting Monitoring

High Memory Usage

Bash
# Check metrics
curl http://localhost:8000/metrics | grep memory

# Reduce workers
CADENCE_WEBHOOK_MAX_WORKERS=4

High Latency

Bash
# Check queue
curl http://localhost:8000/metrics | grep queue_depth

# Increase workers
CADENCE_WEBHOOK_MAX_WORKERS=16

Error Rate Spike

Bash
# Check logs
journalctl -u cadence -p err -n 50

# Get metrics
curl http://localhost:8000/metrics | grep error

Next Steps