Monitoring & Observability
Set up metrics, logging, and health monitoring for Cadence
Comprehensive monitoring and observability setup for Cadence webhook server and analysis operations.
Health Checks
Built-in Health Endpoint
Bash
curl http://localhost:8000/health
Response:
JSON
{
"status": "ok",
"time": "2026-02-02T10:30:00Z"
}
Kubernetes Probes
YAML
livenessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 30
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 10
failureThreshold: 2
Prometheus Metrics
Common Metrics
cadence_jobs_total- Total jobs processedcadence_jobs_duration_seconds- Job processing durationcadence_queue_depth- Jobs waiting in queuecadence_analysis_errors_total- Analysis errorscadence_cache_hits_total- Cache hit count
Prometheus Configuration
YAML
scrape_configs:
- job_name: 'cadence'
static_configs:
- targets: ['localhost:8000']
metrics_path: '/metrics'
scrape_interval: 15s
Grafana Queries
promql
# Throughput (jobs/min)
rate(cadence_jobs_total[5m])
# P95 latency
histogram_quantile(0.95, cadence_jobs_duration_seconds_bucket)
# Cache hit rate
rate(cadence_cache_hits_total[5m]) / (rate(cadence_cache_hits_total[5m]) + rate(cadence_cache_misses_total[5m]))
Structured Logging
Log Format
JSON
{
"timestamp": "2026-02-02T10:30:00Z",
"level": "info",
"component": "webhook",
"message": "Job completed",
"duration_ms": 1250
}
Log Levels
debug- Detailed diagnostic informationinfo- General operational informationwarn- Warning messageserror- Error messagesfatal- Fatal errors
Configuration
YAML
logging:
level: "info"
format: "json"
output: "stdout"
Log Aggregation
ELK Stack (Filebeat)
YAML
filebeat.inputs:
- type: filestream
paths:
- /var/log/cadence/*.log
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "cadence-%{+yyyy.MM.dd}"
CloudWatch
JSON
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [{
"file_path": "/var/log/cadence/*.log",
"log_group_name": "/aws/cadence/webhook"
}]
}
}
}
}
Alerting
Prometheus Rules
YAML
groups:
- name: cadence-alerts
rules:
- alert: CadenceDown
expr: up{job="cadence"} == 0
for: 1m
- alert: HighErrorRate
expr: rate(cadence_analysis_errors_total[5m]) > 0.1
for: 5m
- alert: QueueBackup
expr: cadence_queue_depth > 100
for: 10m
Notification Channels
- Slack
- PagerDuty
Custom Metrics
Track:
- Webhook request latency (p50, p95, p99)
- Cache hit/miss rates
- Database query latency
- Repositories analyzed per day
Dashboards
Key metrics to display:
- Health status
- Jobs processed (total)
- Active jobs
- Queue depth
- Error rate
- Average job duration
- P95 latency
- Memory usage
- CPU usage
Troubleshooting Monitoring
High Memory Usage
Bash
# Check metrics
curl http://localhost:8000/metrics | grep memory
# Reduce workers
CADENCE_WEBHOOK_MAX_WORKERS=4
High Latency
Bash
# Check queue
curl http://localhost:8000/metrics | grep queue_depth
# Increase workers
CADENCE_WEBHOOK_MAX_WORKERS=16
Error Rate Spike
Bash
# Check logs
journalctl -u cadence -p err -n 50
# Get metrics
curl http://localhost:8000/metrics | grep error
Next Steps
- Performance Tuning - Optimize throughput
- Troubleshooting - Diagnose issues
- Prometheus Docs - Advanced setup