High Availability Deployment

Deploy Cadence for high availability and fault tolerance

Production-grade high availability setup for Cadence with multiple instances, redundancy, failover, and load balancing.

Architecture Overview

GitHub/GitLab Webhooks
        │ HTTPS
        ▼
    ┌─────────────────┐
    │  Load Balancer  │ (HAProxy + Nginx)
    └────────┬────────┘
    ┌────────┴────────┐
    │        │        │
    ▼        ▼        ▼
Cadence-1 Cadence-2 Cadence-3
  :8001     :8001     :8001
    │        │        │
    └────────┬────────┘
             │
    ┌────────┴────────┐
    │                 │
 Redis Cluster   PostgreSQL
  (3+ nodes)       (Primary)
                      │
                      ▼
                  PostgreSQL
                  (Replica)

Prerequisites

  • 3+ servers minimum (for quorum)
  • Shared Redis cluster
  • PostgreSQL replication
  • Load balancer (HAProxy)
  • Monitoring system

Multi-Instance Setup

1. Configure Multiple Instances

Create separate systemd services for each instance:

Bash
# Instance 1: Port 8001
# Instance 2: Port 8002  
# Instance 3: Port 8003

2. Shared Cache with Redis

Bash
# Install Redis
sudo apt-get install -y redis-server

# Configure cluster
cat <<EOF | sudo tee /etc/redis/redis-cluster.conf
port 6379
cluster-enabled yes
appendonly yes
maxmemory 2gb
maxmemory-policy allkeys-lru
EOF

3. PostgreSQL Replication

Bash
# Primary configuration
listen_addresses = '*'
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10

# Replica configuration
standby_mode = 'on'
primary_conninfo = 'host=primary.example.com user=replicator'

4. Load Balancer (HAProxy)

INI
# /etc/haproxy/haproxy.cfg

global
    maxconn 4096
    daemon

defaults
    mode http
    timeout connect 5000
    timeout client 50000
    timeout server 50000

frontend cadence_front
    bind *:80
    bind *:443 ssl crt /etc/ssl/cadence.pem
    http-request redirect scheme https code 301 if !{ ssl_fc }
    default_backend cadence_back

backend cadence_back
    balance roundrobin
    option httpchk GET /health HTTP/1.1
    
    server cadence1 10.0.1.10:8001 check fall 3 rise 2
    server cadence2 10.0.1.11:8001 check fall 3 rise 2
    server cadence3 10.0.1.12:8001 check fall 3 rise 2

listen stats
    bind *:8404
    stats enable
    stats uri /stats

5. Verify Failover

Test automatic failover:

Bash
# Stop one instance
sudo systemctl stop cadence-webhook@1

# Verify others still handle traffic
curl http://noslop.example.com/health

# Restart
sudo systemctl start cadence-webhook@1

Monitoring & Alerting

Prometheus Rules

YAML
groups:
- name: cadence_ha
  rules:
  - alert: CadenceNodeDown
    expr: up{job="cadence"} == 0
    for: 2m
    annotations:
      summary: "Cadence instance down"
  
  - alert: ClusterDegraded
    expr: count(up{job="cadence"}) < 2
    for: 1m
    annotations:
      summary: "Less than 2 instances running"

Disaster Recovery

Backup Strategy

Bash
# Daily backups
0 2 * * * pg_dump cadence | gzip > /backups/pg-$(date +%Y%m%d).sql.gz
redis-cli BGSAVE && cp /var/lib/redis/dump.rdb /backups/redis-$(date +%Y%m%d).rdb

Recovery Procedure

Bash
# Restore PostgreSQL
gunzip < /backups/pg-20240202.sql.gz | psql -d cadence

# Restore Redis
cp /backups/redis-20240202.rdb /var/lib/redis/dump.rdb
systemctl restart redis-server

# Verify health
for i in 1 2 3; do
  curl -s http://cadence-$i:8000/health | jq .
done

Multi-Region Setup

For global high availability:

Region 1 (US-East)       Region 2 (EU-West)
┌──────────────────┐     ┌──────────────────┐
│  Cadence x3      │     │  Cadence x3      │
│  Redis Cluster   │◄────►  Redis Cluster   │
│  Postgres Prim   │     │  Postgres Sec    │
└──────────────────┘     └──────────────────┘
        │                          │
        └──────────────┬───────────┘
                   DNS Failover

Next Steps