GitHub

Swarm

How The Helper Bees runs its Docker Swarm clusters. We operate two clusters, one per environment (non-production and production), both living in hb-infra on GCP, region us-central1. This page is the architecture overview. For depth, follow the consolidation RFC, runbooks, and service docs.

Each cluster is the same shape: a small fixed set of manager nodes that hold the cluster together and run the control and observability plane, plus an autoscaling pool of worker nodes that run the application workloads. A custom controller, swarmctl, sits on the managers and brokers the two things native Swarm will not do on its own: rebalancing existing tasks onto new workers, and draining workers before they leave.

The naming convention is swarm-{env_code}-*, where the env code is the first character of the environment: swarm-n-* is non-production, swarm-p-* is production.

At a Glance

Managers Workers
Form 3 fixed Terraform VMs (e2-medium) Regional Managed Instance Group (n2-highmem-4)
Zones us-central1-a / b / c (one each) us-central1-a / b / c / f
Role Raft quorum, control + observability plane Application workloads
Scaling Static (3, always) Autoscaling on memory (non-prod 12 to 18, prod 8 to 12)
External IP Reserved static (managed in the caller) None (egress via Cloud NAT)
SSH OS Login + IAP only OS Login + IAP only

Architecture

Color marks what each component is:

  • Blue: manager nodes
  • Green: worker nodes
  • Amber: the swarmctl control service (runs on the managers)
  • Purple: the observability stack, Prometheus and Grafana (also on the managers)
  • Teal: GCP-managed services (Secret Manager, Cloud NAT, Managed Prometheus, and the MIG autoscaler and health check)
  • Grey: external systems
%%{init: {'theme':'dark','themeVariables':{'fontSize':'16px','lineColor':'#9ca3af'},'flowchart':{'curve':'basis','nodeSpacing':45,'rankSpacing':60}}}%%
flowchart TB
  GHA["GitHub Actions
deploy stacks via IAP"]:::ext subgraph MGRS["Manager plane  ·  Raft quorum"] direction LR M1["swarm-mgr-1
zone a  ·  initializer"]:::mgr M2["swarm-mgr-2
zone b"]:::mgr M3["swarm-mgr-3
zone c"]:::mgr M1 ~~~ M2 ~~~ M3 end SC["swarmctl
leader-gated  ·  :9876
/drain  ·  /rebalance"]:::ctl PROM["Prometheus
in-swarm  ·  1h tmpfs"]:::obs subgraph WRK["Worker plane"] direction LR W1["worker a"]:::wkr W2["worker b"]:::wkr W3["worker c"]:::wkr W4["worker f"]:::wkr W1 ~~~ W2 ~~~ W3 ~~~ W4 end AS["MIG autoscaler
scale on memory  ·  70%"]:::gcp HC["MIG health check :9323
autoheal REPAIR"]:::gcp SM["Secret Manager
join + drain tokens"]:::gcp NAT["Cloud NAT
worker egress"]:::gcp GMP["Google Managed
Prometheus"]:::gcp GRAF["Grafana
Cloudflare Access SSO"]:::obs PD["PagerDuty"]:::ext DMS["Dead Man's Snitch
leader-gated"]:::ext GHA -->|"stack deploy"| MGRS MGRS -.->|"runs"| SC MGRS -.->|"runs"| PROM MGRS -->|"writes tokens"| SM WRK -->|"read token, probe :2377"| SM WRK -->|"egress"| NAT WRK -->|"ExecStop POST /drain"| SC SC -->|"rebalance / drain"| WRK AS -.->|"scales"| WRK HC -.->|"autoheals"| WRK PROM -->|"export"| GMP GMP --> GRAF GMP -->|"PromQL alerts"| PD SC -->|"liveness ping"| DMS classDef mgr fill:#bfdbfe,stroke:#1d4ed8,stroke-width:1px,color:#0b1e3b classDef wkr fill:#bbf7d0,stroke:#15803d,stroke-width:1px,color:#06281a classDef ctl fill:#fde68a,stroke:#b45309,stroke-width:1px,color:#3a2400 classDef gcp fill:#99f6e4,stroke:#0f766e,stroke-width:1px,color:#04241f classDef obs fill:#ddd6fe,stroke:#6d28d9,stroke-width:1px,color:#241152 classDef ext fill:#e5e7eb,stroke:#4b5563,stroke-width:1px,color:#111827

Manager Plane

The managers are the cluster’s brain: a small fixed set of VMs that hold Raft consensus and run the control and observability plane. They are three discrete VMs, not a Managed Instance Group, one per zone.

  • Nodes. swarm-mgr-1/2/3, defined in Terraform with for_each over a manager_nodes map (zones a, b, c), so the VM set and their reserved IPs cannot drift apart. Each is an e2-medium with a 100 GB pd-balanced disk, off a custom ubuntu-2204-docker-swarm-falcon image (Ubuntu 22.04 with Docker and CrowdStrike Falcon). Shielded VM is on.
  • Quorum. The Raft quorum tolerates one failure: lose a manager and it recovers automatically (Terraform recreates it, the startup script rejoins). Losing two of three has no automatic recovery, a deliberate trade-off over an external etcd or Consul. Recover via the quorum-loss runbook.
  • Bootstrap and join. The initializer (swarm-mgr-1) runs docker swarm init and writes the worker and manager join tokens to GCP Secret Manager. Everyone else reads its token from there (never from metadata) and joins, finding managers through their static internal IPs on TCP :2377.
  • Lifecycle. Each boot disk is a standalone resource (auto_delete = false) with hourly snapshots (10-day retention), bounding Raft loss to about an hour. Image upgrades replace the disk one manager at a time, never two (that would break quorum). See the image-upgrade runbook.
  • swarm-reaper. A leader-gated systemd timer (every 60s, leader only) removes workers stuck Down past the 300s grace period, clearing the ghost entries left by autoscaler scale-in. It exports a swarm_workers_down metric that feeds the swarm_ghost_workers alert.

Worker Plane

The workers run the application workloads on an autoscaling regional Managed Instance Group (swarm-{n|p}-workers).

  • Fleet. n2-highmem-4 VMs across four zones (a, b, c, f). The instance template is content-addressed, so any change to machine type, image, disk, tags, manager IPs, or startup script renames it. Workers have no external IP and egress through Cloud NAT.
  • Autoscaling on memory, not CPU. Memory is the binding constraint, so the autoscaler keys on the Ops Agent metric agent.googleapis.com/memory/percent_used: 70% target, 180s cooldown, scale-in at most one replica per 15-minute window. min_replicas and max_replicas are bootstrap defaults only (ignore_changes); live bounds are tuned in the console (intent: non-prod 12 to 18, prod 8 to 12, the non-prod floor higher because it absorbed migration churn).
  • Health and autohealing. A TCP health check on :9323 (the dockerd metrics endpoint) is the liveness signal: if dockerd answers, the node can take tasks. About three minutes of failures triggers a MIG REPAIR (600s initial delay, so a still-booting node is not recreated).
  • Joining and draining. On boot a worker fetches its join token from Secret Manager and probes each manager on :2377 (up to five minutes, to ride out firewall propagation). The MIG update policy is OPPORTUNISTIC: template changes stage but do not auto-roll, so workers are rolled one at a time (see the rolling-update runbook). On scale-in, a systemd ExecStop unit calls swarmctl’s /drain so tasks migrate first, within an 80s shutdown ceiling, then falls through to docker swarm leave --force.
  • Validation. Exercise the full autoscale lifecycle (MIG provision, swarmctl rebalance, worker drain) with the autoscale-validation runbook.

Control Plane: swarmctl

Swarm’s scheduler places new tasks across nodes but never moves existing replicas, and on shutdown it reactively kills and reschedules tasks. swarmctl is a small Go controller that fills both gaps:

  • Rebalance: when the autoscaler adds a worker, swarmctl force-updates services so Swarm spreads tasks onto it (otherwise a fresh worker sits idle).
  • Drain: on worker shutdown, swarmctl sets the node to drain and waits for tasks to migrate before the VM leaves.

It runs as its own stack in mode: global pinned to node.role == manager (one task per manager) and is leader-gated: only the instance on the current Raft leader does work, followers return 503. It listens on host-mode port :9876 (not the ingress mesh) so /drain can resolve the calling worker by its real VPC source IP. The API uses two separate bearer tokens (workers can only drain, managers can only rebalance) and fails closed. swarmctl also pings a leader-gated Dead Man’s Snitch, so a wedged or dead leader is detected even when followers look healthy.

Observability

Observability runs inside the swarm and exports out to GCP. Prometheus (a GMP-compatible build) scrapes swarm targets, keeps only a one-hour tmpfs buffer (no data disk on managers), and exports samples to Google Managed Prometheus as the source of truth, labeled per cluster. Grafana reads GMP through a frontend shim and is fronted by Cloudflare Access SSO with no published ports. The per-node agents (cadvisor, node-exporter, dockerd-exporter, logspout) run mode: global; Prometheus, Grafana, and the tunnel are pinned to managers.

The swarm_observability Terraform module owns the PromQL alert policies against GMP, routed to PagerDuty:

  • Node and host: down, memory, OOM
  • Raft and orchestration: quorum at risk, leader missing, leader churn, Raft latency, failed tasks
  • Container: restart loops, OOM
  • Pipeline health: GMP export absent or stuck, Prometheus down, global-service coverage

Manager-VM host alerts (boot disk, dockerd and Falcon process absence, the TCP :2377 uptime check) live in the swarm_manager module instead, a clean ownership split.

Networking

The swarm lives in a shared-VPC spoke, subnet sb-{n|p}-shared-base-us-central1. Managers have static internal IPs plus reserved external IPs (premium tier, protected from destruction). Workers are internal-only and egress through a regional Cloud NAT.

The swarm-port firewall rules are tag-scoped to swarm-node, but they are defined in the separate thehelperbees/gcp-networks repo, not in infrahive (infrahive only attaches the tag). The ports:

  • TCP 2377: cluster management
  • TCP and UDP 7946: node-to-node gossip
  • UDP 4789: overlay network (VXLAN)
  • TCP 9323: dockerd metrics

The NAT egress addresses and the exact firewall locations are documented in the outbound-IP audit spec.

There is no internal load balancer or DNS for reaching managers: nodes join via the managers’ static internal IPs passed through metadata. SSH is OS Login plus IAP only (enable-oslogin=TRUE, block-project-ssh-keys=TRUE), and every node sets vmDnsSetting=ZonalOnly to avoid the global-DNS single point of failure.

How Workloads Land

Services are deployed with docker stack deploy of per-stack compose.yml files, using node.role constraints to place them. There are three categories:

  • Platform stacks (src/services/: swarmctl, observability) are pinned to managers and deployed by GitHub Actions, which copy the compose dir to a manager over IAP SSH and run docker stack deploy. (The docker-gcr-proxy GCR auth shim is not a stack: it runs as a per-node systemd service bound to 127.0.0.1:7676.)
  • Edge networks (for example hivebook, at src/services/hivebook/) front a Cloud Run origin with a Cloudflare Access perimeter; the tunnel is pinned to managers and the identity-sensitive token-proxy to workers.
  • Application stacks (consumer_portal, hbcrm, and so on) are defined and deployed from outside this repo (hb-ansible / AWX). They spread across workers with node.role == worker and max_replicas_per_node: 1 for anti-affinity, which is the high-availability invariant: replicas land on distinct nodes.

Non-Production vs Production

The manager and worker topology is identical across environments. The differences are sizing, alert routing, and thresholds.

Knob Non-production Production
Manager topology 3x e2-medium, zones a/b/c identical
Worker machine / pool n2-highmem-4, regional MIG a/b/c/f identical
Worker min to max (intent) 12 to 18 8 to 12
Manager deletion protection off on
TCP :2377 uptime check disabled (cost) enabled (60s)
Alert routing PagerDuty Staging, #triage-staging shared PagerDuty, #triage
Manager node-metric alerts enabled off until GMP descriptors register
IAP SSH group ssh-n-env@ ssh-p-env@
Observability budget module default 500 USD
Observability thresholds loose tightened

Live autoscaler bounds may diverge from the code intent above, since they are tuned in the console (ignore_changes).

Operations

Day-to-day and incident procedures:

Out of scope: HomeAlign runs a separate, unrelated Docker Swarm on Azure (ha-infra). This page covers only the GCP clusters in hb-infra.

Edit this page