Swarm
How The Helper Bees runs its Docker Swarm clusters. We operate
two clusters, one per environment (non-production and
production), both living in hb-infra on GCP, region
us-central1. This page is the architecture overview. For
depth, follow the consolidation RFC,
runbooks, and service docs.
Each cluster is the same shape: a small fixed set of manager nodes that hold the cluster together and run the control and observability plane, plus an autoscaling pool of worker nodes that run the application workloads. A custom controller, swarmctl, sits on the managers and brokers the two things native Swarm will not do on its own: rebalancing existing tasks onto new workers, and draining workers before they leave.
The naming convention is swarm-{env_code}-*, where the
env code is the first character of the environment:
swarm-n-* is non-production, swarm-p-* is
production.
At a Glance
| Managers | Workers | |
|---|---|---|
| Form | 3 fixed Terraform VMs (e2-medium) |
Regional Managed Instance Group (n2-highmem-4) |
| Zones | us-central1-a / b / c (one
each) |
us-central1-a / b / c /
f |
| Role | Raft quorum, control + observability plane | Application workloads |
| Scaling | Static (3, always) | Autoscaling on memory (non-prod 12 to 18, prod 8 to 12) |
| External IP | Reserved static (managed in the caller) | None (egress via Cloud NAT) |
| SSH | OS Login + IAP only | OS Login + IAP only |
Architecture
Color marks what each component is:
- Blue: manager nodes
- Green: worker nodes
- Amber: the swarmctl control service (runs on the managers)
- Purple: the observability stack, Prometheus and Grafana (also on the managers)
- Teal: GCP-managed services (Secret Manager, Cloud NAT, Managed Prometheus, and the MIG autoscaler and health check)
- Grey: external systems
%%{init: {'theme':'dark','themeVariables':{'fontSize':'16px','lineColor':'#9ca3af'},'flowchart':{'curve':'basis','nodeSpacing':45,'rankSpacing':60}}}%%
flowchart TB
GHA["GitHub Actions
deploy stacks via IAP"]:::ext
subgraph MGRS["Manager plane · Raft quorum"]
direction LR
M1["swarm-mgr-1
zone a · initializer"]:::mgr
M2["swarm-mgr-2
zone b"]:::mgr
M3["swarm-mgr-3
zone c"]:::mgr
M1 ~~~ M2 ~~~ M3
end
SC["swarmctl
leader-gated · :9876
/drain · /rebalance"]:::ctl
PROM["Prometheus
in-swarm · 1h tmpfs"]:::obs
subgraph WRK["Worker plane"]
direction LR
W1["worker a"]:::wkr
W2["worker b"]:::wkr
W3["worker c"]:::wkr
W4["worker f"]:::wkr
W1 ~~~ W2 ~~~ W3 ~~~ W4
end
AS["MIG autoscaler
scale on memory · 70%"]:::gcp
HC["MIG health check :9323
autoheal REPAIR"]:::gcp
SM["Secret Manager
join + drain tokens"]:::gcp
NAT["Cloud NAT
worker egress"]:::gcp
GMP["Google Managed
Prometheus"]:::gcp
GRAF["Grafana
Cloudflare Access SSO"]:::obs
PD["PagerDuty"]:::ext
DMS["Dead Man's Snitch
leader-gated"]:::ext
GHA -->|"stack deploy"| MGRS
MGRS -.->|"runs"| SC
MGRS -.->|"runs"| PROM
MGRS -->|"writes tokens"| SM
WRK -->|"read token, probe :2377"| SM
WRK -->|"egress"| NAT
WRK -->|"ExecStop POST /drain"| SC
SC -->|"rebalance / drain"| WRK
AS -.->|"scales"| WRK
HC -.->|"autoheals"| WRK
PROM -->|"export"| GMP
GMP --> GRAF
GMP -->|"PromQL alerts"| PD
SC -->|"liveness ping"| DMS
classDef mgr fill:#bfdbfe,stroke:#1d4ed8,stroke-width:1px,color:#0b1e3b
classDef wkr fill:#bbf7d0,stroke:#15803d,stroke-width:1px,color:#06281a
classDef ctl fill:#fde68a,stroke:#b45309,stroke-width:1px,color:#3a2400
classDef gcp fill:#99f6e4,stroke:#0f766e,stroke-width:1px,color:#04241f
classDef obs fill:#ddd6fe,stroke:#6d28d9,stroke-width:1px,color:#241152
classDef ext fill:#e5e7eb,stroke:#4b5563,stroke-width:1px,color:#111827
Manager Plane
The managers are the cluster’s brain: a small fixed set of VMs that hold Raft consensus and run the control and observability plane. They are three discrete VMs, not a Managed Instance Group, one per zone.
- Nodes.
swarm-mgr-1/2/3, defined in Terraform withfor_eachover amanager_nodesmap (zones a, b, c), so the VM set and their reserved IPs cannot drift apart. Each is ane2-mediumwith a 100 GBpd-balanceddisk, off a customubuntu-2204-docker-swarm-falconimage (Ubuntu 22.04 with Docker and CrowdStrike Falcon). Shielded VM is on. - Quorum. The Raft quorum tolerates one failure: lose a manager and it recovers automatically (Terraform recreates it, the startup script rejoins). Losing two of three has no automatic recovery, a deliberate trade-off over an external etcd or Consul. Recover via the quorum-loss runbook.
- Bootstrap and join. The initializer
(
swarm-mgr-1) runsdocker swarm initand writes the worker and manager join tokens to GCP Secret Manager. Everyone else reads its token from there (never from metadata) and joins, finding managers through their static internal IPs on TCP:2377. - Lifecycle. Each boot disk is a standalone resource
(
auto_delete = false) with hourly snapshots (10-day retention), bounding Raft loss to about an hour. Image upgrades replace the disk one manager at a time, never two (that would break quorum). See the image-upgrade runbook. - swarm-reaper. A leader-gated systemd timer (every
60s, leader only) removes workers stuck
Downpast the 300s grace period, clearing the ghost entries left by autoscaler scale-in. It exports aswarm_workers_downmetric that feeds theswarm_ghost_workersalert.
Worker Plane
The workers run the application workloads on an autoscaling
regional Managed Instance Group
(swarm-{n|p}-workers).
- Fleet.
n2-highmem-4VMs across four zones (a, b, c, f). The instance template is content-addressed, so any change to machine type, image, disk, tags, manager IPs, or startup script renames it. Workers have no external IP and egress through Cloud NAT. - Autoscaling on memory, not CPU. Memory is the
binding constraint, so the autoscaler keys on the Ops Agent metric
agent.googleapis.com/memory/percent_used: 70% target, 180s cooldown, scale-in at most one replica per 15-minute window.min_replicasandmax_replicasare bootstrap defaults only (ignore_changes); live bounds are tuned in the console (intent: non-prod 12 to 18, prod 8 to 12, the non-prod floor higher because it absorbed migration churn). - Health and autohealing. A TCP health check on
:9323(the dockerd metrics endpoint) is the liveness signal: if dockerd answers, the node can take tasks. About three minutes of failures triggers a MIGREPAIR(600s initial delay, so a still-booting node is not recreated). - Joining and draining. On boot a worker fetches its
join token from Secret Manager and probes each manager on
:2377(up to five minutes, to ride out firewall propagation). The MIG update policy isOPPORTUNISTIC: template changes stage but do not auto-roll, so workers are rolled one at a time (see the rolling-update runbook). On scale-in, a systemdExecStopunit calls swarmctl’s/drainso tasks migrate first, within an 80s shutdown ceiling, then falls through todocker swarm leave --force. - Validation. Exercise the full autoscale lifecycle (MIG provision, swarmctl rebalance, worker drain) with the autoscale-validation runbook.
Control Plane: swarmctl
Swarm’s scheduler places new tasks across nodes but never moves existing replicas, and on shutdown it reactively kills and reschedules tasks. swarmctl is a small Go controller that fills both gaps:
- Rebalance: when the autoscaler adds a worker, swarmctl force-updates services so Swarm spreads tasks onto it (otherwise a fresh worker sits idle).
- Drain: on worker shutdown, swarmctl sets the node
to
drainand waits for tasks to migrate before the VM leaves.
It runs as its own stack in mode: global pinned to
node.role == manager (one task per manager) and is
leader-gated: only the instance on the current Raft
leader does work, followers return 503. It listens on
host-mode port :9876 (not the ingress mesh) so
/drain can resolve the calling worker by its real VPC
source IP. The API uses two separate bearer tokens (workers can only
drain, managers can only rebalance) and fails closed. swarmctl also
pings a leader-gated Dead Man’s Snitch, so a wedged or
dead leader is detected even when followers look healthy.
Observability
Observability runs inside the swarm and exports out to GCP.
Prometheus (a GMP-compatible build) scrapes swarm targets, keeps only a
one-hour tmpfs buffer (no data disk on managers), and exports samples to
Google Managed Prometheus as the source of truth,
labeled per cluster. Grafana reads GMP through a frontend shim and is
fronted by Cloudflare Access SSO with no published
ports. The per-node agents (cadvisor, node-exporter, dockerd-exporter,
logspout) run mode: global; Prometheus, Grafana, and the
tunnel are pinned to managers.
The swarm_observability Terraform module owns the
PromQL alert policies against GMP, routed to
PagerDuty:
- Node and host: down, memory, OOM
- Raft and orchestration: quorum at risk, leader missing, leader churn, Raft latency, failed tasks
- Container: restart loops, OOM
- Pipeline health: GMP export absent or stuck, Prometheus down, global-service coverage
Manager-VM host alerts (boot disk, dockerd and Falcon process
absence, the TCP :2377 uptime check) live in the
swarm_manager module instead, a clean ownership split.
Networking
The swarm lives in a shared-VPC spoke, subnet
sb-{n|p}-shared-base-us-central1. Managers have static
internal IPs plus reserved external IPs (premium tier, protected from
destruction). Workers are internal-only and egress through a regional
Cloud NAT.
The swarm-port firewall rules are tag-scoped to
swarm-node, but they are defined in the separate
thehelperbees/gcp-networks repo, not in infrahive
(infrahive only attaches the tag). The ports:
- TCP
2377: cluster management - TCP and UDP
7946: node-to-node gossip - UDP
4789: overlay network (VXLAN) - TCP
9323: dockerd metrics
The NAT egress addresses and the exact firewall locations are documented in the outbound-IP audit spec.
There is no internal load balancer or DNS for reaching managers:
nodes join via the managers’ static internal IPs passed through
metadata. SSH is OS Login plus IAP only
(enable-oslogin=TRUE,
block-project-ssh-keys=TRUE), and every node sets
vmDnsSetting=ZonalOnly to avoid the global-DNS single point
of failure.
How Workloads Land
Services are deployed with docker stack deploy of
per-stack compose.yml files, using node.role
constraints to place them. There are three categories:
- Platform stacks (
src/services/: swarmctl, observability) are pinned to managers and deployed by GitHub Actions, which copy the compose dir to a manager over IAP SSH and rundocker stack deploy. (Thedocker-gcr-proxyGCR auth shim is not a stack: it runs as a per-node systemd service bound to127.0.0.1:7676.) - Edge networks (for example hivebook, at
src/services/hivebook/) front a Cloud Run origin with a Cloudflare Access perimeter; the tunnel is pinned to managers and the identity-sensitive token-proxy to workers. - Application stacks (consumer_portal, hbcrm, and so
on) are defined and deployed from outside this repo (hb-ansible / AWX).
They spread across workers with
node.role == workerandmax_replicas_per_node: 1for anti-affinity, which is the high-availability invariant: replicas land on distinct nodes.
Non-Production vs Production
The manager and worker topology is identical across environments. The differences are sizing, alert routing, and thresholds.
| Knob | Non-production | Production |
|---|---|---|
| Manager topology | 3x e2-medium, zones a/b/c |
identical |
| Worker machine / pool | n2-highmem-4, regional MIG a/b/c/f |
identical |
| Worker min to max (intent) | 12 to 18 | 8 to 12 |
| Manager deletion protection | off | on |
TCP :2377 uptime check |
disabled (cost) | enabled (60s) |
| Alert routing | PagerDuty Staging, #triage-staging |
shared PagerDuty, #triage |
| Manager node-metric alerts | enabled | off until GMP descriptors register |
| IAP SSH group | ssh-n-env@ |
ssh-p-env@ |
| Observability budget | module default | 500 USD |
| Observability thresholds | loose | tightened |
Live autoscaler bounds may diverge from the code intent above, since
they are tuned in the console (ignore_changes).
Operations
Day-to-day and incident procedures:
- Rolling Update of Swarm Worker Nodes
- Validate Swarm Worker Autoscale
- Rolling Boot Image Upgrade for Swarm Managers
- Recover Swarm Manager Quorum Loss
- Docker Troubleshooting
Edit this pageOut of scope: HomeAlign runs a separate, unrelated Docker Swarm on Azure (
ha-infra). This page covers only the GCP clusters inhb-infra.