GitHub

Swarm

How The Helper Bees runs its Docker Swarm clusters. We operate two clusters, one per environment (non-production and production), both living in hb-infra on GCP, region us-central1. This page is the architecture overview. For depth, follow the consolidation RFC, runbooks, and service docs.

Each cluster is the same shape: a small fixed set of manager nodes that hold the cluster together and run the control and observability plane, plus an autoscaling pool of worker nodes that run the application workloads. A custom controller, swarmctl, sits on the managers and brokers the two things native Swarm will not do on its own: rebalancing existing tasks onto new workers, and draining workers before they leave.

The naming convention is swarm-{env_code}-*, where the env code is the first character of the environment: swarm-n-* is non-production, swarm-p-* is production.

At a Glance

	Managers	Workers
Form	3 fixed Terraform VMs (`e2-medium`)	Regional Managed Instance Group (`n2-highmem-4`)
Zones	`us-central1-a` / `b` / `c` (one each)	`us-central1-a` / `b` / `c` / `f`
Role	Raft quorum, control + observability plane	Application workloads
Scaling	Static (3, always)	Autoscaling on memory (non-prod 12 to 18, prod 8 to 12)
External IP	Reserved static (managed in the caller)	None (egress via Cloud NAT)
SSH	OS Login + IAP only	OS Login + IAP only

Architecture

Color marks what each component is:

Blue: manager nodes
Green: worker nodes
Amber: the swarmctl control service (runs on the managers)
Purple: the observability stack, Prometheus and Grafana (also on the managers)
Teal: GCP-managed services (Secret Manager, Cloud NAT, Managed Prometheus, and the MIG autoscaler and health check)
Grey: external systems

%%{init: {'theme':'dark','themeVariables':{'fontSize':'16px','lineColor':'#9ca3af'},'flowchart':{'curve':'basis','nodeSpacing':45,'rankSpacing':60}}}%%
flowchart TB
  GHA["GitHub Actions
deploy stacks via IAP"]:::ext

  subgraph MGRS["Manager plane  ·  Raft quorum"]
    direction LR
    M1["swarm-mgr-1
zone a  ·  initializer"]:::mgr
    M2["swarm-mgr-2
zone b"]:::mgr
    M3["swarm-mgr-3
zone c"]:::mgr
    M1 ~~~ M2 ~~~ M3
  end

  SC["swarmctl
leader-gated  ·  :9876
/drain  ·  /rebalance"]:::ctl
  PROM["Prometheus
in-swarm  ·  1h tmpfs"]:::obs

  subgraph WRK["Worker plane"]
    direction LR
    W1["worker a"]:::wkr
    W2["worker b"]:::wkr
    W3["worker c"]:::wkr
    W4["worker f"]:::wkr
    W1 ~~~ W2 ~~~ W3 ~~~ W4
  end

  AS["MIG autoscaler
scale on memory  ·  70%"]:::gcp
  HC["MIG health check :9323
autoheal REPAIR"]:::gcp
  SM["Secret Manager
join + drain tokens"]:::gcp
  NAT["Cloud NAT
worker egress"]:::gcp
  GMP["Google Managed
Prometheus"]:::gcp
  GRAF["Grafana
Cloudflare Access SSO"]:::obs
  PD["PagerDuty"]:::ext
  DMS["Dead Man's Snitch
leader-gated"]:::ext

  GHA -->|"stack deploy"| MGRS
  MGRS -.->|"runs"| SC
  MGRS -.->|"runs"| PROM
  MGRS -->|"writes tokens"| SM
  WRK -->|"read token, probe :2377"| SM
  WRK -->|"egress"| NAT
  WRK -->|"ExecStop POST /drain"| SC
  SC -->|"rebalance / drain"| WRK
  AS -.->|"scales"| WRK
  HC -.->|"autoheals"| WRK
  PROM -->|"export"| GMP
  GMP --> GRAF
  GMP -->|"PromQL alerts"| PD
  SC -->|"liveness ping"| DMS

  classDef mgr fill:#bfdbfe,stroke:#1d4ed8,stroke-width:1px,color:#0b1e3b
  classDef wkr fill:#bbf7d0,stroke:#15803d,stroke-width:1px,color:#06281a
  classDef ctl fill:#fde68a,stroke:#b45309,stroke-width:1px,color:#3a2400
  classDef gcp fill:#99f6e4,stroke:#0f766e,stroke-width:1px,color:#04241f
  classDef obs fill:#ddd6fe,stroke:#6d28d9,stroke-width:1px,color:#241152
  classDef ext fill:#e5e7eb,stroke:#4b5563,stroke-width:1px,color:#111827

Manager Plane

The managers are the cluster’s brain: a small fixed set of VMs that hold Raft consensus and run the control and observability plane. They are three discrete VMs, not a Managed Instance Group, one per zone.

Nodes. swarm-mgr-1/2/3, defined in Terraform with for_each over a manager_nodes map (zones a, b, c), so the VM set and their reserved IPs cannot drift apart. Each is an e2-medium with a 100 GB pd-balanced disk, off a custom ubuntu-2204-docker-swarm-falcon image (Ubuntu 22.04 with Docker and CrowdStrike Falcon). Shielded VM is on.
Quorum. The Raft quorum tolerates one failure: lose a manager and it recovers automatically (Terraform recreates it, the startup script rejoins). Losing two of three has no automatic recovery, a deliberate trade-off over an external etcd or Consul. Recover via the quorum-loss runbook.
Bootstrap and join. The initializer (swarm-mgr-1) runs docker swarm init and writes the worker and manager join tokens to GCP Secret Manager. Everyone else reads its token from there (never from metadata) and joins, finding managers through their static internal IPs on TCP :2377.
Lifecycle. Each boot disk is a standalone resource (auto_delete = false) with hourly snapshots (10-day retention), bounding Raft loss to about an hour. Image upgrades replace the disk one manager at a time, never two (that would break quorum). See the image-upgrade runbook.
swarm-reaper. A leader-gated systemd timer (every 60s, leader only) removes workers stuck Down past the 300s grace period, clearing the ghost entries left by autoscaler scale-in. It exports a swarm_workers_down metric that feeds the swarm_ghost_workers alert.

Worker Plane

The workers run the application workloads on an autoscaling regional Managed Instance Group (swarm-{n|p}-workers).

Fleet. n2-highmem-4 VMs across four zones (a, b, c, f). The instance template is content-addressed, so any change to machine type, image, disk, tags, manager IPs, or startup script renames it. Workers have no external IP and egress through Cloud NAT.
Autoscaling on memory, not CPU. Memory is the binding constraint, so the autoscaler keys on the Ops Agent metric agent.googleapis.com/memory/percent_used: 70% target, 180s cooldown, scale-in at most one replica per 15-minute window. min_replicas and max_replicas are bootstrap defaults only (ignore_changes); live bounds are tuned in the console (intent: non-prod 12 to 18, prod 8 to 12, the non-prod floor higher because it absorbed migration churn).
Health and autohealing. A TCP health check on :9323 (the dockerd metrics endpoint) is the liveness signal: if dockerd answers, the node can take tasks. About three minutes of failures triggers a MIG REPAIR (600s initial delay, so a still-booting node is not recreated).
Joining and draining. On boot a worker fetches its join token from Secret Manager and probes each manager on :2377 (up to five minutes, to ride out firewall propagation). The MIG update policy is OPPORTUNISTIC: template changes stage but do not auto-roll, so workers are rolled one at a time (see the rolling-update runbook). On scale-in, a systemd ExecStop unit calls swarmctl’s /drain so tasks migrate first, within an 80s shutdown ceiling, then falls through to docker swarm leave --force.
Validation. Exercise the full autoscale lifecycle (MIG provision, swarmctl rebalance, worker drain) with the autoscale-validation runbook.

Control Plane: swarmctl

Swarm’s scheduler places new tasks across nodes but never moves existing replicas, and on shutdown it reactively kills and reschedules tasks. swarmctl is a small Go controller that fills both gaps:

Rebalance: when the autoscaler adds a worker, swarmctl force-updates services so Swarm spreads tasks onto it (otherwise a fresh worker sits idle).
Drain: on worker shutdown, swarmctl sets the node to drain and waits for tasks to migrate before the VM leaves.

It runs as its own stack in mode: global pinned to node.role == manager (one task per manager) and is leader-gated: only the instance on the current Raft leader does work, followers return 503. It listens on host-mode port :9876 (not the ingress mesh) so /drain can resolve the calling worker by its real VPC source IP. The API uses two separate bearer tokens (workers can only drain, managers can only rebalance) and fails closed. swarmctl also pings a leader-gated Dead Man’s Snitch, so a wedged or dead leader is detected even when followers look healthy.

Observability

Observability runs inside the swarm and exports out to GCP. Prometheus (a GMP-compatible build) scrapes swarm targets, keeps only a one-hour tmpfs buffer (no data disk on managers), and exports samples to Google Managed Prometheus as the source of truth, labeled per cluster. Grafana reads GMP through a frontend shim and is fronted by Cloudflare Access SSO with no published ports. The per-node agents (cadvisor, node-exporter, dockerd-exporter, logspout) run mode: global; Prometheus, Grafana, and the tunnel are pinned to managers.

The swarm_observability Terraform module owns the PromQL alert policies against GMP, routed to PagerDuty:

Node and host: down, memory, OOM
Raft and orchestration: quorum at risk, leader missing, leader churn, Raft latency, failed tasks
Container: restart loops, OOM
Pipeline health: GMP export absent or stuck, Prometheus down, global-service coverage

Manager-VM host alerts (boot disk, dockerd and Falcon process absence, the TCP :2377 uptime check) live in the swarm_manager module instead, a clean ownership split.

Networking

The swarm lives in a shared-VPC spoke, subnet sb-{n|p}-shared-base-us-central1. Managers have static internal IPs plus reserved external IPs (premium tier, protected from destruction). Workers are internal-only and egress through a regional Cloud NAT.

The swarm-port firewall rules are tag-scoped to swarm-node, but they are defined in the separate thehelperbees/gcp-networks repo, not in infrahive (infrahive only attaches the tag). The ports:

TCP 2377: cluster management
TCP and UDP 7946: node-to-node gossip
UDP 4789: overlay network (VXLAN)
TCP 9323: dockerd metrics

The NAT egress addresses and the exact firewall locations are documented in the outbound-IP audit spec.

There is no internal load balancer or DNS for reaching managers: nodes join via the managers’ static internal IPs passed through metadata. SSH is OS Login plus IAP only (enable-oslogin=TRUE, block-project-ssh-keys=TRUE), and every node sets vmDnsSetting=ZonalOnly to avoid the global-DNS single point of failure.

How Workloads Land

Services are deployed with docker stack deploy of per-stack compose.yml files, using node.role constraints to place them. There are three categories:

Platform stacks (src/services/: swarmctl, observability) are pinned to managers and deployed by GitHub Actions, which copy the compose dir to a manager over IAP SSH and run docker stack deploy. (The docker-gcr-proxy GCR auth shim is not a stack: it runs as a per-node systemd service bound to 127.0.0.1:7676.)
Edge networks (for example hivebook, at src/services/hivebook/) front a Cloud Run origin with a Cloudflare Access perimeter; the tunnel is pinned to managers and the identity-sensitive token-proxy to workers.
Application stacks (consumer_portal, hbcrm, and so on) are defined and deployed from outside this repo (hb-ansible / AWX). They spread across workers with node.role == worker and max_replicas_per_node: 1 for anti-affinity, which is the high-availability invariant: replicas land on distinct nodes.

Non-Production vs Production

The manager and worker topology is identical across environments. The differences are sizing, alert routing, and thresholds.

Knob	Non-production	Production
Manager topology	3x `e2-medium`, zones a/b/c	identical
Worker machine / pool	`n2-highmem-4`, regional MIG a/b/c/f	identical
Worker min to max (intent)	12 to 18	8 to 12
Manager deletion protection	off	on
TCP `:2377` uptime check	disabled (cost)	enabled (60s)
Alert routing	PagerDuty Staging, `#triage-staging`	shared PagerDuty, `#triage`
Manager node-metric alerts	enabled	off until GMP descriptors register
IAP SSH group	`ssh-n-env@`	`ssh-p-env@`
Observability budget	module default	500 USD
Observability thresholds	loose	tightened

Live autoscaler bounds may diverge from the code intent above, since they are tuned in the console (ignore_changes).

Operations

Day-to-day and incident procedures:

Out of scope: HomeAlign runs a separate, unrelated Docker Swarm on Azure (ha-infra). This page covers only the GCP clusters in hb-infra.

Edit this page