GitHub

swarmctl

swarmctl is the manager-side controller for our Docker Swarm clusters: a single Go binary, deployed on every manager, that does two things Swarm doesn’t do itself — rebalances services onto new workers and drains a worker’s tasks before it leaves.

Source: src/services/swarmctl/.

Assumes Swarm basics: nodes (managers vs. workers), services and the tasks (containers) they run, the Raft leader among managers, and force-updatedocker service update --force, a no-op update that makes Swarm reschedule a service’s tasks. Force-update is how the rebalancer moves load onto a new node.

Why it exists

Swarm schedules new tasks across nodes but never moves existing replicas:

  • New capacity sits idle. When the worker MIG autoscales up (on memory pressure), the fresh node gets no existing load until something forces a reschedule. The rebalancer force-updates services so Swarm spreads tasks onto it.
  • Scale-down is abrupt. A removed worker’s tasks are killed and rescheduled reactively. The drain endpoint lets a worker ask a manager to migrate its tasks first.

Both need manager-level Docker API access and must act only on the Raft leader, so they live in a manager-side controller.

What it does

Subsystem Trigger Action
Rebalancer A worker joins while we’re leader, or POST /rebalance Force-update services so Swarm spreads tasks onto under-loaded nodes
Drain A worker calls POST /drain on shutdown Set the node to drain, wait for tasks to migrate, report done
Health Docker probe GET /healthz — liveness + active drain count

Architecture

main.go loads config, builds the shared deps (Docker API client, heartbeat, optional Cloud Monitoring client), wires the subsystems, and runs them with RunAll — goroutines sharing one cancelable context, no channels between them. SIGTERM, or the first unexpected error from any subsystem, cancels the context and the rest shut down.

Topology (compose.yml):

   workers (swarm-n-worker-*)          managers (one swarmctl task each, mode: global)
            │                        ┌─────────────────────────────────────────────┐
            │  :9876, host mode,     │  mgr-1  LEADER   swarmctl ✔  serves + rebalances│
            └─ direct over VPC ─────▶│  mgr-2           swarmctl ·  503 (follower)     │
                                     │  mgr-3           swarmctl ·  503 (follower)     │
                                     └─────────────────────────────────────────────┘
  • mode: global + node.role == manager — one task per manager, none on workers.
  • Leader-gated — only the instance on the Raft leader does work; followers return 503. That’s why it runs on every manager: so the leader’s instance always serves. IsLeader checks the local node; when leadership moves, the new leader’s already-running instance takes over.
  • Host-mode :9876 (not the ingress mesh) — preserves the worker’s real source IP, which /drain needs to identify the caller.
  • stop-first updates — one task per node holds the host port, so the new task can’t start until the old one stops.
  • user: 65532:<docker-gid> — how the distroless-nonroot container reaches the root:docker socket.

Rebalancer

A single-goroutine poll loop: after REBALANCE_BOOT_QUIESCENCE (300s) it ticks every REBALANCE_POLL_INTERVAL (30s). States: Idle → Stabilizing → Rebalancing → Idle.

Triggers — only two. A genuine new worker (diffed against a remembered baseline; waits REBALANCE_STABILIZATION_DELAY, 60s, then rebalances), or a manual POST /rebalance. On becoming leader (process start or failover) it seeds its baseline from the current fleet and skips that tick, so a restart or failover doesn’t rebalance the whole cluster. Node removals never trigger (Swarm reschedules on its own).

Phases (REBALANCE_PHASE):

  • 1 — force-update all replicated services, replica-count descending, REBALANCE_SERVICE_DELAY (5s) apart.
  • 2 — adds convergence early-stop: stop once node-memory variance falls below a threshold.
  • 3 — adds hot-node targeting: update only the services on the hottest REBALANCE_MAX_HOT_NODES workers. The steady state; keeps a pass from spiking cluster-wide CPU. If the memory signal is unavailable, it skips the pass (fail-safe) rather than rebalancing everything.

Phases 2 and 3 read agent.googleapis.com/memory/percent_used (the metric the worker autoscaler also scales on) for the worker MIG; the manager SA already has roles/monitoring.viewer.

Drain

A worker’s shutdown hook calls POST /drain on a manager before docker swarm leave; the handler resolves the caller by its TCP source IP, sets that node to drain, waits up to DRAIN_TIMEOUT (45s) for tasks to migrate, then reports done. Concurrent drains of the same node return 409. The budget is ordered 45s (drain) < ~70s (worker curl) < 80s (systemd stop) so the worker always gets an answer.

Dead Man’s Snitch

swarmctl checks in to deadmanssnitch.com on a periodic ticker so we get alerted if the leader’s process stops running. If the check-in URL goes quiet for longer than the configured grace period, DMS sends an alert.

What it monitors: leader liveness. The snitch is leader-gated — only the Raft leader’s swarmctl instance pings. A wedged or dead leader stops checking in even when followers are healthy, which is the desired signal. An all-manager ping would mask that scenario.

Environment variables:

Variable Default Meaning
SWARMCTL_SNITCH_URL "" (disabled) Dead Man’s Snitch check-in URL (nosnch.in/<token>); empty disables check-ins
SWARMCTL_SNITCH_INTERVAL 1h How often to check in; 6h in staging, 1h in production. A zero/invalid value degrades to the 1h default rather than failing startup

Per-env cadence: staging checks in every 6 hours (6h); production checks in hourly (1h). These values are in envs/staging.yaml and envs/production.yaml. The check-in URL is a low-sensitivity capability token, so it is stored as a plain value in those env files (checked into the repo), not in Secret Manager.

Fail-open behavior: an empty SWARMCTL_SNITCH_URL disables the subsystem entirely — swarmctl starts cleanly and logs once that the snitch is disabled. A failed check-in (non-leader, IsLeader error, HTTP error, or non-2xx response) is logged and skipped; it never crashes the daemon or cancels the context. Only sustained silence (leader down for longer than interval + grace) trips the alert.

Initial check-in: Run does one immediate check-in on startup before the ticker registers, so DMS sees health right after a deploy rather than waiting up to one full interval.

Manual setup: the two DMS snitches (swarmctl-staging and swarmctl-production) are created by hand in the deadmanssnitch.com UI — no Terraform provider for DMS exists in the repo. Set the DMS expected interval to match the in-code cadence (every 6 hours for staging, hourly for production) with grace beyond it so a single missed tick during a leader election doesn’t false-alarm. Paste each check-in URL into envs/staging.yaml and envs/production.yaml as the snitch_url value.

HTTP API

Served on :9876.

Route Auth Gates Returns
GET /healthz none none 200 {"status":"ok","active_drains":N}
POST /drain Bearer drain token leader; source-IP → node 200 drained · 404 no node · 409 already draining · 503 not leader
POST /rebalance Bearer rebalance token leader; loopback-only 202 accepted · 409 pass running · 401 bad token · 403 not loopback · 503 not leader

/drain and /rebalance use separate bearer tokens: workers receive only the drain token, so a worker can’t trigger a rebalance, and /rebalance is additionally loopback-only (callable only from inside swarmctl’s own network namespace on the leader, via nsenter; a host-side curl localhost is SNAT’d off loopback and gets 403, see Operating it). Auth fails closed — a registered route with no secret returns 503 rather than passing through. Tokens are compared in constant time.

Configuration

Environment variables, set in compose.yml; secrets are mounted files, not env values.

Variable Default Meaning
LOG_LEVEL info debug/info/warn/error; JSON logs to stderr
GCP_PROJECT_ID Project for Cloud Monitoring (Phase ≥ 2)
SWARMCTL_LISTEN_ADDR :9876 HTTP listen address
SWARMCTL_DRAIN_SECRET_PATH /run/secrets/drain_token Mounted drain bearer token
SWARMCTL_REBALANCE_SECRET_PATH /run/secrets/rebalance_token Mounted rebalance bearer token
REBALANCE_ENABLED true Register /rebalance and run the poll loop
REBALANCE_PHASE 1 1 all-services · 2 + convergence stop · 3 + hot-node targeting
REBALANCE_POLL_INTERVAL 30s Tick interval
REBALANCE_BOOT_QUIESCENCE 300s Delay before the first tick (let the fleet settle)
REBALANCE_STABILIZATION_DELAY 60s Wait after a join before rebalancing (resets on more joins)
REBALANCE_SERVICE_DELAY 5s Pause between service updates in a pass
REBALANCE_WORKER_MIG_NAME Worker MIG (swarm-<n|p>-workers); required for Phase ≥ 2
REBALANCE_MAX_HOT_NODES 1 How many hottest nodes to target (Phase 3)
REBALANCE_EXCLUDED_SERVICES Comma-separated service names to never touch
REBALANCE_DRY_RUN false Log intended updates without calling Docker
Phase 2 convergence tuning REBALANCE_CONVERGENCE_THRESHOLD (0.10), _INTERVAL (5), _CHECK_DELAY (90s) — rarely changed
DRAIN_ENABLED true Register the /drain route
DRAIN_TIMEOUT 60s (deployed 45s) Max wait for task migration during a drain

Other rarely-touched vars: HEARTBEAT_PATH (/tmp/swarmctl-heartbeat), SWARMCTL_SERVER_SHUTDOWN_TIMEOUT (60s).

Deployment

Deploy is driven entirely by two GitHub Actions workflows (both in .github/workflows/). You never run the deploy from your laptop — you trigger the workflow and it authenticates as observability-deploy@<project> via Workload Identity Federation and runs the deploy script for you.

Workflow File Trigger What it does
Build swarmctl Docker Image build-swarmctl.yml Push to plan touching src/services/swarmctl/**, or manual dispatch Builds and pushes gcr.io/the-helper-bees/swarmctl:<short-sha> (and :latest)
Deploy swarmctl Service deploy-swarmctl.yml Manual dispatch only (auto-deploy on a successful build is wired but disabled) Deploys a given image tag to staging or production

The deploy job ultimately runs:

./zig/zig build -Dmode=ci deploy-swarmctl -- --env <staging|production> --image-tag <short-sha> --wait

-Dmode=ci is for the runner only (it adapts the build for the CI environment). If you ever run the deploy script by hand, omit it: ./zig/zig build deploy-swarmctl -- --env <env> --image-tag <short-sha> --wait.

How to deploy

1. Build the image. Push your change to plan (the build fires automatically when src/services/swarmctl/** changes), or trigger Build swarmctl Docker Image manually:

  • UI: Actions → Build swarmctl Docker ImageRun workflow → pick the branch → Run workflow.
  • CLI: gh workflow run build-swarmctl.yml --ref <branch>

Grab the short SHA from the run — that’s your image tag, and the image must exist in GCR before you can deploy it.

2. Deploy to staging. Trigger Deploy swarmctl Service:

  • UI: Actions → Deploy swarmctl ServiceRun workflow, then set the fields:
    • Use workflow from — your branch (e.g. plan)
    • Target environmentstaging
    • Image tag to deploy (short SHA) — paste the short SHA from step 1, or leave blank to use the selected branch’s current commit
    • Leave Type “deploy” to confirm… and Emergency bypass empty for staging
  • CLI:
    gh workflow run deploy-swarmctl.yml -f environment=staging -f image_tag=<short-sha>

The job blocks on --wait until the stack converges. Then sanity-check a manager’s /healthz and the rebalancer logs before promoting.

3. Promote to production. Production is gated twice: you must type deploy in the confirm box, and there must already be a successful staging deployment at the same commit SHA. So run the production dispatch from the same ref you deployed to staging.

  • UI: Run workflow, then set the fields:
    • Use workflow from — the same branch/commit you deployed to staging
    • Target environmentproduction
    • Type “deploy” to confirm production deploydeploy
    • Image tag to deploy (short SHA) — the short SHA you validated on staging
    • Emergency bypass — leave unchecked (tick it only for incidents)
  • CLI:
    gh workflow run deploy-swarmctl.yml --ref <ref> \
      -f environment=production -f confirm=deploy -f image_tag=<short-sha>

If the staging-success check fails, deploy that SHA to staging first. Emergency bypass (incidents only): set skip_staging_check=true — it’s logged under your GitHub username and skips the gate.

Image tag vs. gate. image_tag defaults to the short SHA of the dispatched ref, but the production gate keys off the ref’s commit SHA, not the image tag. Always pass the exact short SHA you validated on staging so you ship the bits you tested.

What the deploy script does

src/scripts/deploy_swarmctl/main.py picks a reachable manager, projects the two bearer tokens from GCP Secret Manager into versioned Docker secrets, resolves the host docker GID, validates the compose against the Swarm schema, does a one-time docker stack rm if the service mode is changing, and waits for convergence. Per-env values live in envs/{staging,production}.yaml.

Secrets: two GCP Secret Manager secrets per env — swarm-observability-drain-token (workers; /drain) and swarm-observability-rebalance-token (manager-only; /rebalance). Terraform (infra/hb-infra/modules/swarmctl) creates the shells with placeholders; the real random values are populated and rotated out-of-band by adding a secret version. See Secrets Management.

Operating it

Trigger a rebalance manually. /rebalance is leader-gated and loopback-only, where “loopback” means swarmctl’s own network namespace, not the manager host. swarmctl runs in a container on the swarmctl overlay (not host networking), so a plain curl localhost:9876 from the manager shell arrives SNAT’d to the docker gateway IP and is rejected with 403 before auth even runs. The call has to originate inside the container’s netns, so enter it with nsenter.

The function below is fully self-contained: paste it, then call it with the target project. It discovers the running managers, asks one of them which node is the swarm leader, maps that leader back to its zone, reads the rebalance token, and fires the request from inside swarmctl’s netns. No hostnames or zones to look up by hand. Run it from Cloud Shell or any machine with gcloud and secretAccessor on the project. The two project IDs are the only things you choose.

swarm_rebalance() {
  local project="$1"
  [ -n "$project" ] || { echo "usage: swarm_rebalance <project-id>"; return 1; }

  # running swarm managers as "name zone" pairs
  local mgrs; mapfile -t mgrs < <(gcloud compute instances list --project="$project" \
    --filter="name~'^swarm-mgr-' AND status=RUNNING" \
    --format="value(name,zone)")
  [ "${#mgrs[@]}" -gt 0 ] || { echo "no running swarm-mgr-* instances in $project"; return 1; }

  # ask the first manager which node is the leader (Hostname == GCE instance name)
  local first_name first_zone; read -r first_name first_zone <<<"${mgrs[0]}"
  local leader; leader=$(gcloud compute ssh "$first_name" --zone="$first_zone" \
    --tunnel-through-iap --project="$project" --quiet \
    --command "sudo docker node ls --format '{{.Hostname}} {{.ManagerStatus}}'" \
    | awk '$2=="Leader"{print $1}')

  # map the leader hostname back to its zone
  local leader_zone; leader_zone=$(printf '%s\n' "${mgrs[@]}" | awk -v h="$leader" '$1==h{print $2}')
  [ -n "$leader" ] && [ -n "$leader_zone" ] || { echo "could not resolve leader/zone"; return 1; }
  echo "leader: $leader ($leader_zone)"

  # fetch the rebalance token. secretAccessor reads it directly. Do NOT add
  # --impersonate-service-account: observability-deploy@ only grants
  # workloadIdentityUser to the GitHub Actions WIF principal, so impersonation 403s for humans.
  local token; token=$(gcloud secrets versions access latest \
    --secret=swarm-observability-rebalance-token --project="$project")

  # trigger ON the leader, from inside swarmctl's netns (token piped over stdin, not argv)
  echo "$token" | gcloud compute ssh "$leader" --zone="$leader_zone" \
    --tunnel-through-iap --project="$project" --quiet \
    --command '
      read -r TOKEN
      CID=$(sudo docker ps -q -f name=swarmctl | head -1)
      PID=$(sudo docker inspect -f "{{.State.Pid}}" "$CID")
      sudo nsenter -t "$PID" -n curl -sS -XPOST \
        -H "Authorization: Bearer $TOKEN" \
        http://localhost:9876/rebalance -w "\nHTTP %{http_code}\n"'
}

# non-production (staging):
swarm_rebalance prj-bu1-n-hb-infra-5381

# production:
# swarm_rebalance prj-bu1-p-hb-infra-1da6

202 = queued (runs within one poll interval); 409 = a pass is already running; 503 = you hit a follower; 403 = the call did not originate inside swarmctl’s netns (e.g. host-side curl localhost).

Health and logs. swarmctl healthcheck (the binary’s only subcommand) does a local GET /healthz and exits 0/1 — it is the Docker HEALTHCHECK probe, since the distroless image has no shell or curl. Logs are structured JSON (slog) tagged by subsystem; watch for became leader; seeded known workers, new nodes detected; stabilizing, rebalance complete ... exit_reason=, and filterByHotNodes: filtered services.

Gotchas. Phase ≥ 2 crash-loops on boot if REBALANCE_WORKER_MIG_NAME is unset. A 503 from a non-leader manager is normal — only the leader’s instance serves.

Code map

src/services/swarmctl/
├── compose.yml            # Swarm stack: global, manager-only, host-mode :9876
├── Dockerfile             # distroless nonroot
├── envs/                  # per-env deploy config
└── src/
    ├── main.go            # config, wiring, RunAll; `healthcheck` subcommand
    ├── lifecycle.go       # RunAll
    └── internal/
        ├── config/        # env-var config
        ├── swarm/         # Docker API seam, node resolver, leader check
        ├── rebalance/     # poll loop, state machine, phase strategies, /rebalance
        ├── drain/         # /drain handler + drain state
        ├── api/           # HTTP server, auth, /healthz
        ├── monitoring/    # Cloud Monitoring memory query (Phase ≥ 2)
        ├── heartbeat/     # liveness file backing /healthz
        └── snitch/        # Dead Man's Snitch check-in ticker (leader-gated)

Development

A self-contained Go module (src/services/swarmctl/go.mod). Use the project-local toolchain, never system Go.

  • Test (what CI runs): ./bin/go test -C ./src/services/swarmctl ./... -race. Tests live beside the code (internal/*/*_test.go); the rebalancer’s rapid property/model tests (rebalancer_model_test.go) are the highest-signal ones — start there to understand expected behavior.
  • Build the image: docker build -t swarmctl src/services/swarmctl.
  • Repo gate: ./zig/zig build check.

There is no one-command local run yet: swarmctl needs a live Swarm (a manager Docker socket, a real leader, mounted token secrets), so the dev loop is change code → run the test suite → validate end-to-end on staging via the deploy flow above. A turnkey local-swarm harness isn’t built out.

  • Docker Swarm Consolidation RFC — why we run Swarm and how the cluster is shaped.
  • Cluster infrastructure: infra/hb-infra/modules/swarm_manager and swarm_worker — where the managers, the worker autoscaler, and the swarm-reaper backstop are defined.
  • Secrets Management — storing and rotating the bearer tokens.
  • App Service Accounts — the per-app SA identity model on swarm workers.
Edit this page