swarmctl
swarmctl is the manager-side controller for our Docker
Swarm clusters: a single Go binary, deployed on every manager, that does
two things Swarm doesn’t do itself — rebalances
services onto new workers and drains a worker’s tasks
before it leaves.
Source: src/services/swarmctl/.
Assumes Swarm basics: nodes (managers vs. workers), services and the tasks (containers) they run, the Raft leader among managers, and force-update —
docker service update --force, a no-op update that makes Swarm reschedule a service’s tasks. Force-update is how the rebalancer moves load onto a new node.
Why it exists
Swarm schedules new tasks across nodes but never moves existing replicas:
- New capacity sits idle. When the worker MIG autoscales up (on memory pressure), the fresh node gets no existing load until something forces a reschedule. The rebalancer force-updates services so Swarm spreads tasks onto it.
- Scale-down is abrupt. A removed worker’s tasks are killed and rescheduled reactively. The drain endpoint lets a worker ask a manager to migrate its tasks first.
Both need manager-level Docker API access and must act only on the Raft leader, so they live in a manager-side controller.
What it does
| Subsystem | Trigger | Action |
|---|---|---|
| Rebalancer | A worker joins while we’re leader, or
POST /rebalance |
Force-update services so Swarm spreads tasks onto under-loaded nodes |
| Drain | A worker calls POST /drain on shutdown |
Set the node to drain, wait for tasks to migrate,
report done |
| Health | Docker probe | GET /healthz — liveness + active drain count |
Architecture
main.go loads config, builds the shared deps (Docker API
client, heartbeat, optional Cloud Monitoring client), wires the
subsystems, and runs them with RunAll — goroutines sharing
one cancelable context, no channels between them. SIGTERM, or the first
unexpected error from any subsystem, cancels the context and the rest
shut down.
Topology (compose.yml):
workers (swarm-n-worker-*) managers (one swarmctl task each, mode: global)
│ ┌─────────────────────────────────────────────┐
│ :9876, host mode, │ mgr-1 LEADER swarmctl ✔ serves + rebalances│
└─ direct over VPC ─────▶│ mgr-2 swarmctl · 503 (follower) │
│ mgr-3 swarmctl · 503 (follower) │
└─────────────────────────────────────────────┘
mode: global+node.role == manager— one task per manager, none on workers.- Leader-gated — only the instance on the Raft leader
does work; followers return
503. That’s why it runs on every manager: so the leader’s instance always serves.IsLeaderchecks the local node; when leadership moves, the new leader’s already-running instance takes over. - Host-mode
:9876(not the ingress mesh) — preserves the worker’s real source IP, which/drainneeds to identify the caller. stop-firstupdates — one task per node holds the host port, so the new task can’t start until the old one stops.user: 65532:<docker-gid>— how the distroless-nonroot container reaches theroot:dockersocket.
Rebalancer
A single-goroutine poll loop: after
REBALANCE_BOOT_QUIESCENCE (300s) it ticks every
REBALANCE_POLL_INTERVAL (30s). States:
Idle → Stabilizing → Rebalancing → Idle.
Triggers — only two. A genuine new
worker (diffed against a remembered baseline; waits
REBALANCE_STABILIZATION_DELAY, 60s, then rebalances), or a
manual POST /rebalance. On becoming leader
(process start or failover) it seeds its baseline from
the current fleet and skips that tick, so a restart or failover doesn’t
rebalance the whole cluster. Node removals never trigger (Swarm
reschedules on its own).
Phases (REBALANCE_PHASE):
- 1 — force-update all replicated services,
replica-count descending,
REBALANCE_SERVICE_DELAY(5s) apart. - 2 — adds convergence early-stop: stop once node-memory variance falls below a threshold.
- 3 — adds hot-node targeting: update only the
services on the hottest
REBALANCE_MAX_HOT_NODESworkers. The steady state; keeps a pass from spiking cluster-wide CPU. If the memory signal is unavailable, it skips the pass (fail-safe) rather than rebalancing everything.
Phases 2 and 3 read
agent.googleapis.com/memory/percent_used (the metric the
worker autoscaler also scales on) for the worker MIG; the manager SA
already has roles/monitoring.viewer.
Drain
A worker’s shutdown hook calls POST /drain on a manager
before docker swarm leave; the handler resolves the caller
by its TCP source IP, sets that node to drain, waits up to
DRAIN_TIMEOUT (45s) for tasks to migrate, then reports
done. Concurrent drains of the same node return 409. The
budget is ordered
45s (drain) < ~70s (worker curl) < 80s (systemd stop)
so the worker always gets an answer.
Dead Man’s Snitch
swarmctl checks in to deadmanssnitch.com on a periodic ticker so we get alerted if the leader’s process stops running. If the check-in URL goes quiet for longer than the configured grace period, DMS sends an alert.
What it monitors: leader liveness. The snitch is leader-gated — only the Raft leader’s swarmctl instance pings. A wedged or dead leader stops checking in even when followers are healthy, which is the desired signal. An all-manager ping would mask that scenario.
Environment variables:
| Variable | Default | Meaning |
|---|---|---|
SWARMCTL_SNITCH_URL |
"" (disabled) |
Dead Man’s Snitch check-in URL
(nosnch.in/<token>); empty disables check-ins |
SWARMCTL_SNITCH_INTERVAL |
1h |
How often to check in; 6h in staging, 1h
in production. A zero/invalid value degrades to the 1h
default rather than failing startup |
Per-env cadence: staging checks in every 6 hours
(6h); production checks in hourly (1h). These
values are in envs/staging.yaml and
envs/production.yaml. The check-in URL is a low-sensitivity
capability token, so it is stored as a plain value in those env files
(checked into the repo), not in Secret Manager.
Fail-open behavior: an empty
SWARMCTL_SNITCH_URL disables the subsystem entirely —
swarmctl starts cleanly and logs once that the snitch is disabled. A
failed check-in (non-leader, IsLeader error, HTTP error, or
non-2xx response) is logged and skipped; it never crashes the daemon or
cancels the context. Only sustained silence (leader down for longer than
interval + grace) trips the alert.
Initial check-in: Run does one immediate check-in on startup before the ticker registers, so DMS sees health right after a deploy rather than waiting up to one full interval.
Manual setup: the two DMS snitches
(swarmctl-staging and swarmctl-production) are
created by hand in the deadmanssnitch.com UI — no Terraform provider for
DMS exists in the repo. Set the DMS expected interval to match the
in-code cadence (every 6 hours for staging, hourly for production) with
grace beyond it so a single missed tick during a leader election doesn’t
false-alarm. Paste each check-in URL into envs/staging.yaml
and envs/production.yaml as the snitch_url
value.
HTTP API
Served on :9876.
| Route | Auth | Gates | Returns |
|---|---|---|---|
GET /healthz |
none | none | 200 {"status":"ok","active_drains":N} |
POST /drain |
Bearer drain token | leader; source-IP → node | 200 drained · 404 no node ·
409 already draining · 503 not leader |
POST /rebalance |
Bearer rebalance token | leader; loopback-only | 202 accepted · 409 pass running ·
401 bad token · 403 not loopback ·
503 not leader |
/drain and /rebalance use
separate bearer tokens: workers receive only the drain
token, so a worker can’t trigger a rebalance, and
/rebalance is additionally loopback-only (callable only
from inside swarmctl’s own network namespace on the leader, via
nsenter; a host-side curl localhost is SNAT’d
off loopback and gets 403, see Operating it). Auth fails
closed — a registered route with no secret returns
503 rather than passing through. Tokens are compared in
constant time.
Configuration
Environment variables, set in compose.yml; secrets are
mounted files, not env values.
| Variable | Default | Meaning |
|---|---|---|
LOG_LEVEL |
info |
debug/info/warn/error;
JSON logs to stderr |
GCP_PROJECT_ID |
— | Project for Cloud Monitoring (Phase ≥ 2) |
SWARMCTL_LISTEN_ADDR |
:9876 |
HTTP listen address |
SWARMCTL_DRAIN_SECRET_PATH |
/run/secrets/drain_token |
Mounted drain bearer token |
SWARMCTL_REBALANCE_SECRET_PATH |
/run/secrets/rebalance_token |
Mounted rebalance bearer token |
REBALANCE_ENABLED |
true |
Register /rebalance and run the poll loop |
REBALANCE_PHASE |
1 |
1 all-services · 2 + convergence stop ·
3 + hot-node targeting |
REBALANCE_POLL_INTERVAL |
30s |
Tick interval |
REBALANCE_BOOT_QUIESCENCE |
300s |
Delay before the first tick (let the fleet settle) |
REBALANCE_STABILIZATION_DELAY |
60s |
Wait after a join before rebalancing (resets on more joins) |
REBALANCE_SERVICE_DELAY |
5s |
Pause between service updates in a pass |
REBALANCE_WORKER_MIG_NAME |
— | Worker MIG (swarm-<n|p>-workers);
required for Phase ≥ 2 |
REBALANCE_MAX_HOT_NODES |
1 |
How many hottest nodes to target (Phase 3) |
REBALANCE_EXCLUDED_SERVICES |
— | Comma-separated service names to never touch |
REBALANCE_DRY_RUN |
false |
Log intended updates without calling Docker |
| Phase 2 convergence tuning | — | REBALANCE_CONVERGENCE_THRESHOLD (0.10),
_INTERVAL (5), _CHECK_DELAY (90s) — rarely
changed |
DRAIN_ENABLED |
true |
Register the /drain route |
DRAIN_TIMEOUT |
60s (deployed 45s) |
Max wait for task migration during a drain |
Other rarely-touched vars: HEARTBEAT_PATH
(/tmp/swarmctl-heartbeat),
SWARMCTL_SERVER_SHUTDOWN_TIMEOUT (60s).
Deployment
Deploy is driven entirely by two GitHub Actions workflows (both in
.github/workflows/). You never run the deploy from your
laptop — you trigger the workflow and it authenticates as
observability-deploy@<project> via Workload Identity
Federation and runs the deploy script for you.
| Workflow | File | Trigger | What it does |
|---|---|---|---|
| Build swarmctl Docker Image | build-swarmctl.yml |
Push to plan touching
src/services/swarmctl/**, or manual dispatch |
Builds and pushes
gcr.io/the-helper-bees/swarmctl:<short-sha> (and
:latest) |
| Deploy swarmctl Service | deploy-swarmctl.yml |
Manual dispatch only (auto-deploy on a successful build is wired but disabled) | Deploys a given image tag to staging or production |
The deploy job ultimately runs:
./zig/zig build -Dmode=ci deploy-swarmctl -- --env <staging|production> --image-tag <short-sha> --wait
-Dmode=ci is for the runner only (it adapts the build
for the CI environment). If you ever run the deploy script by hand, omit
it:
./zig/zig build deploy-swarmctl -- --env <env> --image-tag <short-sha> --wait.
How to deploy
1. Build the image. Push your change to
plan (the build fires automatically when
src/services/swarmctl/** changes), or trigger Build
swarmctl Docker Image manually:
- UI: Actions → Build swarmctl Docker Image → Run workflow → pick the branch → Run workflow.
- CLI:
gh workflow run build-swarmctl.yml --ref <branch>
Grab the short SHA from the run — that’s your image tag, and the image must exist in GCR before you can deploy it.
2. Deploy to staging. Trigger Deploy swarmctl Service:
- UI: Actions → Deploy swarmctl Service →
Run workflow, then set the fields:
- Use workflow from — your branch (e.g.
plan) - Target environment —
staging - Image tag to deploy (short SHA) — paste the short SHA from step 1, or leave blank to use the selected branch’s current commit
- Leave Type “deploy” to confirm… and Emergency bypass empty for staging
- Use workflow from — your branch (e.g.
- CLI:
gh workflow run deploy-swarmctl.yml -f environment=staging -f image_tag=<short-sha>
The job blocks on --wait until the stack converges. Then
sanity-check a manager’s /healthz and the rebalancer logs
before promoting.
3. Promote to production. Production is gated twice:
you must type deploy in the confirm box,
and there must already be a successful staging
deployment at the same commit SHA. So run the production dispatch
from the same ref you deployed to staging.
- UI: Run workflow, then set the fields:
- Use workflow from — the same branch/commit you deployed to staging
- Target environment —
production - Type “deploy” to confirm production deploy —
deploy - Image tag to deploy (short SHA) — the short SHA you validated on staging
- Emergency bypass — leave unchecked (tick it only for incidents)
- CLI:
gh workflow run deploy-swarmctl.yml --ref <ref> \ -f environment=production -f confirm=deploy -f image_tag=<short-sha>
If the staging-success check fails, deploy that SHA to staging first.
Emergency bypass (incidents only): set
skip_staging_check=true — it’s logged under your GitHub
username and skips the gate.
Image tag vs. gate.
image_tagdefaults to the short SHA of the dispatched ref, but the production gate keys off the ref’s commit SHA, not the image tag. Always pass the exact short SHA you validated on staging so you ship the bits you tested.
What the deploy script does
src/scripts/deploy_swarmctl/main.py picks a reachable
manager, projects the two bearer tokens from GCP Secret Manager into
versioned Docker secrets, resolves the host docker GID, validates the
compose against the Swarm schema, does a one-time
docker stack rm if the service mode is changing, and waits
for convergence. Per-env values live in
envs/{staging,production}.yaml.
Secrets: two GCP Secret Manager secrets per env —
swarm-observability-drain-token (workers;
/drain) and
swarm-observability-rebalance-token (manager-only;
/rebalance). Terraform
(infra/hb-infra/modules/swarmctl) creates the shells with
placeholders; the real random values are populated and rotated
out-of-band by adding a secret version. See Secrets
Management.
Operating it
Trigger a rebalance manually.
/rebalance is leader-gated and
loopback-only, where “loopback” means swarmctl’s
own network namespace, not the manager host. swarmctl runs in a
container on the swarmctl overlay (not host networking), so
a plain curl localhost:9876 from the manager shell arrives
SNAT’d to the docker gateway IP and is rejected with 403
before auth even runs. The call has to originate inside the container’s
netns, so enter it with nsenter.
The function below is fully self-contained: paste it, then call it
with the target project. It discovers the running managers, asks one of
them which node is the swarm leader, maps that leader back to its zone,
reads the rebalance token, and fires the request from inside swarmctl’s
netns. No hostnames or zones to look up by hand. Run it from Cloud Shell
or any machine with gcloud and secretAccessor
on the project. The two project IDs are the only things you choose.
swarm_rebalance() {
local project="$1"
[ -n "$project" ] || { echo "usage: swarm_rebalance <project-id>"; return 1; }
# running swarm managers as "name zone" pairs
local mgrs; mapfile -t mgrs < <(gcloud compute instances list --project="$project" \
--filter="name~'^swarm-mgr-' AND status=RUNNING" \
--format="value(name,zone)")
[ "${#mgrs[@]}" -gt 0 ] || { echo "no running swarm-mgr-* instances in $project"; return 1; }
# ask the first manager which node is the leader (Hostname == GCE instance name)
local first_name first_zone; read -r first_name first_zone <<<"${mgrs[0]}"
local leader; leader=$(gcloud compute ssh "$first_name" --zone="$first_zone" \
--tunnel-through-iap --project="$project" --quiet \
--command "sudo docker node ls --format '{{.Hostname}} {{.ManagerStatus}}'" \
| awk '$2=="Leader"{print $1}')
# map the leader hostname back to its zone
local leader_zone; leader_zone=$(printf '%s\n' "${mgrs[@]}" | awk -v h="$leader" '$1==h{print $2}')
[ -n "$leader" ] && [ -n "$leader_zone" ] || { echo "could not resolve leader/zone"; return 1; }
echo "leader: $leader ($leader_zone)"
# fetch the rebalance token. secretAccessor reads it directly. Do NOT add
# --impersonate-service-account: observability-deploy@ only grants
# workloadIdentityUser to the GitHub Actions WIF principal, so impersonation 403s for humans.
local token; token=$(gcloud secrets versions access latest \
--secret=swarm-observability-rebalance-token --project="$project")
# trigger ON the leader, from inside swarmctl's netns (token piped over stdin, not argv)
echo "$token" | gcloud compute ssh "$leader" --zone="$leader_zone" \
--tunnel-through-iap --project="$project" --quiet \
--command '
read -r TOKEN
CID=$(sudo docker ps -q -f name=swarmctl | head -1)
PID=$(sudo docker inspect -f "{{.State.Pid}}" "$CID")
sudo nsenter -t "$PID" -n curl -sS -XPOST \
-H "Authorization: Bearer $TOKEN" \
http://localhost:9876/rebalance -w "\nHTTP %{http_code}\n"'
}
# non-production (staging):
swarm_rebalance prj-bu1-n-hb-infra-5381
# production:
# swarm_rebalance prj-bu1-p-hb-infra-1da6202 = queued (runs within one poll interval);
409 = a pass is already running; 503 = you hit
a follower; 403 = the call did not originate inside
swarmctl’s netns (e.g. host-side curl localhost).
Health and logs. swarmctl healthcheck
(the binary’s only subcommand) does a local GET /healthz
and exits 0/1 — it is the Docker HEALTHCHECK probe, since
the distroless image has no shell or curl. Logs are structured JSON
(slog) tagged by subsystem; watch for
became leader; seeded known workers,
new nodes detected; stabilizing,
rebalance complete ... exit_reason=, and
filterByHotNodes: filtered services.
Gotchas. Phase ≥ 2 crash-loops on boot if
REBALANCE_WORKER_MIG_NAME is unset. A 503 from
a non-leader manager is normal — only the leader’s instance serves.
Code map
src/services/swarmctl/
├── compose.yml # Swarm stack: global, manager-only, host-mode :9876
├── Dockerfile # distroless nonroot
├── envs/ # per-env deploy config
└── src/
├── main.go # config, wiring, RunAll; `healthcheck` subcommand
├── lifecycle.go # RunAll
└── internal/
├── config/ # env-var config
├── swarm/ # Docker API seam, node resolver, leader check
├── rebalance/ # poll loop, state machine, phase strategies, /rebalance
├── drain/ # /drain handler + drain state
├── api/ # HTTP server, auth, /healthz
├── monitoring/ # Cloud Monitoring memory query (Phase ≥ 2)
├── heartbeat/ # liveness file backing /healthz
└── snitch/ # Dead Man's Snitch check-in ticker (leader-gated)
Development
A self-contained Go module
(src/services/swarmctl/go.mod). Use the project-local
toolchain, never system Go.
- Test (what CI runs):
./bin/go test -C ./src/services/swarmctl ./... -race. Tests live beside the code (internal/*/*_test.go); the rebalancer’srapidproperty/model tests (rebalancer_model_test.go) are the highest-signal ones — start there to understand expected behavior. - Build the image:
docker build -t swarmctl src/services/swarmctl. - Repo gate:
./zig/zig build check.
There is no one-command local run yet: swarmctl needs a live Swarm (a manager Docker socket, a real leader, mounted token secrets), so the dev loop is change code → run the test suite → validate end-to-end on staging via the deploy flow above. A turnkey local-swarm harness isn’t built out.
Related
- Docker Swarm Consolidation RFC — why we run Swarm and how the cluster is shaped.
- Cluster infrastructure:
infra/hb-infra/modules/swarm_managerandswarm_worker— where the managers, the worker autoscaler, and the swarm-reaper backstop are defined. - Secrets Management — storing and rotating the bearer tokens.
- App Service Accounts — the per-app SA identity model on swarm workers.