Swarm Worker Autoscale Validation
This runbook validates the end-to-end auto-scale lifecycle for our Docker Swarm worker fleet: GCP MIG autoscaler trigger → MIG provisions worker → swarm join → swarmctl auto-rebalance → load redistribution → scale-down → swarmctl drain.
Run this when:
- A new swarmctl phase is rolled out (e.g. Phase 1 → 2 → 3 transitions)
- The worker MIG’s autoscaler config changes (threshold, cooldown, target utilization)
- Quarterly, as a regression check, before peak-traffic seasons
- After a major Swarm or Docker version upgrade
- After a worker boot-image change that could affect provisioning time
The procedure exercises both the GCP MIG autoscaler
(provisioning side) and the swarmctl
controller (rebalance + drain). See swarmctl
service docs for the controller architecture.
Background
Two systems work together to keep the cluster sized correctly:
- GCP MIG autoscaler — scales the worker MIG up at 70% memory utilization (currently). Adds/removes VM instances; does not know about Swarm tasks.
swarmctl— the manager-side controller. On worker join, it triggers a rebalance to redistribute existing tasks onto the new worker (Swarm itself only schedules new tasks, never moves existing ones). On worker shutdown, the worker hooksPOST /drainso swarmctl migrates tasks before the VM goes away.
Historically (pre-swarmctl) operators had to manually
docker service update --force after each scale-up to spread
load. swarmctl automated that. This runbook validates both halves still
work together.
Replica convention. Three tiers, distinguished by
how they scale — the key axis for the drain test below is
whether a service has max_replicas_per_node, not “web vs
non-web”:
- 2 replicas, spread across nodes —
${SERVICE_REPLICAS:-1}template plusmax_replicas_per_node: ${MAX_REPLICAS_PER_NODE:-0}(the deploy script setsSERVICE_REPLICAS=2andMAX_REPLICAS_PER_NODE=1). Covers the web-server services (Django/Flask apps, the Vite/Next/Angular UIs, ASP.NETweb_api), Caddy, and Cloudflare Tunnel. These run 2 replicas guaranteed on 2 distinct nodes — this is the HA invariant Test 5 leans on. - 2 replicas, NOT spread —
${SERVICE_REPLICAS:-1}only, nomax_replicas_per_node. This is just the Cloud SQL Proxy sidecar, bumped 1 → 2 (thecloudsql-proxy-2-replicaschange) so a single node loss can’t sever the DB path. Because it has no anti-affinity constraint, swarm may place both replicas on the same worker — so cloudsql-proxy is exempt from Test 5’s “never drops to 0/2” criterion (it can legitimately co-locate). It usually spreads on its own, but that’s best-effort, not guaranteed. - 1 replica — untemplated. Redis, Celery
workers/beat, bucket workers, and a few web-tier services that
are single-replica by design: Keycloak
(
KC_CACHE: local, per-replica cache state) and hbcrm’sws_admin_django(in-memory WebSocket state). These behave like any 1-replica service on drain — they blip to0/1and get rescheduled, which is expected, not a failure. Do not pick one as the Test 5 victim.
One-shot jobs (migration, django_migration,
run_management_commands, default_user) have
restart_policy: condition: none and settle to
0/1 after running — that’s their converged state, not a
failure. The runbook’s pass criteria below assume this baseline.
Prerequisites
- Member of
ssh-n-env@thehelperbees.com(staging only — production not exercised by this runbook) - Member of
sg-hb-infra-development@thehelperbees.comfor MIG config inspection gcloud,gssh, anddockerCLI access to a manager- Grafana access to dashboard
21476(Docker Swarm Overview) and the swarm-tasks dashboard swarm_tasks_failedalert silenced for the test window — otherwise pager fires on every reschedule- All three managers
Ready Active Reachable, one leader:sudo docker node ls
Pre-flight
Capture the baseline state and confirm the moving parts are wired correctly. Skip nothing — every check below maps to a failure mode you’ll be hunting later if something looks off.
1. Confirm the memory metric is flowing
Both the autoscaler and swarmctl Phase ≥ 2 read
agent.googleapis.com/memory/percent_used. No metric = no
scale events and swarmctl Phase 3 silently skips rebalances.
gcloud monitoring time-series list \
--filter='metric.type="agent.googleapis.com/memory/percent_used" AND resource.labels.instance_name=~"swarm-n-worker.*"' \
--interval-end-time=now --interval-duration=5m \
--project=<staging-project>Expect: one series per worker, recent timestamps (within 60s), values between 0 and 1.
2. Confirm swarmctl is healthy on all managers
for mgr in swarm-mgr-1 swarm-mgr-2 swarm-mgr-3; do
echo "=== $mgr ==="
gcloud compute ssh "$mgr" --zone <zone> --tunnel-through-iap --project <staging-project> \
--command "curl -sf localhost:9876/healthz && sudo docker node inspect self -f '{{.ManagerStatus.Leader}}'"
doneExpect: all three return 200 on /healthz; exactly one
returns true for Leader.
3. Capture MIG and topology baseline
# MIG config — current min/max/target + cooldown
gcloud compute instance-groups managed describe swarm-n-workers \
--region <region> --project <staging-project> \
--format='value(autoscaler.autoscalingPolicy)'
# Current swarm topology
sudo docker node ls --format '{{.Hostname}}\t{{.Status}}\t{{.Availability}}\t{{.ManagerStatus}}'
# Current task distribution (skew baseline)
sudo docker node ls -q | while read n; do
echo "$n: $(sudo docker node ps "$n" --filter desired-state=running -q | wc -l) tasks"
done
# Replica-convention sanity: every templated web service should be at 2/2.
# A 1/2 or 0/2 here means a stack didn't finish converging — fix before
# starting the test, otherwise drain/rebalance results will be polluted.
#
# One-shots (restart_policy condition: none) settle to 0/1 and are excluded
# so they don't show as false "NOT CONVERGED". Discovered dynamically rather
# than hard-coded — the set varies per repo (migration, django_migration,
# run_management_commands, default_user, ...).
ONESHOTS=$(sudo docker service ls -q | while read s; do
[ "$(sudo docker service inspect "$s" -f '{{.Spec.TaskTemplate.RestartPolicy.Condition}}')" = "none" ] \
&& sudo docker service inspect "$s" -f '{{.Spec.Name}}'
done)
sudo docker service ls --format '{{.Name}}\t{{.Replicas}}' \
| grep -vFf <(printf '%s\n' "$ONESHOTS") \
| awk -F'\t' '{ split($2,r,"/"); if (r[1]!=r[2]) print " NOT CONVERGED: " $0 }'
# Spot-check a representative templated service shows 2 replicas on
# different nodes (HA invariant we're about to test).
SVC=$(sudo docker service ls --format '{{.Name}}' | grep -E '_django_app$|_caddy$' | head -1)
echo "=== node placement for $SVC ==="
sudo docker service ps "$SVC" --filter desired-state=running \
--format 'table {{.Name}}\t{{.Node}}'Save the output — you’ll compare against it after each test. The replica-convention check should report nothing (all stacks converged); the spot-check should show 2 replicas on 2 distinct nodes.
4. Open observation windows
In parallel terminals:
- On the leader:
sudo docker service logs -f swarmctl_swarmctl 2>&1 | grep -v healthcheck - On the leader:
journalctl -u docker -f - Browser: Grafana dashboard
21476and the swarm-tasks dashboard - Browser: GCP Console → Compute → Instance Groups →
swarm-n-workers→ Monitoring
Test matrix
Each test below is independent and can be run in isolation if time is short. Recommended order is by criticality. Skip-okay tests are marked.
| # | Test | Critical? | Est. time |
|---|---|---|---|
| 1 | Scale-up latency | yes | 15 min |
| 2 | swarmctl rebalance correctness | yes | 10 min (overlaps Test 1) |
| 3 | Burst join (multi-worker) | no | 15 min |
| 4 | Boot quiescence sanity | yes | 10 min |
| 5 | Graceful scale-down + drain | yes | 20 min |
| 6 | Drain timeout | no | 15 min |
| 7 | Leader failover mid-rebalance | yes | 15 min |
| 8 | Excluded services protection | no | 10 min |
| 9 | Dry-run mode | no | 10 min |
| 10 | No-rebalance-on-scale-down | yes | 5 min (overlaps Test 5) |
Total critical path: ~75 min. Full matrix: ~2 hours.
Test 1 — Scale-up latency
Goal: Measure end-to-end time from memory-threshold crossed to task fully rebalanced. Confirms 2-3 min provisioning is still the baseline.
Density note. Under the 2-replica convention, per-stack memory roughly doubles for templated services compared to the pre-2-replica baseline. The 70% memory threshold trips with fewer concurrent stacks per worker than historical measurements would suggest — expect this test to fire scale-up sooner / with less synthetic stress than older runbook iterations. That’s a feature (faster reaction to real load), not a regression.
Method:
Note the current memory% of the cluster (Grafana).
Force memory pressure. Two options:
Option A (cleaner, no app risk): Deploy a memory-stress service:
sudo docker service create \ --name autoscale-test-stress \ --replicas 6 \ --constraint 'node.role==worker' \ --restart-condition none \ polinux/stress \ stress --vm 2 --vm-bytes 1G --timeout 1200sOption B (closer to real load): Scale up
consumer_portal_djangofor several tenants until cluster mem >70%. Risk: longer recovery if the test goes sideways.
Watch the four phases unfold and timestamp each:
Phase Watch for Where Threshold crossed Cluster mem ≥ 70% sustained ~60s Grafana memory panel MIG provisioning New row in instance list, status PROVISIONINGGCP Console / gcloud compute instances listInstance running Status flips to RUNNING, IP assignedsame Joined Swarm Ready Activeindocker node lssudo docker node lson a managerswarmctl detected Log line new nodes detected; stabilizingswarmctl logs Rebalance complete Log line rebalance complete ... exit_reason=swarmctl logs Record each timestamp. Total elapsed = first to last.
Pass criteria:
- Each phase logged, all four observable
- MIG provision → instance running ≤ 3 min (regression check vs historical baseline)
- Swarm join completes within 30s of instance running
- swarmctl stabilization respects
REBALANCE_STABILIZATION_DELAY(60s default) - Rebalance pass completes without
Failedtask states
Cleanup: Remove the stress service:
sudo docker service rm autoscale-test-stress. Memory drops;
expect Test 5 (scale-down) to fire naturally if you don’t interrupt.
Test 2 — swarmctl rebalance correctness
Runs alongside Test 1; no separate setup.
Goal: Confirm the rebalance actually redistributes tasks onto the new worker, not just no-op force-updates.
Method:
Before the rebalance kicks off (during Test 1’s stabilization delay), snapshot task counts per node:
sudo docker node ls -q | while read n; do echo "$n: $(sudo docker node ps "$n" --filter desired-state=running -q | wc -l)" done > /tmp/pre-rebalance.txtWait for rebalance to complete (
rebalance completelog).Snapshot again into
/tmp/post-rebalance.txt.Compute skew before/after:
compute_skew() { awk -F': ' '{print $2}' "$1" | sort -n | awk 'NR==1{min=$1} {max=$1} END{print "min="min, "max="max, "skew="max-min}' } compute_skew /tmp/pre-rebalance.txt compute_skew /tmp/post-rebalance.txt
Pass criteria:
- Post-rebalance skew is lower than pre-rebalance
- New worker has > 0 tasks
- No service ends with 0 running tasks (i.e., rebalance didn’t break a service)
- For Phase 3: rebalance targeted only the hottest
REBALANCE_MAX_HOT_NODESworkers (check swarmctl logs forfilterByHotNodes: filtered services) - Every templated web service still shows
2/2after the pass (rebalance shouldn’t drop a replica). A1/2post-rebalance suggests a service update raced a node-removal — investigate before continuing.
Skew numbers under the 2-replica convention are roughly 2x the pre-2-replica baseline (each templated service contributes 2 tasks per stack, not 1). The skew ratio (max/min) is the comparable metric across baseline eras; absolute task counts are not.
Test 3 — Burst join (multi-worker scale-up) (optional)
Goal: Confirm
REBALANCE_STABILIZATION_DELAY resets correctly when
multiple nodes join in quick succession; only a single rebalance pass
should fire.
Method:
- Temporarily increase MIG max so the autoscaler can add 2 workers at once.
- Generate large memory pressure quickly (scale stress service to 12 replicas).
- Watch swarmctl logs for
new nodes detected; stabilizing— expect this line to appear, then re-appear (with reset delay) when the second worker joins. - After both join, swarmctl should fire one rebalance covering both.
Pass criteria:
- Single rebalance pass for both joins
- No back-to-back rebalances within
REBALANCE_POLL_INTERVALof each other - Log shows
stabilization_delay_reset(or equivalent) when the second join is detected
Test 4 — Boot quiescence sanity
Goal: Confirm that a swarmctl process restart (e.g. failover) does not trigger a spurious rebalance.
Method:
On the leader, restart swarmctl:
sudo docker service update --force swarmctl_swarmctlWait for the new task to start.
Watch logs for the
REBALANCE_BOOT_QUIESCENCE(300s default) wait + the seeding behavior.
Pass criteria:
- Log line:
became leader; seeded known workers - No rebalance pass within the boot-quiescence window
- After the window, swarmctl ticks normally without rebalancing (because baseline already matches current fleet)
Test 5 — Graceful scale-down + drain
Goal: Confirm a removed worker drains its tasks before the VM is killed, and that templated web services stay continuously available throughout the drain.
HA expectation under the 2-replica convention. With
2 replicas spread across 2 different nodes (validated in pre-flight step
3), draining one worker only removes one of two replicas per
spread web service. The other replica continues serving
on the surviving node. A 2-replica-with-spread web service
should never drop to 0/2 during this test — it
should transition 2/2 → 1/2 → 2/2 as swarm reschedules the
drained task. Pre-2-replica runs accepted a brief 0/1 blip
during drain; that’s no longer acceptable for these services.
Which services this applies to. The “never 0/2”
invariant only holds for services that carry
max_replicas_per_node (web apps, Caddy, Cloudflare Tunnel —
see the replica-convention background). It does not
apply to:
- cloudsql-proxy — 2 replicas but no spread
constraint, so both replicas can sit on the drained node; a
0/2blip here is possible and not a failure. - single-replica services (Redis, Celery, Keycloak,
ws_admin_django, sidecars) — these blip to0/1and reschedule, which is expected.
So pick the victim service from the spread tier (a
*_django / *_flask app or
*_caddy). Don’t measure availability against cloudsql-proxy
or a single-replica service — they’ll “fail” the 0/2 criterion by design
and pollute the result.
Method:
After Test 1, stop the stress service:
sudo docker service rm autoscale-test-stress.Cluster memory drops below the scale-down threshold. Wait through the autoscaler cooldown (typically 5–10 min).
Pick a “victim” service hosted on the worker that’s about to drain (the worker that gets selected for removal — see GCP Console). It must be a 2-replica-with-spread service (a
*_django/*_flaskapp or*_caddy) — not cloudsql-proxy or a single-replica service, per the HA note above. In a separate terminal, poll its replica count once per second so you have a continuous record:VICTIM=<one-of-the-spread-services-on-the-doomed-worker> # e.g. hbcp_unum_django_app while true; do printf '%s %s\n' "$(date +%T)" "$(sudo docker service ls --filter "name=$VICTIM" --format '{{.Replicas}}')" sleep 1 done | tee /tmp/drain-availability.logMIG will pick a worker to remove (usually the most recently added). Watch:
- GCP Console: instance moves to
STOPPING - On the leader: swarmctl logs should show the
/drainPOST arriving from the worker
- GCP Console: instance moves to
Capture timing: drain start → drain complete → instance terminated. Stop the availability poll.
Pass criteria:
- Worker hits
/drainbefore exiting (look forreceived drain requestin swarmctl logs with the worker’s IP) - swarmctl resolves caller IP to the correct node
- Node set to
drainindocker node ls - All tasks migrate before the worker leaves (drain returns success
within
DRAIN_TIMEOUT, default 45s) /tmp/drain-availability.logshows the victim service stayed ≥ 1 running replica throughout — i.e. transitions like2/2 → 1/2 → 2/2are good; a0/2line at any timestamp during the drain is a fail (means both replicas were on the drained node, which violates the placement-spread invariant the pre-flight verified). Only valid for the spread tier — see the HA note; cloudsql-proxy and single-replica services are exempt.- No spike in
swarm_tasks_failedmetric (or minimal — a fewFailedstates withShutdownreason are acceptable for non-templated services like one-shots). Note: a one-shot migration task that showsFailedonce thenCompleteon retry is a known, benign pattern — the migration can race cloudsql-proxy readiness on a freshly-joined worker and retry once before the proxy’s IAM/connection setup finishes. It’s not a drain or autoscale failure; don’t flag it. - Instance terminated cleanly (no zombie node entries in
docker node ls)
Test 6 — Drain timeout (optional, harder to engineer)
Goal: Confirm graceful failure when a task can’t migrate in 45s.
Method:
- Deploy a service with a deliberately slow shutdown (e.g. a celery
worker mid-task with
stop_grace_period: 90s). - Force the worker hosting it to scale down.
- Watch what happens when drain exceeds
DRAIN_TIMEOUT.
Pass criteria:
- swarmctl logs
drain timeout exceeded - Worker still leaves cleanly (timeout is a graceful failure, not a hang)
- Swarm reschedules the orphan task reactively
- Brief task-failed blip is captured in metrics (will trip
swarm_tasks_failed— acceptable if the test is announced)
Test 7 — Leader failover mid-rebalance
Goal: Confirm a leader change during an in-flight rebalance does not result in duplicate or runaway rebalances.
Method:
On the current leader, trigger a manual rebalance:
REBALANCE_TOKEN=$(gcloud secrets versions access latest \ --secret=swarm-observability-rebalance-token --project=<staging-project> \ --impersonate-service-account=observability-deploy@<staging-project>.iam.gserviceaccount.com) curl -sS -XPOST -H "Authorization: Bearer $REBALANCE_TOKEN" \ http://localhost:9876/rebalance -w '\nHTTP %{http_code}\n'Expect
202.Within ~10 seconds, force a leader change by restarting Docker on the current leader:
sudo systemctl restart dockerNew leader takes over (one of the other two managers).
Pass criteria:
- New leader logs
became leader; seeded known workers - New leader skips its first tick (per design — baseline matches current state)
- No additional rebalance pass on the new leader within the next 10 min
- Old leader’s in-flight pass either completed or was abandoned cleanly (no half-updated services)
Test 8 — Excluded services protection (optional)
Goal: Confirm
REBALANCE_EXCLUDED_SERVICES prevents touching critical
services.
Method:
Update swarmctl env with:
REBALANCE_EXCLUDED_SERVICES=observability_prometheus,observability_grafanaSnapshot
UpdatedAtfor these services:sudo docker service inspect observability_prometheus -f '{{.UpdatedAt}}'.Trigger a manual
/rebalance(per Test 7 step 1).Compare
UpdatedAtafter.
Pass criteria:
UpdatedAtunchanged for excluded services- Log line:
[rebalance] skipping excluded service: observability_prometheus - Other services were updated as expected
Test 9 — Dry-run mode (optional)
Goal: Confirm REBALANCE_DRY_RUN=true
makes no mutating calls.
Method:
- Update one manager to run swarmctl with
REBALANCE_DRY_RUN=true. Easiest path: edit the service env and force-update; alternative is a sandbox manager. - Trigger
/rebalance. - Inspect logs and
UpdatedAtfor several services.
Pass criteria:
- Logs show
[dry-run] would force-update <service>for N services UpdatedAtunchanged for all services- No Swarm task churn observed
Test 10 — No rebalance on scale-down (regression guard)
Goal: Confirm worker removal does not trigger a rebalance. Swarm reschedules removed tasks on its own; swarmctl must not duplicate that work.
Method:
Runs as part of Test 5 — just watch the logs after the worker leaves.
Pass criteria:
- No
new nodes detectedorrebalancelog entries in the 5 min following worker removal - (Optional: log line confirming the removed-node delta was observed and intentionally ignored)
- All previously
1/2templated web services return to2/2via swarm’s native reschedule (not via a swarmctl pass) within ~1 min of the worker leaving
Reporting
For each test executed, capture:
- Test number + name
- Timestamp range (start to last observed state change — used for correlating with Cloud Monitoring later)
- swarmctl log excerpt showing the key state
transitions (
became leader,new nodes detected,rebalance complete,received drain request, etc.) - Grafana screenshot of the affected panels (memory%, task distribution, task-failed)
- Pass / Fail / Skipped + reason
Compile into a Confluence test report. Paste the summary table (test#, pass/fail, notes) into the originating Jira ticket as a comment so future readers don’t have to dig through Confluence.
Cleanup
After all tests:
- Remove any test-only services (
autoscale-test-stressetc.) - Restore original MIG min/max if Test 3 modified them
- Restore swarmctl env if Tests 8 or 9 modified it
- Un-silence the
swarm_tasks_failedalert - Confirm cluster is back to nominal:
sudo docker service ls | grep -v "[0-9]/[0-9]"should return nothing
Known limitations
- This runbook does not exercise the production autoscaler — only staging. Production gets the validated configuration but no synthetic load.
- Tests assume the current swarmctl phase. If
REBALANCE_PHASEchanges, re-run Test 1 + Test 2 to validate the new convergence/targeting logic. - Tests assume the three-tier replica baseline
described in Background (
SERVICE_REPLICAS=2+MAX_REPLICAS_PER_NODE=1for the spread web tier;SERVICE_REPLICAS=2only for cloudsql-proxy; 1 replica for everything else). If the convention changes — different replica count, the spread constraint added to cloudsql-proxy, or templating removed — revisit Test 5’s0/2-is-fail criterion (it keys offmax_replicas_per_node, not replica count alone) and the density expectations in Test 1 / Test 2. - cloudsql-proxy is 2 replicas without an
anti-affinity constraint. If a future change adds
max_replicas_per_nodeto it, it moves into the spread tier and becomes eligible for the Test 50/2criterion — update the exemption list accordingly. - Tests 3 and 6 are deliberately stretch goals — both require precise timing or specific app-level shutdown behavior that’s hard to engineer reliably. Skip them if time runs out and file followups for the specific scenarios.
References
- swarmctl service docs — controller architecture, env vars, deploy procedure
- Swarm manager image upgrade — similar runbook shape; helpful pattern reference
- Swarm manager quorum loss recovery — what to do if Test 7’s leader failover goes sideways