GitHub

Swarm Worker Autoscale Validation

This runbook validates the end-to-end auto-scale lifecycle for our Docker Swarm worker fleet: GCP MIG autoscaler trigger → MIG provisions worker → swarm join → swarmctl auto-rebalance → load redistribution → scale-down → swarmctl drain.

Run this when:

A new swarmctl phase is rolled out (e.g. Phase 1 → 2 → 3 transitions)
The worker MIG’s autoscaler config changes (threshold, cooldown, target utilization)
Quarterly, as a regression check, before peak-traffic seasons
After a major Swarm or Docker version upgrade
After a worker boot-image change that could affect provisioning time

The procedure exercises both the GCP MIG autoscaler (provisioning side) and the swarmctl controller (rebalance + drain). See swarmctl service docs for the controller architecture.

Background

Two systems work together to keep the cluster sized correctly:

GCP MIG autoscaler — scales the worker MIG up at 70% memory utilization (currently). Adds/removes VM instances; does not know about Swarm tasks.
swarmctl — the manager-side controller. On worker join, it triggers a rebalance to redistribute existing tasks onto the new worker (Swarm itself only schedules new tasks, never moves existing ones). On worker shutdown, the worker hooks POST /drain so swarmctl migrates tasks before the VM goes away.

Historically (pre-swarmctl) operators had to manually docker service update --force after each scale-up to spread load. swarmctl automated that. This runbook validates both halves still work together.

Replica convention. Three tiers, distinguished by how they scale — the key axis for the drain test below is whether a service has max_replicas_per_node, not “web vs non-web”:

2 replicas, spread across nodes — ${SERVICE_REPLICAS:-1} template plus max_replicas_per_node: ${MAX_REPLICAS_PER_NODE:-0} (the deploy script sets SERVICE_REPLICAS=2 and MAX_REPLICAS_PER_NODE=1). Covers the web-server services (Django/Flask apps, the Vite/Next/Angular UIs, ASP.NET web_api), Caddy, and Cloudflare Tunnel. These run 2 replicas guaranteed on 2 distinct nodes — this is the HA invariant Test 5 leans on.
2 replicas, NOT spread — ${SERVICE_REPLICAS:-1} only, no max_replicas_per_node. This is just the Cloud SQL Proxy sidecar, bumped 1 → 2 (the cloudsql-proxy-2-replicas change) so a single node loss can’t sever the DB path. Because it has no anti-affinity constraint, swarm may place both replicas on the same worker — so cloudsql-proxy is exempt from Test 5’s “never drops to 0/2” criterion (it can legitimately co-locate). It usually spreads on its own, but that’s best-effort, not guaranteed.
1 replica — untemplated. Redis, Celery workers/beat, bucket workers, and a few web-tier services that are single-replica by design: Keycloak (KC_CACHE: local, per-replica cache state) and hbcrm’s ws_admin_django (in-memory WebSocket state). These behave like any 1-replica service on drain — they blip to 0/1 and get rescheduled, which is expected, not a failure. Do not pick one as the Test 5 victim.

One-shot jobs (migration, django_migration, run_management_commands, default_user) have restart_policy: condition: none and settle to 0/1 after running — that’s their converged state, not a failure. The runbook’s pass criteria below assume this baseline.

Prerequisites

Member of ssh-n-env@thehelperbees.com (staging only — production not exercised by this runbook)
Member of sg-hb-infra-development@thehelperbees.com for MIG config inspection
gcloud, gssh, and docker CLI access to a manager
Grafana access to dashboard 21476 (Docker Swarm Overview) and the swarm-tasks dashboard
swarm_tasks_failed alert silenced for the test window — otherwise pager fires on every reschedule
All three managers Ready Active Reachable, one leader: sudo docker node ls

Pre-flight

Capture the baseline state and confirm the moving parts are wired correctly. Skip nothing — every check below maps to a failure mode you’ll be hunting later if something looks off.

1. Confirm the memory metric is flowing

Both the autoscaler and swarmctl Phase ≥ 2 read agent.googleapis.com/memory/percent_used. No metric = no scale events and swarmctl Phase 3 silently skips rebalances.

gcloud monitoring time-series list \
  --filter='metric.type="agent.googleapis.com/memory/percent_used" AND resource.labels.instance_name=~"swarm-n-worker.*"' \
  --interval-end-time=now --interval-duration=5m \
  --project=<staging-project>

Expect: one series per worker, recent timestamps (within 60s), values between 0 and 1.

2. Confirm swarmctl is healthy on all managers

for mgr in swarm-mgr-1 swarm-mgr-2 swarm-mgr-3; do
  echo "=== $mgr ==="
  gcloud compute ssh "$mgr" --zone <zone> --tunnel-through-iap --project <staging-project> \
    --command "curl -sf localhost:9876/healthz && sudo docker node inspect self -f '{{.ManagerStatus.Leader}}'"
done

Expect: all three return 200 on /healthz; exactly one returns true for Leader.

3. Capture MIG and topology baseline

# MIG config — current min/max/target + cooldown
gcloud compute instance-groups managed describe swarm-n-workers \
  --region <region> --project <staging-project> \
  --format='value(autoscaler.autoscalingPolicy)'

# Current swarm topology
sudo docker node ls --format '{{.Hostname}}\t{{.Status}}\t{{.Availability}}\t{{.ManagerStatus}}'

# Current task distribution (skew baseline)
sudo docker node ls -q | while read n; do
  echo "$n: $(sudo docker node ps "$n" --filter desired-state=running -q | wc -l) tasks"
done

# Replica-convention sanity: every templated web service should be at 2/2.
# A 1/2 or 0/2 here means a stack didn't finish converging — fix before
# starting the test, otherwise drain/rebalance results will be polluted.
#
# One-shots (restart_policy condition: none) settle to 0/1 and are excluded
# so they don't show as false "NOT CONVERGED". Discovered dynamically rather
# than hard-coded — the set varies per repo (migration, django_migration,
# run_management_commands, default_user, ...).
ONESHOTS=$(sudo docker service ls -q | while read s; do
  [ "$(sudo docker service inspect "$s" -f '{{.Spec.TaskTemplate.RestartPolicy.Condition}}')" = "none" ] \
    && sudo docker service inspect "$s" -f '{{.Spec.Name}}'
done)
sudo docker service ls --format '{{.Name}}\t{{.Replicas}}' \
  | grep -vFf <(printf '%s\n' "$ONESHOTS") \
  | awk -F'\t' '{ split($2,r,"/"); if (r[1]!=r[2]) print "  NOT CONVERGED: " $0 }'

# Spot-check a representative templated service shows 2 replicas on
# different nodes (HA invariant we're about to test).
SVC=$(sudo docker service ls --format '{{.Name}}' | grep -E '_django_app$|_caddy$' | head -1)
echo "=== node placement for $SVC ==="
sudo docker service ps "$SVC" --filter desired-state=running \
  --format 'table {{.Name}}\t{{.Node}}'

Save the output — you’ll compare against it after each test. The replica-convention check should report nothing (all stacks converged); the spot-check should show 2 replicas on 2 distinct nodes.

4. Open observation windows

In parallel terminals:

On the leader: sudo docker service logs -f swarmctl_swarmctl 2>&1 | grep -v healthcheck
On the leader: journalctl -u docker -f
Browser: Grafana dashboard 21476 and the swarm-tasks dashboard
Browser: GCP Console → Compute → Instance Groups → swarm-n-workers → Monitoring

Test matrix

Each test below is independent and can be run in isolation if time is short. Recommended order is by criticality. Skip-okay tests are marked.

#	Test	Critical?	Est. time
1	Scale-up latency	yes	15 min
2	swarmctl rebalance correctness	yes	10 min (overlaps Test 1)
3	Burst join (multi-worker)	no	15 min
4	Boot quiescence sanity	yes	10 min
5	Graceful scale-down + drain	yes	20 min
6	Drain timeout	no	15 min
7	Leader failover mid-rebalance	yes	15 min
8	Excluded services protection	no	10 min
9	Dry-run mode	no	10 min
10	No-rebalance-on-scale-down	yes	5 min (overlaps Test 5)

Total critical path: ~75 min. Full matrix: ~2 hours.

Test 1 — Scale-up latency

Goal: Measure end-to-end time from memory-threshold crossed to task fully rebalanced. Confirms 2-3 min provisioning is still the baseline.

Density note. Under the 2-replica convention, per-stack memory roughly doubles for templated services compared to the pre-2-replica baseline. The 70% memory threshold trips with fewer concurrent stacks per worker than historical measurements would suggest — expect this test to fire scale-up sooner / with less synthetic stress than older runbook iterations. That’s a feature (faster reaction to real load), not a regression.

Method:

Note the current memory% of the cluster (Grafana).
Force memory pressure. Two options:
- Option A (cleaner, no app risk): Deploy a memory-stress service:
```
sudo docker service create \
  --name autoscale-test-stress \
  --replicas 6 \
  --constraint 'node.role==worker' \
  --restart-condition none \
  polinux/stress \
  stress --vm 2 --vm-bytes 1G --timeout 1200s
```
- Option B (closer to real load): Scale up consumer_portal_django for several tenants until cluster mem >70%. Risk: longer recovery if the test goes sideways.

Watch the four phases unfold and timestamp each:

Phase	Watch for	Where
Threshold crossed	Cluster mem ≥ 70% sustained ~60s	Grafana memory panel
MIG provisioning	New row in instance list, status `PROVISIONING`	GCP Console / `gcloud compute instances list`
Instance running	Status flips to `RUNNING`, IP assigned	same
Joined Swarm	`Ready Active` in `docker node ls`	`sudo docker node ls` on a manager
swarmctl detected	Log line `new nodes detected; stabilizing`	swarmctl logs
Rebalance complete	Log line `rebalance complete ... exit_reason=`	swarmctl logs

Record each timestamp. Total elapsed = first to last.

Pass criteria:

Each phase logged, all four observable
MIG provision → instance running ≤ 3 min (regression check vs historical baseline)
Swarm join completes within 30s of instance running
swarmctl stabilization respects REBALANCE_STABILIZATION_DELAY (60s default)
Rebalance pass completes without Failed task states

Cleanup: Remove the stress service: sudo docker service rm autoscale-test-stress. Memory drops; expect Test 5 (scale-down) to fire naturally if you don’t interrupt.

Test 2 — swarmctl rebalance correctness

Runs alongside Test 1; no separate setup.

Goal: Confirm the rebalance actually redistributes tasks onto the new worker, not just no-op force-updates.

Method:

Before the rebalance kicks off (during Test 1’s stabilization delay), snapshot task counts per node:

sudo docker node ls -q | while read n; do
  echo "$n: $(sudo docker node ps "$n" --filter desired-state=running -q | wc -l)"
done > /tmp/pre-rebalance.txt

Wait for rebalance to complete (rebalance complete log).
Snapshot again into /tmp/post-rebalance.txt.

Compute skew before/after:

compute_skew() {
  awk -F': ' '{print $2}' "$1" | sort -n | awk 'NR==1{min=$1} {max=$1} END{print "min="min, "max="max, "skew="max-min}'
}
compute_skew /tmp/pre-rebalance.txt
compute_skew /tmp/post-rebalance.txt

Pass criteria:

Post-rebalance skew is lower than pre-rebalance
New worker has > 0 tasks
No service ends with 0 running tasks (i.e., rebalance didn’t break a service)
For Phase 3: rebalance targeted only the hottest REBALANCE_MAX_HOT_NODES workers (check swarmctl logs for filterByHotNodes: filtered services)
Every templated web service still shows 2/2 after the pass (rebalance shouldn’t drop a replica). A 1/2 post-rebalance suggests a service update raced a node-removal — investigate before continuing.

Skew numbers under the 2-replica convention are roughly 2x the pre-2-replica baseline (each templated service contributes 2 tasks per stack, not 1). The skew ratio (max/min) is the comparable metric across baseline eras; absolute task counts are not.

Test 3 — Burst join (multi-worker scale-up) (optional)

Goal: Confirm REBALANCE_STABILIZATION_DELAY resets correctly when multiple nodes join in quick succession; only a single rebalance pass should fire.

Method:

Temporarily increase MIG max so the autoscaler can add 2 workers at once.
Generate large memory pressure quickly (scale stress service to 12 replicas).
Watch swarmctl logs for new nodes detected; stabilizing — expect this line to appear, then re-appear (with reset delay) when the second worker joins.
After both join, swarmctl should fire one rebalance covering both.

Pass criteria:

Single rebalance pass for both joins
No back-to-back rebalances within REBALANCE_POLL_INTERVAL of each other
Log shows stabilization_delay_reset (or equivalent) when the second join is detected

Test 4 — Boot quiescence sanity

Goal: Confirm that a swarmctl process restart (e.g. failover) does not trigger a spurious rebalance.

Method:

On the leader, restart swarmctl:

sudo docker service update --force swarmctl_swarmctl

Wait for the new task to start.
Watch logs for the REBALANCE_BOOT_QUIESCENCE (300s default) wait + the seeding behavior.

Pass criteria:

Log line: became leader; seeded known workers
No rebalance pass within the boot-quiescence window
After the window, swarmctl ticks normally without rebalancing (because baseline already matches current fleet)

Test 5 — Graceful scale-down + drain

Goal: Confirm a removed worker drains its tasks before the VM is killed, and that templated web services stay continuously available throughout the drain.

HA expectation under the 2-replica convention. With 2 replicas spread across 2 different nodes (validated in pre-flight step 3), draining one worker only removes one of two replicas per spread web service. The other replica continues serving on the surviving node. A 2-replica-with-spread web service should never drop to 0/2 during this test — it should transition 2/2 → 1/2 → 2/2 as swarm reschedules the drained task. Pre-2-replica runs accepted a brief 0/1 blip during drain; that’s no longer acceptable for these services.

Which services this applies to. The “never 0/2” invariant only holds for services that carry max_replicas_per_node (web apps, Caddy, Cloudflare Tunnel — see the replica-convention background). It does not apply to:

cloudsql-proxy — 2 replicas but no spread constraint, so both replicas can sit on the drained node; a 0/2 blip here is possible and not a failure.
single-replica services (Redis, Celery, Keycloak, ws_admin_django, sidecars) — these blip to 0/1 and reschedule, which is expected.

So pick the victim service from the spread tier (a *_django / *_flask app or *_caddy). Don’t measure availability against cloudsql-proxy or a single-replica service — they’ll “fail” the 0/2 criterion by design and pollute the result.

Method:

After Test 1, stop the stress service: sudo docker service rm autoscale-test-stress.
Cluster memory drops below the scale-down threshold. Wait through the autoscaler cooldown (typically 5–10 min).
Pick a “victim” service hosted on the worker that’s about to drain (the worker that gets selected for removal — see GCP Console). It must be a 2-replica-with-spread service (a *_django / *_flask app or *_caddy) — not cloudsql-proxy or a single-replica service, per the HA note above. In a separate terminal, poll its replica count once per second so you have a continuous record:
```
VICTIM=<one-of-the-spread-services-on-the-doomed-worker>  # e.g. hbcp_unum_django_app
while true; do
  printf '%s %s\n' "$(date +%T)" "$(sudo docker service ls --filter "name=$VICTIM" --format '{{.Replicas}}')"
  sleep 1
done | tee /tmp/drain-availability.log
```
MIG will pick a worker to remove (usually the most recently added). Watch:
- GCP Console: instance moves to STOPPING
- On the leader: swarmctl logs should show the /drain POST arriving from the worker
Capture timing: drain start → drain complete → instance terminated. Stop the availability poll.

Pass criteria:

Worker hits /drain before exiting (look for received drain request in swarmctl logs with the worker’s IP)
swarmctl resolves caller IP to the correct node
Node set to drain in docker node ls
All tasks migrate before the worker leaves (drain returns success within DRAIN_TIMEOUT, default 45s)
/tmp/drain-availability.log shows the victim service stayed ≥ 1 running replica throughout — i.e. transitions like 2/2 → 1/2 → 2/2 are good; a 0/2 line at any timestamp during the drain is a fail (means both replicas were on the drained node, which violates the placement-spread invariant the pre-flight verified). Only valid for the spread tier — see the HA note; cloudsql-proxy and single-replica services are exempt.
No spike in swarm_tasks_failed metric (or minimal — a few Failed states with Shutdown reason are acceptable for non-templated services like one-shots). Note: a one-shot migration task that shows Failed once then Complete on retry is a known, benign pattern — the migration can race cloudsql-proxy readiness on a freshly-joined worker and retry once before the proxy’s IAM/connection setup finishes. It’s not a drain or autoscale failure; don’t flag it.
Instance terminated cleanly (no zombie node entries in docker node ls)

Test 6 — Drain timeout (optional, harder to engineer)

Goal: Confirm graceful failure when a task can’t migrate in 45s.

Method:

Deploy a service with a deliberately slow shutdown (e.g. a celery worker mid-task with stop_grace_period: 90s).
Force the worker hosting it to scale down.
Watch what happens when drain exceeds DRAIN_TIMEOUT.

Pass criteria:

swarmctl logs drain timeout exceeded
Worker still leaves cleanly (timeout is a graceful failure, not a hang)
Swarm reschedules the orphan task reactively
Brief task-failed blip is captured in metrics (will trip swarm_tasks_failed — acceptable if the test is announced)

Test 7 — Leader failover mid-rebalance

Goal: Confirm a leader change during an in-flight rebalance does not result in duplicate or runaway rebalances.

Method:

On the current leader, trigger a manual rebalance:

REBALANCE_TOKEN=$(gcloud secrets versions access latest \
  --secret=swarm-observability-rebalance-token --project=<staging-project> \
  --impersonate-service-account=observability-deploy@<staging-project>.iam.gserviceaccount.com)
curl -sS -XPOST -H "Authorization: Bearer $REBALANCE_TOKEN" \
  http://localhost:9876/rebalance -w '\nHTTP %{http_code}\n'

Expect 202.

Within ~10 seconds, force a leader change by restarting Docker on the current leader:
```
sudo systemctl restart docker
```
New leader takes over (one of the other two managers).

Pass criteria:

New leader logs became leader; seeded known workers
New leader skips its first tick (per design — baseline matches current state)
No additional rebalance pass on the new leader within the next 10 min
Old leader’s in-flight pass either completed or was abandoned cleanly (no half-updated services)

Test 8 — Excluded services protection (optional)

Goal: Confirm REBALANCE_EXCLUDED_SERVICES prevents touching critical services.

Method:

Update swarmctl env with:

REBALANCE_EXCLUDED_SERVICES=observability_prometheus,observability_grafana

Snapshot UpdatedAt for these services: sudo docker service inspect observability_prometheus -f '{{.UpdatedAt}}'.
Trigger a manual /rebalance (per Test 7 step 1).
Compare UpdatedAt after.

Pass criteria:

UpdatedAt unchanged for excluded services
Log line: [rebalance] skipping excluded service: observability_prometheus
Other services were updated as expected

Test 9 — Dry-run mode (optional)

Goal: Confirm REBALANCE_DRY_RUN=true makes no mutating calls.

Method:

Update one manager to run swarmctl with REBALANCE_DRY_RUN=true. Easiest path: edit the service env and force-update; alternative is a sandbox manager.
Trigger /rebalance.
Inspect logs and UpdatedAt for several services.

Pass criteria:

Logs show [dry-run] would force-update <service> for N services
UpdatedAt unchanged for all services
No Swarm task churn observed

Test 10 — No rebalance on scale-down (regression guard)

Goal: Confirm worker removal does not trigger a rebalance. Swarm reschedules removed tasks on its own; swarmctl must not duplicate that work.

Method:

Runs as part of Test 5 — just watch the logs after the worker leaves.

Pass criteria:

No new nodes detected or rebalance log entries in the 5 min following worker removal
(Optional: log line confirming the removed-node delta was observed and intentionally ignored)
All previously 1/2 templated web services return to 2/2 via swarm’s native reschedule (not via a swarmctl pass) within ~1 min of the worker leaving

Reporting

For each test executed, capture:

Test number + name
Timestamp range (start to last observed state change — used for correlating with Cloud Monitoring later)
swarmctl log excerpt showing the key state transitions (became leader, new nodes detected, rebalance complete, received drain request, etc.)
Grafana screenshot of the affected panels (memory%, task distribution, task-failed)
Pass / Fail / Skipped + reason

Compile into a Confluence test report. Paste the summary table (test#, pass/fail, notes) into the originating Jira ticket as a comment so future readers don’t have to dig through Confluence.

Cleanup

After all tests:

Remove any test-only services (autoscale-test-stress etc.)
Restore original MIG min/max if Test 3 modified them
Restore swarmctl env if Tests 8 or 9 modified it
Un-silence the swarm_tasks_failed alert
Confirm cluster is back to nominal: sudo docker service ls | grep -v "[0-9]/[0-9]" should return nothing

Known limitations

This runbook does not exercise the production autoscaler — only staging. Production gets the validated configuration but no synthetic load.
Tests assume the current swarmctl phase. If REBALANCE_PHASE changes, re-run Test 1 + Test 2 to validate the new convergence/targeting logic.
Tests assume the three-tier replica baseline described in Background (SERVICE_REPLICAS=2 + MAX_REPLICAS_PER_NODE=1 for the spread web tier; SERVICE_REPLICAS=2 only for cloudsql-proxy; 1 replica for everything else). If the convention changes — different replica count, the spread constraint added to cloudsql-proxy, or templating removed — revisit Test 5’s 0/2-is-fail criterion (it keys off max_replicas_per_node, not replica count alone) and the density expectations in Test 1 / Test 2.
cloudsql-proxy is 2 replicas without an anti-affinity constraint. If a future change adds max_replicas_per_node to it, it moves into the spread tier and becomes eligible for the Test 5 0/2 criterion — update the exemption list accordingly.
Tests 3 and 6 are deliberately stretch goals — both require precise timing or specific app-level shutdown behavior that’s hard to engineer reliably. Skip them if time runs out and file followups for the specific scenarios.

References

swarmctl service docs — controller architecture, env vars, deploy procedure
Swarm manager image upgrade — similar runbook shape; helpful pattern reference
Swarm manager quorum loss recovery — what to do if Test 7’s leader failover goes sideways

Edit this page