GitHub

Swarm Worker Autoscale Validation

This runbook validates the end-to-end auto-scale lifecycle for our Docker Swarm worker fleet: GCP MIG autoscaler trigger → MIG provisions worker → swarm join → swarmctl auto-rebalance → load redistribution → scale-down → swarmctl drain.

Run this when:

  • A new swarmctl phase is rolled out (e.g. Phase 1 → 2 → 3 transitions)
  • The worker MIG’s autoscaler config changes (threshold, cooldown, target utilization)
  • Quarterly, as a regression check, before peak-traffic seasons
  • After a major Swarm or Docker version upgrade
  • After a worker boot-image change that could affect provisioning time

The procedure exercises both the GCP MIG autoscaler (provisioning side) and the swarmctl controller (rebalance + drain). See swarmctl service docs for the controller architecture.

Background

Two systems work together to keep the cluster sized correctly:

  1. GCP MIG autoscaler — scales the worker MIG up at 70% memory utilization (currently). Adds/removes VM instances; does not know about Swarm tasks.
  2. swarmctl — the manager-side controller. On worker join, it triggers a rebalance to redistribute existing tasks onto the new worker (Swarm itself only schedules new tasks, never moves existing ones). On worker shutdown, the worker hooks POST /drain so swarmctl migrates tasks before the VM goes away.

Historically (pre-swarmctl) operators had to manually docker service update --force after each scale-up to spread load. swarmctl automated that. This runbook validates both halves still work together.

Replica convention. Three tiers, distinguished by how they scale — the key axis for the drain test below is whether a service has max_replicas_per_node, not “web vs non-web”:

  • 2 replicas, spread across nodes${SERVICE_REPLICAS:-1} template plus max_replicas_per_node: ${MAX_REPLICAS_PER_NODE:-0} (the deploy script sets SERVICE_REPLICAS=2 and MAX_REPLICAS_PER_NODE=1). Covers the web-server services (Django/Flask apps, the Vite/Next/Angular UIs, ASP.NET web_api), Caddy, and Cloudflare Tunnel. These run 2 replicas guaranteed on 2 distinct nodes — this is the HA invariant Test 5 leans on.
  • 2 replicas, NOT spread${SERVICE_REPLICAS:-1} only, no max_replicas_per_node. This is just the Cloud SQL Proxy sidecar, bumped 1 → 2 (the cloudsql-proxy-2-replicas change) so a single node loss can’t sever the DB path. Because it has no anti-affinity constraint, swarm may place both replicas on the same worker — so cloudsql-proxy is exempt from Test 5’s “never drops to 0/2” criterion (it can legitimately co-locate). It usually spreads on its own, but that’s best-effort, not guaranteed.
  • 1 replica — untemplated. Redis, Celery workers/beat, bucket workers, and a few web-tier services that are single-replica by design: Keycloak (KC_CACHE: local, per-replica cache state) and hbcrm’s ws_admin_django (in-memory WebSocket state). These behave like any 1-replica service on drain — they blip to 0/1 and get rescheduled, which is expected, not a failure. Do not pick one as the Test 5 victim.

One-shot jobs (migration, django_migration, run_management_commands, default_user) have restart_policy: condition: none and settle to 0/1 after running — that’s their converged state, not a failure. The runbook’s pass criteria below assume this baseline.

Prerequisites

  • Member of ssh-n-env@thehelperbees.com (staging only — production not exercised by this runbook)
  • Member of sg-hb-infra-development@thehelperbees.com for MIG config inspection
  • gcloud, gssh, and docker CLI access to a manager
  • Grafana access to dashboard 21476 (Docker Swarm Overview) and the swarm-tasks dashboard
  • swarm_tasks_failed alert silenced for the test window — otherwise pager fires on every reschedule
  • All three managers Ready Active Reachable, one leader: sudo docker node ls

Pre-flight

Capture the baseline state and confirm the moving parts are wired correctly. Skip nothing — every check below maps to a failure mode you’ll be hunting later if something looks off.

1. Confirm the memory metric is flowing

Both the autoscaler and swarmctl Phase ≥ 2 read agent.googleapis.com/memory/percent_used. No metric = no scale events and swarmctl Phase 3 silently skips rebalances.

gcloud monitoring time-series list \
  --filter='metric.type="agent.googleapis.com/memory/percent_used" AND resource.labels.instance_name=~"swarm-n-worker.*"' \
  --interval-end-time=now --interval-duration=5m \
  --project=<staging-project>

Expect: one series per worker, recent timestamps (within 60s), values between 0 and 1.

2. Confirm swarmctl is healthy on all managers

for mgr in swarm-mgr-1 swarm-mgr-2 swarm-mgr-3; do
  echo "=== $mgr ==="
  gcloud compute ssh "$mgr" --zone <zone> --tunnel-through-iap --project <staging-project> \
    --command "curl -sf localhost:9876/healthz && sudo docker node inspect self -f '{{.ManagerStatus.Leader}}'"
done

Expect: all three return 200 on /healthz; exactly one returns true for Leader.

3. Capture MIG and topology baseline

# MIG config — current min/max/target + cooldown
gcloud compute instance-groups managed describe swarm-n-workers \
  --region <region> --project <staging-project> \
  --format='value(autoscaler.autoscalingPolicy)'

# Current swarm topology
sudo docker node ls --format '{{.Hostname}}\t{{.Status}}\t{{.Availability}}\t{{.ManagerStatus}}'

# Current task distribution (skew baseline)
sudo docker node ls -q | while read n; do
  echo "$n: $(sudo docker node ps "$n" --filter desired-state=running -q | wc -l) tasks"
done

# Replica-convention sanity: every templated web service should be at 2/2.
# A 1/2 or 0/2 here means a stack didn't finish converging — fix before
# starting the test, otherwise drain/rebalance results will be polluted.
#
# One-shots (restart_policy condition: none) settle to 0/1 and are excluded
# so they don't show as false "NOT CONVERGED". Discovered dynamically rather
# than hard-coded — the set varies per repo (migration, django_migration,
# run_management_commands, default_user, ...).
ONESHOTS=$(sudo docker service ls -q | while read s; do
  [ "$(sudo docker service inspect "$s" -f '{{.Spec.TaskTemplate.RestartPolicy.Condition}}')" = "none" ] \
    && sudo docker service inspect "$s" -f '{{.Spec.Name}}'
done)
sudo docker service ls --format '{{.Name}}\t{{.Replicas}}' \
  | grep -vFf <(printf '%s\n' "$ONESHOTS") \
  | awk -F'\t' '{ split($2,r,"/"); if (r[1]!=r[2]) print "  NOT CONVERGED: " $0 }'

# Spot-check a representative templated service shows 2 replicas on
# different nodes (HA invariant we're about to test).
SVC=$(sudo docker service ls --format '{{.Name}}' | grep -E '_django_app$|_caddy$' | head -1)
echo "=== node placement for $SVC ==="
sudo docker service ps "$SVC" --filter desired-state=running \
  --format 'table {{.Name}}\t{{.Node}}'

Save the output — you’ll compare against it after each test. The replica-convention check should report nothing (all stacks converged); the spot-check should show 2 replicas on 2 distinct nodes.

4. Open observation windows

In parallel terminals:

  • On the leader: sudo docker service logs -f swarmctl_swarmctl 2>&1 | grep -v healthcheck
  • On the leader: journalctl -u docker -f
  • Browser: Grafana dashboard 21476 and the swarm-tasks dashboard
  • Browser: GCP Console → Compute → Instance Groups → swarm-n-workers → Monitoring

Test matrix

Each test below is independent and can be run in isolation if time is short. Recommended order is by criticality. Skip-okay tests are marked.

# Test Critical? Est. time
1 Scale-up latency yes 15 min
2 swarmctl rebalance correctness yes 10 min (overlaps Test 1)
3 Burst join (multi-worker) no 15 min
4 Boot quiescence sanity yes 10 min
5 Graceful scale-down + drain yes 20 min
6 Drain timeout no 15 min
7 Leader failover mid-rebalance yes 15 min
8 Excluded services protection no 10 min
9 Dry-run mode no 10 min
10 No-rebalance-on-scale-down yes 5 min (overlaps Test 5)

Total critical path: ~75 min. Full matrix: ~2 hours.


Test 1 — Scale-up latency

Goal: Measure end-to-end time from memory-threshold crossed to task fully rebalanced. Confirms 2-3 min provisioning is still the baseline.

Density note. Under the 2-replica convention, per-stack memory roughly doubles for templated services compared to the pre-2-replica baseline. The 70% memory threshold trips with fewer concurrent stacks per worker than historical measurements would suggest — expect this test to fire scale-up sooner / with less synthetic stress than older runbook iterations. That’s a feature (faster reaction to real load), not a regression.

Method:

  1. Note the current memory% of the cluster (Grafana).

  2. Force memory pressure. Two options:

    • Option A (cleaner, no app risk): Deploy a memory-stress service:

      sudo docker service create \
        --name autoscale-test-stress \
        --replicas 6 \
        --constraint 'node.role==worker' \
        --restart-condition none \
        polinux/stress \
        stress --vm 2 --vm-bytes 1G --timeout 1200s
    • Option B (closer to real load): Scale up consumer_portal_django for several tenants until cluster mem >70%. Risk: longer recovery if the test goes sideways.

  3. Watch the four phases unfold and timestamp each:

    Phase Watch for Where
    Threshold crossed Cluster mem ≥ 70% sustained ~60s Grafana memory panel
    MIG provisioning New row in instance list, status PROVISIONING GCP Console / gcloud compute instances list
    Instance running Status flips to RUNNING, IP assigned same
    Joined Swarm Ready Active in docker node ls sudo docker node ls on a manager
    swarmctl detected Log line new nodes detected; stabilizing swarmctl logs
    Rebalance complete Log line rebalance complete ... exit_reason= swarmctl logs
  4. Record each timestamp. Total elapsed = first to last.

Pass criteria:

  • Each phase logged, all four observable
  • MIG provision → instance running ≤ 3 min (regression check vs historical baseline)
  • Swarm join completes within 30s of instance running
  • swarmctl stabilization respects REBALANCE_STABILIZATION_DELAY (60s default)
  • Rebalance pass completes without Failed task states

Cleanup: Remove the stress service: sudo docker service rm autoscale-test-stress. Memory drops; expect Test 5 (scale-down) to fire naturally if you don’t interrupt.


Test 2 — swarmctl rebalance correctness

Runs alongside Test 1; no separate setup.

Goal: Confirm the rebalance actually redistributes tasks onto the new worker, not just no-op force-updates.

Method:

  1. Before the rebalance kicks off (during Test 1’s stabilization delay), snapshot task counts per node:

    sudo docker node ls -q | while read n; do
      echo "$n: $(sudo docker node ps "$n" --filter desired-state=running -q | wc -l)"
    done > /tmp/pre-rebalance.txt
  2. Wait for rebalance to complete (rebalance complete log).

  3. Snapshot again into /tmp/post-rebalance.txt.

  4. Compute skew before/after:

    compute_skew() {
      awk -F': ' '{print $2}' "$1" | sort -n | awk 'NR==1{min=$1} {max=$1} END{print "min="min, "max="max, "skew="max-min}'
    }
    compute_skew /tmp/pre-rebalance.txt
    compute_skew /tmp/post-rebalance.txt

Pass criteria:

  • Post-rebalance skew is lower than pre-rebalance
  • New worker has > 0 tasks
  • No service ends with 0 running tasks (i.e., rebalance didn’t break a service)
  • For Phase 3: rebalance targeted only the hottest REBALANCE_MAX_HOT_NODES workers (check swarmctl logs for filterByHotNodes: filtered services)
  • Every templated web service still shows 2/2 after the pass (rebalance shouldn’t drop a replica). A 1/2 post-rebalance suggests a service update raced a node-removal — investigate before continuing.

Skew numbers under the 2-replica convention are roughly 2x the pre-2-replica baseline (each templated service contributes 2 tasks per stack, not 1). The skew ratio (max/min) is the comparable metric across baseline eras; absolute task counts are not.


Test 3 — Burst join (multi-worker scale-up) (optional)

Goal: Confirm REBALANCE_STABILIZATION_DELAY resets correctly when multiple nodes join in quick succession; only a single rebalance pass should fire.

Method:

  1. Temporarily increase MIG max so the autoscaler can add 2 workers at once.
  2. Generate large memory pressure quickly (scale stress service to 12 replicas).
  3. Watch swarmctl logs for new nodes detected; stabilizing — expect this line to appear, then re-appear (with reset delay) when the second worker joins.
  4. After both join, swarmctl should fire one rebalance covering both.

Pass criteria:

  • Single rebalance pass for both joins
  • No back-to-back rebalances within REBALANCE_POLL_INTERVAL of each other
  • Log shows stabilization_delay_reset (or equivalent) when the second join is detected

Test 4 — Boot quiescence sanity

Goal: Confirm that a swarmctl process restart (e.g. failover) does not trigger a spurious rebalance.

Method:

  1. On the leader, restart swarmctl:

    sudo docker service update --force swarmctl_swarmctl
  2. Wait for the new task to start.

  3. Watch logs for the REBALANCE_BOOT_QUIESCENCE (300s default) wait + the seeding behavior.

Pass criteria:

  • Log line: became leader; seeded known workers
  • No rebalance pass within the boot-quiescence window
  • After the window, swarmctl ticks normally without rebalancing (because baseline already matches current fleet)

Test 5 — Graceful scale-down + drain

Goal: Confirm a removed worker drains its tasks before the VM is killed, and that templated web services stay continuously available throughout the drain.

HA expectation under the 2-replica convention. With 2 replicas spread across 2 different nodes (validated in pre-flight step 3), draining one worker only removes one of two replicas per spread web service. The other replica continues serving on the surviving node. A 2-replica-with-spread web service should never drop to 0/2 during this test — it should transition 2/2 → 1/2 → 2/2 as swarm reschedules the drained task. Pre-2-replica runs accepted a brief 0/1 blip during drain; that’s no longer acceptable for these services.

Which services this applies to. The “never 0/2” invariant only holds for services that carry max_replicas_per_node (web apps, Caddy, Cloudflare Tunnel — see the replica-convention background). It does not apply to:

  • cloudsql-proxy — 2 replicas but no spread constraint, so both replicas can sit on the drained node; a 0/2 blip here is possible and not a failure.
  • single-replica services (Redis, Celery, Keycloak, ws_admin_django, sidecars) — these blip to 0/1 and reschedule, which is expected.

So pick the victim service from the spread tier (a *_django / *_flask app or *_caddy). Don’t measure availability against cloudsql-proxy or a single-replica service — they’ll “fail” the 0/2 criterion by design and pollute the result.

Method:

  1. After Test 1, stop the stress service: sudo docker service rm autoscale-test-stress.

  2. Cluster memory drops below the scale-down threshold. Wait through the autoscaler cooldown (typically 5–10 min).

  3. Pick a “victim” service hosted on the worker that’s about to drain (the worker that gets selected for removal — see GCP Console). It must be a 2-replica-with-spread service (a *_django / *_flask app or *_caddy) — not cloudsql-proxy or a single-replica service, per the HA note above. In a separate terminal, poll its replica count once per second so you have a continuous record:

    VICTIM=<one-of-the-spread-services-on-the-doomed-worker>  # e.g. hbcp_unum_django_app
    while true; do
      printf '%s %s\n' "$(date +%T)" "$(sudo docker service ls --filter "name=$VICTIM" --format '{{.Replicas}}')"
      sleep 1
    done | tee /tmp/drain-availability.log
  4. MIG will pick a worker to remove (usually the most recently added). Watch:

    • GCP Console: instance moves to STOPPING
    • On the leader: swarmctl logs should show the /drain POST arriving from the worker
  5. Capture timing: drain start → drain complete → instance terminated. Stop the availability poll.

Pass criteria:

  • Worker hits /drain before exiting (look for received drain request in swarmctl logs with the worker’s IP)
  • swarmctl resolves caller IP to the correct node
  • Node set to drain in docker node ls
  • All tasks migrate before the worker leaves (drain returns success within DRAIN_TIMEOUT, default 45s)
  • /tmp/drain-availability.log shows the victim service stayed ≥ 1 running replica throughout — i.e. transitions like 2/2 → 1/2 → 2/2 are good; a 0/2 line at any timestamp during the drain is a fail (means both replicas were on the drained node, which violates the placement-spread invariant the pre-flight verified). Only valid for the spread tier — see the HA note; cloudsql-proxy and single-replica services are exempt.
  • No spike in swarm_tasks_failed metric (or minimal — a few Failed states with Shutdown reason are acceptable for non-templated services like one-shots). Note: a one-shot migration task that shows Failed once then Complete on retry is a known, benign pattern — the migration can race cloudsql-proxy readiness on a freshly-joined worker and retry once before the proxy’s IAM/connection setup finishes. It’s not a drain or autoscale failure; don’t flag it.
  • Instance terminated cleanly (no zombie node entries in docker node ls)

Test 6 — Drain timeout (optional, harder to engineer)

Goal: Confirm graceful failure when a task can’t migrate in 45s.

Method:

  1. Deploy a service with a deliberately slow shutdown (e.g. a celery worker mid-task with stop_grace_period: 90s).
  2. Force the worker hosting it to scale down.
  3. Watch what happens when drain exceeds DRAIN_TIMEOUT.

Pass criteria:

  • swarmctl logs drain timeout exceeded
  • Worker still leaves cleanly (timeout is a graceful failure, not a hang)
  • Swarm reschedules the orphan task reactively
  • Brief task-failed blip is captured in metrics (will trip swarm_tasks_failed — acceptable if the test is announced)

Test 7 — Leader failover mid-rebalance

Goal: Confirm a leader change during an in-flight rebalance does not result in duplicate or runaway rebalances.

Method:

  1. On the current leader, trigger a manual rebalance:

    REBALANCE_TOKEN=$(gcloud secrets versions access latest \
      --secret=swarm-observability-rebalance-token --project=<staging-project> \
      --impersonate-service-account=observability-deploy@<staging-project>.iam.gserviceaccount.com)
    curl -sS -XPOST -H "Authorization: Bearer $REBALANCE_TOKEN" \
      http://localhost:9876/rebalance -w '\nHTTP %{http_code}\n'

    Expect 202.

  2. Within ~10 seconds, force a leader change by restarting Docker on the current leader:

    sudo systemctl restart docker
  3. New leader takes over (one of the other two managers).

Pass criteria:

  • New leader logs became leader; seeded known workers
  • New leader skips its first tick (per design — baseline matches current state)
  • No additional rebalance pass on the new leader within the next 10 min
  • Old leader’s in-flight pass either completed or was abandoned cleanly (no half-updated services)

Test 8 — Excluded services protection (optional)

Goal: Confirm REBALANCE_EXCLUDED_SERVICES prevents touching critical services.

Method:

  1. Update swarmctl env with:

    REBALANCE_EXCLUDED_SERVICES=observability_prometheus,observability_grafana
  2. Snapshot UpdatedAt for these services: sudo docker service inspect observability_prometheus -f '{{.UpdatedAt}}'.

  3. Trigger a manual /rebalance (per Test 7 step 1).

  4. Compare UpdatedAt after.

Pass criteria:

  • UpdatedAt unchanged for excluded services
  • Log line: [rebalance] skipping excluded service: observability_prometheus
  • Other services were updated as expected

Test 9 — Dry-run mode (optional)

Goal: Confirm REBALANCE_DRY_RUN=true makes no mutating calls.

Method:

  1. Update one manager to run swarmctl with REBALANCE_DRY_RUN=true. Easiest path: edit the service env and force-update; alternative is a sandbox manager.
  2. Trigger /rebalance.
  3. Inspect logs and UpdatedAt for several services.

Pass criteria:

  • Logs show [dry-run] would force-update <service> for N services
  • UpdatedAt unchanged for all services
  • No Swarm task churn observed

Test 10 — No rebalance on scale-down (regression guard)

Goal: Confirm worker removal does not trigger a rebalance. Swarm reschedules removed tasks on its own; swarmctl must not duplicate that work.

Method:

Runs as part of Test 5 — just watch the logs after the worker leaves.

Pass criteria:

  • No new nodes detected or rebalance log entries in the 5 min following worker removal
  • (Optional: log line confirming the removed-node delta was observed and intentionally ignored)
  • All previously 1/2 templated web services return to 2/2 via swarm’s native reschedule (not via a swarmctl pass) within ~1 min of the worker leaving

Reporting

For each test executed, capture:

  • Test number + name
  • Timestamp range (start to last observed state change — used for correlating with Cloud Monitoring later)
  • swarmctl log excerpt showing the key state transitions (became leader, new nodes detected, rebalance complete, received drain request, etc.)
  • Grafana screenshot of the affected panels (memory%, task distribution, task-failed)
  • Pass / Fail / Skipped + reason

Compile into a Confluence test report. Paste the summary table (test#, pass/fail, notes) into the originating Jira ticket as a comment so future readers don’t have to dig through Confluence.

Cleanup

After all tests:

  • Remove any test-only services (autoscale-test-stress etc.)
  • Restore original MIG min/max if Test 3 modified them
  • Restore swarmctl env if Tests 8 or 9 modified it
  • Un-silence the swarm_tasks_failed alert
  • Confirm cluster is back to nominal: sudo docker service ls | grep -v "[0-9]/[0-9]" should return nothing

Known limitations

  • This runbook does not exercise the production autoscaler — only staging. Production gets the validated configuration but no synthetic load.
  • Tests assume the current swarmctl phase. If REBALANCE_PHASE changes, re-run Test 1 + Test 2 to validate the new convergence/targeting logic.
  • Tests assume the three-tier replica baseline described in Background (SERVICE_REPLICAS=2 + MAX_REPLICAS_PER_NODE=1 for the spread web tier; SERVICE_REPLICAS=2 only for cloudsql-proxy; 1 replica for everything else). If the convention changes — different replica count, the spread constraint added to cloudsql-proxy, or templating removed — revisit Test 5’s 0/2-is-fail criterion (it keys off max_replicas_per_node, not replica count alone) and the density expectations in Test 1 / Test 2.
  • cloudsql-proxy is 2 replicas without an anti-affinity constraint. If a future change adds max_replicas_per_node to it, it moves into the spread tier and becomes eligible for the Test 5 0/2 criterion — update the exemption list accordingly.
  • Tests 3 and 6 are deliberately stretch goals — both require precise timing or specific app-level shutdown behavior that’s hard to engineer reliably. Skip them if time runs out and file followups for the specific scenarios.

References

Edit this page