GitHub

Rolling Update of Swarm Worker Nodes

This runbook covers how to replace Docker Swarm worker VMs one at a time, either to apply a changed instance template (new boot image, machine type, disk, or startup script) or to recycle workers onto the same template.

Use this runbook when:

A Terraform change renamed the worker instance template (any edit to machine_type, boot image, disk settings, or scripts/worker-startup.sh.tftpl changes the content-addressed worker_template_hash)
You need to recycle specific workers without a template change (memory leak, degraded node, kernel weirdness)

Why nothing rolls on its own

The worker MIG update policy is intentionally OPPORTUNISTIC (see infra/hb-infra/modules/swarm_worker/main.tf and the module README). A terraform apply that changes the template only stages the new version on the MIG; no running instance is touched. This avoids the thundering herd of a PROACTIVE roll, where every cold replacement node pulls every image at once. The operator replaces workers explicitly, one at a time.

Three module settings make each replacement safe:

replacement_method = "SUBSTITUTE" with max_unavailable_fixed = 0: the MIG creates the new VM (new name) before deleting the old one, so capacity never dips below the autoscaler floor.
swarm-worker-drain.service on every worker: at shutdown, the retiring VM calls a manager’s swarmctl /drain endpoint, which sets the node to drain and waits for its tasks to migrate (45s drain budget inside the 80s systemd stop cap) before the VM leaves the swarm.
Auto-healing on tcp:9323 with a 600s initial delay: the substitute is given time to boot, install the Ops Agent, and join before health enforcement starts.

Note that new instances created by the autoscaler or auto-healing always use the staged (new) template. Once you apply a template change, any scale-out burst comes up on the new shape even before you roll anyone.

How the roll works

The roll runs two things at once: a per-worker replacement cycle that physically swaps each VM, and a fleet-wide rebalancer pause that wraps the entire run so swarmctl does not react to all that churn until the roll is done.

The per-worker cycle

Workers are replaced strictly one at a time. For each worker, the script:

Replaces it. The MIG brings up the substitute (drift roll: new VM, new name) or rebuilds it in place (--force: same name, fresh boot disk). Either way the retiring VM’s swarm-worker-drain.service fires at shutdown, draining the node and waiting (up to its 45s budget) for swarm to reschedule that worker’s tasks onto the remaining Active nodes before the VM leaves.
Waits for healthy. It blocks until the MIG reports stable and every instance is HEALTHY, absorbing the 600s auto-heal initial delay on the fresh node.
Verifies the swarm join. It SSHes to a manager and confirms the replacement is Ready Active, because MIG health only proves dockerd is up, not that the node actually joined the swarm. A failure here stops the roll immediately, before any further worker is touched.

One-at-a-time is deliberate: it caps concurrent image pulls and drain load to a single node’s worth, and it keeps capacity flat (a SUBSTITUTE surges the new node up first; a --force recreate runs one node short for a few minutes).

Why the rebalancer is paused

swarmctl runs a rebalancer that watches for new swarm nodes. When one joins, it runs a stabilization window and then a rebalance pass that force-updates services onto the newcomer. That is exactly right for an organic node join, but a rolling update creates a join for every worker: a 14-worker roll would set off roughly 14 separate stabilization-and-rebalance passes, each shuffling tasks across the whole fleet while the roll is still moving. The work multiplies and fights the roll.

The script removes that multiplication by quiescing the rebalancer for the duration of the roll and doing the rebalance once, at the end:

Pause every manager before the first replacement. Pause is process-local state on each swarmctl, carrying a TTL. Pausing all managers (not just the current leader) means a leadership failover mid-roll cannot land on an unpaused leader that would start replaying joins. This step is fail-closed: if any manager cannot be paused, the script aborts before touching a single worker.
Roll the fleet while paused. The rebalancer does nothing, yet the fleet still fills in on its own. Each retiring worker’s drain (per-worker step 1) reschedules its tasks onto the Active nodes, and the already-recreated empty nodes are the least-loaded Active nodes, so swarm’s spread scheduler favors them. Capacity tracks the roll without any rebalance passes.
Resume every manager. Resume does not replay the joins it slept through. It reseeds, taking the post-roll fleet as the new baseline, so the roll’s churn is absorbed rather than re-litigated as 14 new-node events.
Fire one explicit rebalance on the leader. This single pass tops off any node the drain-spread did not fill and evens out residual skew. It is the existing manual /rebalance trigger, so no new rebalance logic runs.

Resume and the final rebalance are best-effort, not fail-closed: once the workers are verified the fleet is already safe, so a failed resume only warns. A server-side TTL (default 4h) auto-resumes any manager the script could not reach, including after a kill -9 or host loss that a bash trap can never catch. The bash trap handles the catchable exits (normal completion, a set -e failure, Ctrl-C); the TTL is the backstop for the rest. Net effect: about one clean rebalance pass for the whole roll instead of one per worker.

--no-pause skips this entire layer and accepts the old behavior (a rebalance pass per replaced worker). The pause, resume, and rebalance calls all go through swarmctl’s loopback-only HTTP endpoints, reached over IAP SSH inside swarmctl’s network namespace; the endpoint contract is in docs/services/swarmctl.md.

Prerequisites

Member of ssh-n-env@thehelperbees.com (staging) or ssh-p-env@thehelperbees.com (production) for the manager-side verification — see SSH Into GCP VMs
Member of sg-hb-infra-development@thehelperbees.com if the roll starts with a Terraform change
Fleet currently healthy: all instances HEALTHY in the MIG, all workers Ready Active in docker node ls
A quiet-ish window. Each replacement takes roughly 10-15 minutes end to end (boot + 600s health-check initial delay), so a full 12-worker roll is a multi-hour activity by design.

Step 1: Land the template change (skip if recycling)

Edit the worker module call in infra/hb-infra/business_unit_1/<env>/swarm.tf (or the module itself), then run a targeted plan:

./zig/zig build plan -- hb-infra non-production

Expected plan: a new google_compute_instance_template is created (the hash suffix in its name changes), the MIG’s version.instance_template is updated in place, and the old template is destroyed after the swap (create_before_destroy). No instance is created, replaced, or destroyed. If the plan wants to touch instances, stop and investigate.

Merge and let Cloud Build apply. The fleet now runs the old template while the MIG advertises the new one. Confirm the spread:

gcloud compute instance-groups managed list-instances swarm-n-workers \
    --region us-central1 --project prj-bu1-n-hb-infra-5381 \
    --format="table(name,zone.basename(),instanceStatus,version.instanceTemplate.basename(),instanceHealth[0].detailedHealthState)"

Step 2: Roll with the script

# Replace every worker not yet on the staged template, one at a time,
# prompting between nodes:
./zig/zig build scripts -- roll_swarm_workers non-production

# Recycle the whole fleet even when every worker is already on target
# (rebuild in place; see "Two replacement mechanisms" below):
./zig/zig build scripts -- roll_swarm_workers non-production --force

# Roll specific instances (drift-driven substitute for off-target workers):
./zig/zig build scripts -- roll_swarm_workers non-production --instances swarm-n-worker-5ttm

# Unattended (e.g. from a long tmux session):
./zig/zig build scripts -- roll_swarm_workers non-production --yes

The auto-built worklist (default and --force) is shuffled each run, so the roll does not always start with the same node; an explicit --instances list keeps the order you give it. Per worker, the script issues the replacement, waits for the MIG to stabilize, waits (bounded) for every instance to report HEALTHY, then SSHes to a manager and confirms the replacement is Ready Active in the swarm before moving on. It stops immediately on the first worker that fails to verify. --skip-swarm-check drops the manager-side step if you lack SSH access, but read the caveat in Verification below before using it.

The rebalancer is paused around the roll (pause -> roll -> resume -> rebalance), so the fleet gets one rebalance pass at the end instead of one per replaced worker; see How the roll works for the mechanism and rationale. Operationally, three things to know: pausing is fail-closed, so if any manager cannot be paused (you lack IAP SSH or secretAccessor, or swarmctl predates the pause endpoints) the script aborts before rolling anything; --no-pause skips the pause and accepts a rebalance pass per worker; and --pause-ttl-seconds N sets the safety-net TTL (default 4h, capped by swarmctl at 6h) after which a stuck pause auto-resumes on its own.

Two replacement mechanisms. The default (drift) roll and the --force recycle reach for different gcloud verbs, because they solve different problems:

Drift roll (default, and --instances on an off-target worker) uses update-instances --minimal-action=replace. This is diff-gated: it acts only on instances whose template differs from the MIG’s target, and replaces them via SUBSTITUTE (a new VM with a new name comes up before the old one drains, so capacity never dips). It is a no-op on a worker that already matches the target.
--force recycle uses recreate-instances, which is unconditional. Use it to rebuild workers that are already on target (e.g. to re-pull a startup-script change, clear bad node state, or force a fresh swarm join). recreate rebuilds the VM in place: the instance keeps its name and its id/creationTimestamp, but its boot disk is rebuilt from the current template, and the node is briefly down during the rebuild (one at a time, drain hook still runs at shutdown). It does not surge, so the fleet runs one short during each worker’s rebuild. Because the instance id is preserved, the script tracks the rebuild via the MIG’s per-instance currentAction (RECREATING → NONE), not by waiting for a new name or id. It also waits for the MIG to be stable before each recreate and re-issues if a recreate is dropped (a recreate fired during autoscaler activity is silently ignored).

Because update-instances is diff-gated, --instances <name> will not recycle a worker that is already on the target template; that call is a no-op. To force-recycle on-target workers, use --force.

Step 2 (alternative): Roll manually

The script is a convenience wrapper around two commands. Per worker:

# 1. Replace one instance. SUBSTITUTE creates the new VM first.
gcloud compute instance-groups managed update-instances swarm-n-workers \
    --region us-central1 --project prj-bu1-n-hb-infra-5381 \
    --instances swarm-n-worker-5ttm \
    --minimal-action replace --most-disruptive-allowed-action replace

# 2. Wait for the MIG to finish creating/deleting.
gcloud compute instance-groups managed wait-until swarm-n-workers --stable \
    --region us-central1 --project prj-bu1-n-hb-infra-5381 --timeout 1500

Then verify (next section) before touching the next worker.

Verification

MIG side. The list-instances command from Step 1 should show the old name gone, a new name on the target template, and every instance HEALTHY. Health can read UNKNOWN/TIMEOUT for up to 10 minutes on a fresh substitute (600s initial delay); that is normal, wait it out.

Swarm side (do not skip). The MIG health check probes dockerd’s metrics port, so it proves dockerd is alive, not that the worker joined the swarm. A worker whose join failed sits in the MIG looking healthy while running zero tasks. From a manager:

gssh swarm-mgr-1 n-hb-infra
sudo docker node ls

The substitute must be Ready Active. If the retired node lingers in the list as Down, remove the stale entry:

sudo docker node rm <old-node-name>

Rebalancing. Swarm does not move existing tasks onto a fresh empty node. With the default (paused) roll, the script handles this for you: it resumes the rebalancer and fires one explicit pass once every worker is verified, so the fleet evens out without ~14 separate passes. Watch for the resuming the rebalancer... and rebalance queued (202) lines near the end of the run; a 409 there just means a pass was already running and will converge. If you rolled with --no-pause, or the final trigger warned, the rebalancer still converges on its own loop (see docs/services/swarmctl.md), or force a specific service immediately:

sudo docker service update --force <service-name>

Pitfalls

Do not use the console’s “Rolling update” flow. It flips the update policy to PROACTIVE and rolls the whole fleet at once, exactly the thundering herd the module is designed to prevent. The console’s one-shot “Restart/replace VMs” is the UI equivalent of update-instances, but it also mutates the max_surge/max_unavailable fields (Terraform deliberately ignores those, so the change is invisible to plan).
The autoscaler stays live during the roll. Scale-in is bounded (1 instance per 15-minute window) and scale-out creates instances on the new template, so interleaving is harmless but can be confusing: instance counts and names may shift while you work. For a fully deterministic roll, temporarily pin min = max in the autoscaler console page and restore afterwards (min/max are UI-managed; see the module README’s “UI-managed runtime fields”).
One at a time means one at a time. Passing several instances to a single update-instances call replaces them concurrently, multiplying simultaneous image pulls and drain load. The script never does this.
Startup script edits roll the fleet eventually regardless. Even unrolled, old-template workers are living on borrowed time: any auto-heal recreates them on the new template. Do not park the fleet half-rolled for weeks; finish the roll the same day or revert the template change.

Edit this page