Rolling Update of Swarm Worker Nodes
This runbook covers how to replace Docker Swarm worker VMs one at a time, either to apply a changed instance template (new boot image, machine type, disk, or startup script) or to recycle workers onto the same template.
Use this runbook when:
- A Terraform change renamed the worker instance template (any edit to
machine_type, boot image, disk settings, orscripts/worker-startup.sh.tftplchanges the content-addressedworker_template_hash) - You need to recycle specific workers without a template change (memory leak, degraded node, kernel weirdness)
Why nothing rolls on its own
The worker MIG update policy is intentionally
OPPORTUNISTIC (see
infra/hb-infra/modules/swarm_worker/main.tf and the module
README). A terraform apply that changes the template only
stages the new version on the MIG; no running instance is
touched. This avoids the thundering herd of a PROACTIVE
roll, where every cold replacement node pulls every image at once. The
operator replaces workers explicitly, one at a time.
Three module settings make each replacement safe:
replacement_method = "SUBSTITUTE"withmax_unavailable_fixed = 0: the MIG creates the new VM (new name) before deleting the old one, so capacity never dips below the autoscaler floor.swarm-worker-drain.serviceon every worker: at shutdown, the retiring VM calls a manager’s swarmctl/drainendpoint, which sets the node to drain and waits for its tasks to migrate (45s drain budget inside the 80s systemd stop cap) before the VM leaves the swarm.- Auto-healing on tcp:9323 with a 600s initial delay: the substitute is given time to boot, install the Ops Agent, and join before health enforcement starts.
Note that new instances created by the autoscaler or auto-healing always use the staged (new) template. Once you apply a template change, any scale-out burst comes up on the new shape even before you roll anyone.
How the roll works
The roll runs two things at once: a per-worker replacement cycle that physically swaps each VM, and a fleet-wide rebalancer pause that wraps the entire run so swarmctl does not react to all that churn until the roll is done.
The per-worker cycle
Workers are replaced strictly one at a time. For each worker, the script:
- Replaces it. The MIG brings up the substitute
(drift roll: new VM, new name) or rebuilds it in place
(
--force: same name, fresh boot disk). Either way the retiring VM’sswarm-worker-drain.servicefires at shutdown, draining the node and waiting (up to its 45s budget) for swarm to reschedule that worker’s tasks onto the remaining Active nodes before the VM leaves. - Waits for healthy. It blocks until the MIG reports
stable and every instance is
HEALTHY, absorbing the 600s auto-heal initial delay on the fresh node. - Verifies the swarm join. It SSHes to a manager and
confirms the replacement is
Ready Active, because MIG health only proves dockerd is up, not that the node actually joined the swarm. A failure here stops the roll immediately, before any further worker is touched.
One-at-a-time is deliberate: it caps concurrent image pulls and drain
load to a single node’s worth, and it keeps capacity flat (a
SUBSTITUTE surges the new node up first; a
--force recreate runs one node short for a few
minutes).
Why the rebalancer is paused
swarmctl runs a rebalancer that watches for new swarm nodes. When one
joins, it runs a stabilization window and then a rebalance pass that
force-updates services onto the newcomer. That is exactly
right for an organic node join, but a rolling update creates a
join for every worker: a 14-worker roll would set off
roughly 14 separate stabilization-and-rebalance passes, each shuffling
tasks across the whole fleet while the roll is still moving. The work
multiplies and fights the roll.
The script removes that multiplication by quiescing the rebalancer for the duration of the roll and doing the rebalance once, at the end:
- Pause every manager before the first replacement. Pause is process-local state on each swarmctl, carrying a TTL. Pausing all managers (not just the current leader) means a leadership failover mid-roll cannot land on an unpaused leader that would start replaying joins. This step is fail-closed: if any manager cannot be paused, the script aborts before touching a single worker.
- Roll the fleet while paused. The rebalancer does nothing, yet the fleet still fills in on its own. Each retiring worker’s drain (per-worker step 1) reschedules its tasks onto the Active nodes, and the already-recreated empty nodes are the least-loaded Active nodes, so swarm’s spread scheduler favors them. Capacity tracks the roll without any rebalance passes.
- Resume every manager. Resume does not replay the joins it slept through. It reseeds, taking the post-roll fleet as the new baseline, so the roll’s churn is absorbed rather than re-litigated as 14 new-node events.
- Fire one explicit rebalance on the leader. This
single pass tops off any node the drain-spread did not fill and evens
out residual skew. It is the existing manual
/rebalancetrigger, so no new rebalance logic runs.
Resume and the final rebalance are best-effort, not
fail-closed: once the workers are verified the fleet is already safe, so
a failed resume only warns. A server-side TTL (default 4h) auto-resumes
any manager the script could not reach, including after a
kill -9 or host loss that a bash trap can never catch. The
bash trap handles the catchable exits (normal completion, a
set -e failure, Ctrl-C); the TTL is the backstop for the
rest. Net effect: about one clean rebalance pass for the whole roll
instead of one per worker.
--no-pause skips this entire layer and accepts the old
behavior (a rebalance pass per replaced worker). The pause, resume, and
rebalance calls all go through swarmctl’s loopback-only HTTP endpoints,
reached over IAP SSH inside swarmctl’s network namespace; the endpoint
contract is in docs/services/swarmctl.md.
Prerequisites
- Member of
ssh-n-env@thehelperbees.com(staging) orssh-p-env@thehelperbees.com(production) for the manager-side verification — see SSH Into GCP VMs - Member of
sg-hb-infra-development@thehelperbees.comif the roll starts with a Terraform change - Fleet currently healthy: all instances
HEALTHYin the MIG, all workersReady Activeindocker node ls - A quiet-ish window. Each replacement takes roughly 10-15 minutes end to end (boot + 600s health-check initial delay), so a full 12-worker roll is a multi-hour activity by design.
Step 1: Land the template change (skip if recycling)
Edit the worker module call in
infra/hb-infra/business_unit_1/<env>/swarm.tf (or the
module itself), then run a targeted plan:
./zig/zig build plan -- hb-infra non-productionExpected plan: a new
google_compute_instance_template is created (the hash
suffix in its name changes), the MIG’s
version.instance_template is updated in place, and the old
template is destroyed after the swap
(create_before_destroy). No instance is created,
replaced, or destroyed. If the plan wants to touch instances,
stop and investigate.
Merge and let Cloud Build apply. The fleet now runs the old template while the MIG advertises the new one. Confirm the spread:
gcloud compute instance-groups managed list-instances swarm-n-workers \
--region us-central1 --project prj-bu1-n-hb-infra-5381 \
--format="table(name,zone.basename(),instanceStatus,version.instanceTemplate.basename(),instanceHealth[0].detailedHealthState)"Step 2: Roll with the script
# Replace every worker not yet on the staged template, one at a time,
# prompting between nodes:
./zig/zig build scripts -- roll_swarm_workers non-production
# Recycle the whole fleet even when every worker is already on target
# (rebuild in place; see "Two replacement mechanisms" below):
./zig/zig build scripts -- roll_swarm_workers non-production --force
# Roll specific instances (drift-driven substitute for off-target workers):
./zig/zig build scripts -- roll_swarm_workers non-production --instances swarm-n-worker-5ttm
# Unattended (e.g. from a long tmux session):
./zig/zig build scripts -- roll_swarm_workers non-production --yesThe auto-built worklist (default and --force) is
shuffled each run, so the roll does not always start with the same node;
an explicit --instances list keeps the order you give it.
Per worker, the script issues the replacement, waits for the MIG to
stabilize, waits (bounded) for every instance to report
HEALTHY, then SSHes to a manager and confirms the
replacement is Ready Active in the swarm before moving on.
It stops immediately on the first worker that fails to verify.
--skip-swarm-check drops the manager-side step if you lack
SSH access, but read the caveat in Verification below before using
it.
The rebalancer is paused around the roll
(pause -> roll -> resume -> rebalance), so the
fleet gets one rebalance pass at the end instead of one per replaced
worker; see How the roll works for the
mechanism and rationale. Operationally, three things to know: pausing is
fail-closed, so if any manager cannot be paused (you
lack IAP SSH or secretAccessor, or swarmctl predates the
pause endpoints) the script aborts before rolling anything;
--no-pause skips the pause and accepts a rebalance pass per
worker; and --pause-ttl-seconds N sets the safety-net TTL
(default 4h, capped by swarmctl at 6h) after which a stuck pause
auto-resumes on its own.
Two replacement mechanisms. The default (drift) roll
and the --force recycle reach for different gcloud verbs,
because they solve different problems:
- Drift roll (default, and
--instanceson an off-target worker) usesupdate-instances --minimal-action=replace. This is diff-gated: it acts only on instances whose template differs from the MIG’s target, and replaces them viaSUBSTITUTE(a new VM with a new name comes up before the old one drains, so capacity never dips). It is a no-op on a worker that already matches the target. --forcerecycle usesrecreate-instances, which is unconditional. Use it to rebuild workers that are already on target (e.g. to re-pull a startup-script change, clear bad node state, or force a fresh swarm join). recreate rebuilds the VM in place: the instance keeps its name and its id/creationTimestamp, but its boot disk is rebuilt from the current template, and the node is briefly down during the rebuild (one at a time, drain hook still runs at shutdown). It does not surge, so the fleet runs one short during each worker’s rebuild. Because the instance id is preserved, the script tracks the rebuild via the MIG’s per-instancecurrentAction(RECREATING→NONE), not by waiting for a new name or id. It also waits for the MIG to be stable before each recreate and re-issues if a recreate is dropped (a recreate fired during autoscaler activity is silently ignored).
Because update-instances is diff-gated,
--instances <name> will not recycle
a worker that is already on the target template; that call is a no-op.
To force-recycle on-target workers, use --force.
Step 2 (alternative): Roll manually
The script is a convenience wrapper around two commands. Per worker:
# 1. Replace one instance. SUBSTITUTE creates the new VM first.
gcloud compute instance-groups managed update-instances swarm-n-workers \
--region us-central1 --project prj-bu1-n-hb-infra-5381 \
--instances swarm-n-worker-5ttm \
--minimal-action replace --most-disruptive-allowed-action replace
# 2. Wait for the MIG to finish creating/deleting.
gcloud compute instance-groups managed wait-until swarm-n-workers --stable \
--region us-central1 --project prj-bu1-n-hb-infra-5381 --timeout 1500Then verify (next section) before touching the next worker.
Verification
MIG side. The list-instances command
from Step 1 should show the old name gone, a new name on the target
template, and every instance HEALTHY. Health can read
UNKNOWN/TIMEOUT for up to 10 minutes on a
fresh substitute (600s initial delay); that is normal, wait it out.
Swarm side (do not skip). The MIG health check probes dockerd’s metrics port, so it proves dockerd is alive, not that the worker joined the swarm. A worker whose join failed sits in the MIG looking healthy while running zero tasks. From a manager:
gssh swarm-mgr-1 n-hb-infra
sudo docker node lsThe substitute must be Ready Active. If the retired node
lingers in the list as Down, remove the stale entry:
sudo docker node rm <old-node-name>Rebalancing. Swarm does not move existing tasks onto
a fresh empty node. With the default (paused) roll, the script handles
this for you: it resumes the rebalancer and fires one explicit pass once
every worker is verified, so the fleet evens out without ~14 separate
passes. Watch for the resuming the rebalancer... and
rebalance queued (202) lines near the end of the run; a
409 there just means a pass was already running and will
converge. If you rolled with --no-pause, or the final
trigger warned, the rebalancer still converges on its own loop (see
docs/services/swarmctl.md), or force a specific service
immediately:
sudo docker service update --force <service-name>Pitfalls
- Do not use the console’s “Rolling update” flow. It
flips the update policy to
PROACTIVEand rolls the whole fleet at once, exactly the thundering herd the module is designed to prevent. The console’s one-shot “Restart/replace VMs” is the UI equivalent ofupdate-instances, but it also mutates themax_surge/max_unavailablefields (Terraform deliberately ignores those, so the change is invisible to plan). - The autoscaler stays live during the roll. Scale-in is bounded (1 instance per 15-minute window) and scale-out creates instances on the new template, so interleaving is harmless but can be confusing: instance counts and names may shift while you work. For a fully deterministic roll, temporarily pin min = max in the autoscaler console page and restore afterwards (min/max are UI-managed; see the module README’s “UI-managed runtime fields”).
- One at a time means one at a time. Passing several
instances to a single
update-instancescall replaces them concurrently, multiplying simultaneous image pulls and drain load. The script never does this. - Startup script edits roll the fleet eventually regardless. Even unrolled, old-template workers are living on borrowed time: any auto-heal recreates them on the new template. Do not park the fleet half-rolled for weeks; finish the roll the same day or revert the template change.