Rolling Boot Image Upgrade for Swarm Managers
This runbook covers how to roll a new boot image to the three Docker
Swarm manager VMs (swarm-mgr-1, swarm-mgr-2,
swarm-mgr-3) one at a time, without losing Raft quorum.
Use this runbook when:
- A new version of the
ubuntu-2204-docker-falconboot image has been published (kernel patch, Docker upgrade, Falcon Sensor refresh, Ops Agent bump) - You need to roll a custom one-off image (e.g. CVE remediation)
The procedure relies on three module settings working together:
lifecycle { ignore_changes = [boot_disk] }on the instance (Design Decision #10): bumpingboot_image_nameis a no-op until you explicitly request a-replace.auto_delete = falseon the instance’sboot_diskblock: the disk is managed as a standalonegoogle_compute_diskresource so it survives instance recreation and keeps its snapshot history.lifecycle { replace_triggered_by = [google_compute_disk.boot[each.key]] }on the instance: replacing the disk automatically replaces the instance that was attached to it, so the operator only targets the disk.
Combined with the startup script’s three-branch decision tree (Design Decision #19), this makes a manager replacement safe and idempotent.
Prerequisites
- Member of
ssh-n-env@thehelperbees.com(staging) orssh-p-env@thehelperbees.com(production) — see SSH Into GCP VMs - Member of
sg-hb-infra-development@thehelperbees.com(Terraform builds) gsshtool installed (./zig/zig build gssh)- Quorum is currently healthy:
docker node lsfrom any manager shows 3/3Ready Active Reachable
Detection: Is a New Image Available?
The boot image is built by the gcp_compute_image script
and published to the BU1 infra pipeline project
(prj-bu1-c-infra-pipeline-5327). Two ways to check for
newer versions:
List images by family in the pipeline project:
gcloud compute images list \ --project=prj-bu1-c-infra-pipeline-5327 \ --filter='family:ubuntu-2204-docker-falcon' \ --sort-by=~creationTimestampInspect the build script and recent runs:
./zig/zig build scripts -- gcp_compute_image --helpSee
src/scripts/gcp_compute_image/README.mdfor the full image build flow.
Compare the latest image name against boot_image_name in
infra/hb-infra/business_unit_1/<env>/swarm.tf. If
they differ, an upgrade is available.
Rolling Upgrade Procedure
For each environment (non-production,
production), perform the following steps. Always start with
non-production and only proceed to production
once non-prod has been observed stable for at least 24 hours.
Step 1:
Bump boot_image_name in the Caller
Edit the swarm caller for the target environment:
- Non-production:
infra/hb-infra/business_unit_1/non-production/swarm.tf - Production:
infra/hb-infra/business_unit_1/production/swarm.tf
Change the boot_image_name argument on the
module "swarm_managers" block to the new image name:
module "swarm_managers" {
source = "../../modules/swarm_manager"
# ...
boot_image_project = var.bu1_infra_pipeline_project_id
boot_image_name = "ubuntu-2204-docker-falcon-2026-04-01"
# ...
}
Step 2: Run a Targeted Plan and Verify No Drift
Run a plan against the target environment:
./zig/zig build plan -- hb-infra non-productionExpected: the plan should report no
changes to
module.swarm_managers.google_compute_instance.manager and
no changes to
module.swarm_managers.google_compute_disk.boot. This is
intentional. The module sets
lifecycle { ignore_changes = [boot_disk] } on the instance
(Design Decision #10) so that image updates never automatically replace
running managers. The bumped variable only takes effect when you
explicitly request a -replace against the boot disk.
If the plan does show pending changes to the manager instances or boot disks, stop. Investigate before continuing — something other than the image has drifted.
Step 3: Replace
swarm-mgr-1
Target the boot disk, not the instance. Because the
instance has
lifecycle { replace_triggered_by = [google_compute_disk.boot[each.key]] },
replacing the disk automatically cascades to an instance replacement.
You get one -replace argument instead of two and the module
guarantees they stay in lockstep.
The ./zig/zig build apply wrapper does not currently
pass through -replace arguments to Terraform. For this
one-off operation, fall back to the project-local Terraform binary
(never use system terraform):
./bin/terraform -chdir=infra/hb-infra/business_unit_1/non-production apply \
-replace='module.swarm_managers.google_compute_disk.boot["swarm-mgr-1"]'Review the plan Terraform prints. Exactly two resources should be replaced:
module.swarm_managers.google_compute_disk.boot["swarm-mgr-1"](the resource you targeted)module.swarm_managers.google_compute_instance.manager["swarm-mgr-1"](cascaded viareplace_triggered_by)
If the plan only shows the disk being replaced and the instance
not being replaced, stop — the
replace_triggered_by wiring is broken and the new instance
will re-attach to the old image. File a bug against the module.
Type yes to apply.
Note:
deletion_protection = trueis set on each manager (Design Decision #10), but Terraform handles deletion protection automatically when the destroy is part of a-replace. If Terraform refuses with a deletion-protection error, double-check you are running a-replaceagainst the boot disk and not a full destroy.
Step 4: Wait for the Replaced Manager to Rejoin
The new VM takes roughly 3 minutes to boot, run its startup script,
and rejoin the swarm. The startup script’s three-branch logic (Design
Decision #19) handles the rejoin automatically: it sees the existing
tokens in Secret Manager and immediately calls join_swarm()
against the other two managers.
SSH from one of the other managers (not the one being replaced) and verify quorum:
gssh swarm-mgr-2 n-hb-infra
docker node lsExpected output: all three managers listed, all
Ready, all Active, all Reachable.
The replaced manager will have a new node ID — that is normal.
If after 5 minutes the replaced manager is still missing or shows
Down, see the Quorum-Loss Recovery
runbook and the troubleshooting section below.
Step 5: Verify the Replaced Manager Directly
SSH to the newly replaced manager:
gssh swarm-mgr-1 n-hb-infraRun the following verification commands:
# Confirm the new image is actually booted
uname -r
docker version
# Confirm Falcon Sensor is running
systemctl status falcon-sensor
# Confirm the Ops Agent is running (required for GMP and alert metrics)
systemctl status google-cloud-ops-agent
# Confirm the swarm node label is applied
docker node inspect self --format '{{ .Spec.Labels }}'The kernel version (uname -r) and Docker version should
match what is built into the new image. If either still matches the old
image, the -replace did not actually swap the boot disk —
re-check the Terraform plan output from Step 3.
Step 6: Verify GMP Metric Flow
Per Design Decision #20, Prometheus runs with tmpfs storage and
remote_write to Google Managed Service for Prometheus.
Because there is no on-host Prometheus disk, metric durability is not
affected by manager replacement, but you should still confirm metrics
from the replaced manager are landing in GMP.
In Cloud Monitoring (Metrics Explorer), query an Ops Agent metric filtered to the replaced manager:
metric.type = "agent.googleapis.com/cpu/utilization"
resource.label.instance_id = "<instance-id-of-replaced-manager>"
You should see fresh data points within 1-2 minutes of the boot
completing. The Cloud Monitoring metric absence alert
(alerts.tf, alert #6) will fire if metrics stop arriving
for var.metric_absence_duration, so a missing series here
is the leading indicator that the new image broke the Ops Agent.
Step 7: Repeat for swarm-mgr-2 and
swarm-mgr-3
Repeat Steps 3 through 6 for swarm-mgr-2, then
swarm-mgr-3. Between each manager:
- Wait for
docker node lsto show 3/3Ready Active Reachable - Wait at least 2 minutes for the Raft cluster to stabilize and elect a leader if needed
- Confirm GMP metrics are flowing for the previously replaced manager before starting on the next
Never replace two managers simultaneously. Quorum on a 3-manager swarm requires 2/3 healthy. Replacing two at once kills the cluster and forces you into the Quorum-Loss Recovery runbook.
Step 8: Post-Roll Verification
After all three managers have been replaced:
From any manager:
docker node lsshows 3/3Ready Active ReachableThe swarm-rebalancer service (HB-8223) is healthy:
docker service ps swarm-rebalancer --no-trunc docker service logs --tail 50 swarm-rebalancerGMP is receiving metrics from all three managers (Metrics Explorer query above, with no instance_id filter, should show three series)
No new alerts firing in Cloud Monitoring or PagerDuty for the swarm managers
Commit the
boot_image_namechange to the caller and submit a PR documenting the upgrade
Rollback Procedure
If the new image is broken on one of the managers (Falcon down, Ops Agent down, kernel panic, Docker daemon won’t start), roll that single manager back before continuing:
Revert
boot_image_namein the callerswarm.tfto the previous imageRun the same disk-targeted
-replacecommand from Step 3 against the affected manager:./bin/terraform -chdir=infra/hb-infra/business_unit_1/non-production apply \ -replace='module.swarm_managers.google_compute_disk.boot["swarm-mgr-1"]'Verify per Steps 4-6
Open a Jira ticket against the image build pipeline
If multiple managers are already on the broken image and quorum is at risk, stop the rolling upgrade immediately and roll the affected managers back one at a time. Do not let the cluster drop below 2/3 healthy managers.
If quorum has already been lost, escalate to the Quorum-Loss Recovery runbook.
Why This Procedure Is Safe
Three module design decisions combine to make a rolling image upgrade low-risk:
lifecycle { ignore_changes = [boot_disk] }(Design Decision #10). Image drift never auto-rolls. Bumpingboot_image_nameis a no-op until you explicitly request a-replaceon a specific manager. No accidental cluster-wide replacement.lifecycle { replace_triggered_by = [boot disk] }on the instance. One-replaceagainst the boot disk propagates to the instance automatically. Without this the operator would have to target both resources by hand, and forgetting the disk would silently re-attach the old image.deletion_protection = true(Design Decision #10). Cannot accidentallyterraform destroya manager. The protection is bypassed automatically by-replace, so the upgrade flow still works.- Three-branch startup script (Design Decision #19).
The replaced manager reads existing join tokens from Secret Manager and
immediately calls
join_swarm(). No first-boot race, no manual token rotation, no manualdocker swarm joinneeded. - Hourly boot disk snapshots, 10-day retention (Design Decision #15). If the new image is broken in a way that prevents the manager from booting at all, you can recover Raft state from the most recent snapshot via the Quorum-Loss Recovery runbook, bounding worst-case Raft data loss to ~1 hour.
Troubleshooting
| Symptom | Likely Cause | Fix |
|---|---|---|
| Plan in Step 2 shows manager instance changes | Drift outside the boot image | Investigate before proceeding; do not -replace |
| Step 3 plan shows only the disk replacing, not the instance | replace_triggered_by on the instance is missing or
misconfigured |
Stop; file a bug. Do NOT apply — the new disk would be attached to a new instance booting from the old image |
Replaced manager stuck Down after 5 min |
Startup script failed to read tokens from Secret Manager | SSH to the new manager, check
journalctl -u google-startup-scripts.service |
| Replaced manager rejoins as a worker | manager_join_token was rotated but the Secret Manager
value is stale |
See Quorum-Loss Recovery Critical Gotchas |
| Falcon Sensor not running on new image | Image build broke Falcon installation | Roll back the affected manager, file a bug against
gcp_compute_image |
| Ops Agent not running | Image build broke google-cloud-ops-agent package |
Same as above |
| Metric absence alert fires during the roll | Expected briefly while the replaced manager boots | Should auto-clear within 5 min; if not, investigate Ops Agent on that manager |