GitHub

Rolling Boot Image Upgrade for Swarm Managers

This runbook covers how to roll a new boot image to the three Docker Swarm manager VMs (swarm-mgr-1, swarm-mgr-2, swarm-mgr-3) one at a time, without losing Raft quorum.

Use this runbook when:

  • A new version of the ubuntu-2204-docker-falcon boot image has been published (kernel patch, Docker upgrade, Falcon Sensor refresh, Ops Agent bump)
  • You need to roll a custom one-off image (e.g. CVE remediation)

The procedure relies on three module settings working together:

  • lifecycle { ignore_changes = [boot_disk] } on the instance (Design Decision #10): bumping boot_image_name is a no-op until you explicitly request a -replace.
  • auto_delete = false on the instance’s boot_disk block: the disk is managed as a standalone google_compute_disk resource so it survives instance recreation and keeps its snapshot history.
  • lifecycle { replace_triggered_by = [google_compute_disk.boot[each.key]] } on the instance: replacing the disk automatically replaces the instance that was attached to it, so the operator only targets the disk.

Combined with the startup script’s three-branch decision tree (Design Decision #19), this makes a manager replacement safe and idempotent.

Prerequisites

  • Member of ssh-n-env@thehelperbees.com (staging) or ssh-p-env@thehelperbees.com (production) — see SSH Into GCP VMs
  • Member of sg-hb-infra-development@thehelperbees.com (Terraform builds)
  • gssh tool installed (./zig/zig build gssh)
  • Quorum is currently healthy: docker node ls from any manager shows 3/3 Ready Active Reachable

Detection: Is a New Image Available?

The boot image is built by the gcp_compute_image script and published to the BU1 infra pipeline project (prj-bu1-c-infra-pipeline-5327). Two ways to check for newer versions:

  1. List images by family in the pipeline project:

    gcloud compute images list \
        --project=prj-bu1-c-infra-pipeline-5327 \
        --filter='family:ubuntu-2204-docker-falcon' \
        --sort-by=~creationTimestamp
  2. Inspect the build script and recent runs:

    ./zig/zig build scripts -- gcp_compute_image --help

    See src/scripts/gcp_compute_image/README.md for the full image build flow.

Compare the latest image name against boot_image_name in infra/hb-infra/business_unit_1/<env>/swarm.tf. If they differ, an upgrade is available.

Rolling Upgrade Procedure

For each environment (non-production, production), perform the following steps. Always start with non-production and only proceed to production once non-prod has been observed stable for at least 24 hours.

Step 1: Bump boot_image_name in the Caller

Edit the swarm caller for the target environment:

  • Non-production: infra/hb-infra/business_unit_1/non-production/swarm.tf
  • Production: infra/hb-infra/business_unit_1/production/swarm.tf

Change the boot_image_name argument on the module "swarm_managers" block to the new image name:

module "swarm_managers" {
  source = "../../modules/swarm_manager"

  # ...
  boot_image_project = var.bu1_infra_pipeline_project_id
  boot_image_name    = "ubuntu-2204-docker-falcon-2026-04-01"
  # ...
}

Step 2: Run a Targeted Plan and Verify No Drift

Run a plan against the target environment:

./zig/zig build plan -- hb-infra non-production

Expected: the plan should report no changes to module.swarm_managers.google_compute_instance.manager and no changes to module.swarm_managers.google_compute_disk.boot. This is intentional. The module sets lifecycle { ignore_changes = [boot_disk] } on the instance (Design Decision #10) so that image updates never automatically replace running managers. The bumped variable only takes effect when you explicitly request a -replace against the boot disk.

If the plan does show pending changes to the manager instances or boot disks, stop. Investigate before continuing — something other than the image has drifted.

Step 3: Replace swarm-mgr-1

Target the boot disk, not the instance. Because the instance has lifecycle { replace_triggered_by = [google_compute_disk.boot[each.key]] }, replacing the disk automatically cascades to an instance replacement. You get one -replace argument instead of two and the module guarantees they stay in lockstep.

The ./zig/zig build apply wrapper does not currently pass through -replace arguments to Terraform. For this one-off operation, fall back to the project-local Terraform binary (never use system terraform):

./bin/terraform -chdir=infra/hb-infra/business_unit_1/non-production apply \
    -replace='module.swarm_managers.google_compute_disk.boot["swarm-mgr-1"]'

Review the plan Terraform prints. Exactly two resources should be replaced:

  1. module.swarm_managers.google_compute_disk.boot["swarm-mgr-1"] (the resource you targeted)
  2. module.swarm_managers.google_compute_instance.manager["swarm-mgr-1"] (cascaded via replace_triggered_by)

If the plan only shows the disk being replaced and the instance not being replaced, stop — the replace_triggered_by wiring is broken and the new instance will re-attach to the old image. File a bug against the module.

Type yes to apply.

Note: deletion_protection = true is set on each manager (Design Decision #10), but Terraform handles deletion protection automatically when the destroy is part of a -replace. If Terraform refuses with a deletion-protection error, double-check you are running a -replace against the boot disk and not a full destroy.

Step 4: Wait for the Replaced Manager to Rejoin

The new VM takes roughly 3 minutes to boot, run its startup script, and rejoin the swarm. The startup script’s three-branch logic (Design Decision #19) handles the rejoin automatically: it sees the existing tokens in Secret Manager and immediately calls join_swarm() against the other two managers.

SSH from one of the other managers (not the one being replaced) and verify quorum:

gssh swarm-mgr-2 n-hb-infra
docker node ls

Expected output: all three managers listed, all Ready, all Active, all Reachable. The replaced manager will have a new node ID — that is normal.

If after 5 minutes the replaced manager is still missing or shows Down, see the Quorum-Loss Recovery runbook and the troubleshooting section below.

Step 5: Verify the Replaced Manager Directly

SSH to the newly replaced manager:

gssh swarm-mgr-1 n-hb-infra

Run the following verification commands:

# Confirm the new image is actually booted
uname -r
docker version

# Confirm Falcon Sensor is running
systemctl status falcon-sensor

# Confirm the Ops Agent is running (required for GMP and alert metrics)
systemctl status google-cloud-ops-agent

# Confirm the swarm node label is applied
docker node inspect self --format '{{ .Spec.Labels }}'

The kernel version (uname -r) and Docker version should match what is built into the new image. If either still matches the old image, the -replace did not actually swap the boot disk — re-check the Terraform plan output from Step 3.

Step 6: Verify GMP Metric Flow

Per Design Decision #20, Prometheus runs with tmpfs storage and remote_write to Google Managed Service for Prometheus. Because there is no on-host Prometheus disk, metric durability is not affected by manager replacement, but you should still confirm metrics from the replaced manager are landing in GMP.

In Cloud Monitoring (Metrics Explorer), query an Ops Agent metric filtered to the replaced manager:

metric.type = "agent.googleapis.com/cpu/utilization"
resource.label.instance_id = "<instance-id-of-replaced-manager>"

You should see fresh data points within 1-2 minutes of the boot completing. The Cloud Monitoring metric absence alert (alerts.tf, alert #6) will fire if metrics stop arriving for var.metric_absence_duration, so a missing series here is the leading indicator that the new image broke the Ops Agent.

Step 7: Repeat for swarm-mgr-2 and swarm-mgr-3

Repeat Steps 3 through 6 for swarm-mgr-2, then swarm-mgr-3. Between each manager:

  1. Wait for docker node ls to show 3/3 Ready Active Reachable
  2. Wait at least 2 minutes for the Raft cluster to stabilize and elect a leader if needed
  3. Confirm GMP metrics are flowing for the previously replaced manager before starting on the next

Never replace two managers simultaneously. Quorum on a 3-manager swarm requires 2/3 healthy. Replacing two at once kills the cluster and forces you into the Quorum-Loss Recovery runbook.

Step 8: Post-Roll Verification

After all three managers have been replaced:

  1. From any manager: docker node ls shows 3/3 Ready Active Reachable

  2. The swarm-rebalancer service (HB-8223) is healthy:

    docker service ps swarm-rebalancer --no-trunc
    docker service logs --tail 50 swarm-rebalancer
  3. GMP is receiving metrics from all three managers (Metrics Explorer query above, with no instance_id filter, should show three series)

  4. No new alerts firing in Cloud Monitoring or PagerDuty for the swarm managers

  5. Commit the boot_image_name change to the caller and submit a PR documenting the upgrade

Rollback Procedure

If the new image is broken on one of the managers (Falcon down, Ops Agent down, kernel panic, Docker daemon won’t start), roll that single manager back before continuing:

  1. Revert boot_image_name in the caller swarm.tf to the previous image

  2. Run the same disk-targeted -replace command from Step 3 against the affected manager:

    ./bin/terraform -chdir=infra/hb-infra/business_unit_1/non-production apply \
        -replace='module.swarm_managers.google_compute_disk.boot["swarm-mgr-1"]'
  3. Verify per Steps 4-6

  4. Open a Jira ticket against the image build pipeline

If multiple managers are already on the broken image and quorum is at risk, stop the rolling upgrade immediately and roll the affected managers back one at a time. Do not let the cluster drop below 2/3 healthy managers.

If quorum has already been lost, escalate to the Quorum-Loss Recovery runbook.

Why This Procedure Is Safe

Three module design decisions combine to make a rolling image upgrade low-risk:

  • lifecycle { ignore_changes = [boot_disk] } (Design Decision #10). Image drift never auto-rolls. Bumping boot_image_name is a no-op until you explicitly request a -replace on a specific manager. No accidental cluster-wide replacement.
  • lifecycle { replace_triggered_by = [boot disk] } on the instance. One -replace against the boot disk propagates to the instance automatically. Without this the operator would have to target both resources by hand, and forgetting the disk would silently re-attach the old image.
  • deletion_protection = true (Design Decision #10). Cannot accidentally terraform destroy a manager. The protection is bypassed automatically by -replace, so the upgrade flow still works.
  • Three-branch startup script (Design Decision #19). The replaced manager reads existing join tokens from Secret Manager and immediately calls join_swarm(). No first-boot race, no manual token rotation, no manual docker swarm join needed.
  • Hourly boot disk snapshots, 10-day retention (Design Decision #15). If the new image is broken in a way that prevents the manager from booting at all, you can recover Raft state from the most recent snapshot via the Quorum-Loss Recovery runbook, bounding worst-case Raft data loss to ~1 hour.

Troubleshooting

Symptom Likely Cause Fix
Plan in Step 2 shows manager instance changes Drift outside the boot image Investigate before proceeding; do not -replace
Step 3 plan shows only the disk replacing, not the instance replace_triggered_by on the instance is missing or misconfigured Stop; file a bug. Do NOT apply — the new disk would be attached to a new instance booting from the old image
Replaced manager stuck Down after 5 min Startup script failed to read tokens from Secret Manager SSH to the new manager, check journalctl -u google-startup-scripts.service
Replaced manager rejoins as a worker manager_join_token was rotated but the Secret Manager value is stale See Quorum-Loss Recovery Critical Gotchas
Falcon Sensor not running on new image Image build broke Falcon installation Roll back the affected manager, file a bug against gcp_compute_image
Ops Agent not running Image build broke google-cloud-ops-agent package Same as above
Metric absence alert fires during the roll Expected briefly while the replaced manager boots Should auto-clear within 5 min; if not, investigate Ops Agent on that manager
Edit this page