Quorum-Loss Recovery for Swarm Managers
This runbook covers how to recover the Docker Swarm cluster after
losing Raft quorum on the manager nodes (swarm-mgr-1,
swarm-mgr-2, swarm-mgr-3).
This is the highest-stakes operational scenario for the swarm. While quorum is lost:
- No new services can be scheduled
- No manager elections can happen
- No
docker swarm,docker service, ordocker secretoperations succeed - Existing running containers continue to run, but cannot be rescheduled if they fail
The recovery flow is the only swarm manager scenario without an automatic recovery path (Design Decision #17). It is the deliberate trade-off for choosing a 3-manager swarm over an external etcd/Consul cluster.
Prerequisites
- Member of
ssh-n-env@thehelperbees.com(staging) orssh-p-env@thehelperbees.com(production) - Member of
sg-hb-infra-development@thehelperbees.com(Terraform builds) gcloudauthenticated as a user withsecretmanager.secretVersionAdderon the env secrets project (the swarm_manager service account already hassecretmanager.secretVersionManagerper Design Decision #17, so the recovery flow does not require breaking-glass elevation if you can SSH and rungcloudas that SA)gsshtool installed (./zig/zig build gssh)
Detection: Has Quorum Actually Been Lost?
Before invoking this runbook, confirm the swarm has actually lost quorum, not a transient network blip or a single-manager failure.
Confirm quorum loss:
SSH to any manager and run:
docker node lsIf the cluster has lost a leader, the command returns:
Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leaderCheck Cloud Monitoring. The manager metric absence alert (see
alerts.tf, alert #6) will be firing on 2 or 3 of the manager instances. The TCP 2377 uptime check (production only) will also be failing.SSH to each manager and check the local Raft state:
docker info 2>&1 | grep -A1 'Swarm:'The
LocalNodeStatefield will readpendingorerroron managers that cannot reach quorum.
Distinguish from other scenarios — DO NOT invoke this runbook if:
- Transient network blip. The error above clears
within 60-120 seconds without intervention. Wait two minutes, retry
docker node ls. If it works again, you do not have quorum loss. Investigate the network layer instead. - Single manager failure (2/3 still healthy).
docker node lssucceeds from the surviving managers and shows one node asDown. Quorum is intact (2/3). No recovery is needed; let Terraform recreate the dead manager via standard plan/apply, and its startup script will rejoin the cluster automatically (Design Decision #19).
If you have confirmed at least 2 of 3 managers are unreachable AND the surviving manager(s) cannot elect a leader, proceed.
Scenario A: One Survivor (2 Managers Down)
Use this scenario when exactly one manager VM is still reachable with intact local Raft state.
CRITICAL: If you have two survivors, you must pick exactly one to be the recovery node. Compare
docker infooutput on both — choose the one with the most recentManagersandNodescount. Then treat the other as “down” for the rest of this procedure. Never rundocker swarm init --force-new-clusteron more than one node — see Critical Gotchas below.
Step A.1: SSH to the Surviving Manager
gssh swarm-mgr-1 n-hb-infra(Substitute swarm-mgr-2 / swarm-mgr-3 and
p-hb-infra for production.)
Step A.2: Force a New Cluster
sudo docker swarm init --force-new-clusterThis re-creates a single-node Raft cluster from the survivor’s
existing Raft state. The dead managers will be retained as nodes in the
swarm but marked as Down and Unreachable.
Existing services, networks, secrets, and configs are preserved.
WARNING: This is a one-way operation. Once you run
--force-new-cluster, you cannot un-do it. Confirm Step A.1 was the right node before continuing.
Step A.3: Capture the New Join Tokens IMMEDIATELY
docker swarm init --force-new-cluster rotates both join
tokens. The old tokens stored in Secret Manager are now dead — any VM
trying to join with them will fail. You must capture the new tokens
before Terraform attempts to rebuild anything:
sudo docker swarm join-token worker -q
sudo docker swarm join-token manager -qCopy both values somewhere safe (your terminal scrollback is fine for
the next few minutes). The -q flag prints only the token,
no surrounding command output.
Step A.4: Update Secret Manager BEFORE Terraform Rebuilds
This is the single most important step in the runbook. The dead managers’ Terraform-managed startup scripts read tokens from Secret Manager on first boot. If Secret Manager still contains stale tokens when Terraform recreates them, the new managers will silently fail to join the cluster.
Identify the env secrets project ID (this is the value of
var.env_secrets_project_id in the caller — you can find it
in infra/hb-infra/business_unit_1/<env>/variables.tf
or by running terraform output from the caller
directory).
Push fresh secret versions for both tokens:
# Worker token
sudo docker swarm join-token worker -q | \
gcloud secrets versions add swarm-worker-join-token \
--data-file=- \
--project=<env_secrets_project_id>
# Manager token
sudo docker swarm join-token manager -q | \
gcloud secrets versions add swarm-manager-join-token \
--data-file=- \
--project=<env_secrets_project_id>Verify the new versions are now the latest:
gcloud secrets versions list swarm-worker-join-token --project=<env_secrets_project_id>
gcloud secrets versions list swarm-manager-join-token --project=<env_secrets_project_id>The version count should have incremented by 1 for each secret, with
the newest version in ENABLED state.
Step A.5: Run Terraform Plan
From your workstation:
./zig/zig build plan -- hb-infra non-production(Use production for the prod environment.)
The plan should report that the dead manager instances need to be recreated. Their boot disks will also be recreated. No other resources should be affected.
If the plan reports changes outside of the dead managers, investigate before applying.
Step A.6: Apply the Plan
./zig/zig build apply -- hb-infra non-productionTerraform recreates the dead manager VMs. Their startup scripts run,
hit Branch 1 of the three-branch decision tree (Design Decision #19) —
token exists in Secret Manager — and immediately call
join_swarm(), joining the new cluster created in Step
A.2.
Step A.7: Verify Quorum Restored
SSH to any manager:
gssh swarm-mgr-1 n-hb-infra
docker node lsExpected: all three managers listed as
Ready Active Reachable. The dead-but-rebuilt managers will
have new node IDs. The original (pre-recovery) entries for those
managers will appear as duplicates with Down and
Unreachable status — clean those up:
# List dead nodes
docker node ls --filter 'role=manager'
# Remove old, dead manager entries (use the node ID, not the name)
sudo docker node rm <dead-node-id>Step A.8: Reconcile Workers
Worker VMs that were running during the outage may have lost their connection to the cluster. They need to be checked and possibly rejoined.
From a manager, list all nodes:
docker node lsFor each worker showing
DownorUnreachable, SSH to it and rejoin:gssh <worker-name> n-hb-infra sudo docker swarm leave sudo docker swarm join --token <new_worker_token> <any_manager_internal_ip>:2377Use the worker token captured in Step A.3, and the internal IP of any healthy manager (find via
gcloud compute instances list --filter='labels.role=swarm-manager' --project=<env_project>or read from the swarm_manager module’smanager_internal_ipsoutput).New workers booted after recovery will read the fresh worker token from Secret Manager and join automatically (worker MIG, HB-8216). Only existing workers that survived the outage may need manual attention.
Scenario B: Zero Survivors (All Managers Down or Unrecoverable)
Use this scenario when no manager VM has usable Raft state (all 3 destroyed, all 3 corrupted, all 3 inaccessible). This is the disaster-recovery path that the hourly boot disk snapshot policy (Design Decision #15) exists for.
Step B.1: Identify the Most Recent Boot Disk Snapshot
gcloud compute snapshots list \
--filter='labels.role=swarm-manager' \
--sort-by=~creationTimestamp \
--project=<env_project>Pick the newest snapshot. The hourly snapshot cadence (Design Decision #15) bounds your worst-case Raft data loss to roughly 1 hour. Note that any services scheduled, secrets rotated, or node labels added between the snapshot timestamp and the failure are gone.
The <env_project> is the GCP project where the
manager VMs live (the var.env_project_id value from
infra/hb-infra/business_unit_1/<env>/variables.tf).
Step B.2: Create a Recovery Disk from the Snapshot
gcloud compute disks create swarm-mgr-recovery \
--source-snapshot=<snapshot-name> \
--zone=us-central1-a \
--project=<env_project>Choose a zone that matches the original manager
(us-central1-a, us-central1-b, or
us-central1-c).
Step B.3: Boot a Temporary Recovery VM from the Snapshot Disk
This is a manual one-off VM, not Terraform-managed.
The goal is to bring up a single node with the snapshot’s Raft state so
we can run --force-new-cluster against it.
gcloud compute instances create swarm-mgr-recovery \
--zone=us-central1-a \
--machine-type=e2-medium \
--disk=name=swarm-mgr-recovery,boot=yes,auto-delete=no \
--subnet=<env_subnet_name> \
--network-tier=PREMIUM \
--tags=swarm-node,allow-ssh \
--service-account=$(gcloud iam service-accounts list \
--filter="displayName:swarm-manager*" \
--format="value(email)" \
--project=<env_project>) \
--scopes=cloud-platform \
--project=<env_project>Notes:
- The subnet must be the same VPC subnet used by the production swarm so the recovery VM can talk to the surviving workers (and to the rebuilt managers in step B.5)
- The
swarm-nodetag is required for HB-8217 firewall rules to allow swarm port 2377 - Use the swarm_manager service account so this VM has the same Secret Manager access the real managers do
- This VM is intentionally not in the Terraform state; it is a disposable recovery host
Step B.4: SSH to the Recovery VM and Force a New Cluster
gssh swarm-mgr-recovery <env_short_name>
sudo docker swarm init --force-new-clusterThe Raft state from the snapshot lets the cluster bootstrap as if this were a one-survivor recovery. From this point forward, follow Scenario A starting at Step A.3 (capture tokens, update Secret Manager, Terraform recreate the real managers, verify quorum, reconcile workers).
Step B.5: Destroy the Recovery VM
After Terraform has recreated the three real managers and they have all joined the new cluster, the recovery VM is no longer needed and should be removed from the swarm and destroyed:
# From the recovery VM
sudo docker swarm leave --force
# From your workstation
gcloud compute instances delete swarm-mgr-recovery \
--zone=us-central1-a \
--project=<env_project>
gcloud compute disks delete swarm-mgr-recovery \
--zone=us-central1-a \
--project=<env_project>Verify docker node ls from a real manager shows exactly
3 manager nodes and no swarm-mgr-recovery entry.
Critical Gotchas
Token Rotation Order Matters
docker swarm init --force-new-cluster rotates the join
tokens. The old tokens in Secret Manager are dead the moment you run
that command. Tokens must be in Secret Manager
before Terraform rebuilds the dead managers — otherwise
the rebuilt managers’ startup scripts will read stale tokens, attempt to
join, fail silently, and you will be left with healthy-looking VMs that
are not actually in the swarm.
Always perform Step A.4 (Secret Manager update) before Step A.5 (Terraform plan).
Split-Brain Risk With Multiple Survivors
docker swarm init --force-new-cluster is destructive to
other swarm participants. If you run it on two
survivors simultaneously, you create two independent swarms with
overlapping Raft state, overlapping service IDs, and overlapping network
state. This is unrecoverable without manual data export.
If you have two survivors, compare them carefully and pick exactly one. Treat the other as “dead” for the duration of the recovery and let Terraform rebuild it from scratch.
Never run docker swarm init --force-new-cluster
on more than one node.
Worker Reconciliation May Be Manual
Workers booted from the worker MIG (HB-8216) auto-rejoin via their startup script (which also reads tokens from Secret Manager). After Step A.4 they will pick up the fresh tokens automatically.
However, existing workers that survived the outage
still hold the old token in their local Docker state. They will not
auto-rejoin. You must SSH to each surviving worker and run the manual
docker swarm leave / docker swarm join cycle
described in Step A.8.
Snapshot Recovery Loses In-Flight State
Scenario B’s snapshot recovery loses everything written to Raft after the snapshot timestamp:
- Services scheduled between the snapshot and the failure
- Docker secrets/configs created or rotated between the snapshot and the failure
- Node labels added between the snapshot and the failure
- Manager node identity changes between the snapshot and the failure
The hourly snapshot cadence bounds the loss to ~1 hour, but it does not eliminate it. After Scenario B recovery, audit your services, secrets, and configs against an external source of truth (Terraform state, AWX inventory, application deployment configs) and re-create anything missing.
Verification After Recovery
After Scenario A or Scenario B, confirm all of the following:
docker node lsfrom any manager shows 3/3Ready Active ReachableCloud Monitoring’s manager metric absence alert has cleared (within ~5 min)
Cloud Monitoring’s TCP 2377 uptime check (production only) is passing
Secret Manager has the latest tokens — verify the version count incremented:
gcloud secrets versions list swarm-worker-join-token --project=<env_secrets_project_id> gcloud secrets versions list swarm-manager-join-token --project=<env_secrets_project_id>Sample workload reschedules cleanly:
docker service update --force <some-service-name> docker service ps <some-service-name>swarm-rebalancer (HB-8223) is healthy:
docker service ps swarm-rebalancer --no-truncAll workers are
Ready Active. Any that were manually rejoined in Step A.8 should appear with their new node IDs
Why This Procedure Exists
A 3-manager Docker Swarm tolerates a single manager failure with no operator intervention (Terraform recreates the dead manager, its startup script rejoins via Branch 1 of Design Decision #19). Quorum loss requires 2 of 3 managers to fail simultaneously, which is an unlikely but possible failure mode.
The swarm_manager module’s IAM grants
secretmanager.secretVersionManager on both join token
secrets specifically so this recovery flow can rotate tokens without
breaking-glass IAM elevation (Design Decision #17). The hourly boot disk
snapshot policy with 10-day retention (Design Decision #15) exists
specifically to make Scenario B recoverable, with worst-case ~1 hour of
Raft state loss.
Prevention and Practice
This runbook must be exercised at least once on the staging swarm (HB-8219) before HB-8215 closes. The steps are dense and unforgiving; the first time you walk through them should not be during a real production incident.
Specifically, walk through:
- Scenario A end-to-end against staging, including the manual worker reconciliation step
- Scenario B at least the first three steps (snapshot list, disk create, recovery VM boot) against staging — you do not need to fully complete the recovery, but you should validate that the snapshot pipeline produces a usable disk and the recovery VM can read its Raft state
Document any deviations from this runbook (real project IDs, tag names, machine types) as fixes to this file in a follow-up PR.
Edit this page