GitHub

Quorum-Loss Recovery for Swarm Managers

This runbook covers how to recover the Docker Swarm cluster after losing Raft quorum on the manager nodes (swarm-mgr-1, swarm-mgr-2, swarm-mgr-3).

This is the highest-stakes operational scenario for the swarm. While quorum is lost:

No new services can be scheduled
No manager elections can happen
No docker swarm, docker service, or docker secret operations succeed
Existing running containers continue to run, but cannot be rescheduled if they fail

The recovery flow is the only swarm manager scenario without an automatic recovery path (Design Decision #17). It is the deliberate trade-off for choosing a 3-manager swarm over an external etcd/Consul cluster.

Prerequisites

Member of ssh-n-env@thehelperbees.com (staging) or ssh-p-env@thehelperbees.com (production)
Member of sg-hb-infra-development@thehelperbees.com (Terraform builds)
gcloud authenticated as a user with secretmanager.secretVersionAdder on the env secrets project (the swarm_manager service account already has secretmanager.secretVersionManager per Design Decision #17, so the recovery flow does not require breaking-glass elevation if you can SSH and run gcloud as that SA)
gssh tool installed (./zig/zig build gssh)

Detection: Has Quorum Actually Been Lost?

Before invoking this runbook, confirm the swarm has actually lost quorum, not a transient network blip or a single-manager failure.

Confirm quorum loss:

SSH to any manager and run:

docker node ls

If the cluster has lost a leader, the command returns:

Error response from daemon: rpc error: code = Unknown desc = The swarm does not have a leader

Check Cloud Monitoring. The manager metric absence alert (see alerts.tf, alert #6) will be firing on 2 or 3 of the manager instances. The TCP 2377 uptime check (production only) will also be failing.
SSH to each manager and check the local Raft state:
```
docker info 2>&1 | grep -A1 'Swarm:'
```
The LocalNodeState field will read pending or error on managers that cannot reach quorum.

Distinguish from other scenarios — DO NOT invoke this runbook if:

Transient network blip. The error above clears within 60-120 seconds without intervention. Wait two minutes, retry docker node ls. If it works again, you do not have quorum loss. Investigate the network layer instead.
Single manager failure (2/3 still healthy). docker node ls succeeds from the surviving managers and shows one node as Down. Quorum is intact (2/3). No recovery is needed; let Terraform recreate the dead manager via standard plan/apply, and its startup script will rejoin the cluster automatically (Design Decision #19).

If you have confirmed at least 2 of 3 managers are unreachable AND the surviving manager(s) cannot elect a leader, proceed.

Scenario A: One Survivor (2 Managers Down)

Use this scenario when exactly one manager VM is still reachable with intact local Raft state.

CRITICAL: If you have two survivors, you must pick exactly one to be the recovery node. Compare docker info output on both — choose the one with the most recent Managers and Nodes count. Then treat the other as “down” for the rest of this procedure. Never run docker swarm init --force-new-cluster on more than one node — see Critical Gotchas below.

Step A.1: SSH to the Surviving Manager

gssh swarm-mgr-1 n-hb-infra

(Substitute swarm-mgr-2 / swarm-mgr-3 and p-hb-infra for production.)

Step A.2: Force a New Cluster

sudo docker swarm init --force-new-cluster

This re-creates a single-node Raft cluster from the survivor’s existing Raft state. The dead managers will be retained as nodes in the swarm but marked as Down and Unreachable. Existing services, networks, secrets, and configs are preserved.

WARNING: This is a one-way operation. Once you run --force-new-cluster, you cannot un-do it. Confirm Step A.1 was the right node before continuing.

Step A.3: Capture the New Join Tokens IMMEDIATELY

docker swarm init --force-new-cluster rotates both join tokens. The old tokens stored in Secret Manager are now dead — any VM trying to join with them will fail. You must capture the new tokens before Terraform attempts to rebuild anything:

sudo docker swarm join-token worker -q
sudo docker swarm join-token manager -q

Copy both values somewhere safe (your terminal scrollback is fine for the next few minutes). The -q flag prints only the token, no surrounding command output.

Step A.4: Update Secret Manager BEFORE Terraform Rebuilds

This is the single most important step in the runbook. The dead managers’ Terraform-managed startup scripts read tokens from Secret Manager on first boot. If Secret Manager still contains stale tokens when Terraform recreates them, the new managers will silently fail to join the cluster.

Identify the env secrets project ID (this is the value of var.env_secrets_project_id in the caller — you can find it in infra/hb-infra/business_unit_1/<env>/variables.tf or by running terraform output from the caller directory).

Push fresh secret versions for both tokens:

# Worker token
sudo docker swarm join-token worker -q | \
    gcloud secrets versions add swarm-worker-join-token \
        --data-file=- \
        --project=<env_secrets_project_id>

# Manager token
sudo docker swarm join-token manager -q | \
    gcloud secrets versions add swarm-manager-join-token \
        --data-file=- \
        --project=<env_secrets_project_id>

Verify the new versions are now the latest:

gcloud secrets versions list swarm-worker-join-token --project=<env_secrets_project_id>
gcloud secrets versions list swarm-manager-join-token --project=<env_secrets_project_id>

The version count should have incremented by 1 for each secret, with the newest version in ENABLED state.

Step A.5: Run Terraform Plan

From your workstation:

./zig/zig build plan -- hb-infra non-production

(Use production for the prod environment.)

The plan should report that the dead manager instances need to be recreated. Their boot disks will also be recreated. No other resources should be affected.

If the plan reports changes outside of the dead managers, investigate before applying.

Step A.6: Apply the Plan

./zig/zig build apply -- hb-infra non-production

Terraform recreates the dead manager VMs. Their startup scripts run, hit Branch 1 of the three-branch decision tree (Design Decision #19) — token exists in Secret Manager — and immediately call join_swarm(), joining the new cluster created in Step A.2.

Step A.7: Verify Quorum Restored

SSH to any manager:

gssh swarm-mgr-1 n-hb-infra
docker node ls

Expected: all three managers listed as Ready Active Reachable. The dead-but-rebuilt managers will have new node IDs. The original (pre-recovery) entries for those managers will appear as duplicates with Down and Unreachable status — clean those up:

# List dead nodes
docker node ls --filter 'role=manager'

# Remove old, dead manager entries (use the node ID, not the name)
sudo docker node rm <dead-node-id>

Step A.8: Reconcile Workers

Worker VMs that were running during the outage may have lost their connection to the cluster. They need to be checked and possibly rejoined.

From a manager, list all nodes:
```
docker node ls
```
For each worker showing Down or Unreachable, SSH to it and rejoin:
```
gssh <worker-name> n-hb-infra
sudo docker swarm leave
sudo docker swarm join --token <new_worker_token> <any_manager_internal_ip>:2377
```
Use the worker token captured in Step A.3, and the internal IP of any healthy manager (find via gcloud compute instances list --filter='labels.role=swarm-manager' --project=<env_project> or read from the swarm_manager module’s manager_internal_ips output).
New workers booted after recovery will read the fresh worker token from Secret Manager and join automatically (worker MIG, HB-8216). Only existing workers that survived the outage may need manual attention.

Scenario B: Zero Survivors (All Managers Down or Unrecoverable)

Use this scenario when no manager VM has usable Raft state (all 3 destroyed, all 3 corrupted, all 3 inaccessible). This is the disaster-recovery path that the hourly boot disk snapshot policy (Design Decision #15) exists for.

Step B.1: Identify the Most Recent Boot Disk Snapshot

gcloud compute snapshots list \
    --filter='labels.role=swarm-manager' \
    --sort-by=~creationTimestamp \
    --project=<env_project>

Pick the newest snapshot. The hourly snapshot cadence (Design Decision #15) bounds your worst-case Raft data loss to roughly 1 hour. Note that any services scheduled, secrets rotated, or node labels added between the snapshot timestamp and the failure are gone.

The <env_project> is the GCP project where the manager VMs live (the var.env_project_id value from infra/hb-infra/business_unit_1/<env>/variables.tf).

Step B.2: Create a Recovery Disk from the Snapshot

gcloud compute disks create swarm-mgr-recovery \
    --source-snapshot=<snapshot-name> \
    --zone=us-central1-a \
    --project=<env_project>

Choose a zone that matches the original manager (us-central1-a, us-central1-b, or us-central1-c).

Step B.3: Boot a Temporary Recovery VM from the Snapshot Disk

This is a manual one-off VM, not Terraform-managed. The goal is to bring up a single node with the snapshot’s Raft state so we can run --force-new-cluster against it.

gcloud compute instances create swarm-mgr-recovery \
    --zone=us-central1-a \
    --machine-type=e2-medium \
    --disk=name=swarm-mgr-recovery,boot=yes,auto-delete=no \
    --subnet=<env_subnet_name> \
    --network-tier=PREMIUM \
    --tags=swarm-node,allow-ssh \
    --service-account=$(gcloud iam service-accounts list \
        --filter="displayName:swarm-manager*" \
        --format="value(email)" \
        --project=<env_project>) \
    --scopes=cloud-platform \
    --project=<env_project>

Notes:

The subnet must be the same VPC subnet used by the production swarm so the recovery VM can talk to the surviving workers (and to the rebuilt managers in step B.5)
The swarm-node tag is required for HB-8217 firewall rules to allow swarm port 2377
Use the swarm_manager service account so this VM has the same Secret Manager access the real managers do
This VM is intentionally not in the Terraform state; it is a disposable recovery host

Step B.4: SSH to the Recovery VM and Force a New Cluster

gssh swarm-mgr-recovery <env_short_name>
sudo docker swarm init --force-new-cluster

The Raft state from the snapshot lets the cluster bootstrap as if this were a one-survivor recovery. From this point forward, follow Scenario A starting at Step A.3 (capture tokens, update Secret Manager, Terraform recreate the real managers, verify quorum, reconcile workers).

Step B.5: Destroy the Recovery VM

After Terraform has recreated the three real managers and they have all joined the new cluster, the recovery VM is no longer needed and should be removed from the swarm and destroyed:

# From the recovery VM
sudo docker swarm leave --force

# From your workstation
gcloud compute instances delete swarm-mgr-recovery \
    --zone=us-central1-a \
    --project=<env_project>
gcloud compute disks delete swarm-mgr-recovery \
    --zone=us-central1-a \
    --project=<env_project>

Verify docker node ls from a real manager shows exactly 3 manager nodes and no swarm-mgr-recovery entry.

Critical Gotchas

Token Rotation Order Matters

docker swarm init --force-new-cluster rotates the join tokens. The old tokens in Secret Manager are dead the moment you run that command. Tokens must be in Secret Manager before Terraform rebuilds the dead managers — otherwise the rebuilt managers’ startup scripts will read stale tokens, attempt to join, fail silently, and you will be left with healthy-looking VMs that are not actually in the swarm.

Always perform Step A.4 (Secret Manager update) before Step A.5 (Terraform plan).

Split-Brain Risk With Multiple Survivors

docker swarm init --force-new-cluster is destructive to other swarm participants. If you run it on two survivors simultaneously, you create two independent swarms with overlapping Raft state, overlapping service IDs, and overlapping network state. This is unrecoverable without manual data export.

If you have two survivors, compare them carefully and pick exactly one. Treat the other as “dead” for the duration of the recovery and let Terraform rebuild it from scratch.

Never run docker swarm init --force-new-cluster on more than one node.

Worker Reconciliation May Be Manual

Workers booted from the worker MIG (HB-8216) auto-rejoin via their startup script (which also reads tokens from Secret Manager). After Step A.4 they will pick up the fresh tokens automatically.

However, existing workers that survived the outage still hold the old token in their local Docker state. They will not auto-rejoin. You must SSH to each surviving worker and run the manual docker swarm leave / docker swarm join cycle described in Step A.8.

Snapshot Recovery Loses In-Flight State

Scenario B’s snapshot recovery loses everything written to Raft after the snapshot timestamp:

Services scheduled between the snapshot and the failure
Docker secrets/configs created or rotated between the snapshot and the failure
Node labels added between the snapshot and the failure
Manager node identity changes between the snapshot and the failure

The hourly snapshot cadence bounds the loss to ~1 hour, but it does not eliminate it. After Scenario B recovery, audit your services, secrets, and configs against an external source of truth (Terraform state, AWX inventory, application deployment configs) and re-create anything missing.

Verification After Recovery

After Scenario A or Scenario B, confirm all of the following:

docker node ls from any manager shows 3/3 Ready Active Reachable
Cloud Monitoring’s manager metric absence alert has cleared (within ~5 min)
Cloud Monitoring’s TCP 2377 uptime check (production only) is passing

Secret Manager has the latest tokens — verify the version count incremented:

gcloud secrets versions list swarm-worker-join-token --project=<env_secrets_project_id>
gcloud secrets versions list swarm-manager-join-token --project=<env_secrets_project_id>

Sample workload reschedules cleanly:

docker service update --force <some-service-name>
docker service ps <some-service-name>

swarm-rebalancer (HB-8223) is healthy: docker service ps swarm-rebalancer --no-trunc
All workers are Ready Active. Any that were manually rejoined in Step A.8 should appear with their new node IDs

Why This Procedure Exists

A 3-manager Docker Swarm tolerates a single manager failure with no operator intervention (Terraform recreates the dead manager, its startup script rejoins via Branch 1 of Design Decision #19). Quorum loss requires 2 of 3 managers to fail simultaneously, which is an unlikely but possible failure mode.

The swarm_manager module’s IAM grants secretmanager.secretVersionManager on both join token secrets specifically so this recovery flow can rotate tokens without breaking-glass IAM elevation (Design Decision #17). The hourly boot disk snapshot policy with 10-day retention (Design Decision #15) exists specifically to make Scenario B recoverable, with worst-case ~1 hour of Raft state loss.

Prevention and Practice

This runbook must be exercised at least once on the staging swarm (HB-8219) before HB-8215 closes. The steps are dense and unforgiving; the first time you walk through them should not be during a real production incident.

Specifically, walk through:

Scenario A end-to-end against staging, including the manual worker reconciliation step
Scenario B at least the first three steps (snapshot list, disk create, recovery VM boot) against staging — you do not need to fully complete the recovery, but you should validate that the snapshot pipeline produces a usable disk and the recovery VM can read its Raft state

Document any deviations from this runbook (real project IDs, tag names, machine types) as fixes to this file in a follow-up PR.

Edit this page