GitHub

HA Windows Worker Rolling Reboot

Reboot an HA prod Windows worker VM (e.g. THB-P-WEB1-WIN) with zero customer-visible downtime by routing its tenants through the spare VM THB-P-WEB8 for the duration of the reboot.

Use this runbook when:

  • A Windows CU is staged but the host needs a reboot to finalize it (HCS Element not found errors are the symptom)
  • The host’s OS disk or SKU needs to change and the customer-facing webapp can’t drop
  • Any maintenance that requires a Windows worker reboot during business hours

Prerequisites

  • ADO access: permission to run WebApp Production Release and InterfaceTaskService Production Release pipelines
  • SSH access to the source VM’s Linux manager (e.g. THB-P-WEB<N>-LINUX) and THB-P-WEB8-LINUX via configured SSH endpoints
  • Azure Portal access: ability to restart THB-P-WEB<N>-WIN and THB-P-WEB8-WIN
  • (Optional) Cloudflare dashboard access (Zero Trust → Networks → Tunnels) to verify HA pool registration
  • THB-P-WEB8 standing by — the spare VM pair (Linux + Windows) must already be provisioned and joined to the prod swarm

Scope: what’s blue-green’d

Per tenant on a source VM there are two stacks. Only one needs zero downtime:

Stack What runs there Reboot behavior
<tenant>-prod-webappcaddy-stack webapp (Windows) + caddy + cloudflared (Linux) Blue-green via WEB8 — zero customer downtime
<tenant> InterfaceTaskService (Windows) Drops during reboot (~15 min) — queue-driven, jobs resume automatically

prod-databasedeploy-stack and similar singletons run on the Linux manager only and are unaffected by a Windows worker reboot.

Procedure

The procedure is per source VM. Substitute <N> (e.g. WEB1) and <tenant> throughout.

Stage 1 — Stand up webappcaddy stacks on WEB8

Run WebApp Production Release for each tenant on the source VM:

  • Branch: release
  • Tenant: <tenant>
  • Deployment VM: THB-P-WEB8 (overrides the tenant’s normal target)

Note: You will need to manually approve production deploys on Azure DevOps.

Tips: Use the ha_ado script to deploy multiple tenants with a single command.

# Run with --dry-run first
zig build scripts -- ha_ado deploy --env p --service webapp --tenants demo,cna,imperial,simpra,thb --deployment-vm THB-P-WEB8 [--dry-run]

First-time deploys pull a ~13 GB Windows webapp image (20+ min). The pipeline reports success once the stack is created, but cloudflared stays in its startup wait until webapp serves a 200 — that’s by design. Cloudflare shows only source connectors until then.

Verify each tenant lands healthy on WEB8:

# On THB-P-WEB8-LINUX
sudo docker service ls --filter name=<tenant>-prod-webappcaddy-stack
# Expect all three services 1/1

sudo docker service logs --tail 30 <tenant>-prod-webappcaddy-stack_cloudflared
# Expect: caddy DNS resolved → origin chain ready → 4 "Registered tunnel connection" lines

(Optional) Confirm both connectors are in the HA pool:

  • Cloudflare Zero Trust → Networks → Tunnels → <tenant> tunnel → 8 active connections (4 from source + 4 from WEB8)

Stage 2 — Cutover (per tenant)

Stop the source VM’s cloudflared so Cloudflare drains its connector from the HA pool:

# On THB-P-WEB<N>-LINUX
sudo docker service scale <tenant>-prod-webappcaddy-stack_cloudflared=0

Cloudflare edge takes ~5–15 min to fully drop the disconnected connector. No requests are dropped during this window — traffic continues to flow through the source until the edge times out and routes everything to WEB8.

Watch traffic shift to THB-P-WEB8 in papertrail logs (or repeated curl):

curl -s https://<tenant>.myhomealign.com/api/healthcheck | jq -r .version

Repeat Stage 2 for every tenant on the source VM.

Stage 3 — Reboot the source Windows VM

Once all of the source VM’s tenants are confirmed serving from WEB8:

# On THB-P-WEB<N>-LINUX — clean shutdown of remaining stacks before reboot
# (Removes webappcaddy stacks that are now drained, plus InterfaceTaskService
# stacks so they SIGTERM gracefully instead of being hard-killed.)
for t in <tenant1> <tenant2> <...>; do
  sudo docker stack rm ${t}-prod-webappcaddy-stack
  sudo docker stack rm $t
done

Reboot via Azure Portal → THB-P-WEB<N>-WINRestart. Wait ~10–15 min for CU finalization.

Verify the reboot completed cleanly:

# SSH to THB-P-WEB<N>-WIN
Test-Path 'HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Component Based Servicing\RebootPending'
# Expect: False

"$((Get-CimInstance Win32_OperatingSystem).Version).$((Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion').UBR)"
# Expect: UBR ticks forward from the pre-reboot value

If RebootPending is still True, see Troubleshooting.

Stage 4 — Migrate tenants back to the source VM

For each tenant, re-run WebApp Production Release and InterfaceTaskService Production Release:

  • Branch: release
  • Tenant: <tenant>
  • Deployment VM: tenant-default (no override — the tenant config’s source VM is the right target)

Tips: Use the ha_ado script to deploy multiple tenants with a single command.

# Deploy webapps to their default VMs
zig build scripts -- ha_ado deploy --env p --service webapp --tenants demo,cna,imperial,simpra,thb [--dry-run]
# Deploys interfacetasks
zig build scripts -- ha_ado deploy --env p --service interfacetaskservice --tenants demo,cna,imperial,simpra,thb [--dry-run]

Wait for both stacks to reach 1/1 (image is cached, so pull is fast). Verify cloudflared registers cleanly (same expected log sequence as Stage 1).

Cut traffic back to the source by stopping WEB8’s cloudflared for each tenant:

# On THB-P-WEB8-LINUX
sudo docker service scale <tenant>-prod-webappcaddy-stack_cloudflared=0

Confirm in papertrail logs (or repeated curl) that responses are coming from the source VM’s image SHA again.

Stage 5 — Clean up WEB8

Once all of the source VM’s tenants are back on the source:

# On THB-P-WEB8-LINUX
for t in <tenant1> <tenant2> <...>; do
  sudo docker stack rm ${t}-prod-webappcaddy-stack
done

sleep 10
sudo docker service ls    # only logspout should remain
sudo docker stack ls      # empty

WEB8 is now clean and ready for the next source VM.

Troubleshooting

cloudflared is in a restart loop

The entrypoint waits exit with status 1 on timeout (caddy DNS 120s / origin chain 1800s) and swarm respawns the container. No customer-facing 502s during the loop — cloudflared never reaches exec cloudflared tunnel ... run, so it never registers with Cloudflare, and traffic stays on healthy connectors. Diagnose which wait stage is timing out:

sudo docker service logs --tail 50 <tenant>-prod-webappcaddy-stack_cloudflared
Log message Likely cause Fix
timeout after 120s waiting for caddy DNS caddy service not in swarm DNS docker service ls --filter name=<tenant>-prod-webappcaddy-stack_caddy — if 0/1, caddy is failing to start (check its logs)
timeout after 1800s waiting for origin chain caddy up but webapp unreachable docker service ps <tenant>-prod-webappcaddy-stack_webapp --no-trunc — if Preparing, image still pulling (just wait); if Failed, see the ERROR column

After reboot, RebootPending is still True

The CU didn’t finalize. Reboot the VM a second time via Azure Portal. If it persists, the staged CU may be corrupted — fall back to DISM /Online /Cleanup-Image /RevertPendingActions to revert it (drops security fixes; document the gap).

Cloudflare dashboard shows only one set of connectors

The HA pool isn’t forming. Likely causes:

  • WEB8’s cloudflared isn’t actually running — check docker service ls on WEB8
  • WEB8 and the source are using different tunnel credentials — confirm tunnelSecret matches in the tenant config

Source VM doesn’t come back from reboot

Check Azure VM status. If stuck “Updating” for >30 min, look at boot diagnostics. CU finalization can run longer for large patches but should never exceed ~30 min.

Validation checklist

Before declaring a VM cycle done:

Don’t restart Docker on a host with RebootPending=True. It clears the in-memory HCS state that’s keeping existing containers alive and cascades the whole host into the failed state. Always reboot the OS, never just the Docker daemon.

Edit this page