HA Windows Worker Rolling Reboot
Reboot an HA prod Windows worker VM (e.g.
THB-P-WEB1-WIN) with zero customer-visible
downtime by routing its tenants through the spare VM
THB-P-WEB8 for the duration of the reboot.
Use this runbook when:
- A Windows CU is staged but the host needs a reboot to finalize it
(HCS
Element not founderrors are the symptom) - The host’s OS disk or SKU needs to change and the customer-facing webapp can’t drop
- Any maintenance that requires a Windows worker reboot during business hours
Prerequisites
- ADO access: permission to run WebApp Production Release and InterfaceTaskService Production Release pipelines
- SSH access to the source VM’s Linux manager (e.g.
THB-P-WEB<N>-LINUX) andTHB-P-WEB8-LINUXvia configured SSH endpoints - Azure Portal access: ability to restart
THB-P-WEB<N>-WINandTHB-P-WEB8-WIN - (Optional) Cloudflare dashboard access (Zero Trust → Networks → Tunnels) to verify HA pool registration
THB-P-WEB8standing by — the spare VM pair (Linux + Windows) must already be provisioned and joined to the prod swarm
Scope: what’s blue-green’d
Per tenant on a source VM there are two stacks. Only one needs zero downtime:
| Stack | What runs there | Reboot behavior |
|---|---|---|
<tenant>-prod-webappcaddy-stack |
webapp (Windows) + caddy + cloudflared (Linux) | Blue-green via WEB8 — zero customer downtime |
<tenant> |
InterfaceTaskService (Windows) | Drops during reboot (~15 min) — queue-driven, jobs resume automatically |
prod-databasedeploy-stack and similar singletons run on
the Linux manager only and are unaffected by a Windows worker
reboot.
Procedure
The procedure is per source VM. Substitute <N>
(e.g. WEB1) and <tenant> throughout.
Stage 1 — Stand up webappcaddy stacks on WEB8
Run WebApp Production Release for each tenant on the source VM:
- Branch:
release - Tenant:
<tenant> - Deployment VM:
THB-P-WEB8(overrides the tenant’s normal target)
Note: You will need to manually approve production deploys on Azure DevOps.
Tips: Use the ha_ado script to deploy multiple tenants with a single command.
# Run with --dry-run first
zig build scripts -- ha_ado deploy --env p --service webapp --tenants demo,cna,imperial,simpra,thb --deployment-vm THB-P-WEB8 [--dry-run]First-time deploys pull a ~13 GB Windows webapp image (20+ min). The pipeline reports success once the stack is created, but cloudflared stays in its startup wait until webapp serves a 200 — that’s by design. Cloudflare shows only source connectors until then.
Verify each tenant lands healthy on WEB8:
# On THB-P-WEB8-LINUX
sudo docker service ls --filter name=<tenant>-prod-webappcaddy-stack
# Expect all three services 1/1
sudo docker service logs --tail 30 <tenant>-prod-webappcaddy-stack_cloudflared
# Expect: caddy DNS resolved → origin chain ready → 4 "Registered tunnel connection" lines(Optional) Confirm both connectors are in the HA pool:
- Cloudflare Zero Trust → Networks → Tunnels →
<tenant>tunnel → 8 active connections (4 from source + 4 from WEB8)
Stage 2 — Cutover (per tenant)
Stop the source VM’s cloudflared so Cloudflare drains its connector from the HA pool:
# On THB-P-WEB<N>-LINUX
sudo docker service scale <tenant>-prod-webappcaddy-stack_cloudflared=0Cloudflare edge takes ~5–15 min to fully drop the disconnected connector. No requests are dropped during this window — traffic continues to flow through the source until the edge times out and routes everything to WEB8.
Watch traffic shift to THB-P-WEB8 in papertrail logs (or repeated curl):
curl -s https://<tenant>.myhomealign.com/api/healthcheck | jq -r .versionRepeat Stage 2 for every tenant on the source VM.
Stage 3 — Reboot the source Windows VM
Once all of the source VM’s tenants are confirmed serving from WEB8:
# On THB-P-WEB<N>-LINUX — clean shutdown of remaining stacks before reboot
# (Removes webappcaddy stacks that are now drained, plus InterfaceTaskService
# stacks so they SIGTERM gracefully instead of being hard-killed.)
for t in <tenant1> <tenant2> <...>; do
sudo docker stack rm ${t}-prod-webappcaddy-stack
sudo docker stack rm $t
doneReboot via Azure Portal → THB-P-WEB<N>-WIN →
Restart. Wait ~10–15 min for CU finalization.
Verify the reboot completed cleanly:
# SSH to THB-P-WEB<N>-WIN
Test-Path 'HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Component Based Servicing\RebootPending'
# Expect: False
"$((Get-CimInstance Win32_OperatingSystem).Version).$((Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion').UBR)"
# Expect: UBR ticks forward from the pre-reboot valueIf RebootPending is still True, see
Troubleshooting.
Stage 4 — Migrate tenants back to the source VM
For each tenant, re-run WebApp Production Release and InterfaceTaskService Production Release:
- Branch:
release - Tenant:
<tenant> - Deployment VM:
tenant-default(no override — the tenant config’s source VM is the right target)
Tips: Use the ha_ado script to deploy multiple tenants with a single command.
# Deploy webapps to their default VMs
zig build scripts -- ha_ado deploy --env p --service webapp --tenants demo,cna,imperial,simpra,thb [--dry-run]
# Deploys interfacetasks
zig build scripts -- ha_ado deploy --env p --service interfacetaskservice --tenants demo,cna,imperial,simpra,thb [--dry-run]Wait for both stacks to reach 1/1 (image is cached, so pull is fast). Verify cloudflared registers cleanly (same expected log sequence as Stage 1).
Cut traffic back to the source by stopping WEB8’s cloudflared for each tenant:
# On THB-P-WEB8-LINUX
sudo docker service scale <tenant>-prod-webappcaddy-stack_cloudflared=0Confirm in papertrail logs (or repeated curl) that responses are coming from the source VM’s image SHA again.
Stage 5 — Clean up WEB8
Once all of the source VM’s tenants are back on the source:
# On THB-P-WEB8-LINUX
for t in <tenant1> <tenant2> <...>; do
sudo docker stack rm ${t}-prod-webappcaddy-stack
done
sleep 10
sudo docker service ls # only logspout should remain
sudo docker stack ls # emptyWEB8 is now clean and ready for the next source VM.
Troubleshooting
cloudflared is in a restart loop
The entrypoint waits exit with status 1 on timeout
(caddy DNS 120s / origin chain 1800s) and swarm respawns the container.
No customer-facing 502s during the loop — cloudflared
never reaches exec cloudflared tunnel ... run, so it never
registers with Cloudflare, and traffic stays on healthy connectors.
Diagnose which wait stage is timing out:
sudo docker service logs --tail 50 <tenant>-prod-webappcaddy-stack_cloudflared| Log message | Likely cause | Fix |
|---|---|---|
timeout after 120s waiting for caddy DNS |
caddy service not in swarm DNS | docker service ls --filter name=<tenant>-prod-webappcaddy-stack_caddy
— if 0/1, caddy is failing to start (check its logs) |
timeout after 1800s waiting for origin chain |
caddy up but webapp unreachable | docker service ps <tenant>-prod-webappcaddy-stack_webapp --no-trunc
— if Preparing, image still pulling (just wait); if
Failed, see the ERROR column |
After
reboot, RebootPending is still True
The CU didn’t finalize. Reboot the VM a second time via Azure Portal.
If it persists, the staged CU may be corrupted — fall back to
DISM /Online /Cleanup-Image /RevertPendingActions to revert
it (drops security fixes; document the gap).
Cloudflare dashboard shows only one set of connectors
The HA pool isn’t forming. Likely causes:
- WEB8’s cloudflared isn’t actually running — check
docker service lson WEB8 - WEB8 and the source are using different tunnel credentials — confirm
tunnelSecretmatches in the tenant config
Source VM doesn’t come back from reboot
Check Azure VM status. If stuck “Updating” for >30 min, look at boot diagnostics. CU finalization can run longer for large patches but should never exceed ~30 min.
Validation checklist
Before declaring a VM cycle done:
Edit this pageDon’t restart Docker on a host with
RebootPending=True. It clears the in-memory HCS state that’s keeping existing containers alive and cascades the whole host into the failed state. Always reboot the OS, never just the Docker daemon.