Runbooks
Step-by-step guides for common operations.
Remote Access
- SSH Into GCP VMs — connect to Google Cloud instances via IAP tunnels
- SSH Into Azure VMs — connect to Azure instances via Bastion or direct SSH
Maintenance
- Upgrade Caddy and Cloudflared — update reverse proxy and tunnel agents on production VMs
- Execute a Planned Outage — standardized process for scheduled downtime with stakeholder notification
- Create a New Custom GCP Boot Image — build a new ubuntu-2204-docker-falcon boot image via gcp_compute_image
- Rolling Boot Image Upgrade for Swarm Managers — roll a new boot image onto the three swarm managers without losing quorum
- Rolling Update of Swarm Worker Nodes — replace swarm worker VMs one at a time after a template change, scripted via roll_swarm_workers
- Recover Swarm Manager Quorum Loss — recover Raft quorum after losing two or more swarm managers
- Validate Swarm Worker Autoscale — exercise the end-to-end auto-scale lifecycle (MIG provision, swarmctl rebalance, worker drain) before phase rollouts or after autoscaler config changes
- HA Windows Worker Rolling Reboot — blue-green reboot of HA Windows worker VMs with zero customer downtime
Debugging
- Debug Docker Container Issues — diagnose and recover stopped or failed containers
- Verify Azure Secret Values — check that Key Vault secrets match expected values
- Debug Cloudflare 403 Blocks — diagnose API requests blocked by Cloudflare using the Ray ID
Launches
- HA Admin Portal — end-to-end guide for launching a new partner HA Admin Portal
- Legacy CoPo — launch a new legacy Consumer Portal partner
- Benefits Hub — launch a new Benefits Hub tenant on the shared VM
Services
- FileMage SFTP Gateway — web portal + SFTP, tunneled via Cloudflare, with deploy and rotation procedures