Create a New Custom GCP Boot Image
This runbook covers building a new
ubuntu-2204-docker-falcon (or successor) boot image via the
gcp_compute_image script. The resulting image is what swarm
managers and the swarm worker MIG boot from.
Use this runbook when:
- A new kernel patch, Docker version, Falcon Sensor build, or Ops Agent version needs to land in the base image
- A CVE remediation requires a one-off image rebuild
- You are bootstrapping a new swarm environment
After the image is built, follow Rolling Boot Image Upgrade for Swarm Managers to roll it onto running managers.
Prerequisites
gcloudCLI installed and authenticated:gcloud auth login- Member of
sg-hb-infra-development@thehelperbees.com(the bar for trusted infra changes; see Setup) - Compute Admin (or equivalent: instance, disk, image, firewall,
network, address create and delete) on
prj-bu1-c-infra-pipeline-5327 - Secret Manager Accessor on
prj-c-secrets-a7ccfor the three CrowdStrike secrets (crowdstrike_api_client_id,crowdstrike_api_client_secret,crowdstrike_customer_id) - About 30 minutes of uninterrupted runtime
- A local workstation, not a CI runner. The interactive TUI is the recommended path.
Step 1: Pick the base image and image name
Base image. The current swarm callers are pinned to Ubuntu 22.04. Use:
ubuntu-os-cloud/ubuntu-2204-lts(current production base; match this unless you are deliberately upgrading the OS)ubuntu-os-cloud/ubuntu-2404-lts-amd64(use when upgrading the OS; coordinate before changing what the swarm boots from)
Naming convention.
<os>-<docker>-<security>-<date>
with date in YYYY-MM-DD form:
ubuntu-2204-docker-falcon-2026-04-01ubuntu-2404-docker-swarm-falcon-v1(acceptable for milestone images that are not date-keyed)
The swarm caller
(infra/hb-infra/business_unit_1/<env>/swarm.tf)
selects images by exact name, so the name you choose here is what you
will reference in the Terraform bump (Step 4).
Step 2: Run the script (interactive)
From the repo root:
./zig/zig build scripts -- gcp_compute_imageThe script prompts for four values in sequence:
- Base boot image: select from a list fetched live from GCP (Ubuntu and Debian families).
- GCP region: pick a US-based region (e.g.
us-central1). - GCP zone: pick a zone in that region.
- Custom boot image name: enter the name you chose in Step 1.
It then prints a configuration summary and asks for confirmation.
Type y to proceed.
The script runs five phases (see
src/scripts/gcp_compute_image/README.md for full
detail):
- Generate an SSH keypair and a unique 6-hex-char run ID.
- Create temporary infrastructure: VPC (if missing), SSH firewall rule (if missing), static external IP, persistent disk from the base image, n2-highcpu-8 instance, wait for SSH.
- SFTP the static directory and run
docker-install.sh, then reconnect and runinstallations.shwith the Falcon CID. - Stop the instance, snapshot its disk into a Compute Engine image with the name you chose.
- Tear down the temporary infrastructure (instance, disk, IP; firewall and VPC only if the script created them).
Expect 15 to 25 minutes end to end. Watch the terminal for any
failure markers or ERROR lines. The script prints
structured progress messages and exits non-zero on any failure.
Step 2 (alt): CLI mode
For scripted or repeatable runs, pass all four arguments:
./zig/zig build scripts -- gcp_compute_image \
--base-image ubuntu-os-cloud/ubuntu-2204-lts \
--region us-central1 \
--zone us-central1-a \
--image-name ubuntu-2204-docker-falcon-2026-04-01The script enforces all-or-nothing on these four flags. Providing a subset is a hard error.
Use --dry-run to preview the gcloud calls without making
changes:
./zig/zig build scripts -- gcp_compute_image \
--base-image ubuntu-os-cloud/ubuntu-2204-lts \
--region us-central1 \
--zone us-central1-a \
--image-name ubuntu-2204-docker-falcon-2026-04-01 \
--dry-runStep 3: Verify the image exists
IMAGE_NAME=ubuntu-2204-docker-falcon-2026-04-01
gcloud compute images list \
--project=prj-bu1-c-infra-pipeline-5327 \
--no-standard-images \
--format='table(name,family,creationTimestamp.date(format="%Y-%m-%d"))' \
--filter="name~^${IMAGE_NAME}$"Expected: one row with the name you chose, an empty
FAMILY column (see Known Sharp Edges below), and today’s
date.
Step 4: Use the image
For swarm managers, follow Rolling Boot Image Upgrade for
Swarm Managers. The procedure bumps boot_image_name in
infra/hb-infra/business_unit_1/<env>/swarm.tf and
rolls a terraform apply -replace against each manager one
at a time.
For the swarm worker MIG, bump boot_image_name in the
worker module call (same file pattern) and run a normal
terraform apply. The MIG retemplates and replaces workers
via its rolling update policy.
What gets installed
| Software | Version | Purpose |
|---|---|---|
| Docker (apt) | latest stable from Docker’s repo | Container runtime |
| Docker Swarm | initialized as manager (see Known Sharp Edges) | Container orchestration |
core-infra overlay network |
attachable, driver=overlay | Cross-stack service connectivity |
google-cloud-cli (apt) |
latest from Google’s apt repo | Replaces the base image’s tarball gcloud (load-bearing for swarm
worker startup; see installations.sh comment) |
docker-credential-gcr |
1.5.0 | GCR auth helper, configured for gcr.io,
us.gcr.io, etc. |
| CrowdStrike Falcon Sensor | latest Ubuntu sensor from the CrowdStrike API | Endpoint security; service started and enabled at boot |
uv |
latest from astral.sh installer | Python package manager |
| Ansible | latest, installed via uv tool install ansible |
Configuration management |
docker-system-cleanup.sh |
shipped to /home/ubuntu/static/scripts/ |
On-demand cleanup helper, see Docker Troubleshooting |
Known Sharp Edges
Image ships with an active swarm
installations.sh runs docker swarm init
during the bake. The resulting image carries Raft state in
/var/lib/docker/swarm and boots Swarm: active
as a manager of its own single-node cluster.
This is not intentional state for swarm worker hosts. On 2026-05-05,
this caused new workers to silently sit in their orphan single-node
swarm and never join the production cluster. PRs #795 and #796 patch
worker-startup.sh.tftpl to detect the orphan
(Swarm: active AND ControlAvailable=true) and
force a leave plus rejoin. Swarm managers are not affected by the orphan
because their startup logic explicitly handles it.
The proper root-cause fix (drop docker swarm init from
installations.sh) is tracked separately. Until that ships,
do not use images built from this script for any non-swarm role without
first running
docker swarm leave --force && systemctl restart docker
on the booted instance, or accepting that a worker-style startup script
must do it for you.
Image family is not set
The script’s gcloud compute images create call omits
--family. Today’s callers select images by name and this is
fine, but family-based filtering
(gcloud compute images list --filter='family:...') returns
nothing for our images. If you need family selection, set it manually
after the build:
gcloud compute images update <image-name> \
--project=prj-bu1-c-infra-pipeline-5327 \
--family=ubuntu-2204-docker-falconAdding a --family flag to the script is tracked
separately.
GCP project is hardcoded
GCP_PROJECT_ID and GCP_SECRETS_PROJECT_ID
are constants in main.py, not flags. Targeting a different
project requires a code edit.
Concurrency
Two runs at once are name-safe (each gets a unique 6-hex run ID for
instance and disk names) but share the default VPC and
default-allow-ssh firewall. The script’s “only delete what
I created” cleanup logic protects against tearing down resources another
run depends on, but parallel runs are uncommon and untested. Avoid
unless necessary.
Failure Recovery
The script’s own cleanup (Phase 5) is idempotent and runs on any
unhandled exception. If the script crashed before Phase 5 (e.g. you
killed it with Ctrl-C, or your machine went to sleep mid-run), tear down
the leftover resources manually. Each run names its resources
gcp-compute-image-<run_id>:
PROJECT=prj-bu1-c-infra-pipeline-5327
# Find leftover instances
gcloud compute instances list --project=$PROJECT --filter='name~^gcp-compute-image-'
# Find leftover disks
gcloud compute disks list --project=$PROJECT --filter='name~^gcp-compute-image-'
# Find leftover external IPs
gcloud compute addresses list --project=$PROJECT --filter='name~^gcp-compute-image-'Delete each one with the corresponding delete command.
The default-allow-ssh firewall and default VPC
are safe to leave in place. They pre-exist or are created by other
workloads.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
gcloud CLI not found at startup |
gcloud not on PATH |
Install Google Cloud SDK; verify gcloud --version |
Failed to fetch secret 'crowdstrike_api_*' |
Not authenticated, or no access to
prj-c-secrets-a7cc |
gcloud auth login; ask for Secret Manager Accessor on
the project |
Authentication failed: Invalid client_id or client_secret
from CrowdStrike |
Secrets are stale (CrowdStrike rotated them) | Update the values in Secret Manager; re-run |
Failed to establish SSH connection after instance
create |
SSH service not yet ready, or firewall data-plane lag | Wait 30s and re-run; verify default-allow-ssh allows
tcp:22 from your IP |
Permission denied on a gcloud compute *
call |
Missing Compute Admin (or finer-grained delete perms) | Get added to the appropriate role on
prj-bu1-c-infra-pipeline-5327 |
| Image name already exists | --force is set on images create, so it
overwrites |
If you wanted a new image, choose a new name; otherwise no action needed |
| Script killed mid-run, partial resources left over | Cleanup didn’t reach Phase 5 | See Failure Recovery |
References
src/scripts/gcp_compute_image/README.md: full developer-level script documentation (in the repo, not rendered in this docs site)- Rolling Boot Image Upgrade for Swarm Managers: what to do with the image once built
- Docker Troubleshooting:
on-demand
docker-system-cleanup.shusage - Setup: global gcloud and group access prerequisites