GitHub

Create a New Custom GCP Boot Image

This runbook covers building a new ubuntu-2204-docker-falcon (or successor) boot image via the gcp_compute_image script. The resulting image is what swarm managers and the swarm worker MIG boot from.

Use this runbook when:

A new kernel patch, Docker version, Falcon Sensor build, or Ops Agent version needs to land in the base image
A CVE remediation requires a one-off image rebuild
You are bootstrapping a new swarm environment

After the image is built, follow Rolling Boot Image Upgrade for Swarm Managers to roll it onto running managers.

Prerequisites

gcloud CLI installed and authenticated: gcloud auth login
Member of sg-hb-infra-development@thehelperbees.com (the bar for trusted infra changes; see Setup)
Compute Admin (or equivalent: instance, disk, image, firewall, network, address create and delete) on prj-bu1-c-infra-pipeline-5327
Secret Manager Accessor on prj-c-secrets-a7cc for the three CrowdStrike secrets (crowdstrike_api_client_id, crowdstrike_api_client_secret, crowdstrike_customer_id)
About 30 minutes of uninterrupted runtime
A local workstation, not a CI runner. The interactive TUI is the recommended path.

Step 1: Pick the base image and image name

Base image. The current swarm callers are pinned to Ubuntu 22.04. Use:

ubuntu-os-cloud/ubuntu-2204-lts (current production base; match this unless you are deliberately upgrading the OS)
ubuntu-os-cloud/ubuntu-2404-lts-amd64 (use when upgrading the OS; coordinate before changing what the swarm boots from)

Naming convention. <os>-<docker>-<security>-<date> with date in YYYY-MM-DD form:

ubuntu-2204-docker-falcon-2026-04-01
ubuntu-2404-docker-swarm-falcon-v1 (acceptable for milestone images that are not date-keyed)

The swarm caller (infra/hb-infra/business_unit_1/<env>/swarm.tf) selects images by exact name, so the name you choose here is what you will reference in the Terraform bump (Step 4).

Step 2: Run the script (interactive)

From the repo root:

./zig/zig build scripts -- gcp_compute_image

The script prompts for four values in sequence:

Base boot image: select from a list fetched live from GCP (Ubuntu and Debian families).
GCP region: pick a US-based region (e.g. us-central1).
GCP zone: pick a zone in that region.
Custom boot image name: enter the name you chose in Step 1.

It then prints a configuration summary and asks for confirmation. Type y to proceed.

The script runs five phases (see src/scripts/gcp_compute_image/README.md for full detail):

Generate an SSH keypair and a unique 6-hex-char run ID.
Create temporary infrastructure: VPC (if missing), SSH firewall rule (if missing), static external IP, persistent disk from the base image, n2-highcpu-8 instance, wait for SSH.
SFTP the static directory and run docker-install.sh, then reconnect and run installations.sh with the Falcon CID.
Stop the instance, snapshot its disk into a Compute Engine image with the name you chose.
Tear down the temporary infrastructure (instance, disk, IP; firewall and VPC only if the script created them).

Expect 15 to 25 minutes end to end. Watch the terminal for any failure markers or ERROR lines. The script prints structured progress messages and exits non-zero on any failure.

Step 2 (alt): CLI mode

For scripted or repeatable runs, pass all four arguments:

./zig/zig build scripts -- gcp_compute_image \
    --base-image ubuntu-os-cloud/ubuntu-2204-lts \
    --region us-central1 \
    --zone us-central1-a \
    --image-name ubuntu-2204-docker-falcon-2026-04-01

The script enforces all-or-nothing on these four flags. Providing a subset is a hard error.

Use --dry-run to preview the gcloud calls without making changes:

./zig/zig build scripts -- gcp_compute_image \
    --base-image ubuntu-os-cloud/ubuntu-2204-lts \
    --region us-central1 \
    --zone us-central1-a \
    --image-name ubuntu-2204-docker-falcon-2026-04-01 \
    --dry-run

Step 3: Verify the image exists

IMAGE_NAME=ubuntu-2204-docker-falcon-2026-04-01

gcloud compute images list \
    --project=prj-bu1-c-infra-pipeline-5327 \
    --no-standard-images \
    --format='table(name,family,creationTimestamp.date(format="%Y-%m-%d"))' \
    --filter="name~^${IMAGE_NAME}$"

Expected: one row with the name you chose, an empty FAMILY column (see Known Sharp Edges below), and today’s date.

Step 4: Use the image

For swarm managers, follow Rolling Boot Image Upgrade for Swarm Managers. The procedure bumps boot_image_name in infra/hb-infra/business_unit_1/<env>/swarm.tf and rolls a terraform apply -replace against each manager one at a time.

For the swarm worker MIG, bump boot_image_name in the worker module call (same file pattern) and run a normal terraform apply. The MIG retemplates and replaces workers via its rolling update policy.

What gets installed

Software	Version	Purpose
Docker (apt)	latest stable from Docker’s repo	Container runtime
Docker Swarm	initialized as manager (see Known Sharp Edges)	Container orchestration
`core-infra` overlay network	attachable, driver=overlay	Cross-stack service connectivity
`google-cloud-cli` (apt)	latest from Google’s apt repo	Replaces the base image’s tarball gcloud (load-bearing for swarm worker startup; see `installations.sh` comment)
`docker-credential-gcr`	1.5.0	GCR auth helper, configured for `gcr.io`, `us.gcr.io`, etc.
CrowdStrike Falcon Sensor	latest Ubuntu sensor from the CrowdStrike API	Endpoint security; service started and enabled at boot
`uv`	latest from astral.sh installer	Python package manager
Ansible	latest, installed via `uv tool install ansible`	Configuration management
`docker-system-cleanup.sh`	shipped to `/home/ubuntu/static/scripts/`	On-demand cleanup helper, see Docker Troubleshooting

Known Sharp Edges

Image ships with an active swarm

installations.sh runs docker swarm init during the bake. The resulting image carries Raft state in /var/lib/docker/swarm and boots Swarm: active as a manager of its own single-node cluster.

This is not intentional state for swarm worker hosts. On 2026-05-05, this caused new workers to silently sit in their orphan single-node swarm and never join the production cluster. PRs #795 and #796 patch worker-startup.sh.tftpl to detect the orphan (Swarm: active AND ControlAvailable=true) and force a leave plus rejoin. Swarm managers are not affected by the orphan because their startup logic explicitly handles it.

The proper root-cause fix (drop docker swarm init from installations.sh) is tracked separately. Until that ships, do not use images built from this script for any non-swarm role without first running docker swarm leave --force && systemctl restart docker on the booted instance, or accepting that a worker-style startup script must do it for you.

Image family is not set

The script’s gcloud compute images create call omits --family. Today’s callers select images by name and this is fine, but family-based filtering (gcloud compute images list --filter='family:...') returns nothing for our images. If you need family selection, set it manually after the build:

gcloud compute images update <image-name> \
    --project=prj-bu1-c-infra-pipeline-5327 \
    --family=ubuntu-2204-docker-falcon

Adding a --family flag to the script is tracked separately.

GCP project is hardcoded

GCP_PROJECT_ID and GCP_SECRETS_PROJECT_ID are constants in main.py, not flags. Targeting a different project requires a code edit.

Concurrency

Two runs at once are name-safe (each gets a unique 6-hex run ID for instance and disk names) but share the default VPC and default-allow-ssh firewall. The script’s “only delete what I created” cleanup logic protects against tearing down resources another run depends on, but parallel runs are uncommon and untested. Avoid unless necessary.

Failure Recovery

The script’s own cleanup (Phase 5) is idempotent and runs on any unhandled exception. If the script crashed before Phase 5 (e.g. you killed it with Ctrl-C, or your machine went to sleep mid-run), tear down the leftover resources manually. Each run names its resources gcp-compute-image-<run_id>:

PROJECT=prj-bu1-c-infra-pipeline-5327

# Find leftover instances
gcloud compute instances list --project=$PROJECT --filter='name~^gcp-compute-image-'

# Find leftover disks
gcloud compute disks list --project=$PROJECT --filter='name~^gcp-compute-image-'

# Find leftover external IPs
gcloud compute addresses list --project=$PROJECT --filter='name~^gcp-compute-image-'

Delete each one with the corresponding delete command. The default-allow-ssh firewall and default VPC are safe to leave in place. They pre-exist or are created by other workloads.

Troubleshooting

Symptom	Likely cause	Fix
`gcloud CLI not found` at startup	`gcloud` not on PATH	Install Google Cloud SDK; verify `gcloud --version`
`Failed to fetch secret 'crowdstrike_api_*'`	Not authenticated, or no access to `prj-c-secrets-a7cc`	`gcloud auth login`; ask for Secret Manager Accessor on the project
`Authentication failed: Invalid client_id or client_secret` from CrowdStrike	Secrets are stale (CrowdStrike rotated them)	Update the values in Secret Manager; re-run
`Failed to establish SSH connection` after instance create	SSH service not yet ready, or firewall data-plane lag	Wait 30s and re-run; verify `default-allow-ssh` allows tcp:22 from your IP
`Permission denied` on a `gcloud compute *` call	Missing Compute Admin (or finer-grained delete perms)	Get added to the appropriate role on `prj-bu1-c-infra-pipeline-5327`
Image name already exists	`--force` is set on `images create`, so it overwrites	If you wanted a new image, choose a new name; otherwise no action needed
Script killed mid-run, partial resources left over	Cleanup didn’t reach Phase 5	See Failure Recovery

References

src/scripts/gcp_compute_image/README.md: full developer-level script documentation (in the repo, not rendered in this docs site)
Rolling Boot Image Upgrade for Swarm Managers: what to do with the image once built
Docker Troubleshooting: on-demand docker-system-cleanup.sh usage
Setup: global gcloud and group access prerequisites

Edit this page