GitHub

Create a New Custom GCP Boot Image

This runbook covers building a new ubuntu-2204-docker-falcon (or successor) boot image via the gcp_compute_image script. The resulting image is what swarm managers and the swarm worker MIG boot from.

Use this runbook when:

  • A new kernel patch, Docker version, Falcon Sensor build, or Ops Agent version needs to land in the base image
  • A CVE remediation requires a one-off image rebuild
  • You are bootstrapping a new swarm environment

After the image is built, follow Rolling Boot Image Upgrade for Swarm Managers to roll it onto running managers.

Prerequisites

  • gcloud CLI installed and authenticated: gcloud auth login
  • Member of sg-hb-infra-development@thehelperbees.com (the bar for trusted infra changes; see Setup)
  • Compute Admin (or equivalent: instance, disk, image, firewall, network, address create and delete) on prj-bu1-c-infra-pipeline-5327
  • Secret Manager Accessor on prj-c-secrets-a7cc for the three CrowdStrike secrets (crowdstrike_api_client_id, crowdstrike_api_client_secret, crowdstrike_customer_id)
  • About 30 minutes of uninterrupted runtime
  • A local workstation, not a CI runner. The interactive TUI is the recommended path.

Step 1: Pick the base image and image name

Base image. The current swarm callers are pinned to Ubuntu 22.04. Use:

  • ubuntu-os-cloud/ubuntu-2204-lts (current production base; match this unless you are deliberately upgrading the OS)
  • ubuntu-os-cloud/ubuntu-2404-lts-amd64 (use when upgrading the OS; coordinate before changing what the swarm boots from)

Naming convention. <os>-<docker>-<security>-<date> with date in YYYY-MM-DD form:

  • ubuntu-2204-docker-falcon-2026-04-01
  • ubuntu-2404-docker-swarm-falcon-v1 (acceptable for milestone images that are not date-keyed)

The swarm caller (infra/hb-infra/business_unit_1/<env>/swarm.tf) selects images by exact name, so the name you choose here is what you will reference in the Terraform bump (Step 4).

Step 2: Run the script (interactive)

From the repo root:

./zig/zig build scripts -- gcp_compute_image

The script prompts for four values in sequence:

  1. Base boot image: select from a list fetched live from GCP (Ubuntu and Debian families).
  2. GCP region: pick a US-based region (e.g. us-central1).
  3. GCP zone: pick a zone in that region.
  4. Custom boot image name: enter the name you chose in Step 1.

It then prints a configuration summary and asks for confirmation. Type y to proceed.

The script runs five phases (see src/scripts/gcp_compute_image/README.md for full detail):

  1. Generate an SSH keypair and a unique 6-hex-char run ID.
  2. Create temporary infrastructure: VPC (if missing), SSH firewall rule (if missing), static external IP, persistent disk from the base image, n2-highcpu-8 instance, wait for SSH.
  3. SFTP the static directory and run docker-install.sh, then reconnect and run installations.sh with the Falcon CID.
  4. Stop the instance, snapshot its disk into a Compute Engine image with the name you chose.
  5. Tear down the temporary infrastructure (instance, disk, IP; firewall and VPC only if the script created them).

Expect 15 to 25 minutes end to end. Watch the terminal for any failure markers or ERROR lines. The script prints structured progress messages and exits non-zero on any failure.

Step 2 (alt): CLI mode

For scripted or repeatable runs, pass all four arguments:

./zig/zig build scripts -- gcp_compute_image \
    --base-image ubuntu-os-cloud/ubuntu-2204-lts \
    --region us-central1 \
    --zone us-central1-a \
    --image-name ubuntu-2204-docker-falcon-2026-04-01

The script enforces all-or-nothing on these four flags. Providing a subset is a hard error.

Use --dry-run to preview the gcloud calls without making changes:

./zig/zig build scripts -- gcp_compute_image \
    --base-image ubuntu-os-cloud/ubuntu-2204-lts \
    --region us-central1 \
    --zone us-central1-a \
    --image-name ubuntu-2204-docker-falcon-2026-04-01 \
    --dry-run

Step 3: Verify the image exists

IMAGE_NAME=ubuntu-2204-docker-falcon-2026-04-01

gcloud compute images list \
    --project=prj-bu1-c-infra-pipeline-5327 \
    --no-standard-images \
    --format='table(name,family,creationTimestamp.date(format="%Y-%m-%d"))' \
    --filter="name~^${IMAGE_NAME}$"

Expected: one row with the name you chose, an empty FAMILY column (see Known Sharp Edges below), and today’s date.

Step 4: Use the image

For swarm managers, follow Rolling Boot Image Upgrade for Swarm Managers. The procedure bumps boot_image_name in infra/hb-infra/business_unit_1/<env>/swarm.tf and rolls a terraform apply -replace against each manager one at a time.

For the swarm worker MIG, bump boot_image_name in the worker module call (same file pattern) and run a normal terraform apply. The MIG retemplates and replaces workers via its rolling update policy.

What gets installed

Software Version Purpose
Docker (apt) latest stable from Docker’s repo Container runtime
Docker Swarm initialized as manager (see Known Sharp Edges) Container orchestration
core-infra overlay network attachable, driver=overlay Cross-stack service connectivity
google-cloud-cli (apt) latest from Google’s apt repo Replaces the base image’s tarball gcloud (load-bearing for swarm worker startup; see installations.sh comment)
docker-credential-gcr 1.5.0 GCR auth helper, configured for gcr.io, us.gcr.io, etc.
CrowdStrike Falcon Sensor latest Ubuntu sensor from the CrowdStrike API Endpoint security; service started and enabled at boot
uv latest from astral.sh installer Python package manager
Ansible latest, installed via uv tool install ansible Configuration management
docker-system-cleanup.sh shipped to /home/ubuntu/static/scripts/ On-demand cleanup helper, see Docker Troubleshooting

Known Sharp Edges

Image ships with an active swarm

installations.sh runs docker swarm init during the bake. The resulting image carries Raft state in /var/lib/docker/swarm and boots Swarm: active as a manager of its own single-node cluster.

This is not intentional state for swarm worker hosts. On 2026-05-05, this caused new workers to silently sit in their orphan single-node swarm and never join the production cluster. PRs #795 and #796 patch worker-startup.sh.tftpl to detect the orphan (Swarm: active AND ControlAvailable=true) and force a leave plus rejoin. Swarm managers are not affected by the orphan because their startup logic explicitly handles it.

The proper root-cause fix (drop docker swarm init from installations.sh) is tracked separately. Until that ships, do not use images built from this script for any non-swarm role without first running docker swarm leave --force && systemctl restart docker on the booted instance, or accepting that a worker-style startup script must do it for you.

Image family is not set

The script’s gcloud compute images create call omits --family. Today’s callers select images by name and this is fine, but family-based filtering (gcloud compute images list --filter='family:...') returns nothing for our images. If you need family selection, set it manually after the build:

gcloud compute images update <image-name> \
    --project=prj-bu1-c-infra-pipeline-5327 \
    --family=ubuntu-2204-docker-falcon

Adding a --family flag to the script is tracked separately.

GCP project is hardcoded

GCP_PROJECT_ID and GCP_SECRETS_PROJECT_ID are constants in main.py, not flags. Targeting a different project requires a code edit.

Concurrency

Two runs at once are name-safe (each gets a unique 6-hex run ID for instance and disk names) but share the default VPC and default-allow-ssh firewall. The script’s “only delete what I created” cleanup logic protects against tearing down resources another run depends on, but parallel runs are uncommon and untested. Avoid unless necessary.

Failure Recovery

The script’s own cleanup (Phase 5) is idempotent and runs on any unhandled exception. If the script crashed before Phase 5 (e.g. you killed it with Ctrl-C, or your machine went to sleep mid-run), tear down the leftover resources manually. Each run names its resources gcp-compute-image-<run_id>:

PROJECT=prj-bu1-c-infra-pipeline-5327

# Find leftover instances
gcloud compute instances list --project=$PROJECT --filter='name~^gcp-compute-image-'

# Find leftover disks
gcloud compute disks list --project=$PROJECT --filter='name~^gcp-compute-image-'

# Find leftover external IPs
gcloud compute addresses list --project=$PROJECT --filter='name~^gcp-compute-image-'

Delete each one with the corresponding delete command. The default-allow-ssh firewall and default VPC are safe to leave in place. They pre-exist or are created by other workloads.

Troubleshooting

Symptom Likely cause Fix
gcloud CLI not found at startup gcloud not on PATH Install Google Cloud SDK; verify gcloud --version
Failed to fetch secret 'crowdstrike_api_*' Not authenticated, or no access to prj-c-secrets-a7cc gcloud auth login; ask for Secret Manager Accessor on the project
Authentication failed: Invalid client_id or client_secret from CrowdStrike Secrets are stale (CrowdStrike rotated them) Update the values in Secret Manager; re-run
Failed to establish SSH connection after instance create SSH service not yet ready, or firewall data-plane lag Wait 30s and re-run; verify default-allow-ssh allows tcp:22 from your IP
Permission denied on a gcloud compute * call Missing Compute Admin (or finer-grained delete perms) Get added to the appropriate role on prj-bu1-c-infra-pipeline-5327
Image name already exists --force is set on images create, so it overwrites If you wanted a new image, choose a new name; otherwise no action needed
Script killed mid-run, partial resources left over Cleanup didn’t reach Phase 5 See Failure Recovery

References

Edit this page