GitHub

Docker Swarm Outbound IP Audit

Status: Ready for review. All 8 us-central1 NAT IPs (4 non-prod + 4 prod) verified live as of 2026-04-30. Per the 2026-05-12 policy revision in §2, allowlists also include the 4 us-west1 reserve IPs per env (16 IPs total) for forward compatibility. Author: miko.hadikusuma Created: 2026-04-29 Updated: 2026-05-06 Jira: HB-8210 Gates: HB-8218 (Allowlist updates) → HB-8232 (Deploy production swarm)


Table of Contents

  1. Purpose & Scope
  2. Current vs. Target State
  3. Audit by Surface
    1. GCP Cloud SQL authorized_networks
    2. Azure SQL Firewall (HA Apps Only)
    3. Cloudflare Access Groups (IP Bypass)
    4. Caddy VPN Bypass List
    5. Partner API Webhooks / External Integrations
    6. Other (lower priority but worth verifying)
  4. Cutover Sequence (handoff to HB-8218)
  5. Open Questions
  6. Action Item Traceability
  7. References

1. Purpose & Scope

When the nine Phase 0 THB applications consolidate onto shared Docker Swarm workers, every per-VM external IP currently used as the source of an outbound connection disappears. Workers run with no external IP behind a shared Cloud NAT with 4 static egress IPs per region (us-central1) — see §2 for the verified set.

Anything that allowlists today’s per-VM IPs needs to be updated to the new NAT IPs before workloads cut over. This audit enumerates every such allowlist surface, where the rule lives, and the migration action needed. The deliverable feeds directly into HB-8218 (the implementation ticket).

Apps in scope (canonical list per ./zig/zig build scripts -- compose-file-linter --list): bees, benefits-platform, consumer_portal, django_homealign, hb-buzz, hbcrm, pd, thb-keycloak, the_consumer_portal. (HA apps — Azure-side — are out of scope for the swarm migration but their Azure SQL firewall is still affected; see §3.2.)

In scope: anywhere a current VM’s external IP is referenced as a source/allowed origin for outbound calls from one of these nine apps.

Out of scope:


2. Current vs. Target State

Current per-VM external IPs

Discovered via data.google_compute_addresses.* and per-resource google_compute_address definitions:

App grouping Resource (prod) Notes
HB shared VM (hbcrm, bea_hbcrm, hb_buzz, django_homealign, eligibility, keycloak/thb-keycloak, hbtemplate, unleash, posthog) google_compute_address.external_migrate (infra/hb-infra/business_unit_1/production/main.tf:22) One VM hosts ~9 apps. thb-keycloak repo deploys the keycloak service here.
Benefits Hub (benefits-platform repo, multiple tenant variants) google_compute_address.external (infra/hb-infra/business_unit_1/production/benefits_platform.tf:6)
Bees Flask API (bees repo) module.bees-p-vm2.instance_external_ip (infra/bees-infra/business_unit_1/production/main.tf:23) Standalone VM in its own bees-infra project
Legacy Consumer Portal (consumer_portal repo, multi-partner) google_compute_address.cp_external (infra/pd-infra/business_unit_1/production/consumer_portal.tf:7) cp-p-vm. Hosts hbcp_jh, hbcp_aarp, hbcp_sompo, hbcp_pru, hbcp_bcbs_ar, hbcp_cgi, etc. Source of HB-8208 gcsfuse work.
New Consumer Portal — CoPo 3.0 (the_consumer_portal repo) google_compute_address.external (infra/pd-infra/business_unit_1/production/portal-the-consumer-portal.tf:9) portal-p-vm. Auth0 custom domain integration; PR preview env target (HB-7963).
ADO agent (deploys HomeAlign Apps) google_compute_address.ado_agent_external_ip (infra/ha-infra/business_unit_1/production/azure_devops.tf:83) Serves Azure DevOps SSH/deploy traffic into HA (HomeAlign) Windows VMs. HA apps are not in swarm scope; ADO agent stays put → no action.
22+ PD partner VMs google_compute_address.external per infra/pd-infra/business_unit_1/production/pdp*/main.tf:12 One VM per partner

Quick inventory command:

gcloud compute addresses list --filter="address_type=EXTERNAL" --format="table(name,address,project)" \
  --project=<env-project-id>

Target swarm NAT IPs

Provisioned per HB-8217 / HB-8360 as Cloud NAT on the vpc-{env}-shared-base shared VPC. Capacity: 4 IPs × 64,512 ports / 2048 min_ports_per_vm = ~126 VM ceiling per region.

Canonical TF source (separate repo): thehelperbees/gcp-networks

Resource Path
Cloud Router (cr-{env_code}-shared-base-spoke-us-central1-nat-router) modules/base_shared_vpc/nat.tf:22-32
NAT external IPs (ca-{env_code}-shared-base-spoke-{us-central1,us-west1}-{0..3}) modules/base_shared_vpc/nat.tf:34-39 (count = var.nat_num_addresses_region1 per region; us-central1 active, us-west1 reserve per §2 policy)
Router NAT (rn-{env_code}-shared-base-spoke-us-central1-egress) modules/base_shared_vpc/nat.tf:41-56
Module call with nat_num_addresses_region1 = 4, nat_min_ports_per_vm = 2048 envs/{production,non-production}/boa_vpc_fw.tf:45-47
HB-8217 swarm firewall rules (TCP 2377, TCP+UDP 7946, UDP 4789, TCP 9323), tag-scoped to swarm-node modules/base_shared_vpc/firewall.tf:254-345

NAT IPs to allowlist (verified 2026-04-30 against live state):

All four IPs per region are attached round-robin to that region’s Router NAT, so every IP must be in every allowlist — partial allowlisting causes intermittent failures as workers cycle through unallowlisted IPs. Per the 2026-05-12 policy revision below, both regions’ IPs are included in every allowlist (16 total — 4 us-central1 + 4 us-west1 per env), even though only us-central1 carries swarm traffic today.

Environment Region Network host project Address resource NAT IP Status
Non-production us-central1 prj-n-shared-base-cb89 ca-n-shared-base-spoke-us-central1-0 34.72.82.40 active
Non-production us-central1 prj-n-shared-base-cb89 ca-n-shared-base-spoke-us-central1-1 34.122.251.101 active
Non-production us-central1 prj-n-shared-base-cb89 ca-n-shared-base-spoke-us-central1-2 34.170.35.69 active
Non-production us-central1 prj-n-shared-base-cb89 ca-n-shared-base-spoke-us-central1-3 35.194.23.180 active
Non-production us-west1 prj-n-shared-base-cb89 ca-n-shared-base-spoke-us-west1-0 34.105.111.96 reserve
Non-production us-west1 prj-n-shared-base-cb89 ca-n-shared-base-spoke-us-west1-1 34.127.70.253 reserve
Non-production us-west1 prj-n-shared-base-cb89 ca-n-shared-base-spoke-us-west1-2 136.109.214.215 reserve
Non-production us-west1 prj-n-shared-base-cb89 ca-n-shared-base-spoke-us-west1-3 34.169.152.90 reserve
Production us-central1 prj-p-shared-base-11f6 ca-p-shared-base-spoke-us-central1-0 34.132.72.45 active
Production us-central1 prj-p-shared-base-11f6 ca-p-shared-base-spoke-us-central1-1 35.226.246.142 active
Production us-central1 prj-p-shared-base-11f6 ca-p-shared-base-spoke-us-central1-2 34.136.40.107 active
Production us-central1 prj-p-shared-base-11f6 ca-p-shared-base-spoke-us-central1-3 34.57.41.236 active
Production us-west1 prj-p-shared-base-11f6 ca-p-shared-base-spoke-us-west1-0 35.227.138.226 reserve
Production us-west1 prj-p-shared-base-11f6 ca-p-shared-base-spoke-us-west1-1 35.185.201.232 reserve
Production us-west1 prj-p-shared-base-11f6 ca-p-shared-base-spoke-us-west1-2 34.168.190.12 reserve
Production us-west1 prj-p-shared-base-11f6 ca-p-shared-base-spoke-us-west1-3 8.229.103.135 reserve

📝 Historical note: HB-8218 originally listed only 2 of 4 IPs per env. The remaining 6 were discovered live during this audit (gcp-networks provisions nat_num_addresses_region1 = 4 for both envs per envs/{production,non-production}/boa_vpc_fw.tf:45). HB-8218 description updated 2026-04-30 with the corrected set.

📝 us-west1 NAT IPs are pre-emptively included in allowlists. A gcloud compute addresses list query against either shared-base-vpc-host project also returns 4 IPs per env in us-west1 (non-prod 34.105.111.96 / 34.127.70.253 / 136.109.214.215 / 34.169.152.90; prod 35.227.138.226 / 35.185.201.232 / 34.168.190.12 / 8.229.103.135). These belong to a parallel Cloud NAT in us-west1 for non-swarm workloads today (Cloud Functions and any future regional expansion). The swarm cluster currently runs only in us-central1 — managers in us-central1-{a,b,c}, workers default to ["us-central1-a","us-central1-b","us-central1-c","us-central1-f"] per swarm_worker/variables.tf, so today no swarm traffic ever egresses through the us-west1 pool.

Policy as of 2026-05-12: include the us-west1 IPs in every allowlist anyway (internal surfaces like the legacy SQL allowlist script, and partner-side allowlists like §3.5). Rationale: if/when we activate swarm in us-west1 for DR or capacity, we won’t need a second round of partner change-control or script patches. The cost is small (4 extra IPs per env to maintain) and the IPs are static reservations that won’t move, so the future risk of stale entries is low. The earlier guidance to omit them has been superseded.

Verification command (per env):

# Non-production
gcloud compute addresses list \
  --project=prj-n-shared-base-cb89 \
  --filter='name~^ca-n-shared' \
  --format='table(name,address)'

# Production
gcloud compute addresses list \
  --project=prj-p-shared-base-11f6 \
  --filter='name~^ca-p-shared' \
  --format='table(name,address)'

Manager IPs (3 per env, separate from worker NAT):

Verifying egress before partner outreach

Before notifying partners (§3.5), confirm that worker traffic actually exits via the NAT IPs above. Misalignment between the reserved IPs and the IPs partners see is exactly the silent-failure mode that caused the HB-8218 4-vs-2 finding — but for partner allowlists, that failure happens at the partner’s edge, after cutover, with no logs on our side.

Two complementary checks. Pass criteria: every worker is mapped to one of the 4 NAT IPs, and a curl from inside the worker echoes back the same IP that GCP’s mapping table claims for it.

Check 1 — control-plane mapping: ask GCP which NAT IP each worker is using. No cluster shell access needed.

NETWORK_HOST_NON_PROD="$(gcloud projects list \
  --filter='labels.application_name=base-shared-vpc-host AND labels.environment=non-production' \
  --format='value(projectId)')"

gcloud compute routers get-nat-mapping-info \
  cr-n-shared-base-spoke-us-central1-nat-router \
  --nat-name=rn-n-shared-base-spoke-us-central1-egress \
  --region=us-central1 \
  --project="$NETWORK_HOST_NON_PROD"

Output is a JSON list, one entry per VM, each with natIpPortRanges[].natIp. Group by natIp to confirm: every worker maps to one of 34.72.82.40, 34.122.251.101, 34.170.35.69, 35.194.23.180. Any other IP is a finding.

Check 2 — partner’s-eye view: run a one-shot global service that hits an IP echo from every worker. What it reports is what a partner will see.

# From a swarm manager (SSH via IAP):
docker service create \
  --name nat-ip-check \
  --mode global \
  --restart-condition none \
  alpine sh -c 'echo "$(hostname): $(wget -qO- https://ifconfig.me)"'

# Wait ~30s for tasks to complete, then:
docker service logs nat-ip-check --no-trunc

# Cleanup:
docker service rm nat-ip-check

Cross-check the IP each worker reports against Check 1: they should match per-worker. If they don’t, the egress path isn’t what we think it is — stop and investigate before partner outreach.

If you don’t have manager shell access, alternate per-worker approach:

gcloud compute ssh swarm-worker-n-XXXX --tunnel-through-iap \
  --command='curl -s https://ifconfig.me'

…repeated for each worker.

Optional — Check 3, NAT translation logs: the gcp-networks egress_nat_region1 resource enables log_config { filter = "TRANSLATIONS_ONLY" }, so every flow gets logged. While Check 2 is running, tail the logs to see actual translations:

gcloud logging read \
  'resource.type="nat_gateway"
   resource.labels.gateway_name="rn-n-shared-base-spoke-us-central1-egress"' \
  --project="$NETWORK_HOST_NON_PROD" \
  --limit=50 \
  --format='value(jsonPayload.connection.nat_ip,jsonPayload.connection.src_ip,jsonPayload.connection.dest_ip)'

Three columns: NAT IP exited from, worker internal IP, destination. Confirms what we expect from a third independent angle.

If all three checks agree, the IP set is ground-truthed and partner outreach can go out. If any disagree, the disagreement is the next thing to chase.

Per-surface connectivity sweep (HB-8211)

docs/scripts/swarm_connectivity_test.sh is the operational pre-flight check that pairs with the §3 audit. SSH into a swarm worker, copy the script over, run it. It probes each surface this audit calls out (GCP APIs, Azure SQL, Cloudflare-fronted services, Papertrail, Datadog, Tableau /trusted) plus skip placeholders for partner endpoints (TBD until URLs surface).

A complementary control-plane audit lives at src/scripts/swarm_connectivity_test.sh and runs from any operator workstation with gcloud + az + a Cloudflare API token. It re-derives the NAT pool from gcp-networks state and probes each allowlist surface from outside-in: PD MSSQL authorized_networks, Azure SQL firewall, Caddy caddy_vpn_bypass_list in GCS, Cloudflare Access vpn_bypass policies, NAT egress mapping, and Tableau wgserver.trusted_hosts (via gcloud compute ssh + tsm configuration get). Run via ./zig/zig build scripts -- swarm_connectivity_test — exit code = number of failures, suitable as a CI gate on top of (not in place of) the on-host script.

gcloud compute scp docs/scripts/swarm_connectivity_test.sh \
  swarm-n-worker-XXXX:/tmp/ \
  --tunnel-through-iap --project=prj-bu1-n-hb-infra-5381

gcloud compute ssh ubuntu@swarm-n-worker-XXXX \
  --tunnel-through-iap --project=prj-bu1-n-hb-infra-5381 \
  --command='bash /tmp/swarm_connectivity_test.sh'

Run it before every migration batch. Pre-cutover, expected fails are typically the vpn_bypass-protected checks (until HB-8413 / HB-8414 ship). The Azure SQL TCP probe is network-path-only and not authoritative for firewall allowlisting (see §3.2). Public Cloudflare sanity endpoints should pass without an allowlist update. Post-cutover, everything except SKIP rows should be green. Exit code = number of failures, easy CI gate.

What to expect when running the test from each VM type

The script’s diagnostic value comes from comparing legacy-VM results to swarm-worker results — the diff shows what allowlist work has landed and what hasn’t. This matrix shows expected results for the rows that vary by host:

Probe hb-VM (legacy) partner-VM (legacy, e.g. pdp21) swarm worker (target post-migration) Why
eligibility (vpn_bypass) FAIL (302) PASS (200) PASS (200) — after HB-8413 / HB-8414 local.pd_addresses is filtered by name:pdp* in the PD project. Partner VMs are in there; HB VM isn’t. After migration, swarm NAT IPs are added as static entries → workers behave like partner VMs do today.
hb-infra cloudsql.client PASS FAIL PASS — after HB-8590 Legacy: each VM’s compute SA is scoped to projects that VM connects to. Post-migration: worker SA holds the union so any stack can land on any worker.
pd-infra cloudsql.client FAIL PASS PASS — after HB-8590 Same — pd-infra DBs are reachable today only from PD partner VMs (their SA has the binding).
bees-infra cloudsql.client PASS FAIL PASS — after HB-8590 hb-VM apps reach bees-infra DBs today; partner VMs don’t.
the-helper-bees (shared) PASS FAIL PASS — after HB-8590 Shared/legacy Cloud SQL instances (ansible-hb-psql-prod/stage, ansible-psql-warehouse). hb-VM apps connect; partner VMs don’t.

How to use this matrix: if you run the script from a swarm worker and any of the “PASS — after …” rows isn’t green, that’s a finding for the cited subtask. Conversely, the legacy-VM columns aren’t bugs — they reflect the current per-VM IAM scoping that HB-8590 deliberately collapses, and the current vpn_bypass filter that HB-8413/HB-8414 deliberately broadens. The whole point is the diff between legacy and worker.

The other probe groups (GCP APIs, public Cloudflare-fronted, Papertrail, Datadog, Azure SQL TCP) should PASS from any GCP VM in the right env regardless of HB-8218 status — they don’t depend on the work HB-8218 is gating.


3. Audit by Surface

3.1 GCP Cloud SQL authorized_networks

Instance Path What’s allowlisted today Migration action
HB Postgres (hb-p-psql) infra/hb-infra/business_unit_1/production/main.tf:32 var.authorized_networks only (THB VPN: gc{2,4,5,6}-algo) — no per-VM IPs None. App connects via Cloud SQL Proxy sidecar (Phase 0 Step 1).
HB MSSQL sandbox — non-prod only (hb-n-sql-server) infra/hb-infra/business_unit_1/non-production/sql_server.tf:48 var.authorized_networks + hb-n-vm external IP Remove the hb-n-vm entry; nothing replaces it. Added by PR #362 as a DHA sandbox for the Benefits Hub team to test against without colliding with other teams’ deploys. The allowlist entry was for the host-level Cloud SQL proxy connecting via public IP. The new per-stack cloud-sql-proxy sidecar authenticates to the Cloud SQL admin API by IAM, bypassing authorized_networks entirely. No prod equivalent. Workers need cloudsql.client on prj-bu1-n-hb-infra-* (HB-8590) and module.hb-n-sql-server.instance_connection_name in the django_homealign stack’s sidecar command.
PD shared MSSQL (pd-p-mssql) resource: mssql.tf:84; allowlist concat: mssql.tf:110 var.authorized_networks + local.temporary_authorized_networks (lines 11-14) + local.pd_vm_authorized_networks (lines 23-29) HIGH RISK. pd_vm_authorized_networks auto-populates from every PD VM external IP via regex match on app vm_resource_name. As partner VMs go away, regex matches drop → auto-removal from allowlist. Need to add the swarm NAT IPs as static entries (next to local.temporary_authorized_networks) before the first PD partner cuts over.
Per-partner Postgres (pdp{N}-{n,p}-psql-* in prj-bu1-{n,p}-pd-infra-*) infra/pd-infra/business_unit_1/{env}/pdp*/main.tf:33 var.authorized_networks only (THB VPN) None on the SQL side. Connect via --private-ip through VPC peering — set CSQL_PROXY_PRIVATE_IP=true in each app’s .cloudsql-proxy.env (pd #896). Bypasses NAT entirely, no allowlist needed for swarm workers.
Legacy PD Postgres (the-helper-bees:ansible-p<N>-psql-{stage,prod} for p1, p5, p7, p10, p13, p14) Not Terraform-managed. See comment in infra/pd-infra/business_unit_1/non-production/pdp14/main.tf: “Because the database server was not created by Terraform, I had to manually create [databases/users/passwords] by hand”. Allowlist managed imperatively via src/scripts/add_swarm_nat_to_legacy_sql.sh. Per-partner mix of legacy per-VM IPs and partner-specific entries (collected over years pre-swarm). Add swarm NAT IPs via the script (no Terraform path available — see §3.1.1 for the rationale). Required because these instances have no private IP, so public IP through NAT is the only reachable path from swarm workers. Cloud SQL silently drops TCP SYNs from non-allowlisted source IPs → i/o timeout at the proxy.

Action item: Add a new entry to the allowlist concat in pd-p-mssql.ip_configuration.authorized_networks once swarm NAT IPs are known.

# Live audit non-prod
gcloud sql instances list \
  --project=prj-bu1-n-pd-infra-fee5 \
  --filter='databaseVersion~SQLSERVER' \
  --format='yaml(name,settings.ipConfiguration.authorizedNetworks)'

# Live audit prod
gcloud sql instances list \
  --project=prj-bu1-p-pd-infra-b355 \
  --filter='databaseVersion~SQLSERVER' \
  --format='yaml(name,settings.ipConfiguration.authorizedNetworks)'

# Count expected non-prod NAT IPs in the allowlist (all 8 per the 2026-05-12
# policy: 4 us-central1 active + 4 us-west1 reserve). Expect "8" if both
# regions are fully allowlisted; a value in [4..7] means partial coverage.
gcloud sql instances list --project=prj-bu1-n-pd-infra-fee5 \
  --filter='name~^pd-n-mssql-' \
  --format='value(settings.ipConfiguration.authorizedNetworks[].value)' \
  | grep -cxE '34\.72\.82\.40|34\.122\.251\.101|34\.170\.35\.69|35\.194\.23\.180|34\.105\.111\.96|34\.127\.70\.253|136\.109\.214\.215|34\.169\.152\.90'

3.1.1 PD Postgres connectivity strategy: hybrid --private-ip + NAT allowlist (HB-8646)

The 18 swarm-deployed PD apps split into two cohorts based on where their Cloud SQL lives. We use a different connectivity path for each:

Cohort Apps Cloud SQL location Path Allowlist requirement
New-hierarchy (10 apps) p8, p17, p18, p19, p20, p21, p23, p24, p25, p118 Per-app instance in prj-bu1-{n,p}-pd-infra-* with private IP enabled cloudsql-proxy --private-ip over VPC peering (no NAT involvement) None for swarm workers — VPC peering is a private path
Legacy (8 apps) p1, p5, p5i, p7, p10, p13, p14, p14_bea Shared instance in the-helper-bees:ansible-p<N>-psql-{stage,prod}no private IP cloudsql-proxy over public IP through Cloud NAT All 8 swarm NAT egress IPs (4 us-central1 active + 4 us-west1 reserve, see §2) must be in each legacy instance’s authorized_networks

Two pairs of legacy apps share a single Postgres instance (hbpdp5 + hbpdp5iansible-p5-psql-*; hbpdp14 + hbpdp14_beaansible-p14-psql-*), so the 8 legacy apps map to 6 unique legacy instances per env.

Why hybrid and not one approach for both:

  • “Just --private-ip everywhere” isn’t an option because the legacy ansible-p<N>-psql-* instances don’t have a private IP. Adding one requires pg_dump + restore into a new instance + per-partner coordination — not on the near-term roadmap (deferred until the broader legacy-instance retirement).
  • “Just allowlist NAT IPs everywhere” would work mechanically (mirrors the existing pd-p-mssql pattern in §3.1), but expanding NAT-port-budget pressure to 10 more instances that already have a cleaner option (private IP via VPC peering) is the wrong direction. New-hierarchy apps get the architecturally cleaner path; legacy gets the only path it has.

How each path is wired:

  • --private-ip: cloudsql-proxy v2 auto-binds CLI flags to CSQL_PROXY_* env vars via cobra/viper. Setting CSQL_PROXY_PRIVATE_IP=true in an app’s .cloudsql-proxy.env is equivalent to passing --private-ip on the command line. PD’s production.deploy.yml already loads each app’s .cloudsql-proxy.env via env_file:, so flipping the env var in the per-app file is the entire change. See pd #896.
  • NAT allowlist: all 8 swarm NAT egress IPs per env (4 us-central1 active + 4 us-west1 reserve, see §2) are added to each legacy instance’s authorized_networks via src/scripts/add_swarm_nat_to_legacy_sql.sh. Including the reserve us-west1 IPs up front means we won’t need a second allowlist patch round if/when we activate that region. The script is idempotent and supports --app <id> for incremental rollout. See infrahive #811.

Verification: probe_pd_postgres_reachability covers the 10 new-hierarchy apps; probe_pd_legacy_postgres_allowlist (added in infrahive #808) covers the 6 legacy instances. Both run from a dev laptop / CI and pass post-rollout.

Failure mode to recognize: if a legacy app sees cloudsql-proxy: failed to connect to instance: Dial error: ... i/o timeout it almost always means a swarm NAT IP isn’t in the target instance’s authorized_networks (Cloud SQL drops TCP silently for non-allowlisted source IPs). If a new-hierarchy app sees cloudsql-proxy: failed to connect to instance: Config error: instance does not have IP of type "PRIVATE" it means an old compose overlay accidentally forced --private-ip onto a legacy app — see HB-8646 retro for the exact failure mode and production.swarm-overrides.yml removal in pd #896.

Future legacy → new-hierarchy migration: when one of the legacy ansible-p<N>-psql-* instances eventually gets migrated to prj-bu1-{n,p}-pd-infra-*, the affected app moves from the “legacy” row to the “new-hierarchy” row. Concretely: drop its entry from LEGACY_PD_INSTANCES in add_swarm_nat_to_legacy_sql.sh and swarm_connectivity_test.sh, and add CSQL_PROXY_PRIVATE_IP=true to its .cloudsql-proxy.env. The NAT allowlist on the decommissioned legacy instance can be left as-is or cleaned up — irrelevant once the instance is gone.

3.2 Azure SQL Firewall (HA Apps Only)

Server Resource group Today’s source Migration action
ha-prod1-azsqldb HA-PROD1-SQL-RG Per-VM rules added imperatively via src/scripts/add_az_sql_fw_rule.sh Add swarm NAT IPs as new rules; leave per-VM rules in place during cutover, remove post-migration
ha-dev-azsqldb HA-DEV-SQL-RG Same Same

These rules are NOT in Terraform. They live only in Azure. Inventory:

# Production
az sql server firewall-rule list --resource-group HA-PROD1-SQL-RG \
  --server ha-prod1-azsqldb --output table

# Non-prod
az sql server firewall-rule list --resource-group HA-DEV-SQL-RG \
  --server ha-dev-azsqldb --output table

Add rule (one invocation per NAT IP from §2):

./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_0 -ip <nat-ip-0>
./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_1 -ip <nat-ip-1>
./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_2 -ip <nat-ip-2>
./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_3 -ip <nat-ip-3>

The script is idempotent (checks for existing rule with same IP first). The _0 through _3 suffix aligns with the gcp-networks address resource indices (ca-{env_code}-shared-base-spoke-us-central1-{0..3}).

⚠️ TCP probes don’t verify the firewall rule. Azure SQL accepts the TCP connect at the load balancer regardless of firewall — the rule is enforced at the TDS auth layer, not L4. So a passing </dev/tcp/host/1433> (or anything that does only a TCP probe, including docs/scripts/swarm_connectivity_test.sh) proves DNS + VNet egress + L4 path work, but does not prove the worker’s NAT IP is in the firewall allowlist. Authoritative verification requires a TDS-aware client (sqlcmd, the mssql Python driver, etc.) attempting the pre-login phase — a successful pre-login means the firewall accepted the source IP. Treat the connectivity-test TCP probe as a network-path canary, not a firewall verifier. (A v2 of swarm_connectivity_test.sh could shell out to sqlcmd -Q "SELECT 1", but that introduces a non-stock-Ubuntu dependency — left as a follow-up.)

3.3 Cloudflare Access Groups (IP Bypass)

Two locations:

A) infra/pd-infra/business_unit_1/production/cloudflare.tf — most action here:

Group Source today Migration action
green_p_ip_bypass (line 59) data.google_compute_addresses.hb_p_vm_ip_address (HB project external IPs) Auto-shrinks as HB VM goes away. Add swarm NAT IPs as static entries.
yellow_p_ip_bypass (line 71) data.google_compute_addresses.bees_p_vm_ip_address Same — auto-shrinks; add static NAT IPs.
pdp24_p_bastion_ip_bypass / pdp18_p_bastion_ip_bypass bastion-specific compute addresses (data.google_compute_addresses.p24_bastion_p_vm_ip_address / p18_bastion_p_vm_ip_address) No change. Bastions stay on dedicated VMs and are not part of the swarm migration; their external IPs are stable.
pdp18_p_ip_bypass, pdp24_p_ip_bypass, salesforce_p_ip_bypass, thb_p_authorized_networks Hardcoded partner / VPN IP ranges No change — these are partner egress, not ours.

B) HB-side cloudflare-access module calls — these live in hb-infra (not pd-infra) and use a dynamic vpn_ip_addresses input rather than named Cloudflare Access groups:

Module call Path Source Migration action
Standard cloudflare-access (covers all HB apps with vpn_bypass policy: hbcrm, hb_buzz, eligibility-via-thehelperbees.com-domain, etc.) hb-infra/.../production/cloudflare.tf:8-15 vpn_ip_addresses = concat(local.pd_addresses, local.vpn_addresses)local.pd_addresses is auto-discovered from data.google_compute_addresses.pd_ip_addresses (filter name:pdp* in PD project) Add swarm NAT IPs as static entries to local.pd_addresses in main.tf:8-11. Single change flows into both module calls below.
eligibility-cloudflare-access (separate call because eligibility uses session_duration = "168h" vs the default; covers eligibility.thb.nu and eligibility.thb.sh) hb-infra/.../production/eligibility_service.tf:6-13 Same expression: concat(local.pd_addresses, local.vpn_addresses) Inherits fix from updating local.pd_addresses.

Why this matters for pdp24 → eligibility: When pdp24 calls https://eligibility.thb.nu, Cloudflare Access checks if the source IP is in the vpn_bypass group. Today, pdp24’s 35.225.100.177 is in local.pd_addresses (via the dynamic data lookup) so it passes. After migration, pdp24’s IP becomes a swarm NAT IP — which is not in the PD project’s external addresses, so the dynamic lookup won’t include it. The static-entry fix above is what keeps this internal flow working post-migration. No partner action required — this is our own infrastructure.

C) infra/ha-infra/business_unit_1/production/cloudflare/cloudflare.tf — uses only var.authorized_networks (THB VPN). No change needed.

Live audit:

# Already in TF — read current state
./zig/zig build plan -- pd-infra production cloudflare
./zig/zig build plan -- hb-infra production cloudflare
./zig/zig build plan -- hb-infra production eligibility_service

3.4 Caddy VPN Bypass List

Important: The Terraform-supplied caddy_vpn_bypass_list is one of three inputs that compose the actual runtime Caddy allowlist. The full picture, assembled by Ansible at deploy time:

admin_allowed_ip_addresses = [vm_ip] + vpn_ip_addresses + caddy_vpn_bypass_list

(Composed across gc_deploy_with_gcp_secrets.yml:358-360 — the playbook actually used in production — and roles/caddy_create/tasks/main.yml:46-48 in hb-ansible.) This list drives the (is-public-ip) / (is-vpn-ip) snippets in roles/caddy_create/templates/Caddyfile.j2, which gate the admin/flower/camunda/websocket routes that bypass Cloudflare Access.

Component Source Migration impact
vm_ip hostvars[vm_name].ansible_ssh_host — the deploy target’s own SSH host Becomes the swarm worker’s NAT egress IP. Since all workers share the same NAT IPs, every worker will (incidentally) self-allowlist every other worker’s traffic via Cloudflare. New property; not necessarily wrong, worth surfacing.
vpn_ip_addresses Maps to infrahive var.authorized_networks (THB algo VPN exits) No change — same VPN IPs pre/post migration.
caddy_vpn_bypass_list Terraform-fed via the ansible_config module (see table below) Active migration concern. See callers below.
How caddy_vpn_bypass_list flows from Terraform to runtime

The caddy_vpn_bypass_list value is not read at deploy time from a static file in hb-ansible. It traverses three systems:

infrahive Terraform (hb-infra/.../main.tf)
    local.caddy_pd_addresses
        |
        v
    module "hb-p-ansible-config5" { caddy_vpn_bypass_list = local.caddy_pd_addresses }
        |  (terraform apply)
        v
    ansible_config TF module (terraform-modules.git, external)
        |
        v
    GCS: gs://ansible-config-65ab/<app>/config.yml      ← regenerated on every apply
        |
        v
    hb-ansible playbook (gc_deploy_with_gcp_secrets.yml) loads bucket_path at deploy time
        |
        v
    filtered_app.caddy_vpn_bypass_list (per-app dict)
        |
        v
    caddy_create role (roles/caddy_create/tasks/main.yml:46-48)
        admin_allowed_ip_addresses = [vm_ip] + vpn_ip_addresses + filtered_app.caddy_vpn_bypass_list
        |
        v
    Caddyfile.j2 (is-public-ip / is-vpn-ip matchers)

Verified end-to-end via gsutil cat gs://ansible-config-65ab/hbcrm/config.yml — that file already contains a populated caddy_vpn_bypass_list array with all PD partner VM IPs (78+ entries today).

Implication for HB-8413 (Caddy VPN bypass) and HB-8414 (Cloudflare Access): changes happen in infrahive only. No edits to hb-ansible. The Caddy template is data-driven — it picks up whatever’s in the GCS config on next deploy. And because the same local.caddy_pd_addresses / local.pd_addresses data lookup feeds all three downstream surfaces (Caddy bypass, standard cloudflare-access, eligibility-cloudflare-access), a single Terraform PR closes both HB-8413 and HB-8414 — coordinate so the work isn’t duplicated.

Terraform sources of caddy_vpn_bypass_list:

Caller Source today Migration action
module.hb-p-ansible-config5 (infra/hb-infra/business_unit_1/production/main.tf:140) local.caddy_pd_addresses — auto-discovered PD VM external IPs (name:pdp* filter, line 12) Auto-shrinks as PD VMs migrate. Replace with the swarm NAT IPs as static entries. HB→PD calls will continue to roundtrip via Cloudflare even on the same cluster (no overlay during migration — see HB-8620 follow-up).
module.hb-n-ansible-config (non-prod equivalent) Same pattern Same
module.benefits-hub-p-ansible-config1 (infra/hb-infra/business_unit_1/production/benefits_platform.tf:73) [] empty No change

Decision (resolves Q1): The migration will not introduce an HB↔︎PD swarm overlay network. HB→PD calls continue to traverse the public internet via Cloudflare, requiring the swarm NAT IPs in this bypass list. A future overlay is tracked under HB-8620 (parented to the HB-6748 DevOps KTLO epic), which would let us remove this bypass list entirely.

Note on partner_allowed_ip_addresses (inbound, out of scope): While inspecting hb-ansible, we also reviewed the per-app partner_allowed_ip_addresses variable (roles/caddy_create/tasks/main.yml:32, only set on stage_hbpdp11 with [76.187.226.216]). This is the inverse direction — a partner’s IP allowlisted on our Caddy for inbound /partner/* traffic — and is therefore out of scope for HB-8210. Additionally, the variable is set as a fact but never referenced in the current Caddyfile.j2 template, suggesting it is either vestigial or enforced at the app layer inside partner_django. Either way, partner IPs don’t change during our migration, so no action.

3.5 Partner API Webhooks / External Integrations

This is the highest-risk and least-discoverable surface. Partner systems often allowlist our outbound IP for inbound webhooks, SFTP polling, or API calls.

Known surfaces requiring live inventory (cannot enumerate from infrahive alone):

  1. Partner SFTP destinationsinfra/pd-infra/cloudfunctions/sftp_pubsub_messages/SFTP/client.go, infra/pd-infra/cloudfunctions/sftp_bucket_file/SFTP/client.go. SFTP from cloud functions, not VMs — likely uses its own NAT. Verify these aren’t routed through partner VMs.

  2. PSR (Provider Services Request) handler — out of scope, documented for clarityinfra/pd-infra/modules/partner_psr_request_handler/. On first read this looked like a possible outbound surface, but inspection of the module and its callers (pdp8, pdp24, …) confirms the flow is the opposite direction: a partner’s Salesforce instance pushes files into our GCS bucket (psr-{partner_code}-{env}-*), authenticated by a service-account key that we generate and hand to the partner’s Salesforce dev. A Cloud Function then converts uploads to PDF/XML for hbcrm to read. No partner-side IP allowlist is involved, and our outbound IP doesn’t appear in this flow at all — so it’s unaffected by the swarm migration. Listed here only so future readers don’t re-add it as an open question.

  3. Per-partner outbound API integrations — likely configured per-partner in either:

    • pd repo .envs/ files (partner API base URLs and auth)
    • hb-buzz / hbcrm / django_homealign per-tenant settings
    • Salesforce / Camunda outbound webhooks

Recommended path: post in #devops asking team members which partners allowlist our outbound IP. Active outreach: #devops thread on 2026-04-30. Track replies there and add new partners to the confirmed-partners table below as they surface.

Owner for partner outreach: PD partner success / integrations team. Best to start parallel to HB-8218 implementation since some partners have multi-week change-control SLAs.

Confirmed partners with our outbound IP allowlisted

All IPs below verified live via curl ifconfig.me from the corresponding VM. Each address resource carries lifecycle { ignore_changes = all } so the IPs have been stable.

Partner Env Outbound IP GCE VM Project
pdp24 prod 35.225.100.177 pdp24-p-vm-6670 prj-bu1-p-pd-infra-b355
pdp24 non-prod 34.30.125.50 pdp24-n-vm-2327 prj-bu1-n-pd-infra-fee5
pdp10 prod 34.69.126.254 pdp10-p-vm-a437 prj-bu1-p-pd-infra-b355
pdp10 non-prod 34.134.208.190 pdp10-n-vm-01f0 prj-bu1-n-pd-infra-fee5
pdp7 prod 35.238.12.142 pdp7-p-vm-ebcb prj-bu1-p-pd-infra-b355
pdp7 non-prod 35.232.30.74 pdp7-n-vm-5c52 prj-bu1-n-pd-infra-fee5

Source TF for all rows: google_compute_address.external in infra/pd-infra/business_unit_1/{production,non-production}/pdp{N}/main.tf:12.

Per-partner allowlist verification (HB-8415, 2026-05-07)

Audited the actual outbound surface for each of the three “stable IP” partners above by querying the runtime dw_eventsubscription table (the only code path in the pd repo that POSTs to partner-controlled URLs — see docs/scripts/query_event_subscriptions.sh) and probing the partner endpoints with a 401-vs-403 differential test (curl from the allowlisted VM vs from a non-allowlisted host).

Partner Env Active subscriptions Partner system IP allowlist confirmed? Migration action
pdp7 both 0 none in repo ✅ no surface (PD team confirmed 2026-05-07: pdp10 is the only EventSubscription consumer) none
pdp10 (Genworth) prod 6 b7g1csty57.execute-api.us-east-1.amazonaws.com (AWS API Gateway, Microsoft Entra OAuth tenant 820ad304-…) yes add 8 prod swarm NAT IPs to Genworth’s prod APIGW resource policy (4 us-central1 active + 4 us-west1 reserve per §2 policy)
pdp10 (Genworth) non-prod 8 partner + 8 internal-test p0cr5dk17j.execute-api.us-east-1.amazonaws.com (separate AWS APIGW, same tenant) yes add 8 non-prod swarm NAT IPs to Genworth’s staging APIGW resource policy (4 us-central1 active + 4 us-west1 reserve per §2 policy)
pdp24 both 0 none in repo ✅ no surface (PD team confirmed 2026-05-07: pdp10 is the only EventSubscription consumer) none

401-vs-403 evidence (Genworth APIGWs both expose a real resource policy / WAF rule, not OAuth-only):

Source Genworth prod APIGW (b7g1csty57…) Genworth staging APIGW (p0cr5dk17j…)
pdp10 prod VM (34.69.126.254) 401 UnauthorizedException (allowlisted)
pdp10 non-prod VM (34.134.208.190) 401 UnauthorizedException (allowlisted)
pdp24 prod VM (35.225.100.177) 403 ForbiddenException (not allowlisted)
Operator workstation 403 ForbiddenException 403 ForbiddenException

A 403 from API Gateway with ForbiddenException arrives before the Lambda authorizer runs — that’s the canonical signature of an AWS resource-policy IP block. A 401 means the request reached the authorizer (so the IP passed) and only the bearer token failed. The differential confirms Genworth IP-allowlists pdp10’s per-VM IPs in both envs.

Pdp7 and pdp24 closed: PD team confirmed 2026-05-07 that pdp10 is the only EventSubscription consumer — pdp7 and pdp24 do not use the partner-webhook feature at all. Combined with their empty dw_eventsubscription tables and no other partner-bound outbound code path in the pd repo, this is sufficient to mark both as “no migration action”: no partner allowlist coordination needed, and their per-VM google_compute_address.external resources can be released right after swarm-prod cutover with no dual-allowlist window. (Out-of-band integrations like manual CSV uploads or ops scripts weren’t explicitly probed, but the team’s confirmation that “P10 is the only one using that feature” is the authoritative signal — they own the integration roadmap.)

Other partners — unknown. Coverage is tribal; PD partner success / integrations team should confirm no others exist. Outreach posted in #devops on 2026-04-30 — track replies there.

Internal flows that depend on these IPs: pdp24 → eligibility.thb.nu / eligibility.thb.sh (confirmed). Our own Cloudflare Access vpn_bypass group includes all PD partner VM IPs (pdp7, pdp10, pdp24, and any other active partner) via local.pd_addresses (dynamic data lookup), so any of them calling our HB-side services rides this allowlist. After migration the dynamic lookup shrinks to nothing and is replaced by static swarm NAT IP entries — covered by §3.3 row B. No partner action needed for these internal flows.

Decision (post-migration): notify each partner contact of the swarm NAT IPs across both environments (16 IPs total: 8 non-prod + 8 prod per §2, including us-west1 reserves) and have them update their allowlist in a single change. Bundling envs minimizes partner change-control churn — most partners have one allowlist that covers any of our outbound, regardless of which env it originated from. After prod cutover (the last env to migrate), the per-partner production GCE addresses can be released (delete the google_compute_address.external resources in pdpN/main.tf).

Why not preserve the existing per-partner IPs?

To pick the right path it helps to know how Cloud NAT actually distributes IPs to outbound flows. With the MANUAL_ONLY mode + min_ports_per_vm = 2048 that gcp-networks configures, Cloud NAT does per-VM IP allocation, not per-flow:

  1. Each worker VM is assigned a fixed allocation of ~2048 ports from one of the pool’s NAT IPs.
  2. The worker uses that single IP for all its outbound flows (until port exhaustion, which is rare with 2048 ports).
  3. The assignment is Cloud-NAT-internal — we don’t control which IP a given worker gets.
  4. With N IPs in the pool and M workers, roughly M/N workers land on each IP.

Service replicas in Swarm are scheduled across the worker pool, so per-partner outbound traffic (e.g., pdp24 → partner API, pdp10 → partner API) comes from whichever NAT IP was assigned to the worker the scheduler placed it on — not from any IP we can pre-select.

Options compared (per partner)
# Option Preserves the existing per-partner IP? Partner allowlist work Operational cost Verdict
1 Notify the partner, they allowlist all 16 swarm NAT IPs (8 non-prod + 8 prod, see §2 policy on us-west1) No, but doesn’t matter — partner accepts new IPs One email/ticket per partner; ~24-hour turnaround for most Zero — fewer resources to maintain after migration Default. Reconsider only if change-control SLA is multi-week.
2 Don’t migrate the affected partner stack; leave on dedicated VM Yes (fully) None Continued cost of the dedicated VM; partner stays out of Phase 0 scope ❌ Ruled out (we want to migrate all partners).
3 Outbound HTTPS proxy on a tiny dedicated VM owning the existing IP; partner stack uses HTTPS_PROXY env Yes (exactly, via HTTPS_PROXY) None One always-on VM per partner needing preservation; monitoring/alerts/upgrades for each; single point of failure for that partner’s traffic ⚠️ Acceptable if option 1 is blocked by partner SLA. Cost scales linearly with number of preserved IPs.
4 Add the existing IPs to the swarm Cloud NAT pool No. Per-VM allocation means ~1/(N+k) of workers get any given preserved IP and use it for all their outbound regardless of destination. Other workers (and any partner replicas scheduled on them) use the other IPs. Partner still has to allowlist the full set. Same as option 1 Slightly higher than option 1 (extra IPs reserved forever) ❌ Doesn’t preserve IPs for partner-specific traffic. Cloud NAT lacks per-flow IP selection.
5 Pin partner stack to a specific worker; attach the existing IP to that node Yes (the pinned worker uses that IP) None Defeats Swarm scheduling, breaks autoscaler, single point of failure for the partner stack ❌ Ruled out (already rejected).
Why option 4 keeps coming up

Option 4 feels like it should work: “we have the IPs, just keep using them.” It would work if Cloud NAT supported per-flow IP selection rules (“traffic to partner.example.com → source IP X”), but it doesn’t — IP assignment is per-VM, not per-flow or per-destination. So a preserved IP ends up tied to whichever workers Cloud NAT happens to assign it to, used by every outbound flow from those workers regardless of destination, while partner replicas on other workers don’t use it at all. It’s effectively option 1 with extra unused IPs in the pool.

The only real ways to preserve a partner-specific IP are option 3 (proxy steers at the application layer) or option 5 (node pinning forces the worker assignment), and we’ve ruled out 5. Reconsider option 3 only if option 1 is blocked for a specific partner.

Pre-cutover checklist (per partner, run for pdp7, pdp10, and pdp24):

Precondition: the §2 verification procedure (“Verifying egress before partner outreach”) must have been run for non-prod, with the verified IPs locked in. For prod, communicate the IPs we have today and follow up before prod cutover with any updates from the same verification run.

  1. Send the partner contact the full swarm NAT IP set across both environments (from §2). Per the 2026-05-12 policy revision, include the us-west1 reserve IPs alongside the us-central1 active set — that way the partner does one round of change-control, not two. Communicating both envs at once minimizes partner change-control churn:
    • Non-production (8 IPs — 4 us-central1 active + 4 us-west1 reserve): 34.72.82.40, 34.122.251.101, 34.170.35.69, 35.194.23.180, 34.105.111.96, 34.127.70.253, 136.109.214.215, 34.169.152.90
    • Production (8 IPs — 4 us-central1 active + 4 us-west1 reserve): 34.132.72.45, 35.226.246.142, 34.136.40.107, 34.57.41.236, 35.227.138.226, 35.185.201.232, 34.168.190.12, 8.229.103.135
  2. Ask the partner to enumerate every IP of ours they currently allowlist — both prod and non-prod, if applicable. Capture this in the audit’s confirmed-partners table so we know which entries to clean up post-migration.
  3. Get written confirmation that all 16 swarm NAT IPs (8 per env) are in their allowlist before the corresponding env’s migration day. The 4 us-west1 reserves per env are forward-looking and won’t carry traffic today, but having them in place pre-emptively avoids a second round of change-control when we activate that region.
  4. Keep existing per-partner IPs allowlisted on the partner side throughout the dual-allowlist window per §4. The window must extend through the last migration to complete (= prod, gated on HB-8232). Per-partner IPs to keep (only those the partner confirms in step 2 are actually in their allowlist):
    • pdp7 prod: 35.238.12.142 / non-prod: 35.232.30.74
    • pdp10 prod: 34.69.126.254 / non-prod: 34.134.208.190
    • pdp24 prod: 35.225.100.177 / non-prod: 34.30.125.50
  5. One week after prod cutover: ask the partner to remove the old IPs from their allowlist; release the corresponding google_compute_address.external resources in pdp{N}/main.tf (and non-prod equivalents if applicable) on our side.

Tableau Analytics — internal system that allowlists our partner IPs

PDP apps embed Tableau dashboards by POSTing to https://tab.thehelperbees.com/trusted to mint a per-user token (see get_tableau_token in pd/thbpd/dw/views/views.py:1954-1983). The endpoint mints a token only when the source IP is in Tableau Server’s wgserver.trusted_hosts; otherwise it returns the literal string -1. After migration, swarm worker traffic egresses through the NAT IPs in §2 — those IPs must be in trusted_hosts before any PDP stack that embeds dashboards cuts over.

Property Value
Hostname (both envs) tab.thehelperbees.com
GCE VM ansible-tab-stage in the-helper-bees (us-central1-a)
Tableau Server version 2021.2.21
Allowlist mechanism tsm configuration set -k wgserver.trusted_hosts (comma-separated CIDR list) — committed via tsm pending-changes apply
CF Access on tab.thehelperbees.com Not enforced — CF policy verified absent for this hostname; trust is enforced solely by wgserver.trusted_hosts + Tableau session auth
Caddy front Caddy 2.4.3 on the same VM (config baked into the GCR image gcr.io/the-helper-bees/tab/caddy:2026-03-05, source not in any THB repo). The (is-public-ip) / (is-vpn-ip) snippets are cosmetic — both @public-match and @admin-match proxy to the same backend. Caddyfile/Dockerfile cleanup tracked under HB-8635
Tableau Server config Not managed by infrahivetsm lives only on the VM; zero references to tableau / Tableau / tab.thb in this repo

Both prod and non-prod apps point at the same tab.thehelperbees.com (confirmed by tech lead 2026-05-05). All swarm NAT IPs must be in the single shared trusted_hosts list — applying the change is environment-agnostic.

Migration action — DONE for us-central1 (HB-8623, PR #801):

  1. SSHed to ansible-tab-stage (via gssh tab the-helper-bees awx-builder-stage from infrahive bin/).
  2. Captured the existing 20-entry trusted_hosts list with tsm configuration get -k wgserver.trusted_hosts --trust-admin-controller-cert.
  3. Appended the 8 us-central1 swarm NAT IPs (purely additive; no existing entries removed) → 28 total.
  4. tsm pending-changes apply + tsm restart (a 90-min outage during apply was caused by /var/opt/tab-backups filling to 100%; recovered by deleting old weekly backups and restarting the controller. Findings filed as follow-ups under HB-6748: backup retention automation, backup health monitoring, Tableau Server upgrade).
  5. End-to-end verified: worker-side probe (tab.thehelperbees.com/trusted from swarm-n-worker-r2w7, egress 34.170.35.69) returns HTTP 200 + "-1" (the script POSTs deliberately-bogus credentials — 200 + "-1" proves the request got past CF Access AND was processed by Tableau Server, which is the chain we care about; a separate manual test with real credentials returned HTTP 200 + a real token, confirming the trusted-auth path also works end-to-end). Control-plane probe (tableau-trusted-hosts) reports 4/4 us-central1 NAT IPs in wgserver.trusted_hosts for both envs.

Follow-up needed — us-west1 reserve IPs (2026-05-12 policy revision): PR #801 predates the §2 policy update that pre-emptively includes the 4 us-west1 reserve IPs per env. Apply the same tsm configuration set -k wgserver.trusted_hosts + tsm pending-changes apply + tsm restart cycle, this time appending the 8 us-west1 IPs (4 non-prod + 4 prod — see §2 table) — list grows from 28 → 36 entries. After apply, the tableau-trusted-hosts probe should report 8/8 NAT IPs per env. Tracked separately so the historical PR #801 record stays accurate.

Why it’s not “structurally identical to a partner allowlist.” Tableau is internal to THB (we own the server) and the allowlist is enforced by Tableau Server config we control directly — not by a partner-side firewall. Once the swarm NAT IPs are in trusted_hosts, no further coordination is needed. Per-partner GCE address cleanup (HB-8239) can release the legacy pdp1/pdp5/pdp14 entries from trusted_hosts after prod cutover, but they’re zero-cost left in place during the dual-allowlist window.

Out-of-band findings filed during HB-8623 investigation (all parented to HB-6748 unless noted):

  • HB-8635 — Tableau VM Caddyfile / Dockerfile cleanup (both ~4 years stale, hardcoded CF DNS token, unused TAB_CLOUDFLARE_CADDY_TOKEN Docker secret, non-functional (is-public-ip)/(is-vpn-ip) snippets). Reclassified Low priority — cosmetic, no functional impact.
  • HB-8636 — One-time fix: clear /var/opt/tab-backups and take a fresh backup (Slack-imported, immediate cleanup of the disk-full state).
  • HB-8637 — Tableau Server backup retention automation (auto-delete weekly backups older than 30 days; prevents the disk-full failure mode recurring).
  • HB-8638 — Tableau Server backup-disk health alarm (page on /var/opt/tab-backups > 80% full and on stale/0-byte newest .tsbak; closes the visibility gap that let the disk fill silently).
  • Tableau Server upgrade from 2021.2.21 → current LTS, Ubuntu 16.04 host EOL — captured as a comment on existing epic HB-5567 (parented to HB-5565 “Terraform the Tableau Server”); the TF rebuild is the natural upgrade vehicle, no separate ticket needed.

3.6 Other (lower priority but worth verifying)

  • Cloudflare Tunnel / cloudflared configsrc/tools/launchbot/templates/cloudflared/config.yml.tmpl. Tunnel is identity-based, not IP-based, but verify there’s no IP-pinned policy.
  • Datadog allowlists — outbound to Datadog via dd_agent. Datadog accepts traffic from any source; Datadog API key auths the call. No change.
  • Papertrail (syslog+tls://logs7.papertrailapp.com:23105) — same; no IP allowlist on Papertrail’s side.
  • Camunda / Keycloak outbound — internal, no external allowlist.
  • GitHub Actions deploy proxy (infra/common-infra/business_unit_1/shared/gh_actions_deploy_proxy.tf) — separate proxy, not impacted.

4. Cutover Sequence (handoff to HB-8218)

  1. Provision swarm Cloud NAT for prod (HB-8217 already done for non-prod; prod gated on HB-8232).
  2. Capture the NAT IPs per env via gcloud compute addresses list --filter='name~^ca-{n,p}-shared-base-spoke-us-central1-' --format='table(name,address)' against the corresponding network host project (full commands in §2). Use addresses list rather than routers nats describe here — describe returns address-resource URLs, not IP values. For defense-in-depth, also run routers nats describe to confirm those addresses are actually attached to the NAT, and routers get-nat-mapping-info to confirm workers map to them (per §2 verification procedure).
# Non-production
gcloud compute routers nats describe rn-n-shared-base-spoke-us-central1-egress \
  --router=cr-n-shared-base-spoke-us-central1-nat-router \
  --region=us-central1 \
  --project=prj-n-shared-base-cb89

# Production
gcloud compute routers nats describe rn-p-shared-base-spoke-us-central1-egress \
  --router=cr-p-shared-base-spoke-us-central1-nat-router \
  --region=us-central1 \
  --project=prj-p-shared-base-11f6
  1. Pre-add NAT IPs to all surfaces before any workload moves:
    • PD shared MSSQL authorized_networks (Terraform PR adding static entries to the concat in pd-infra/.../mssql.tf:110).
    • Azure SQL firewall: 2 rules per env via add_az_sql_fw_rule.sh.
    • Cloudflare Access groups green_p_ip_bypass and yellow_p_ip_bypass: TF PR adding static entries.
    • Caddy VPN bypass: TF PR adding the swarm NAT IPs as static entries to the caddy_vpn_bypass_list callers. (No overlay during migration — see HB-8620 for the future overlay work.)
    • Partner systems: outreach in flight (longest-tail; start now).
  2. Validate (workers boot but stay drained until allowlists confirmed).
  3. Cut over workloads.
  4. Post-migration cleanup: remove now-stale per-VM rules.

Dual-allowlist window

Steps 3-6 deliberately leave both per-VM IPs and the new NAT IPs allowlisted for the duration of the cutover. This is necessary for zero-downtime migration but expands the allowlist by ~4 IPs per environment for the duration of the window. Treat this as a known temporary security posture:

  • Expected duration: hours to days for in-house surfaces (Cloud SQL, Azure SQL, Cloudflare); weeks for partner-side allowlists where partner change-control SLAs apply (most acute for §3.5 partner webhooks).
  • Highest-impact surfaces during the window: PD shared MSSQL (§3.1) and partner webhooks (§3.5) — both expose data paths, not just management.
  • Monitoring: before the window opens, enable connection logging on Cloud SQL (cloudsql.googleapis.com/postgres.log or equivalent) and Azure SQL audit logging filtered by client_ip. Watch for traffic from the new NAT IPs before cutover (should be zero) and unexpected continued traffic from per-VM IPs after cutover (signals incomplete migration of a workload).
  • Cleanup commitment: step 6 is not optional. Each surface owner is on the hook for removing the stale per-VM entry within 1 week of cutover, tracked in HB-8218 subtasks.

5. Open Questions

# Question Owner
1 Will the swarm cluster route HB↔︎PD via overlay network? Resolved. No — HB↔︎PD traffic continues over Cloudflare during the migration. The swarm NAT IPs must be added to caddy_vpn_bypass_list. Future overlay tracked in HB-8620.
2 Does production HB MSSQL exist? Resolved. Confirmed via PR #362: non-prod-only sandbox added for Benefits Hub testing; no prod equivalent.
3 Is the ADO agent VM (ado-agent-p-vm) included in this migration? Resolved. Not in scope — stays on its dedicated VM. The module.ado-agent-p-vm.instance_external_ip reference in the HA NSG SSH-2290 / SSH-22 rules (infra/ha-infra/business_unit_1/production/azure_devops.tf:87,105) needs no change.
4 Partially resolved. It’s tribal knowledge. Confirmed (all verified live via curl ifconfig.me): pdp7 (35.238.12.142), pdp10 (34.69.126.254), and pdp24 (35.225.100.177) allowlist our outbound IP on their side (see §3.5). Slack outreach posted in #devops on 2026-04-30; awaiting confirmation that no others exist. PD partner success
5 What is the actual NAT IP resource name/path? Resolved. TF lives in thehelperbees/gcp-networks at modules/base_shared_vpc/nat.tf (called from envs/{production,non-production}/boa_vpc_fw.tf). HB-8217 firewall rules are in modules/base_shared_vpc/firewall.tf:254-345. All 8 us-central1 NAT IPs verified live 2026-04-30 (4 non-prod + 4 prod) — see §2 table. us-west1 NAT IPs (4 per env) are pre-emptively included in all allowlists for future DR/expansion — see §2 callout for the 2026-05-12 policy revision.

6. Action Item Traceability

Every migration action this audit identifies maps to an existing Jira ticket. Comments noted below were posted to the corresponding ticket on 2026-04-30 to add audit-derived scope detail that wasn’t in the original ticket description.

Audit § Action Jira ticket Notes
§2 Provision swarm Cloud NAT (prod) HB-8232 Gates prod migration
§2 NAT IP set composition verification (3-check procedure) HB-8417 See comment 2026-04-30 distinguishing this from connectivity testing
§2 Connectivity test script HB-8211 In Progress
§3.1 PD shared MSSQL authorized_networks HB-8412
§3.1 Remove hb-n-vm entry from sandbox MSSQL HB-8412 Folded in via comment 2026-04-30
§3.1.1 Per-app --private-ip opt-in for new-hierarchy Postgres (cloudsql-proxy env var) HB-8646 pd #896 — replaces global compose overlay with per-app .cloudsql-proxy.env opt-in
§3.1.1 NAT allowlist on legacy the-helper-bees:ansible-p<N>-psql-* instances HB-8646 infrahive #811 — idempotent gcloud script, --app <id> filter for incremental rollout
§3.1.1 Connectivity probes for both paths (new-hierarchy + legacy) HB-8211 infrahive #808probe_pd_postgres_reachability + probe_pd_legacy_postgres_allowlist
§3.2 Azure SQL firewall (prod + non-prod) HB-8411
§3.3 row A pd-infra Cloudflare Access groups HB-8414
§3.3 row B hb-infra cloudflare-access + eligibility (local.pd_addresses) HB-8414 Folded in via comment 2026-04-30 — original scope only mentioned pd-infra
§3.4 Caddy VPN bypass list HB-8413
§3.4 future HB↔︎PD overlay network HB-8620 Parented to HB-6748 (DevOps KTLO)
§3.5 Notify pdp7 / pdp10 / pdp24 (both envs) HB-8415 Concrete list + checklist added via comment 2026-04-30; per-partner allowlist verification via docs/scripts/query_event_subscriptions.sh + 401-vs-403 differential test recorded in §3.5 (2026-05-07)
§3.5 Tableau Analytics allowlist (wgserver.trusted_hosts += 8 us-central1 swarm NAT IPs) HB-8623 DONE via PR #801 — verified end-to-end 2026-05-06
§3.5 Tableau Analytics allowlist follow-up (wgserver.trusted_hosts += 8 us-west1 reserve NAT IPs) TBD Open — 2026-05-12 policy revision; same tsm operational path as HB-8623
§3.5 Tableau Caddyfile / Dockerfile cleanup (cosmetic) HB-8635 Filed 2026-05-05 — Low priority; out-of-band finding from HB-8623
§3.5 Tableau backup disk one-time clear + fresh backup HB-8636 Filed 2026-05-06 (Slack-imported) — Medium
§3.5 Tableau backup retention automation (recurring prune) HB-8637 Filed 2026-05-06 — Medium
§3.5 Tableau backup-disk health alarm (/var/opt/tab-backups > 80%) HB-8638 Filed 2026-05-06 — Medium
§3.5 Tableau Server + Ubuntu host upgrade (2021.2.21 / 16.04 EOL) HB-5567 Comment 2026-05-06 — TF rebuild is the upgrade vehicle
§3.5 Per-partner GCE address cleanup HB-8239 Implicit (decommission also removes address resources)
§4 Pre-add NAT IPs to all surfaces HB-8411 / 8412 / 8413 / 8414 / 8415
§4 Dual-allowlist window (do NOT remove old per-VM IPs during cutover) HB-8416
§4 Cut over workloads HB-8232 + per-app migration tickets
§4 Post-migration cleanup HB-8239
Cross-cutting Worker SA cloudsql.client grants HB-8590 Parented to HB-8216 (worker MIG module)
Parent Allowlist updates (umbrella) HB-8218 Description updated 2026-04-30 to fix 2-IP/4-IP staleness and link audit
Parent Swarm Consolidation Epic HB-8200

7. References

Edit this page