Docker Swarm Outbound IP Audit
Status: Ready for review. All 8 us-central1 NAT IPs (4 non-prod + 4 prod) verified live as of 2026-04-30. Per the 2026-05-12 policy revision in §2, allowlists also include the 4 us-west1 reserve IPs per env (16 IPs total) for forward compatibility. Author: miko.hadikusuma Created: 2026-04-29 Updated: 2026-05-06 Jira: HB-8210 Gates: HB-8218 (Allowlist updates) → HB-8232 (Deploy production swarm)
Table of Contents
- Purpose & Scope
- Current vs. Target State
- Audit by Surface
- Cutover Sequence (handoff to HB-8218)
- Open Questions
- Action Item Traceability
- References
1. Purpose & Scope
When the nine Phase 0 THB applications consolidate onto shared Docker Swarm workers, every per-VM external IP currently used as the source of an outbound connection disappears. Workers run with no external IP behind a shared Cloud NAT with 4 static egress IPs per region (us-central1) — see §2 for the verified set.
Anything that allowlists today’s per-VM IPs needs to be updated to the new NAT IPs before workloads cut over. This audit enumerates every such allowlist surface, where the rule lives, and the migration action needed. The deliverable feeds directly into HB-8218 (the implementation ticket).
Apps in scope (canonical list per
./zig/zig build scripts -- compose-file-linter --list):
bees, benefits-platform,
consumer_portal, django_homealign,
hb-buzz, hbcrm, pd,
thb-keycloak, the_consumer_portal. (HA apps —
Azure-side — are out of scope for the swarm migration but their Azure
SQL firewall is still affected; see §3.2.)
In scope: anywhere a current VM’s external IP is referenced as a source/allowed origin for outbound calls from one of these nine apps.
Out of scope:
- Inbound allowlists for user access to our infra (e.g. THB
VPN exit IPs
gc{2,4,5,6}-algoininfra/bees-infra/common.auto.tfvars). These remain identical post-migration. - Partner-side IPs allowlisting them into our systems (e.g.
local.partner_ip_addressesininfra/ha-infra/business_unit_1/production/environment.tf). Partner IPs don’t change. - App service-account permissions (covered by HB-8212 / HB-8213).
2. Current vs. Target State
Current per-VM external IPs
Discovered via data.google_compute_addresses.* and
per-resource google_compute_address definitions:
| App grouping | Resource (prod) | Notes |
|---|---|---|
| HB shared VM (hbcrm, bea_hbcrm, hb_buzz, django_homealign, eligibility, keycloak/thb-keycloak, hbtemplate, unleash, posthog) | google_compute_address.external_migrate (infra/hb-infra/business_unit_1/production/main.tf:22) |
One VM hosts ~9 apps. thb-keycloak repo deploys the
keycloak service here. |
| Benefits Hub (benefits-platform repo, multiple tenant variants) | google_compute_address.external (infra/hb-infra/business_unit_1/production/benefits_platform.tf:6) |
|
Bees Flask API (bees repo) |
module.bees-p-vm2.instance_external_ip (infra/bees-infra/business_unit_1/production/main.tf:23) |
Standalone VM in its own bees-infra project |
Legacy Consumer Portal (consumer_portal repo,
multi-partner) |
google_compute_address.cp_external (infra/pd-infra/business_unit_1/production/consumer_portal.tf:7) |
cp-p-vm. Hosts hbcp_jh, hbcp_aarp, hbcp_sompo,
hbcp_pru, hbcp_bcbs_ar, hbcp_cgi, etc. Source of HB-8208
gcsfuse work. |
New Consumer Portal — CoPo 3.0 (the_consumer_portal
repo) |
google_compute_address.external (infra/pd-infra/business_unit_1/production/portal-the-consumer-portal.tf:9) |
portal-p-vm. Auth0 custom domain integration; PR
preview env target (HB-7963). |
| ADO agent (deploys HomeAlign Apps) | google_compute_address.ado_agent_external_ip (infra/ha-infra/business_unit_1/production/azure_devops.tf:83) |
Serves Azure DevOps SSH/deploy traffic into HA (HomeAlign) Windows VMs. HA apps are not in swarm scope; ADO agent stays put → no action. |
| 22+ PD partner VMs | google_compute_address.external per
infra/pd-infra/business_unit_1/production/pdp*/main.tf:12 |
One VM per partner |
Quick inventory command:
gcloud compute addresses list --filter="address_type=EXTERNAL" --format="table(name,address,project)" \
--project=<env-project-id>Target swarm NAT IPs
Provisioned per HB-8217
/ HB-8360
as Cloud NAT on the vpc-{env}-shared-base shared VPC.
Capacity:
4 IPs × 64,512 ports / 2048 min_ports_per_vm = ~126 VM ceiling per region.
Canonical TF source (separate repo): thehelperbees/gcp-networks
| Resource | Path |
|---|---|
Cloud Router
(cr-{env_code}-shared-base-spoke-us-central1-nat-router) |
modules/base_shared_vpc/nat.tf:22-32 |
NAT external IPs
(ca-{env_code}-shared-base-spoke-{us-central1,us-west1}-{0..3}) |
modules/base_shared_vpc/nat.tf:34-39 (count =
var.nat_num_addresses_region1 per region; us-central1
active, us-west1 reserve per §2 policy) |
Router NAT
(rn-{env_code}-shared-base-spoke-us-central1-egress) |
modules/base_shared_vpc/nat.tf:41-56 |
Module call with nat_num_addresses_region1 = 4,
nat_min_ports_per_vm = 2048 |
envs/{production,non-production}/boa_vpc_fw.tf:45-47 |
HB-8217
swarm firewall rules (TCP 2377, TCP+UDP 7946, UDP 4789, TCP 9323),
tag-scoped to swarm-node |
modules/base_shared_vpc/firewall.tf:254-345 |
NAT IPs to allowlist (verified 2026-04-30 against live state):
All four IPs per region are attached round-robin to that region’s Router NAT, so every IP must be in every allowlist — partial allowlisting causes intermittent failures as workers cycle through unallowlisted IPs. Per the 2026-05-12 policy revision below, both regions’ IPs are included in every allowlist (16 total — 4 us-central1 + 4 us-west1 per env), even though only us-central1 carries swarm traffic today.
| Environment | Region | Network host project | Address resource | NAT IP | Status |
|---|---|---|---|---|---|
| Non-production | us-central1 | prj-n-shared-base-cb89 |
ca-n-shared-base-spoke-us-central1-0 |
34.72.82.40 |
active |
| Non-production | us-central1 | prj-n-shared-base-cb89 |
ca-n-shared-base-spoke-us-central1-1 |
34.122.251.101 |
active |
| Non-production | us-central1 | prj-n-shared-base-cb89 |
ca-n-shared-base-spoke-us-central1-2 |
34.170.35.69 |
active |
| Non-production | us-central1 | prj-n-shared-base-cb89 |
ca-n-shared-base-spoke-us-central1-3 |
35.194.23.180 |
active |
| Non-production | us-west1 | prj-n-shared-base-cb89 |
ca-n-shared-base-spoke-us-west1-0 |
34.105.111.96 |
reserve |
| Non-production | us-west1 | prj-n-shared-base-cb89 |
ca-n-shared-base-spoke-us-west1-1 |
34.127.70.253 |
reserve |
| Non-production | us-west1 | prj-n-shared-base-cb89 |
ca-n-shared-base-spoke-us-west1-2 |
136.109.214.215 |
reserve |
| Non-production | us-west1 | prj-n-shared-base-cb89 |
ca-n-shared-base-spoke-us-west1-3 |
34.169.152.90 |
reserve |
| Production | us-central1 | prj-p-shared-base-11f6 |
ca-p-shared-base-spoke-us-central1-0 |
34.132.72.45 |
active |
| Production | us-central1 | prj-p-shared-base-11f6 |
ca-p-shared-base-spoke-us-central1-1 |
35.226.246.142 |
active |
| Production | us-central1 | prj-p-shared-base-11f6 |
ca-p-shared-base-spoke-us-central1-2 |
34.136.40.107 |
active |
| Production | us-central1 | prj-p-shared-base-11f6 |
ca-p-shared-base-spoke-us-central1-3 |
34.57.41.236 |
active |
| Production | us-west1 | prj-p-shared-base-11f6 |
ca-p-shared-base-spoke-us-west1-0 |
35.227.138.226 |
reserve |
| Production | us-west1 | prj-p-shared-base-11f6 |
ca-p-shared-base-spoke-us-west1-1 |
35.185.201.232 |
reserve |
| Production | us-west1 | prj-p-shared-base-11f6 |
ca-p-shared-base-spoke-us-west1-2 |
34.168.190.12 |
reserve |
| Production | us-west1 | prj-p-shared-base-11f6 |
ca-p-shared-base-spoke-us-west1-3 |
8.229.103.135 |
reserve |
📝 Historical note: HB-8218 originally listed only 2 of 4 IPs per env. The remaining 6 were discovered live during this audit (gcp-networks provisions
nat_num_addresses_region1 = 4for both envs perenvs/{production,non-production}/boa_vpc_fw.tf:45). HB-8218 description updated 2026-04-30 with the corrected set.
📝 us-west1 NAT IPs are pre-emptively included in allowlists. A
gcloud compute addresses listquery against either shared-base-vpc-host project also returns 4 IPs per env inus-west1(non-prod34.105.111.96 / 34.127.70.253 / 136.109.214.215 / 34.169.152.90; prod35.227.138.226 / 35.185.201.232 / 34.168.190.12 / 8.229.103.135). These belong to a parallel Cloud NAT inus-west1for non-swarm workloads today (Cloud Functions and any future regional expansion). The swarm cluster currently runs only in us-central1 — managers inus-central1-{a,b,c}, workers default to["us-central1-a","us-central1-b","us-central1-c","us-central1-f"]perswarm_worker/variables.tf, so today no swarm traffic ever egresses through the us-west1 pool.Policy as of 2026-05-12: include the us-west1 IPs in every allowlist anyway (internal surfaces like the legacy SQL allowlist script, and partner-side allowlists like §3.5). Rationale: if/when we activate swarm in us-west1 for DR or capacity, we won’t need a second round of partner change-control or script patches. The cost is small (4 extra IPs per env to maintain) and the IPs are static reservations that won’t move, so the future risk of stale entries is low. The earlier guidance to omit them has been superseded.
Verification command (per env):
# Non-production
gcloud compute addresses list \
--project=prj-n-shared-base-cb89 \
--filter='name~^ca-n-shared' \
--format='table(name,address)'
# Production
gcloud compute addresses list \
--project=prj-p-shared-base-11f6 \
--filter='name~^ca-p-shared' \
--format='table(name,address)'Manager IPs (3 per env, separate from worker NAT):
- Non-prod:
google_compute_address.swarm_manager_externalininfra/hb-infra/business_unit_1/non-production/swarm.tf(deployed) - Prod: same resource in
production/swarm.tf— block-commented (/* ... */), gated on HB-8232
Verifying egress before partner outreach
Before notifying partners (§3.5), confirm that worker traffic actually exits via the NAT IPs above. Misalignment between the reserved IPs and the IPs partners see is exactly the silent-failure mode that caused the HB-8218 4-vs-2 finding — but for partner allowlists, that failure happens at the partner’s edge, after cutover, with no logs on our side.
Two complementary checks. Pass criteria: every worker is mapped to one of the 4 NAT IPs, and a curl from inside the worker echoes back the same IP that GCP’s mapping table claims for it.
Check 1 — control-plane mapping: ask GCP which NAT IP each worker is using. No cluster shell access needed.
NETWORK_HOST_NON_PROD="$(gcloud projects list \
--filter='labels.application_name=base-shared-vpc-host AND labels.environment=non-production' \
--format='value(projectId)')"
gcloud compute routers get-nat-mapping-info \
cr-n-shared-base-spoke-us-central1-nat-router \
--nat-name=rn-n-shared-base-spoke-us-central1-egress \
--region=us-central1 \
--project="$NETWORK_HOST_NON_PROD"Output is a JSON list, one entry per VM, each with
natIpPortRanges[].natIp. Group by natIp to
confirm: every worker maps to one of 34.72.82.40,
34.122.251.101, 34.170.35.69,
35.194.23.180. Any other IP is a finding.
Check 2 — partner’s-eye view: run a one-shot global service that hits an IP echo from every worker. What it reports is what a partner will see.
# From a swarm manager (SSH via IAP):
docker service create \
--name nat-ip-check \
--mode global \
--restart-condition none \
alpine sh -c 'echo "$(hostname): $(wget -qO- https://ifconfig.me)"'
# Wait ~30s for tasks to complete, then:
docker service logs nat-ip-check --no-trunc
# Cleanup:
docker service rm nat-ip-checkCross-check the IP each worker reports against Check 1: they should match per-worker. If they don’t, the egress path isn’t what we think it is — stop and investigate before partner outreach.
If you don’t have manager shell access, alternate per-worker approach:
gcloud compute ssh swarm-worker-n-XXXX --tunnel-through-iap \
--command='curl -s https://ifconfig.me'…repeated for each worker.
Optional — Check 3, NAT translation logs: the
gcp-networks egress_nat_region1 resource enables
log_config { filter = "TRANSLATIONS_ONLY" }, so every flow
gets logged. While Check 2 is running, tail the logs to see actual
translations:
gcloud logging read \
'resource.type="nat_gateway"
resource.labels.gateway_name="rn-n-shared-base-spoke-us-central1-egress"' \
--project="$NETWORK_HOST_NON_PROD" \
--limit=50 \
--format='value(jsonPayload.connection.nat_ip,jsonPayload.connection.src_ip,jsonPayload.connection.dest_ip)'Three columns: NAT IP exited from, worker internal IP, destination. Confirms what we expect from a third independent angle.
If all three checks agree, the IP set is ground-truthed and partner outreach can go out. If any disagree, the disagreement is the next thing to chase.
Per-surface connectivity sweep (HB-8211)
docs/scripts/swarm_connectivity_test.sh
is the operational pre-flight check that pairs with the §3 audit. SSH
into a swarm worker, copy the script over, run it. It probes each
surface this audit calls out (GCP APIs, Azure SQL, Cloudflare-fronted
services, Papertrail, Datadog, Tableau /trusted) plus skip
placeholders for partner endpoints (TBD until URLs surface).
A complementary control-plane audit lives at src/scripts/swarm_connectivity_test.sh
and runs from any operator workstation with gcloud +
az + a Cloudflare API token. It re-derives the NAT pool
from gcp-networks state and probes each allowlist surface from
outside-in: PD MSSQL authorized_networks, Azure SQL
firewall, Caddy caddy_vpn_bypass_list in GCS, Cloudflare
Access vpn_bypass policies, NAT egress mapping, and Tableau
wgserver.trusted_hosts (via gcloud compute ssh
+ tsm configuration get). Run via
./zig/zig build scripts -- swarm_connectivity_test — exit
code = number of failures, suitable as a CI gate on top of (not in place
of) the on-host script.
gcloud compute scp docs/scripts/swarm_connectivity_test.sh \
swarm-n-worker-XXXX:/tmp/ \
--tunnel-through-iap --project=prj-bu1-n-hb-infra-5381
gcloud compute ssh ubuntu@swarm-n-worker-XXXX \
--tunnel-through-iap --project=prj-bu1-n-hb-infra-5381 \
--command='bash /tmp/swarm_connectivity_test.sh'Run it before every migration batch. Pre-cutover, expected fails are
typically the vpn_bypass-protected checks (until HB-8413 /
HB-8414
ship). The Azure SQL TCP probe is network-path-only and not
authoritative for firewall allowlisting (see §3.2). Public Cloudflare
sanity endpoints should pass without an allowlist update. Post-cutover,
everything except SKIP rows should be green. Exit code =
number of failures, easy CI gate.
What to expect when running the test from each VM type
The script’s diagnostic value comes from comparing legacy-VM results to swarm-worker results — the diff shows what allowlist work has landed and what hasn’t. This matrix shows expected results for the rows that vary by host:
| Probe | hb-VM (legacy) | partner-VM (legacy, e.g. pdp21) | swarm worker (target post-migration) | Why |
|---|---|---|---|---|
eligibility (vpn_bypass) |
FAIL (302) | PASS (200) | PASS (200) — after HB-8413 / HB-8414 | local.pd_addresses is filtered by
name:pdp* in the PD project. Partner VMs are in there; HB
VM isn’t. After migration, swarm NAT IPs are added as static entries →
workers behave like partner VMs do today. |
hb-infra cloudsql.client |
PASS | FAIL | PASS — after HB-8590 | Legacy: each VM’s compute SA is scoped to projects that VM connects to. Post-migration: worker SA holds the union so any stack can land on any worker. |
pd-infra cloudsql.client |
FAIL | PASS | PASS — after HB-8590 | Same — pd-infra DBs are reachable today only from PD partner VMs (their SA has the binding). |
bees-infra cloudsql.client |
PASS | FAIL | PASS — after HB-8590 | hb-VM apps reach bees-infra DBs today; partner VMs don’t. |
the-helper-bees (shared) |
PASS | FAIL | PASS — after HB-8590 | Shared/legacy Cloud SQL instances
(ansible-hb-psql-prod/stage,
ansible-psql-warehouse). hb-VM apps connect; partner VMs
don’t. |
How to use this matrix: if you run the script from a
swarm worker and any of the “PASS — after …” rows isn’t green,
that’s a finding for the cited subtask. Conversely, the legacy-VM
columns aren’t bugs — they reflect the current per-VM IAM scoping that
HB-8590 deliberately collapses, and the current vpn_bypass
filter that HB-8413/HB-8414 deliberately broadens. The whole point is
the diff between legacy and worker.
The other probe groups (GCP APIs, public Cloudflare-fronted, Papertrail, Datadog, Azure SQL TCP) should PASS from any GCP VM in the right env regardless of HB-8218 status — they don’t depend on the work HB-8218 is gating.
3. Audit by Surface
3.1 GCP
Cloud SQL authorized_networks
| Instance | Path | What’s allowlisted today | Migration action |
|---|---|---|---|
HB Postgres (hb-p-psql) |
infra/hb-infra/business_unit_1/production/main.tf:32 |
var.authorized_networks only (THB VPN:
gc{2,4,5,6}-algo) — no per-VM IPs |
None. App connects via Cloud SQL Proxy sidecar (Phase 0 Step 1). |
HB MSSQL sandbox — non-prod only (hb-n-sql-server) |
infra/hb-infra/business_unit_1/non-production/sql_server.tf:48 |
var.authorized_networks + hb-n-vm external
IP |
Remove the hb-n-vm entry; nothing replaces
it. Added by PR #362
as a DHA sandbox for the Benefits Hub team to test against without
colliding with other teams’ deploys. The allowlist entry was for the
host-level Cloud SQL proxy connecting via public IP. The new per-stack
cloud-sql-proxy sidecar authenticates to the Cloud SQL
admin API by IAM, bypassing authorized_networks entirely.
No prod equivalent. Workers need
cloudsql.client on prj-bu1-n-hb-infra-* (HB-8590)
and module.hb-n-sql-server.instance_connection_name in the
django_homealign stack’s sidecar command. |
PD shared MSSQL (pd-p-mssql) |
resource: mssql.tf:84;
allowlist concat: mssql.tf:110 |
var.authorized_networks +
local.temporary_authorized_networks (lines 11-14) +
local.pd_vm_authorized_networks (lines
23-29) |
HIGH RISK. pd_vm_authorized_networks
auto-populates from every PD VM external IP via regex match on app
vm_resource_name. As partner VMs go away, regex matches
drop → auto-removal from allowlist. Need to add the swarm NAT IPs as
static entries (next to
local.temporary_authorized_networks) before the first PD
partner cuts over. |
Per-partner Postgres (pdp{N}-{n,p}-psql-* in
prj-bu1-{n,p}-pd-infra-*) |
infra/pd-infra/business_unit_1/{env}/pdp*/main.tf:33 |
var.authorized_networks only (THB VPN) |
None on the SQL side. Connect via
--private-ip through VPC peering — set
CSQL_PROXY_PRIVATE_IP=true in each app’s
.cloudsql-proxy.env (pd #896).
Bypasses NAT entirely, no allowlist needed for swarm workers. |
Legacy PD Postgres
(the-helper-bees:ansible-p<N>-psql-{stage,prod} for
p1, p5, p7, p10, p13, p14) |
Not Terraform-managed. See comment in infra/pd-infra/business_unit_1/non-production/pdp14/main.tf:
“Because the database server was not created by Terraform, I had to
manually create [databases/users/passwords] by hand”. Allowlist
managed imperatively via src/scripts/add_swarm_nat_to_legacy_sql.sh. |
Per-partner mix of legacy per-VM IPs and partner-specific entries (collected over years pre-swarm). | Add swarm NAT IPs via the script (no Terraform path
available — see §3.1.1 for the rationale). Required because these
instances have no private IP, so public IP through NAT
is the only reachable path from swarm workers. Cloud SQL silently drops
TCP SYNs from non-allowlisted source IPs → i/o timeout at
the proxy. |
Action item: Add a new entry to the allowlist concat
in pd-p-mssql.ip_configuration.authorized_networks once
swarm NAT IPs are known.
# Live audit non-prod
gcloud sql instances list \
--project=prj-bu1-n-pd-infra-fee5 \
--filter='databaseVersion~SQLSERVER' \
--format='yaml(name,settings.ipConfiguration.authorizedNetworks)'
# Live audit prod
gcloud sql instances list \
--project=prj-bu1-p-pd-infra-b355 \
--filter='databaseVersion~SQLSERVER' \
--format='yaml(name,settings.ipConfiguration.authorizedNetworks)'
# Count expected non-prod NAT IPs in the allowlist (all 8 per the 2026-05-12
# policy: 4 us-central1 active + 4 us-west1 reserve). Expect "8" if both
# regions are fully allowlisted; a value in [4..7] means partial coverage.
gcloud sql instances list --project=prj-bu1-n-pd-infra-fee5 \
--filter='name~^pd-n-mssql-' \
--format='value(settings.ipConfiguration.authorizedNetworks[].value)' \
| grep -cxE '34\.72\.82\.40|34\.122\.251\.101|34\.170\.35\.69|35\.194\.23\.180|34\.105\.111\.96|34\.127\.70\.253|136\.109\.214\.215|34\.169\.152\.90'3.1.1 PD Postgres connectivity strategy: hybrid
--private-ip + NAT allowlist (HB-8646)
The 18 swarm-deployed PD apps split into two cohorts based on where their Cloud SQL lives. We use a different connectivity path for each:
| Cohort | Apps | Cloud SQL location | Path | Allowlist requirement |
|---|---|---|---|---|
| New-hierarchy (10 apps) | p8, p17, p18, p19, p20, p21, p23, p24, p25, p118 | Per-app instance in prj-bu1-{n,p}-pd-infra-* with
private IP enabled |
cloudsql-proxy --private-ip over VPC peering (no NAT
involvement) |
None for swarm workers — VPC peering is a private path |
| Legacy (8 apps) | p1, p5, p5i, p7, p10, p13, p14, p14_bea | Shared instance in
the-helper-bees:ansible-p<N>-psql-{stage,prod} —
no private IP |
cloudsql-proxy over public IP through Cloud NAT |
All 8 swarm NAT egress IPs (4 us-central1 active +
4 us-west1 reserve, see §2) must be in each legacy instance’s
authorized_networks |
Two pairs of legacy apps share a single Postgres instance
(hbpdp5 + hbpdp5i →
ansible-p5-psql-*; hbpdp14 +
hbpdp14_bea → ansible-p14-psql-*), so the 8
legacy apps map to 6 unique legacy instances per env.
Why hybrid and not one approach for both:
- “Just
--private-ipeverywhere” isn’t an option because the legacyansible-p<N>-psql-*instances don’t have a private IP. Adding one requirespg_dump+ restore into a new instance + per-partner coordination — not on the near-term roadmap (deferred until the broader legacy-instance retirement). - “Just allowlist NAT IPs everywhere” would work
mechanically (mirrors the existing
pd-p-mssqlpattern in §3.1), but expanding NAT-port-budget pressure to 10 more instances that already have a cleaner option (private IP via VPC peering) is the wrong direction. New-hierarchy apps get the architecturally cleaner path; legacy gets the only path it has.
How each path is wired:
--private-ip:cloudsql-proxyv2 auto-binds CLI flags toCSQL_PROXY_*env vars via cobra/viper. SettingCSQL_PROXY_PRIVATE_IP=truein an app’s.cloudsql-proxy.envis equivalent to passing--private-ipon the command line. PD’sproduction.deploy.ymlalready loads each app’s.cloudsql-proxy.envviaenv_file:, so flipping the env var in the per-app file is the entire change. See pd #896.- NAT allowlist: all 8 swarm NAT egress IPs per env (4 us-central1
active + 4 us-west1 reserve, see §2) are added to each legacy instance’s
authorized_networksviasrc/scripts/add_swarm_nat_to_legacy_sql.sh. Including the reserve us-west1 IPs up front means we won’t need a second allowlist patch round if/when we activate that region. The script is idempotent and supports--app <id>for incremental rollout. See infrahive #811.
Verification: probe_pd_postgres_reachability
covers the 10 new-hierarchy apps; probe_pd_legacy_postgres_allowlist
(added in infrahive
#808) covers the 6 legacy instances. Both run from a dev laptop / CI
and pass post-rollout.
Failure mode to recognize: if a legacy app sees
cloudsql-proxy: failed to connect to instance: Dial error: ... i/o timeout
it almost always means a swarm NAT IP isn’t in the target instance’s
authorized_networks (Cloud SQL drops TCP silently for
non-allowlisted source IPs). If a new-hierarchy app sees
cloudsql-proxy: failed to connect to instance: Config error: instance does not have IP of type "PRIVATE"
it means an old compose overlay accidentally forced
--private-ip onto a legacy app — see HB-8646 retro for the
exact failure mode and production.swarm-overrides.yml
removal in pd
#896.
Future legacy → new-hierarchy migration: when one of
the legacy ansible-p<N>-psql-* instances eventually
gets migrated to prj-bu1-{n,p}-pd-infra-*, the affected app
moves from the “legacy” row to the “new-hierarchy” row. Concretely: drop
its entry from LEGACY_PD_INSTANCES in add_swarm_nat_to_legacy_sql.sh
and swarm_connectivity_test.sh, and add
CSQL_PROXY_PRIVATE_IP=true to its
.cloudsql-proxy.env. The NAT allowlist on the
decommissioned legacy instance can be left as-is or cleaned up —
irrelevant once the instance is gone.
3.2 Azure SQL Firewall (HA Apps Only)
| Server | Resource group | Today’s source | Migration action |
|---|---|---|---|
ha-prod1-azsqldb |
HA-PROD1-SQL-RG |
Per-VM rules added imperatively via src/scripts/add_az_sql_fw_rule.sh |
Add swarm NAT IPs as new rules; leave per-VM rules in place during cutover, remove post-migration |
ha-dev-azsqldb |
HA-DEV-SQL-RG |
Same | Same |
These rules are NOT in Terraform. They live only in Azure. Inventory:
# Production
az sql server firewall-rule list --resource-group HA-PROD1-SQL-RG \
--server ha-prod1-azsqldb --output table
# Non-prod
az sql server firewall-rule list --resource-group HA-DEV-SQL-RG \
--server ha-dev-azsqldb --output tableAdd rule (one invocation per NAT IP from §2):
./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_0 -ip <nat-ip-0>
./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_1 -ip <nat-ip-1>
./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_2 -ip <nat-ip-2>
./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_3 -ip <nat-ip-3>The script is idempotent (checks for existing rule with same IP
first). The _0 through _3 suffix aligns with
the gcp-networks address resource indices
(ca-{env_code}-shared-base-spoke-us-central1-{0..3}).
⚠️ TCP probes don’t verify the firewall rule. Azure SQL accepts the TCP connect at the load balancer regardless of firewall — the rule is enforced at the TDS auth layer, not L4. So a passing
</dev/tcp/host/1433>(or anything that does only a TCP probe, includingdocs/scripts/swarm_connectivity_test.sh) proves DNS + VNet egress + L4 path work, but does not prove the worker’s NAT IP is in the firewall allowlist. Authoritative verification requires a TDS-aware client (sqlcmd, themssqlPython driver, etc.) attempting the pre-login phase — a successful pre-login means the firewall accepted the source IP. Treat the connectivity-test TCP probe as a network-path canary, not a firewall verifier. (A v2 ofswarm_connectivity_test.shcould shell out tosqlcmd -Q "SELECT 1", but that introduces a non-stock-Ubuntu dependency — left as a follow-up.)
3.3 Cloudflare Access Groups (IP Bypass)
Two locations:
A) infra/pd-infra/business_unit_1/production/cloudflare.tf
— most action here:
| Group | Source today | Migration action |
|---|---|---|
green_p_ip_bypass (line 59) |
data.google_compute_addresses.hb_p_vm_ip_address (HB
project external IPs) |
Auto-shrinks as HB VM goes away. Add swarm NAT IPs as static entries. |
yellow_p_ip_bypass (line 71) |
data.google_compute_addresses.bees_p_vm_ip_address |
Same — auto-shrinks; add static NAT IPs. |
pdp24_p_bastion_ip_bypass /
pdp18_p_bastion_ip_bypass |
bastion-specific compute addresses
(data.google_compute_addresses.p24_bastion_p_vm_ip_address
/ p18_bastion_p_vm_ip_address) |
No change. Bastions stay on dedicated VMs and are not part of the swarm migration; their external IPs are stable. |
pdp18_p_ip_bypass, pdp24_p_ip_bypass,
salesforce_p_ip_bypass,
thb_p_authorized_networks |
Hardcoded partner / VPN IP ranges | No change — these are partner egress, not ours. |
B) HB-side cloudflare-access module calls — these
live in hb-infra (not pd-infra) and use a dynamic
vpn_ip_addresses input rather than named Cloudflare Access
groups:
| Module call | Path | Source | Migration action |
|---|---|---|---|
Standard cloudflare-access (covers all HB apps with
vpn_bypass policy: hbcrm, hb_buzz,
eligibility-via-thehelperbees.com-domain, etc.) |
hb-infra/.../production/cloudflare.tf:8-15 |
vpn_ip_addresses = concat(local.pd_addresses, local.vpn_addresses)
— local.pd_addresses is auto-discovered from
data.google_compute_addresses.pd_ip_addresses (filter
name:pdp* in PD project) |
Add swarm NAT IPs as static entries to
local.pd_addresses in main.tf:8-11.
Single change flows into both module calls below. |
eligibility-cloudflare-access (separate call because
eligibility uses session_duration = "168h" vs the default;
covers eligibility.thb.nu and
eligibility.thb.sh) |
hb-infra/.../production/eligibility_service.tf:6-13 |
Same expression:
concat(local.pd_addresses, local.vpn_addresses) |
Inherits fix from updating local.pd_addresses. |
Why this matters for pdp24 → eligibility: When pdp24 calls
https://eligibility.thb.nu, Cloudflare Access checks if the source IP is in thevpn_bypassgroup. Today, pdp24’s35.225.100.177is inlocal.pd_addresses(via the dynamic data lookup) so it passes. After migration, pdp24’s IP becomes a swarm NAT IP — which is not in the PD project’s external addresses, so the dynamic lookup won’t include it. The static-entry fix above is what keeps this internal flow working post-migration. No partner action required — this is our own infrastructure.
C) infra/ha-infra/business_unit_1/production/cloudflare/cloudflare.tf
— uses only var.authorized_networks (THB VPN). No
change needed.
Live audit:
# Already in TF — read current state
./zig/zig build plan -- pd-infra production cloudflare
./zig/zig build plan -- hb-infra production cloudflare
./zig/zig build plan -- hb-infra production eligibility_service3.4 Caddy VPN Bypass List
Important: The Terraform-supplied
caddy_vpn_bypass_list is one of three inputs that
compose the actual runtime Caddy allowlist. The full picture, assembled
by Ansible at deploy time:
admin_allowed_ip_addresses = [vm_ip] + vpn_ip_addresses + caddy_vpn_bypass_list
(Composed across gc_deploy_with_gcp_secrets.yml:358-360
— the playbook actually used in production — and roles/caddy_create/tasks/main.yml:46-48
in hb-ansible.) This list drives the
(is-public-ip) / (is-vpn-ip) snippets in roles/caddy_create/templates/Caddyfile.j2,
which gate the admin/flower/camunda/websocket routes that bypass
Cloudflare Access.
| Component | Source | Migration impact |
|---|---|---|
vm_ip |
hostvars[vm_name].ansible_ssh_host — the deploy
target’s own SSH host |
Becomes the swarm worker’s NAT egress IP. Since all workers share the same NAT IPs, every worker will (incidentally) self-allowlist every other worker’s traffic via Cloudflare. New property; not necessarily wrong, worth surfacing. |
vpn_ip_addresses |
Maps to infrahive var.authorized_networks (THB algo VPN
exits) |
No change — same VPN IPs pre/post migration. |
caddy_vpn_bypass_list |
Terraform-fed via the ansible_config module (see table
below) |
Active migration concern. See callers below. |
How caddy_vpn_bypass_list flows from
Terraform to runtime
The caddy_vpn_bypass_list value is not read at
deploy time from a static file in hb-ansible. It traverses
three systems:
infrahive Terraform (hb-infra/.../main.tf)
local.caddy_pd_addresses
|
v
module "hb-p-ansible-config5" { caddy_vpn_bypass_list = local.caddy_pd_addresses }
| (terraform apply)
v
ansible_config TF module (terraform-modules.git, external)
|
v
GCS: gs://ansible-config-65ab/<app>/config.yml ← regenerated on every apply
|
v
hb-ansible playbook (gc_deploy_with_gcp_secrets.yml) loads bucket_path at deploy time
|
v
filtered_app.caddy_vpn_bypass_list (per-app dict)
|
v
caddy_create role (roles/caddy_create/tasks/main.yml:46-48)
admin_allowed_ip_addresses = [vm_ip] + vpn_ip_addresses + filtered_app.caddy_vpn_bypass_list
|
v
Caddyfile.j2 (is-public-ip / is-vpn-ip matchers)
Verified end-to-end via
gsutil cat gs://ansible-config-65ab/hbcrm/config.yml — that
file already contains a populated caddy_vpn_bypass_list
array with all PD partner VM IPs (78+ entries today).
Implication for HB-8413 (Caddy VPN bypass) and HB-8414
(Cloudflare Access): changes happen in infrahive
only. No edits to hb-ansible. The Caddy template
is data-driven — it picks up whatever’s in the GCS config on next
deploy. And because the same local.caddy_pd_addresses /
local.pd_addresses data lookup feeds all three downstream
surfaces (Caddy bypass, standard cloudflare-access,
eligibility-cloudflare-access), a single Terraform PR closes
both HB-8413 and HB-8414 — coordinate so the work isn’t
duplicated.
Terraform sources of
caddy_vpn_bypass_list:
| Caller | Source today | Migration action |
|---|---|---|
module.hb-p-ansible-config5 (infra/hb-infra/business_unit_1/production/main.tf:140) |
local.caddy_pd_addresses — auto-discovered PD VM
external IPs (name:pdp* filter, line 12) |
Auto-shrinks as PD VMs migrate. Replace with the swarm NAT IPs as static entries. HB→PD calls will continue to roundtrip via Cloudflare even on the same cluster (no overlay during migration — see HB-8620 follow-up). |
module.hb-n-ansible-config (non-prod equivalent) |
Same pattern | Same |
module.benefits-hub-p-ansible-config1 (infra/hb-infra/business_unit_1/production/benefits_platform.tf:73) |
[] empty |
No change |
Decision (resolves Q1): The migration will not introduce an HB↔︎PD swarm overlay network. HB→PD calls continue to traverse the public internet via Cloudflare, requiring the swarm NAT IPs in this bypass list. A future overlay is tracked under HB-8620 (parented to the HB-6748 DevOps KTLO epic), which would let us remove this bypass list entirely.
Note on partner_allowed_ip_addresses (inbound,
out of scope): While inspecting hb-ansible, we
also reviewed the per-app partner_allowed_ip_addresses
variable (roles/caddy_create/tasks/main.yml:32, only set on
stage_hbpdp11 with [76.187.226.216]). This is
the inverse direction — a partner’s IP allowlisted on
our Caddy for inbound /partner/* traffic — and is
therefore out of scope for HB-8210.
Additionally, the variable is set as a fact but never referenced
in the current Caddyfile.j2 template, suggesting
it is either vestigial or enforced at the app layer inside
partner_django. Either way, partner IPs don’t change during
our migration, so no action.
3.5 Partner API Webhooks / External Integrations
This is the highest-risk and least-discoverable surface. Partner systems often allowlist our outbound IP for inbound webhooks, SFTP polling, or API calls.
Known surfaces requiring live inventory (cannot enumerate from infrahive alone):
Partner SFTP destinations —
infra/pd-infra/cloudfunctions/sftp_pubsub_messages/SFTP/client.go,infra/pd-infra/cloudfunctions/sftp_bucket_file/SFTP/client.go. SFTP from cloud functions, not VMs — likely uses its own NAT. Verify these aren’t routed through partner VMs.PSR (Provider Services Request) handler — out of scope, documented for clarity —
infra/pd-infra/modules/partner_psr_request_handler/. On first read this looked like a possible outbound surface, but inspection of the module and its callers (pdp8, pdp24, …) confirms the flow is the opposite direction: a partner’s Salesforce instance pushes files into our GCS bucket (psr-{partner_code}-{env}-*), authenticated by a service-account key that we generate and hand to the partner’s Salesforce dev. A Cloud Function then converts uploads to PDF/XML for hbcrm to read. No partner-side IP allowlist is involved, and our outbound IP doesn’t appear in this flow at all — so it’s unaffected by the swarm migration. Listed here only so future readers don’t re-add it as an open question.Per-partner outbound API integrations — likely configured per-partner in either:
pdrepo.envs/files (partner API base URLs and auth)- hb-buzz / hbcrm / django_homealign per-tenant settings
- Salesforce / Camunda outbound webhooks
Recommended path: post in #devops
asking team members which partners allowlist our outbound IP. Active
outreach: #devops
thread on 2026-04-30. Track replies there and add new partners to
the confirmed-partners table below as they surface.
Owner for partner outreach: PD partner success / integrations team. Best to start parallel to HB-8218 implementation since some partners have multi-week change-control SLAs.
Confirmed partners with our outbound IP allowlisted
All IPs below verified live via curl ifconfig.me from
the corresponding VM. Each address resource carries
lifecycle { ignore_changes = all } so the IPs have been
stable.
| Partner | Env | Outbound IP | GCE VM | Project |
|---|---|---|---|---|
| pdp24 | prod | 35.225.100.177 |
pdp24-p-vm-6670 |
prj-bu1-p-pd-infra-b355 |
| pdp24 | non-prod | 34.30.125.50 |
pdp24-n-vm-2327 |
prj-bu1-n-pd-infra-fee5 |
| pdp10 | prod | 34.69.126.254 |
pdp10-p-vm-a437 |
prj-bu1-p-pd-infra-b355 |
| pdp10 | non-prod | 34.134.208.190 |
pdp10-n-vm-01f0 |
prj-bu1-n-pd-infra-fee5 |
| pdp7 | prod | 35.238.12.142 |
pdp7-p-vm-ebcb |
prj-bu1-p-pd-infra-b355 |
| pdp7 | non-prod | 35.232.30.74 |
pdp7-n-vm-5c52 |
prj-bu1-n-pd-infra-fee5 |
Source TF for all rows: google_compute_address.external
in
infra/pd-infra/business_unit_1/{production,non-production}/pdp{N}/main.tf:12.
Per-partner allowlist verification (HB-8415, 2026-05-07)
Audited the actual outbound surface for each of the three “stable IP”
partners above by querying the runtime dw_eventsubscription
table (the only code path in the pd repo that POSTs to
partner-controlled URLs — see docs/scripts/query_event_subscriptions.sh)
and probing the partner endpoints with a 401-vs-403 differential test
(curl from the allowlisted VM vs from a non-allowlisted
host).
| Partner | Env | Active subscriptions | Partner system | IP allowlist confirmed? | Migration action |
|---|---|---|---|---|---|
| pdp7 | both | 0 | none in repo | ✅ no surface (PD team confirmed 2026-05-07: pdp10 is the only EventSubscription consumer) | none |
| pdp10 (Genworth) | prod | 6 | b7g1csty57.execute-api.us-east-1.amazonaws.com (AWS API
Gateway, Microsoft Entra OAuth tenant 820ad304-…) |
✅ yes | add 8 prod swarm NAT IPs to Genworth’s prod APIGW resource policy (4 us-central1 active + 4 us-west1 reserve per §2 policy) |
| pdp10 (Genworth) | non-prod | 8 partner + 8 internal-test | p0cr5dk17j.execute-api.us-east-1.amazonaws.com
(separate AWS APIGW, same tenant) |
✅ yes | add 8 non-prod swarm NAT IPs to Genworth’s staging APIGW resource policy (4 us-central1 active + 4 us-west1 reserve per §2 policy) |
| pdp24 | both | 0 | none in repo | ✅ no surface (PD team confirmed 2026-05-07: pdp10 is the only EventSubscription consumer) | none |
401-vs-403 evidence (Genworth APIGWs both expose a real resource policy / WAF rule, not OAuth-only):
| Source | Genworth prod APIGW (b7g1csty57…) |
Genworth staging APIGW (p0cr5dk17j…) |
|---|---|---|
pdp10 prod VM (34.69.126.254) |
401 UnauthorizedException (allowlisted) |
— |
pdp10 non-prod VM (34.134.208.190) |
— | 401 UnauthorizedException (allowlisted) |
pdp24 prod VM (35.225.100.177) |
403 ForbiddenException (not allowlisted) |
— |
| Operator workstation | 403 ForbiddenException |
403 ForbiddenException |
A 403 from API Gateway with ForbiddenException arrives
before the Lambda authorizer runs — that’s the
canonical signature of an AWS resource-policy IP block. A 401 means the
request reached the authorizer (so the IP passed) and only the bearer
token failed. The differential confirms Genworth IP-allowlists pdp10’s
per-VM IPs in both envs.
Pdp7 and pdp24 closed: PD team confirmed 2026-05-07
that pdp10 is the only EventSubscription consumer —
pdp7 and pdp24 do not use the partner-webhook feature at all. Combined
with their empty dw_eventsubscription tables and no other
partner-bound outbound code path in the pd repo, this is sufficient to
mark both as “no migration action”: no partner allowlist coordination
needed, and their per-VM google_compute_address.external
resources can be released right after swarm-prod cutover with no
dual-allowlist window. (Out-of-band integrations like manual CSV uploads
or ops scripts weren’t explicitly probed, but the team’s confirmation
that “P10 is the only one using that feature” is the authoritative
signal — they own the integration roadmap.)
Other partners — unknown. Coverage is tribal; PD
partner success / integrations team should confirm no others exist.
Outreach posted in #devops
on 2026-04-30 — track replies there.
Internal flows that depend on these IPs: pdp24 →
eligibility.thb.nu / eligibility.thb.sh
(confirmed). Our own Cloudflare Access vpn_bypass group
includes all PD partner VM IPs (pdp7, pdp10, pdp24, and
any other active partner) via local.pd_addresses (dynamic
data lookup), so any of them calling our HB-side services rides this
allowlist. After migration the dynamic lookup shrinks to nothing and is
replaced by static swarm NAT IP entries — covered by §3.3 row B.
No partner action needed for these internal flows.
Decision (post-migration): notify each partner
contact of the swarm NAT IPs across both environments
(16 IPs total: 8 non-prod + 8 prod per §2, including us-west1 reserves)
and have them update their allowlist in a single change. Bundling envs
minimizes partner change-control churn — most partners have one
allowlist that covers any of our outbound, regardless of which env it
originated from. After prod cutover (the last env to
migrate), the per-partner production GCE addresses can be released
(delete the google_compute_address.external resources in
pdpN/main.tf).
Why not preserve the existing per-partner IPs?
To pick the right path it helps to know how Cloud NAT actually
distributes IPs to outbound flows. With the MANUAL_ONLY
mode + min_ports_per_vm = 2048 that gcp-networks
configures, Cloud NAT does per-VM IP allocation, not
per-flow:
- Each worker VM is assigned a fixed allocation of ~2048 ports from one of the pool’s NAT IPs.
- The worker uses that single IP for all its outbound flows (until port exhaustion, which is rare with 2048 ports).
- The assignment is Cloud-NAT-internal — we don’t control which IP a given worker gets.
- With N IPs in the pool and M workers, roughly M/N workers land on each IP.
Service replicas in Swarm are scheduled across the worker pool, so per-partner outbound traffic (e.g., pdp24 → partner API, pdp10 → partner API) comes from whichever NAT IP was assigned to the worker the scheduler placed it on — not from any IP we can pre-select.
Options compared (per partner)
| # | Option | Preserves the existing per-partner IP? | Partner allowlist work | Operational cost | Verdict |
|---|---|---|---|---|---|
| 1 | Notify the partner, they allowlist all 16 swarm NAT IPs (8 non-prod + 8 prod, see §2 policy on us-west1) | No, but doesn’t matter — partner accepts new IPs | One email/ticket per partner; ~24-hour turnaround for most | Zero — fewer resources to maintain after migration | ✅ Default. Reconsider only if change-control SLA is multi-week. |
| 2 | Don’t migrate the affected partner stack; leave on dedicated VM | Yes (fully) | None | Continued cost of the dedicated VM; partner stays out of Phase 0 scope | ❌ Ruled out (we want to migrate all partners). |
| 3 | Outbound HTTPS proxy on a tiny dedicated VM owning the existing IP;
partner stack uses HTTPS_PROXY env |
Yes (exactly, via HTTPS_PROXY) |
None | One always-on VM per partner needing preservation; monitoring/alerts/upgrades for each; single point of failure for that partner’s traffic | ⚠️ Acceptable if option 1 is blocked by partner SLA. Cost scales linearly with number of preserved IPs. |
| 4 | Add the existing IPs to the swarm Cloud NAT pool | No. Per-VM allocation means ~1/(N+k) of workers get any given preserved IP and use it for all their outbound regardless of destination. Other workers (and any partner replicas scheduled on them) use the other IPs. Partner still has to allowlist the full set. | Same as option 1 | Slightly higher than option 1 (extra IPs reserved forever) | ❌ Doesn’t preserve IPs for partner-specific traffic. Cloud NAT lacks per-flow IP selection. |
| 5 | Pin partner stack to a specific worker; attach the existing IP to that node | Yes (the pinned worker uses that IP) | None | Defeats Swarm scheduling, breaks autoscaler, single point of failure for the partner stack | ❌ Ruled out (already rejected). |
Why option 4 keeps coming up
Option 4 feels like it should work: “we have the IPs, just
keep using them.” It would work if Cloud NAT supported per-flow IP
selection rules (“traffic to partner.example.com → source
IP X”), but it doesn’t — IP assignment is per-VM, not
per-flow or per-destination. So a preserved IP ends up tied to whichever
workers Cloud NAT happens to assign it to, used by every outbound flow
from those workers regardless of destination, while partner replicas on
other workers don’t use it at all. It’s effectively option 1 with extra
unused IPs in the pool.
The only real ways to preserve a partner-specific IP are option 3 (proxy steers at the application layer) or option 5 (node pinning forces the worker assignment), and we’ve ruled out 5. Reconsider option 3 only if option 1 is blocked for a specific partner.
Pre-cutover checklist (per partner, run for pdp7, pdp10, and pdp24):
Precondition: the §2 verification procedure (“Verifying egress before partner outreach”) must have been run for non-prod, with the verified IPs locked in. For prod, communicate the IPs we have today and follow up before prod cutover with any updates from the same verification run.
- Send the partner contact the full swarm NAT IP set across
both environments (from §2). Per the 2026-05-12 policy
revision, include the us-west1 reserve IPs alongside the us-central1
active set — that way the partner does one round of change-control, not
two. Communicating both envs at once minimizes partner change-control
churn:
- Non-production (8 IPs — 4 us-central1 active + 4 us-west1
reserve):
34.72.82.40,34.122.251.101,34.170.35.69,35.194.23.180,34.105.111.96,34.127.70.253,136.109.214.215,34.169.152.90 - Production (8 IPs — 4 us-central1 active + 4 us-west1
reserve):
34.132.72.45,35.226.246.142,34.136.40.107,34.57.41.236,35.227.138.226,35.185.201.232,34.168.190.12,8.229.103.135
- Non-production (8 IPs — 4 us-central1 active + 4 us-west1
reserve):
- Ask the partner to enumerate every IP of ours they currently allowlist — both prod and non-prod, if applicable. Capture this in the audit’s confirmed-partners table so we know which entries to clean up post-migration.
- Get written confirmation that all 16 swarm NAT IPs (8 per env) are in their allowlist before the corresponding env’s migration day. The 4 us-west1 reserves per env are forward-looking and won’t carry traffic today, but having them in place pre-emptively avoids a second round of change-control when we activate that region.
- Keep existing per-partner IPs allowlisted on the
partner side throughout the dual-allowlist window per §4. The window
must extend through the last migration to complete (= prod,
gated on HB-8232).
Per-partner IPs to keep (only those the partner confirms in step 2 are
actually in their allowlist):
- pdp7 prod:
35.238.12.142/ non-prod:35.232.30.74 - pdp10 prod:
34.69.126.254/ non-prod:34.134.208.190 - pdp24 prod:
35.225.100.177/ non-prod:34.30.125.50
- pdp7 prod:
- One week after prod cutover: ask the partner to
remove the old IPs from their allowlist; release the corresponding
google_compute_address.externalresources inpdp{N}/main.tf(and non-prod equivalents if applicable) on our side.
Tableau Analytics — internal system that allowlists our partner IPs
PDP apps embed Tableau dashboards by POSTing to
https://tab.thehelperbees.com/trusted to mint a per-user
token (see get_tableau_token in
pd/thbpd/dw/views/views.py:1954-1983). The endpoint mints a
token only when the source IP is in Tableau Server’s
wgserver.trusted_hosts; otherwise it returns the literal
string -1. After migration, swarm worker traffic egresses
through the NAT IPs in §2 — those IPs must be in
trusted_hosts before any PDP stack that embeds dashboards
cuts over.
| Property | Value |
|---|---|
| Hostname (both envs) | tab.thehelperbees.com |
| GCE VM | ansible-tab-stage in the-helper-bees
(us-central1-a) |
| Tableau Server version | 2021.2.21 |
| Allowlist mechanism | tsm configuration set -k wgserver.trusted_hosts
(comma-separated CIDR list) — committed via
tsm pending-changes apply |
CF Access on tab.thehelperbees.com |
Not enforced — CF policy verified absent for this
hostname; trust is enforced solely by
wgserver.trusted_hosts + Tableau session auth |
| Caddy front | Caddy 2.4.3 on the same VM (config baked into the GCR image
gcr.io/the-helper-bees/tab/caddy:2026-03-05, source not in
any THB repo). The (is-public-ip) /
(is-vpn-ip) snippets are cosmetic — both
@public-match and @admin-match proxy to the
same backend. Caddyfile/Dockerfile cleanup tracked under HB-8635 |
| Tableau Server config | Not managed by infrahive — tsm lives
only on the VM; zero references to tableau /
Tableau / tab.thb in this repo |
Both prod and non-prod apps point at the same
tab.thehelperbees.com (confirmed by tech lead 2026-05-05).
All swarm NAT IPs must be in the single shared
trusted_hosts list — applying the change is
environment-agnostic.
Migration action — DONE for us-central1 (HB-8623, PR #801):
- SSHed to
ansible-tab-stage(viagssh tab the-helper-bees awx-builder-stagefrom infrahivebin/). - Captured the existing 20-entry
trusted_hostslist withtsm configuration get -k wgserver.trusted_hosts --trust-admin-controller-cert. - Appended the 8 us-central1 swarm NAT IPs (purely additive; no existing entries removed) → 28 total.
tsm pending-changes apply+tsm restart(a 90-min outage during apply was caused by/var/opt/tab-backupsfilling to 100%; recovered by deleting old weekly backups and restarting the controller. Findings filed as follow-ups under HB-6748: backup retention automation, backup health monitoring, Tableau Server upgrade).- End-to-end verified: worker-side probe
(
tab.thehelperbees.com/trustedfromswarm-n-worker-r2w7, egress34.170.35.69) returns HTTP 200 +"-1"(the script POSTs deliberately-bogus credentials —200 + "-1"proves the request got past CF Access AND was processed by Tableau Server, which is the chain we care about; a separate manual test with real credentials returned HTTP 200 + a real token, confirming the trusted-auth path also works end-to-end). Control-plane probe (tableau-trusted-hosts) reports 4/4 us-central1 NAT IPs inwgserver.trusted_hostsfor both envs.
Follow-up needed — us-west1 reserve IPs (2026-05-12 policy
revision): PR #801 predates the §2 policy update that
pre-emptively includes the 4 us-west1 reserve IPs per env. Apply the
same tsm configuration set -k wgserver.trusted_hosts +
tsm pending-changes apply + tsm restart cycle,
this time appending the 8 us-west1 IPs (4 non-prod + 4 prod — see §2
table) — list grows from 28 → 36 entries. After apply, the
tableau-trusted-hosts probe should report 8/8 NAT IPs per
env. Tracked separately so the historical PR #801 record stays
accurate.
Why it’s not “structurally identical to a partner
allowlist.” Tableau is internal to THB (we own the server) and
the allowlist is enforced by Tableau Server config we control directly —
not by a partner-side firewall. Once the swarm NAT IPs are in
trusted_hosts, no further coordination is needed.
Per-partner GCE address cleanup (HB-8239)
can release the legacy pdp1/pdp5/pdp14 entries from
trusted_hosts after prod cutover, but they’re zero-cost
left in place during the dual-allowlist window.
Out-of-band findings filed during HB-8623 investigation (all parented to HB-6748 unless noted):
- HB-8635 —
Tableau VM Caddyfile / Dockerfile cleanup (both ~4 years stale,
hardcoded CF DNS token, unused
TAB_CLOUDFLARE_CADDY_TOKENDocker secret, non-functional(is-public-ip)/(is-vpn-ip)snippets). Reclassified Low priority — cosmetic, no functional impact. - HB-8636 —
One-time fix: clear
/var/opt/tab-backupsand take a fresh backup (Slack-imported, immediate cleanup of the disk-full state). - HB-8637 — Tableau Server backup retention automation (auto-delete weekly backups older than 30 days; prevents the disk-full failure mode recurring).
- HB-8638 —
Tableau Server backup-disk health alarm (page on
/var/opt/tab-backups> 80% full and on stale/0-byte newest.tsbak; closes the visibility gap that let the disk fill silently). - Tableau Server upgrade from 2021.2.21 → current LTS, Ubuntu 16.04 host EOL — captured as a comment on existing epic HB-5567 (parented to HB-5565 “Terraform the Tableau Server”); the TF rebuild is the natural upgrade vehicle, no separate ticket needed.
3.6 Other (lower priority but worth verifying)
- Cloudflare Tunnel / cloudflared config —
src/tools/launchbot/templates/cloudflared/config.yml.tmpl. Tunnel is identity-based, not IP-based, but verify there’s no IP-pinned policy. - Datadog allowlists — outbound to Datadog via
dd_agent. Datadog accepts traffic from any source; Datadog API key auths the call. No change. - Papertrail
(
syslog+tls://logs7.papertrailapp.com:23105) — same; no IP allowlist on Papertrail’s side. - Camunda / Keycloak outbound — internal, no external allowlist.
- GitHub Actions deploy proxy (
infra/common-infra/business_unit_1/shared/gh_actions_deploy_proxy.tf) — separate proxy, not impacted.
4. Cutover Sequence (handoff to HB-8218)
- Provision swarm Cloud NAT for prod (HB-8217 already done for non-prod; prod gated on HB-8232).
- Capture the NAT IPs per env via
gcloud compute addresses list --filter='name~^ca-{n,p}-shared-base-spoke-us-central1-' --format='table(name,address)'against the corresponding network host project (full commands in §2). Useaddresses listrather thanrouters nats describehere —describereturns address-resource URLs, not IP values. For defense-in-depth, also runrouters nats describeto confirm those addresses are actually attached to the NAT, androuters get-nat-mapping-infoto confirm workers map to them (per §2 verification procedure).
# Non-production
gcloud compute routers nats describe rn-n-shared-base-spoke-us-central1-egress \
--router=cr-n-shared-base-spoke-us-central1-nat-router \
--region=us-central1 \
--project=prj-n-shared-base-cb89
# Production
gcloud compute routers nats describe rn-p-shared-base-spoke-us-central1-egress \
--router=cr-p-shared-base-spoke-us-central1-nat-router \
--region=us-central1 \
--project=prj-p-shared-base-11f6- Pre-add NAT IPs to all surfaces before any
workload moves:
- PD shared MSSQL
authorized_networks(Terraform PR adding static entries to the concat inpd-infra/.../mssql.tf:110). - Azure SQL firewall: 2 rules per env via
add_az_sql_fw_rule.sh. - Cloudflare Access groups
green_p_ip_bypassandyellow_p_ip_bypass: TF PR adding static entries. - Caddy VPN bypass: TF PR adding the swarm NAT IPs as static entries
to the
caddy_vpn_bypass_listcallers. (No overlay during migration — see HB-8620 for the future overlay work.) - Partner systems: outreach in flight (longest-tail; start now).
- PD shared MSSQL
- Validate (workers boot but stay drained until allowlists confirmed).
- Cut over workloads.
- Post-migration cleanup: remove now-stale per-VM rules.
Dual-allowlist window
Steps 3-6 deliberately leave both per-VM IPs and the new NAT IPs allowlisted for the duration of the cutover. This is necessary for zero-downtime migration but expands the allowlist by ~4 IPs per environment for the duration of the window. Treat this as a known temporary security posture:
- Expected duration: hours to days for in-house surfaces (Cloud SQL, Azure SQL, Cloudflare); weeks for partner-side allowlists where partner change-control SLAs apply (most acute for §3.5 partner webhooks).
- Highest-impact surfaces during the window: PD shared MSSQL (§3.1) and partner webhooks (§3.5) — both expose data paths, not just management.
- Monitoring: before the window opens, enable
connection logging on Cloud SQL
(
cloudsql.googleapis.com/postgres.logor equivalent) and Azure SQL audit logging filtered byclient_ip. Watch for traffic from the new NAT IPs before cutover (should be zero) and unexpected continued traffic from per-VM IPs after cutover (signals incomplete migration of a workload). - Cleanup commitment: step 6 is not optional. Each surface owner is on the hook for removing the stale per-VM entry within 1 week of cutover, tracked in HB-8218 subtasks.
5. Open Questions
| # | Question | Owner |
|---|---|---|
| 1 | caddy_vpn_bypass_list. Future overlay tracked in HB-8620. |
— |
| 2 | — | |
| 3 | ado-agent-p-vm) included in
this migration?module.ado-agent-p-vm.instance_external_ip reference in the
HA NSG SSH-2290 / SSH-22 rules (infra/ha-infra/business_unit_1/production/azure_devops.tf:87,105)
needs no change. |
— |
| 4 | Partially resolved. It’s tribal knowledge.
Confirmed (all verified live via curl ifconfig.me):
pdp7 (35.238.12.142),
pdp10 (34.69.126.254), and
pdp24 (35.225.100.177) allowlist our
outbound IP on their side (see §3.5). Slack
outreach posted in #devops on 2026-04-30; awaiting
confirmation that no others exist. |
PD partner success |
| 5 | thehelperbees/gcp-networks
at modules/base_shared_vpc/nat.tf (called from
envs/{production,non-production}/boa_vpc_fw.tf). HB-8217
firewall rules are in
modules/base_shared_vpc/firewall.tf:254-345. All 8
us-central1 NAT IPs verified live 2026-04-30 (4 non-prod + 4
prod) — see §2 table. us-west1 NAT IPs (4 per env) are
pre-emptively included in all allowlists for future
DR/expansion — see §2 callout for the 2026-05-12 policy revision. |
— |
6. Action Item Traceability
Every migration action this audit identifies maps to an existing Jira ticket. Comments noted below were posted to the corresponding ticket on 2026-04-30 to add audit-derived scope detail that wasn’t in the original ticket description.
| Audit § | Action | Jira ticket | Notes |
|---|---|---|---|
| §2 | Provision swarm Cloud NAT (prod) | HB-8232 | Gates prod migration |
| §2 | NAT IP set composition verification (3-check procedure) | HB-8417 | See comment 2026-04-30 distinguishing this from connectivity testing |
| §2 | Connectivity test script | HB-8211 | In Progress |
| §3.1 | PD shared MSSQL authorized_networks |
HB-8412 | |
| §3.1 | Remove hb-n-vm entry from sandbox MSSQL |
HB-8412 | Folded in via comment 2026-04-30 |
| §3.1.1 | Per-app --private-ip opt-in for new-hierarchy Postgres
(cloudsql-proxy env var) |
HB-8646 | pd #896 —
replaces global compose overlay with per-app
.cloudsql-proxy.env opt-in |
| §3.1.1 | NAT allowlist on legacy
the-helper-bees:ansible-p<N>-psql-* instances |
HB-8646 | infrahive
#811 — idempotent gcloud script, --app <id>
filter for incremental rollout |
| §3.1.1 | Connectivity probes for both paths (new-hierarchy + legacy) | HB-8211 | infrahive
#808 — probe_pd_postgres_reachability +
probe_pd_legacy_postgres_allowlist |
| §3.2 | Azure SQL firewall (prod + non-prod) | HB-8411 | |
| §3.3 row A | pd-infra Cloudflare Access groups | HB-8414 | |
| §3.3 row B | hb-infra cloudflare-access + eligibility
(local.pd_addresses) |
HB-8414 | Folded in via comment 2026-04-30 — original scope only mentioned pd-infra |
| §3.4 | Caddy VPN bypass list | HB-8413 | |
| §3.4 future | HB↔︎PD overlay network | HB-8620 | Parented to HB-6748 (DevOps KTLO) |
| §3.5 | Notify pdp7 / pdp10 / pdp24 (both envs) | HB-8415 | Concrete list + checklist added via comment 2026-04-30; per-partner
allowlist verification via docs/scripts/query_event_subscriptions.sh
+ 401-vs-403 differential test recorded in §3.5 (2026-05-07) |
| §3.5 | Tableau Analytics allowlist (wgserver.trusted_hosts +=
8 us-central1 swarm NAT IPs) |
HB-8623 | DONE via PR #801 — verified end-to-end 2026-05-06 |
| §3.5 | Tableau Analytics allowlist follow-up
(wgserver.trusted_hosts += 8 us-west1 reserve NAT IPs) |
TBD | Open — 2026-05-12 policy revision; same tsm operational
path as HB-8623 |
| §3.5 | Tableau Caddyfile / Dockerfile cleanup (cosmetic) | HB-8635 | Filed 2026-05-05 — Low priority; out-of-band finding from HB-8623 |
| §3.5 | Tableau backup disk one-time clear + fresh backup | HB-8636 | Filed 2026-05-06 (Slack-imported) — Medium |
| §3.5 | Tableau backup retention automation (recurring prune) | HB-8637 | Filed 2026-05-06 — Medium |
| §3.5 | Tableau backup-disk health alarm (/var/opt/tab-backups
> 80%) |
HB-8638 | Filed 2026-05-06 — Medium |
| §3.5 | Tableau Server + Ubuntu host upgrade (2021.2.21 / 16.04 EOL) | HB-5567 | Comment 2026-05-06 — TF rebuild is the upgrade vehicle |
| §3.5 | Per-partner GCE address cleanup | HB-8239 | Implicit (decommission also removes address resources) |
| §4 | Pre-add NAT IPs to all surfaces | HB-8411 / 8412 / 8413 / 8414 / 8415 | |
| §4 | Dual-allowlist window (do NOT remove old per-VM IPs during cutover) | HB-8416 | |
| §4 | Cut over workloads | HB-8232 + per-app migration tickets | |
| §4 | Post-migration cleanup | HB-8239 | |
| Cross-cutting | Worker SA cloudsql.client grants |
HB-8590 | Parented to HB-8216 (worker MIG module) |
| Parent | Allowlist updates (umbrella) | HB-8218 | Description updated 2026-04-30 to fix 2-IP/4-IP staleness and link audit |
| Parent | Swarm Consolidation Epic | HB-8200 |
7. References
- Swarm worker module README
- Swarm cluster (non-prod) / (prod, gated) — managers, manager external IPs, and worker MIG, consolidated per #766
- THB VPN authorized networks
- Add Azure SQL firewall rule script