Pre-Migration Outbound IP Dependency Audit
Status: Ready for review. All 8 NAT IPs (4 non-prod + 4 prod, us-central1) verified live as of 2026-04-30. Author: miko.hadikusuma Created: 2026-04-29 Updated: 2026-04-29 Jira: HB-8210 Gates: HB-8218 (Allowlist updates) → HB-8232 (Deploy production swarm)
Table of Contents
- Purpose & Scope
- Current vs. Target State
- Audit by Surface
- Cutover Sequence (handoff to HB-8218)
- Open Questions
- Action Item Traceability
- References
1. Purpose & Scope
When the nine Phase 0 THB applications consolidate onto shared Docker Swarm workers, every per-VM external IP currently used as the source of an outbound connection disappears. Workers run with no external IP behind a shared Cloud NAT with 4 static egress IPs per region (us-central1) — see §2 for the verified set.
Anything that allowlists today’s per-VM IPs needs to be updated to the new NAT IPs before workloads cut over. This audit enumerates every such allowlist surface, where the rule lives, and the migration action needed. The deliverable feeds directly into HB-8218 (the implementation ticket).
Apps in scope (canonical list per
./zig/zig build scripts -- compose-file-linter --list):
bees, benefits-platform,
consumer_portal, django_homealign,
hb-buzz, hbcrm, pd,
thb-keycloak, the_consumer_portal. (HA apps —
Azure-side — are out of scope for the swarm migration but their Azure
SQL firewall is still affected; see §3.2.)
In scope: anywhere a current VM’s external IP is referenced as a source/allowed origin for outbound calls from one of these nine apps.
Out of scope:
- Inbound allowlists for user access to our infra (e.g. THB
VPN exit IPs
gc{2,4,5,6}-algoininfra/bees-infra/common.auto.tfvars). These remain identical post-migration. - Partner-side IPs allowlisting them into our systems (e.g.
local.partner_ip_addressesininfra/ha-infra/business_unit_1/production/environment.tf). Partner IPs don’t change. - App service-account permissions (covered by HB-8212 / HB-8213).
2. Current vs. Target State
Current per-VM external IPs
Discovered via data.google_compute_addresses.* and
per-resource google_compute_address definitions:
| App grouping | Resource (prod) | Notes |
|---|---|---|
| HB shared VM (hbcrm, bea_hbcrm, hb_buzz, django_homealign, eligibility, keycloak/thb-keycloak, hbtemplate, unleash, posthog) | google_compute_address.external_migrate (infra/hb-infra/business_unit_1/production/main.tf:22) |
One VM hosts ~9 apps. thb-keycloak repo deploys the
keycloak service here. |
| Benefits Hub (benefits-platform repo, multiple tenant variants) | google_compute_address.external (infra/hb-infra/business_unit_1/production/benefits_platform.tf:6) |
|
Bees Flask API (bees repo) |
module.bees-p-vm2.instance_external_ip (infra/bees-infra/business_unit_1/production/main.tf:23) |
Standalone VM in its own bees-infra project |
Legacy Consumer Portal (consumer_portal repo,
multi-partner) |
google_compute_address.cp_external (infra/pd-infra/business_unit_1/production/consumer_portal.tf:7) |
cp-p-vm. Hosts hbcp_jh, hbcp_aarp, hbcp_ta, hbcp_sompo,
hbcp_pru, hbcp_bcbs_ar, hbcp_cgi, etc. Source of HB-8208
gcsfuse work. |
New Consumer Portal — CoPo 3.0 (the_consumer_portal
repo) |
google_compute_address.external (infra/pd-infra/business_unit_1/production/portal-the-consumer-portal.tf:9) |
portal-p-vm. Auth0 custom domain integration; PR
preview env target (HB-7963). |
| ADO agent (deploys HomeAlign Apps) | google_compute_address.ado_agent_external_ip (infra/ha-infra/business_unit_1/production/azure_devops.tf:83) |
Serves Azure DevOps SSH/deploy traffic into HA (HomeAlign) Windows VMs. HA apps are not in swarm scope; ADO agent stays put → no action. |
| 22+ PD partner VMs | google_compute_address.external per
infra/pd-infra/business_unit_1/production/pdp*/main.tf:12 |
One VM per partner |
Quick inventory command:
gcloud compute addresses list --filter="address_type=EXTERNAL" --format="table(name,address,project)" \
--project=<env-project-id>Target swarm NAT IPs
Provisioned per HB-8217
/ HB-8360
as Cloud NAT on the vpc-{env}-shared-base shared VPC.
Capacity:
4 IPs × 64,512 ports / 2048 min_ports_per_vm = ~126 VM ceiling per region.
Canonical TF source (separate repo): thehelperbees/gcp-networks
| Resource | Path |
|---|---|
Cloud Router
(cr-{env_code}-shared-base-spoke-us-central1-nat-router) |
modules/base_shared_vpc/nat.tf:22-32 |
NAT external IPs
(ca-{env_code}-shared-base-spoke-us-central1-{0..3}) |
modules/base_shared_vpc/nat.tf:34-39 (count =
var.nat_num_addresses_region1) |
Router NAT
(rn-{env_code}-shared-base-spoke-us-central1-egress) |
modules/base_shared_vpc/nat.tf:41-56 |
Module call with nat_num_addresses_region1 = 4,
nat_min_ports_per_vm = 2048 |
envs/{production,non-production}/boa_vpc_fw.tf:45-47 |
HB-8217
swarm firewall rules (TCP 2377, TCP+UDP 7946, UDP 4789, TCP 9323),
tag-scoped to swarm-node |
modules/base_shared_vpc/firewall.tf:254-345 |
NAT IPs to allowlist (verified 2026-04-30 against live state):
All four IPs per region are attached round-robin to the Router NAT, so every IP must be in every allowlist — partial allowlisting causes intermittent ~50% failures as workers cycle through unallowlisted IPs.
| Environment | Network host project | Address resource | NAT IP |
|---|---|---|---|
| Non-production | prj-n-shared-base-cb89 |
ca-n-shared-base-spoke-us-central1-0 |
34.72.82.40 |
| Non-production | prj-n-shared-base-cb89 |
ca-n-shared-base-spoke-us-central1-1 |
34.122.251.101 |
| Non-production | prj-n-shared-base-cb89 |
ca-n-shared-base-spoke-us-central1-2 |
34.170.35.69 |
| Non-production | prj-n-shared-base-cb89 |
ca-n-shared-base-spoke-us-central1-3 |
35.194.23.180 |
| Production | prj-p-shared-base-11f6 |
ca-p-shared-base-spoke-us-central1-0 |
34.132.72.45 |
| Production | prj-p-shared-base-11f6 |
ca-p-shared-base-spoke-us-central1-1 |
35.226.246.142 |
| Production | prj-p-shared-base-11f6 |
ca-p-shared-base-spoke-us-central1-2 |
34.136.40.107 |
| Production | prj-p-shared-base-11f6 |
ca-p-shared-base-spoke-us-central1-3 |
34.57.41.236 |
📝 Historical note: HB-8218 originally listed only 2 of 4 IPs per env. The remaining 6 were discovered live during this audit (gcp-networks provisions
nat_num_addresses_region1 = 4for both envs perenvs/{production,non-production}/boa_vpc_fw.tf:45). HB-8218 description updated 2026-04-30 with the corrected set.
⚠️ us-west1 NAT IPs exist but are out of scope. A
gcloud compute addresses listquery against either shared-base-vpc-host project also returns 4 IPs inus-west1(e.g., non-prod34.105.111.96 / 34.127.70.253 / 136.109.214.215 / 34.169.152.90; prod35.227.138.226 / 35.185.201.232 / 34.168.190.12 / 8.229.103.135). These belong to a parallel Cloud NAT inus-west1for non-swarm workloads (Cloud Functions and any future regional expansion). The swarm cluster runs only in us-central1 — managers inus-central1-{a,b,c}, workers default to["us-central1-a","us-central1-b","us-central1-c","us-central1-f"]perswarm_worker/variables.tf. Workers cannot egress through the us-west1 NAT pool. Do not include the us-west1 IPs in any partner-side allowlist — they’re zero-yield (no traffic ever comes from them) and create future change-control burden.
Verification command (per env):
# Non-production
gcloud compute addresses list \
--project=prj-n-shared-base-cb89 \
--filter='name~^ca-n-shared' \
--format='table(name,address)'
# Production
gcloud compute addresses list \
--project=prj-p-shared-base-11f6 \
--filter='name~^ca-p-shared' \
--format='table(name,address)'Manager IPs (3 per env, separate from worker NAT):
- Non-prod:
google_compute_address.swarm_manager_externalininfra/hb-infra/business_unit_1/non-production/swarm.tf(deployed) - Prod: same resource in
production/swarm.tf— block-commented (/* ... */), gated on HB-8232
Verifying egress before partner outreach
Before notifying partners (§3.5), confirm that worker traffic actually exits via the NAT IPs above. Misalignment between the reserved IPs and the IPs partners see is exactly the silent-failure mode that caused the HB-8218 4-vs-2 finding — but for partner allowlists, that failure happens at the partner’s edge, after cutover, with no logs on our side.
Two complementary checks. Pass criteria: every worker is mapped to one of the 4 NAT IPs, and a curl from inside the worker echoes back the same IP that GCP’s mapping table claims for it.
Check 1 — control-plane mapping: ask GCP which NAT IP each worker is using. No cluster shell access needed.
NETWORK_HOST_NON_PROD="$(gcloud projects list \
--filter='labels.application_name=base-shared-vpc-host AND labels.environment=non-production' \
--format='value(projectId)')"
gcloud compute routers get-nat-mapping-info \
cr-n-shared-base-spoke-us-central1-nat-router \
--nat-name=rn-n-shared-base-spoke-us-central1-egress \
--region=us-central1 \
--project="$NETWORK_HOST_NON_PROD"Output is a JSON list, one entry per VM, each with
natIpPortRanges[].natIp. Group by natIp to
confirm: every worker maps to one of 34.72.82.40,
34.122.251.101, 34.170.35.69,
35.194.23.180. Any other IP is a finding.
Check 2 — partner’s-eye view: run a one-shot global service that hits an IP echo from every worker. What it reports is what a partner will see.
# From a swarm manager (SSH via IAP):
docker service create \
--name nat-ip-check \
--mode global \
--restart-condition none \
alpine sh -c 'echo "$(hostname): $(wget -qO- https://ifconfig.me)"'
# Wait ~30s for tasks to complete, then:
docker service logs nat-ip-check --no-trunc
# Cleanup:
docker service rm nat-ip-checkCross-check the IP each worker reports against Check 1: they should match per-worker. If they don’t, the egress path isn’t what we think it is — stop and investigate before partner outreach.
If you don’t have manager shell access, alternate per-worker approach:
gcloud compute ssh swarm-worker-n-XXXX --tunnel-through-iap \
--command='curl -s https://ifconfig.me'…repeated for each worker.
Optional — Check 3, NAT translation logs: the
gcp-networks egress_nat_region1 resource enables
log_config { filter = "TRANSLATIONS_ONLY" }, so every flow
gets logged. While Check 2 is running, tail the logs to see actual
translations:
gcloud logging read \
'resource.type="nat_gateway"
resource.labels.gateway_name="rn-n-shared-base-spoke-us-central1-egress"' \
--project="$NETWORK_HOST_NON_PROD" \
--limit=50 \
--format='value(jsonPayload.connection.nat_ip,jsonPayload.connection.src_ip,jsonPayload.connection.dest_ip)'Three columns: NAT IP exited from, worker internal IP, destination. Confirms what we expect from a third independent angle.
If all three checks agree, the IP set is ground-truthed and partner outreach can go out. If any disagree, the disagreement is the next thing to chase.
3. Audit by Surface
3.1 GCP
Cloud SQL authorized_networks
| Instance | Path | What’s allowlisted today | Migration action |
|---|---|---|---|
HB Postgres (hb-p-psql) |
infra/hb-infra/business_unit_1/production/main.tf:32 |
var.authorized_networks only (THB VPN:
gc{2,4,5,6}-algo) — no per-VM IPs |
None. App connects via Cloud SQL Proxy sidecar (Phase 0 Step 1). |
HB MSSQL sandbox — non-prod only (hb-n-sql-server) |
infra/hb-infra/business_unit_1/non-production/sql_server.tf:48 |
var.authorized_networks + hb-n-vm external
IP |
Remove the hb-n-vm entry; nothing replaces
it. Added by PR #362
as a DHA sandbox for the Benefits Hub team to test against without
colliding with other teams’ deploys. The allowlist entry was for the
host-level Cloud SQL proxy connecting via public IP. The new per-stack
cloud-sql-proxy sidecar authenticates to the Cloud SQL
admin API by IAM, bypassing authorized_networks entirely.
No prod equivalent. Workers need
cloudsql.client on prj-bu1-n-hb-infra-* (HB-8590)
and module.hb-n-sql-server.instance_connection_name in the
django_homealign stack’s sidecar command. |
PD shared MSSQL (pd-p-mssql) |
resource: mssql.tf:84;
allowlist concat: mssql.tf:110 |
var.authorized_networks +
local.temporary_authorized_networks (lines 11-14) +
local.pd_vm_authorized_networks (lines
23-29) |
HIGH RISK. pd_vm_authorized_networks
auto-populates from every PD VM external IP via regex match on app
vm_resource_name. As partner VMs go away, regex matches
drop → auto-removal from allowlist. Need to add the swarm NAT IPs as
static entries (next to
local.temporary_authorized_networks) before the first PD
partner cuts over. |
Per-partner Postgres (pdp{N}-p-psql) |
infra/pd-infra/business_unit_1/production/pdp*/main.tf:33 |
var.authorized_networks only (THB VPN) |
None. Phase 0 sidecar handles connectivity. |
Action item: Add a new entry to the allowlist concat
in pd-p-mssql.ip_configuration.authorized_networks once
swarm NAT IPs are known.
# Live audit
gcloud sql instances describe <instance-name> --project=<env-project-id> \
--format="value(settings.ipConfiguration.authorizedNetworks)"3.2 Azure SQL Firewall (HA Apps Only)
| Server | Resource group | Today’s source | Migration action |
|---|---|---|---|
ha-prod1-azsqldb |
HA-PROD1-SQL-RG |
Per-VM rules added imperatively via src/scripts/add_az_sql_fw_rule.sh |
Add swarm NAT IPs as new rules; leave per-VM rules in place during cutover, remove post-migration |
ha-dev-azsqldb |
HA-DEV-SQL-RG |
Same | Same |
These rules are NOT in Terraform. They live only in Azure. Inventory:
# Production
az sql server firewall-rule list --resource-group HA-PROD1-SQL-RG \
--server ha-prod1-azsqldb --output table
# Non-prod
az sql server firewall-rule list --resource-group HA-DEV-SQL-RG \
--server ha-dev-azsqldb --output tableAdd rule (one invocation per NAT IP from §2):
./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_0 -ip <nat-ip-0>
./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_1 -ip <nat-ip-1>
./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_2 -ip <nat-ip-2>
./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_3 -ip <nat-ip-3>The script is idempotent (checks for existing rule with same IP
first). The _0 through _3 suffix aligns with
the gcp-networks address resource indices
(ca-{env_code}-shared-base-spoke-us-central1-{0..3}).
3.3 Cloudflare Access Groups (IP Bypass)
Two locations:
A) infra/pd-infra/business_unit_1/production/cloudflare.tf
— most action here:
| Group | Source today | Migration action |
|---|---|---|
green_p_ip_bypass (line 59) |
data.google_compute_addresses.hb_p_vm_ip_address (HB
project external IPs) |
Auto-shrinks as HB VM goes away. Add swarm NAT IPs as static entries. |
yellow_p_ip_bypass (line 71) |
data.google_compute_addresses.bees_p_vm_ip_address |
Same — auto-shrinks; add static NAT IPs. |
pdp24_p_bastion_ip_bypass /
pdp18_p_bastion_ip_bypass |
bastion-specific compute addresses
(data.google_compute_addresses.p24_bastion_p_vm_ip_address
/ p18_bastion_p_vm_ip_address) |
No change. Bastions stay on dedicated VMs and are not part of the swarm migration; their external IPs are stable. |
pdp18_p_ip_bypass, pdp24_p_ip_bypass,
salesforce_p_ip_bypass,
thb_p_authorized_networks |
Hardcoded partner / VPN IP ranges | No change — these are partner egress, not ours. |
B) HB-side cloudflare-access module calls — these
live in hb-infra (not pd-infra) and use a dynamic
vpn_ip_addresses input rather than named Cloudflare Access
groups:
| Module call | Path | Source | Migration action |
|---|---|---|---|
Standard cloudflare-access (covers all HB apps with
vpn_bypass policy: hbcrm, hb_buzz,
eligibility-via-thehelperbees.com-domain, etc.) |
hb-infra/.../production/cloudflare.tf:8-15 |
vpn_ip_addresses = concat(local.pd_addresses, local.vpn_addresses)
— local.pd_addresses is auto-discovered from
data.google_compute_addresses.pd_ip_addresses (filter
name:pdp* in PD project) |
Add swarm NAT IPs as static entries to
local.pd_addresses in main.tf:8-11.
Single change flows into both module calls below. |
eligibility-cloudflare-access (separate call because
eligibility uses session_duration = "168h" vs the default;
covers eligibility.thb.nu and
eligibility.thb.sh) |
hb-infra/.../production/eligibility_service.tf:6-13 |
Same expression:
concat(local.pd_addresses, local.vpn_addresses) |
Inherits fix from updating local.pd_addresses. |
Why this matters for pdp24 → eligibility: When pdp24 calls
https://eligibility.thb.nu, Cloudflare Access checks if the source IP is in thevpn_bypassgroup. Today, pdp24’s35.225.100.177is inlocal.pd_addresses(via the dynamic data lookup) so it passes. After migration, pdp24’s IP becomes a swarm NAT IP — which is not in the PD project’s external addresses, so the dynamic lookup won’t include it. The static-entry fix above is what keeps this internal flow working post-migration. No partner action required — this is our own infrastructure.
C) infra/ha-infra/business_unit_1/production/cloudflare/cloudflare.tf
— uses only var.authorized_networks (THB VPN). No
change needed.
Live audit:
# Already in TF — read current state
./zig/zig build plan -- pd-infra production cloudflare
./zig/zig build plan -- hb-infra production cloudflare
./zig/zig build plan -- hb-infra production eligibility_service3.4 Caddy VPN Bypass List
Important: The Terraform-supplied
caddy_vpn_bypass_list is one of three inputs that
compose the actual runtime Caddy allowlist. The full picture, assembled
by Ansible at deploy time:
admin_allowed_ip_addresses = [vm_ip] + vpn_ip_addresses + caddy_vpn_bypass_list
(Composed across gc_deploy_with_gcp_secrets.yml:358-360
— the playbook actually used in production — and roles/caddy_create/tasks/main.yml:46-48
in hb-ansible.) This list drives the
(is-public-ip) / (is-vpn-ip) snippets in roles/caddy_create/templates/Caddyfile.j2,
which gate the admin/flower/camunda/websocket routes that bypass
Cloudflare Access.
| Component | Source | Migration impact |
|---|---|---|
vm_ip |
hostvars[vm_name].ansible_ssh_host — the deploy
target’s own SSH host |
Becomes the swarm worker’s NAT egress IP. Since all workers share the same NAT IPs, every worker will (incidentally) self-allowlist every other worker’s traffic via Cloudflare. New property; not necessarily wrong, worth surfacing. |
vpn_ip_addresses |
Maps to infrahive var.authorized_networks (THB algo VPN
exits) |
No change — same VPN IPs pre/post migration. |
caddy_vpn_bypass_list |
Terraform-fed via the ansible_config module (see table
below) |
Active migration concern. See callers below. |
How caddy_vpn_bypass_list flows from
Terraform to runtime
The caddy_vpn_bypass_list value is not read at
deploy time from a static file in hb-ansible. It traverses
three systems:
infrahive Terraform (hb-infra/.../main.tf)
local.caddy_pd_addresses
|
v
module "hb-p-ansible-config5" { caddy_vpn_bypass_list = local.caddy_pd_addresses }
| (terraform apply)
v
ansible_config TF module (terraform-modules.git, external)
|
v
GCS: gs://ansible-config-65ab/<app>/config.yml ← regenerated on every apply
|
v
hb-ansible playbook (gc_deploy_with_gcp_secrets.yml) loads bucket_path at deploy time
|
v
filtered_app.caddy_vpn_bypass_list (per-app dict)
|
v
caddy_create role (roles/caddy_create/tasks/main.yml:46-48)
admin_allowed_ip_addresses = [vm_ip] + vpn_ip_addresses + filtered_app.caddy_vpn_bypass_list
|
v
Caddyfile.j2 (is-public-ip / is-vpn-ip matchers)
Verified end-to-end via
gsutil cat gs://ansible-config-65ab/hbcrm/config.yml — that
file already contains a populated caddy_vpn_bypass_list
array with all PD partner VM IPs (78+ entries today).
Implication for HB-8413 (Caddy VPN bypass) and HB-8414
(Cloudflare Access): changes happen in infrahive
only. No edits to hb-ansible. The Caddy template
is data-driven — it picks up whatever’s in the GCS config on next
deploy. And because the same local.caddy_pd_addresses /
local.pd_addresses data lookup feeds all three downstream
surfaces (Caddy bypass, standard cloudflare-access,
eligibility-cloudflare-access), a single Terraform PR closes
both HB-8413 and HB-8414 — coordinate so the work isn’t
duplicated.
Terraform sources of
caddy_vpn_bypass_list:
| Caller | Source today | Migration action |
|---|---|---|
module.hb-p-ansible-config5 (infra/hb-infra/business_unit_1/production/main.tf:140) |
local.caddy_pd_addresses — auto-discovered PD VM
external IPs (name:pdp* filter, line 12) |
Auto-shrinks as PD VMs migrate. Replace with the swarm NAT IPs as static entries. HB→PD calls will continue to roundtrip via Cloudflare even on the same cluster (no overlay during migration — see HB-8620 follow-up). |
module.hb-n-ansible-config (non-prod equivalent) |
Same pattern | Same |
module.benefits-hub-p-ansible-config1 (infra/hb-infra/business_unit_1/production/benefits_platform.tf:73) |
[] empty |
No change |
Decision (resolves Q1): The migration will not introduce an HB↔︎PD swarm overlay network. HB→PD calls continue to traverse the public internet via Cloudflare, requiring the swarm NAT IPs in this bypass list. A future overlay is tracked under HB-8620 (parented to the HB-6748 DevOps KTLO epic), which would let us remove this bypass list entirely.
Note on partner_allowed_ip_addresses (inbound,
out of scope): While inspecting hb-ansible, we
also reviewed the per-app partner_allowed_ip_addresses
variable (roles/caddy_create/tasks/main.yml:32, only set on
stage_hbpdp11 with [76.187.226.216]). This is
the inverse direction — a partner’s IP allowlisted on
our Caddy for inbound /partner/* traffic — and is
therefore out of scope for HB-8210.
Additionally, the variable is set as a fact but never referenced
in the current Caddyfile.j2 template, suggesting
it is either vestigial or enforced at the app layer inside
partner_django. Either way, partner IPs don’t change during
our migration, so no action.
3.5 Partner API Webhooks / External Integrations
This is the highest-risk and least-discoverable surface. Partner systems often allowlist our outbound IP for inbound webhooks, SFTP polling, or API calls.
Known surfaces requiring live inventory (cannot enumerate from infrahive alone):
Partner SFTP destinations —
infra/pd-infra/cloudfunctions/sftp_pubsub_messages/SFTP/client.go,infra/pd-infra/cloudfunctions/sftp_bucket_file/SFTP/client.go. SFTP from cloud functions, not VMs — likely uses its own NAT. Verify these aren’t routed through partner VMs.PSR (Provider Services Request) handler — out of scope, documented for clarity —
infra/pd-infra/modules/partner_psr_request_handler/. On first read this looked like a possible outbound surface, but inspection of the module and its callers (pdp8, pdp24, …) confirms the flow is the opposite direction: a partner’s Salesforce instance pushes files into our GCS bucket (psr-{partner_code}-{env}-*), authenticated by a service-account key that we generate and hand to the partner’s Salesforce dev. A Cloud Function then converts uploads to PDF/XML for hbcrm to read. No partner-side IP allowlist is involved, and our outbound IP doesn’t appear in this flow at all — so it’s unaffected by the swarm migration. Listed here only so future readers don’t re-add it as an open question.Per-partner outbound API integrations — likely configured per-partner in either:
pdrepo.envs/files (partner API base URLs and auth)- hb-buzz / hbcrm / django_homealign per-tenant settings
- Salesforce / Camunda outbound webhooks
Recommended path: post in #devops
asking team members which partners allowlist our outbound IP. Active
outreach: #devops
thread on 2026-04-30. Track replies there and add new partners to
the confirmed-partners table below as they surface.
Owner for partner outreach: PD partner success / integrations team. Best to start parallel to HB-8218 implementation since some partners have multi-week change-control SLAs.
Confirmed partners with our outbound IP allowlisted
All IPs below verified live via curl ifconfig.me from
the corresponding VM. Each address resource carries
lifecycle { ignore_changes = all } so the IPs have been
stable.
| Partner | Env | Outbound IP | GCE VM | Project |
|---|---|---|---|---|
| pdp24 | prod | 35.225.100.177 |
pdp24-p-vm-6670 |
prj-bu1-p-pd-infra-b355 |
| pdp24 | non-prod | 34.30.125.50 |
pdp24-n-vm-2327 |
prj-bu1-n-pd-infra-fee5 |
| pdp10 | prod | 34.69.126.254 |
pdp10-p-vm-a437 |
prj-bu1-p-pd-infra-b355 |
| pdp10 | non-prod | 34.134.208.190 |
pdp10-n-vm-01f0 |
prj-bu1-n-pd-infra-fee5 |
| pdp7 | prod | 35.238.12.142 |
pdp7-p-vm-ebcb |
prj-bu1-p-pd-infra-b355 |
| pdp7 | non-prod | 35.232.30.74 |
pdp7-n-vm-5c52 |
prj-bu1-n-pd-infra-fee5 |
Source TF for all rows: google_compute_address.external
in
infra/pd-infra/business_unit_1/{production,non-production}/pdp{N}/main.tf:12.
Whether each partner actually has the non-prod IP allowlisted on their side is unknown. Confirmed partner integrations may be prod-only; partner outreach (per pre-cutover checklist below) should ask explicitly: “do you allowlist any IPs of ours other than the prod one above?” and capture the answer here.
Other partners — unknown. Coverage is tribal; PD
partner success / integrations team should confirm no others exist.
Outreach posted in #devops
on 2026-04-30 — track replies there.
Internal flows that depend on these IPs: pdp24 →
eligibility.thb.nu / eligibility.thb.sh
(confirmed). Our own Cloudflare Access vpn_bypass group
includes all PD partner VM IPs (pdp7, pdp10, pdp24, and
any other active partner) via local.pd_addresses (dynamic
data lookup), so any of them calling our HB-side services rides this
allowlist. After migration the dynamic lookup shrinks to nothing and is
replaced by static swarm NAT IP entries — covered by §3.3 row B.
No partner action needed for these internal flows.
Decision (post-migration): notify each partner
contact of the swarm NAT IPs across both environments
(4 non-prod + 4 prod, per §2) and have them update their allowlist in a
single change. Bundling envs minimizes partner change-control churn —
most partners have one allowlist that covers any of our outbound,
regardless of which env it originated from. After prod
cutover (the last env to migrate), the per-partner production GCE
addresses can be released (delete the
google_compute_address.external resources in
pdpN/main.tf).
Why not preserve the existing per-partner IPs?
To pick the right path it helps to know how Cloud NAT actually
distributes IPs to outbound flows. With the MANUAL_ONLY
mode + min_ports_per_vm = 2048 that gcp-networks
configures, Cloud NAT does per-VM IP allocation, not
per-flow:
- Each worker VM is assigned a fixed allocation of ~2048 ports from one of the pool’s NAT IPs.
- The worker uses that single IP for all its outbound flows (until port exhaustion, which is rare with 2048 ports).
- The assignment is Cloud-NAT-internal — we don’t control which IP a given worker gets.
- With N IPs in the pool and M workers, roughly M/N workers land on each IP.
Service replicas in Swarm are scheduled across the worker pool, so per-partner outbound traffic (e.g., pdp24 → partner API, pdp10 → partner API) comes from whichever NAT IP was assigned to the worker the scheduler placed it on — not from any IP we can pre-select.
Options compared (per partner)
| # | Option | Preserves the existing per-partner IP? | Partner allowlist work | Operational cost | Verdict |
|---|---|---|---|---|---|
| 1 | Notify the partner, they allowlist all 8 swarm NAT IPs (4 non-prod + 4 prod) | No, but doesn’t matter — partner accepts new IPs | One email/ticket per partner; ~24-hour turnaround for most | Zero — fewer resources to maintain after migration | ✅ Default. Reconsider only if change-control SLA is multi-week. |
| 2 | Don’t migrate the affected partner stack; leave on dedicated VM | Yes (fully) | None | Continued cost of the dedicated VM; partner stays out of Phase 0 scope | ❌ Ruled out (we want to migrate all partners). |
| 3 | Outbound HTTPS proxy on a tiny dedicated VM owning the existing IP;
partner stack uses HTTPS_PROXY env |
Yes (exactly, via HTTPS_PROXY) |
None | One always-on VM per partner needing preservation; monitoring/alerts/upgrades for each; single point of failure for that partner’s traffic | ⚠️ Acceptable if option 1 is blocked by partner SLA. Cost scales linearly with number of preserved IPs. |
| 4 | Add the existing IPs to the swarm Cloud NAT pool | No. Per-VM allocation means ~1/(N+k) of workers get any given preserved IP and use it for all their outbound regardless of destination. Other workers (and any partner replicas scheduled on them) use the other IPs. Partner still has to allowlist the full set. | Same as option 1 | Slightly higher than option 1 (extra IPs reserved forever) | ❌ Doesn’t preserve IPs for partner-specific traffic. Cloud NAT lacks per-flow IP selection. |
| 5 | Pin partner stack to a specific worker; attach the existing IP to that node | Yes (the pinned worker uses that IP) | None | Defeats Swarm scheduling, breaks autoscaler, single point of failure for the partner stack | ❌ Ruled out (already rejected). |
Why option 4 keeps coming up
Option 4 feels like it should work: “we have the IPs, just
keep using them.” It would work if Cloud NAT supported per-flow IP
selection rules (“traffic to partner.example.com → source
IP X”), but it doesn’t — IP assignment is per-VM, not
per-flow or per-destination. So a preserved IP ends up tied to whichever
workers Cloud NAT happens to assign it to, used by every outbound flow
from those workers regardless of destination, while partner replicas on
other workers don’t use it at all. It’s effectively option 1 with extra
unused IPs in the pool.
The only real ways to preserve a partner-specific IP are option 3 (proxy steers at the application layer) or option 5 (node pinning forces the worker assignment), and we’ve ruled out 5. Reconsider option 3 only if option 1 is blocked for a specific partner.
Pre-cutover checklist (per partner, run for pdp7, pdp10, and pdp24):
Precondition: the §2 verification procedure (“Verifying egress before partner outreach”) must have been run for non-prod, with the 4 verified IPs locked in. For prod, communicate the IPs we have today and follow up before prod cutover with any updates from the same verification run.
- Send the partner contact the full swarm NAT IP set across
both environments (from §2). Communicating both envs at once
minimizes partner change-control churn:
- Non-production (4 IPs, verified live):
34.72.82.40,34.122.251.101,34.170.35.69,35.194.23.180 - Production (4 IPs, all verified live 2026-04-30):
34.132.72.45,35.226.246.142,34.136.40.107,34.57.41.236
- Non-production (4 IPs, verified live):
- Ask the partner to enumerate every IP of ours they currently allowlist — both prod and non-prod, if applicable. Capture this in the audit’s confirmed-partners table so we know which entries to clean up post-migration.
- Get written confirmation that all 8 swarm NAT IPs (or current best subset for prod) are in their allowlist before the corresponding env’s migration day.
- Keep existing per-partner IPs allowlisted on the
partner side throughout the dual-allowlist window per §4. The window
must extend through the last migration to complete (= prod,
gated on HB-8232).
Per-partner IPs to keep (only those the partner confirms in step 2 are
actually in their allowlist):
- pdp7 prod:
35.238.12.142/ non-prod:35.232.30.74 - pdp10 prod:
34.69.126.254/ non-prod:34.134.208.190 - pdp24 prod:
35.225.100.177/ non-prod:34.30.125.50
- pdp7 prod:
- One week after prod cutover: ask the partner to
remove the old IPs from their allowlist; release the corresponding
google_compute_address.externalresources inpdp{N}/main.tf(and non-prod equivalents if applicable) on our side.
Tableau Analytics — internal system that allowlists our partner IPs
The internal Tableau server hosting Analytics pages allowlists outbound IPs from pdp1, pdp5, and pdp14 so those partner stacks can fetch/render dashboards. After migration, those stacks egress through swarm NAT IPs and Tableau’s allowlist won’t match.
This is structurally identical to the partner allowlist pattern above
— Tableau is “external” from the swarm’s perspective even though it’s
internal to THB. Not managed by infrahive (zero
references to tableau / Tableau /
tab.thb in the repo); the tab_user DB user in
crm_database is a read-role for Tableau to query CRM data,
not the Tableau server itself.
The migration action (add swarm NAT IPs to Tableau’s allowlist, dual-allowlist window through prod cutover, verify Tableau → our Cloud SQL access still works post-migration) is tracked under HB-8623 (subtask of HB-8218).
Open questions for whoever owns Tableau (Analytics / DevOps): where does the Tableau server live (self-hosted VM? Tableau Cloud?), who owns its allowlist, and are there any other partner stacks beyond pdp1/5/14 that hit it?
3.6 Other (lower priority but worth verifying)
- Cloudflare Tunnel / cloudflared config —
src/tools/launchbot/templates/cloudflared/config.yml.tmpl. Tunnel is identity-based, not IP-based, but verify there’s no IP-pinned policy. - Datadog allowlists — outbound to Datadog via
dd_agent. Datadog accepts traffic from any source; Datadog API key auths the call. No change. - Papertrail
(
syslog+tls://logs7.papertrailapp.com:23105) — same; no IP allowlist on Papertrail’s side. - Camunda / Keycloak outbound — internal, no external allowlist.
- GitHub Actions deploy proxy (
infra/common-infra/business_unit_1/shared/gh_actions_deploy_proxy.tf) — separate proxy, not impacted.
4. Cutover Sequence (handoff to HB-8218)
- Provision swarm Cloud NAT for prod (HB-8217 already done for non-prod; prod gated on HB-8232).
- Capture the NAT IPs per env via
gcloud compute addresses list --filter='name~^ca-{n,p}-shared-base-spoke-us-central1-' --format='table(name,address)'against the corresponding network host project (full commands in §2). Useaddresses listrather thanrouters nats describehere —describereturns address-resource URLs, not IP values. For defense-in-depth, also runrouters nats describeto confirm those addresses are actually attached to the NAT, androuters get-nat-mapping-infoto confirm workers map to them (per §2 verification procedure).
# Non-production
gcloud compute routers nats describe rn-n-shared-base-spoke-us-central1-egress \
--router=cr-n-shared-base-spoke-us-central1-nat-router \
--region=us-central1 \
--project=prj-n-shared-base-cb89
# Production
gcloud compute routers nats describe rn-p-shared-base-spoke-us-central1-egress \
--router=cr-p-shared-base-spoke-us-central1-nat-router \
--region=us-central1 \
--project=prj-p-shared-base-11f6- Pre-add NAT IPs to all surfaces before any
workload moves:
- PD shared MSSQL
authorized_networks(Terraform PR adding static entries to the concat inpd-infra/.../mssql.tf:110). - Azure SQL firewall: 2 rules per env via
add_az_sql_fw_rule.sh. - Cloudflare Access groups
green_p_ip_bypassandyellow_p_ip_bypass: TF PR adding static entries. - Caddy VPN bypass: TF PR adding the swarm NAT IPs as static entries
to the
caddy_vpn_bypass_listcallers. (No overlay during migration — see HB-8620 for the future overlay work.) - Partner systems: outreach in flight (longest-tail; start now).
- PD shared MSSQL
- Validate (workers boot but stay drained until allowlists confirmed).
- Cut over workloads.
- Post-migration cleanup: remove now-stale per-VM rules.
Dual-allowlist window
Steps 3-6 deliberately leave both per-VM IPs and the new NAT IPs allowlisted for the duration of the cutover. This is necessary for zero-downtime migration but expands the allowlist by ~4 IPs per environment for the duration of the window. Treat this as a known temporary security posture:
- Expected duration: hours to days for in-house surfaces (Cloud SQL, Azure SQL, Cloudflare); weeks for partner-side allowlists where partner change-control SLAs apply (most acute for §3.5 partner webhooks).
- Highest-impact surfaces during the window: PD shared MSSQL (§3.1) and partner webhooks (§3.5) — both expose data paths, not just management.
- Monitoring: before the window opens, enable
connection logging on Cloud SQL
(
cloudsql.googleapis.com/postgres.logor equivalent) and Azure SQL audit logging filtered byclient_ip. Watch for traffic from the new NAT IPs before cutover (should be zero) and unexpected continued traffic from per-VM IPs after cutover (signals incomplete migration of a workload). - Cleanup commitment: step 6 is not optional. Each surface owner is on the hook for removing the stale per-VM entry within 1 week of cutover, tracked in HB-8218 subtasks.
5. Open Questions
| # | Question | Owner |
|---|---|---|
| 1 | caddy_vpn_bypass_list. Future overlay tracked in HB-8620. |
— |
| 2 | — | |
| 3 | ado-agent-p-vm) included in
this migration?module.ado-agent-p-vm.instance_external_ip reference in the
HA NSG SSH-2290 / SSH-22 rules (infra/ha-infra/business_unit_1/production/azure_devops.tf:87,105)
needs no change. |
— |
| 4 | Partially resolved. It’s tribal knowledge.
Confirmed (all verified live via curl ifconfig.me):
pdp7 (35.238.12.142),
pdp10 (34.69.126.254), and
pdp24 (35.225.100.177) allowlist our
outbound IP on their side (see §3.5). Slack
outreach posted in #devops on 2026-04-30; awaiting
confirmation that no others exist. |
PD partner success |
| 5 | thehelperbees/gcp-networks
at modules/base_shared_vpc/nat.tf (called from
envs/{prod,non-production}/boa_vpc_fw.tf). HB-8217
firewall rules are in
modules/base_shared_vpc/firewall.tf:254-345. All 8
NAT IPs verified live 2026-04-30 (4 non-prod + 4 prod,
us-central1) — see §2 table. us-west1 NAT IPs exist but are out of scope
(swarm is us-central1-only). |
— |
6. Action Item Traceability
Every migration action this audit identifies maps to an existing Jira ticket. Comments noted below were posted to the corresponding ticket on 2026-04-30 to add audit-derived scope detail that wasn’t in the original ticket description.
| Audit § | Action | Jira ticket | Notes |
|---|---|---|---|
| §2 | Provision swarm Cloud NAT (prod) | HB-8232 | Gates prod migration |
| §2 | NAT IP set composition verification (3-check procedure) | HB-8417 | See comment 2026-04-30 distinguishing this from connectivity testing |
| §2 | Connectivity test script | HB-8211 | In Progress |
| §3.1 | PD shared MSSQL authorized_networks |
HB-8412 | |
| §3.1 | Remove hb-n-vm entry from sandbox MSSQL |
HB-8412 | Folded in via comment 2026-04-30 |
| §3.2 | Azure SQL firewall (prod + non-prod) | HB-8411 | |
| §3.3 row A | pd-infra Cloudflare Access groups | HB-8414 | |
| §3.3 row B | hb-infra cloudflare-access + eligibility
(local.pd_addresses) |
HB-8414 | Folded in via comment 2026-04-30 — original scope only mentioned pd-infra |
| §3.4 | Caddy VPN bypass list | HB-8413 | |
| §3.4 future | HB↔︎PD overlay network | HB-8620 | Parented to HB-6748 (DevOps KTLO) |
| §3.5 | Notify pdp7 / pdp10 / pdp24 (both envs) | HB-8415 | Concrete list + checklist added via comment 2026-04-30 |
| §3.5 | Tableau Analytics allowlist (pdp1/pdp5/pdp14 → Tableau, plus reverse-direction Cloud SQL/Cloudflare verification) | HB-8623 | Created 2026-04-30 |
| §3.5 | Per-partner GCE address cleanup | HB-8239 | Implicit (decommission also removes address resources) |
| §4 | Pre-add NAT IPs to all surfaces | HB-8411 / 8412 / 8413 / 8414 / 8415 | |
| §4 | Dual-allowlist window (do NOT remove old per-VM IPs during cutover) | HB-8416 | |
| §4 | Cut over workloads | HB-8232 + per-app migration tickets | |
| §4 | Post-migration cleanup | HB-8239 | |
| Cross-cutting | Worker SA cloudsql.client grants |
HB-8590 | Parented to HB-8216 (worker MIG module) |
| Parent | Allowlist updates (umbrella) | HB-8218 | Description updated 2026-04-30 to fix 2-IP/4-IP staleness and link audit |
| Parent | Swarm Consolidation Epic | HB-8200 |
7. References
- Swarm worker module README
- Swarm cluster (non-prod) / (prod, gated) — managers, manager external IPs, and worker MIG, consolidated per #766
- THB VPN authorized networks
- Add Azure SQL firewall rule script