GitHub

Pre-Migration Outbound IP Dependency Audit

Status: Ready for review. All 8 NAT IPs (4 non-prod + 4 prod, us-central1) verified live as of 2026-04-30. Author: miko.hadikusuma Created: 2026-04-29 Updated: 2026-04-29 Jira: HB-8210 Gates: HB-8218 (Allowlist updates) → HB-8232 (Deploy production swarm)


Table of Contents

  1. Purpose & Scope
  2. Current vs. Target State
  3. Audit by Surface
    1. GCP Cloud SQL authorized_networks
    2. Azure SQL Firewall (HA Apps Only)
    3. Cloudflare Access Groups (IP Bypass)
    4. Caddy VPN Bypass List
    5. Partner API Webhooks / External Integrations
    6. Other (lower priority but worth verifying)
  4. Cutover Sequence (handoff to HB-8218)
  5. Open Questions
  6. Action Item Traceability
  7. References

1. Purpose & Scope

When the nine Phase 0 THB applications consolidate onto shared Docker Swarm workers, every per-VM external IP currently used as the source of an outbound connection disappears. Workers run with no external IP behind a shared Cloud NAT with 4 static egress IPs per region (us-central1) — see §2 for the verified set.

Anything that allowlists today’s per-VM IPs needs to be updated to the new NAT IPs before workloads cut over. This audit enumerates every such allowlist surface, where the rule lives, and the migration action needed. The deliverable feeds directly into HB-8218 (the implementation ticket).

Apps in scope (canonical list per ./zig/zig build scripts -- compose-file-linter --list): bees, benefits-platform, consumer_portal, django_homealign, hb-buzz, hbcrm, pd, thb-keycloak, the_consumer_portal. (HA apps — Azure-side — are out of scope for the swarm migration but their Azure SQL firewall is still affected; see §3.2.)

In scope: anywhere a current VM’s external IP is referenced as a source/allowed origin for outbound calls from one of these nine apps.

Out of scope:


2. Current vs. Target State

Current per-VM external IPs

Discovered via data.google_compute_addresses.* and per-resource google_compute_address definitions:

App grouping Resource (prod) Notes
HB shared VM (hbcrm, bea_hbcrm, hb_buzz, django_homealign, eligibility, keycloak/thb-keycloak, hbtemplate, unleash, posthog) google_compute_address.external_migrate (infra/hb-infra/business_unit_1/production/main.tf:22) One VM hosts ~9 apps. thb-keycloak repo deploys the keycloak service here.
Benefits Hub (benefits-platform repo, multiple tenant variants) google_compute_address.external (infra/hb-infra/business_unit_1/production/benefits_platform.tf:6)
Bees Flask API (bees repo) module.bees-p-vm2.instance_external_ip (infra/bees-infra/business_unit_1/production/main.tf:23) Standalone VM in its own bees-infra project
Legacy Consumer Portal (consumer_portal repo, multi-partner) google_compute_address.cp_external (infra/pd-infra/business_unit_1/production/consumer_portal.tf:7) cp-p-vm. Hosts hbcp_jh, hbcp_aarp, hbcp_ta, hbcp_sompo, hbcp_pru, hbcp_bcbs_ar, hbcp_cgi, etc. Source of HB-8208 gcsfuse work.
New Consumer Portal — CoPo 3.0 (the_consumer_portal repo) google_compute_address.external (infra/pd-infra/business_unit_1/production/portal-the-consumer-portal.tf:9) portal-p-vm. Auth0 custom domain integration; PR preview env target (HB-7963).
ADO agent (deploys HomeAlign Apps) google_compute_address.ado_agent_external_ip (infra/ha-infra/business_unit_1/production/azure_devops.tf:83) Serves Azure DevOps SSH/deploy traffic into HA (HomeAlign) Windows VMs. HA apps are not in swarm scope; ADO agent stays put → no action.
22+ PD partner VMs google_compute_address.external per infra/pd-infra/business_unit_1/production/pdp*/main.tf:12 One VM per partner

Quick inventory command:

gcloud compute addresses list --filter="address_type=EXTERNAL" --format="table(name,address,project)" \
  --project=<env-project-id>

Target swarm NAT IPs

Provisioned per HB-8217 / HB-8360 as Cloud NAT on the vpc-{env}-shared-base shared VPC. Capacity: 4 IPs × 64,512 ports / 2048 min_ports_per_vm = ~126 VM ceiling per region.

Canonical TF source (separate repo): thehelperbees/gcp-networks

Resource Path
Cloud Router (cr-{env_code}-shared-base-spoke-us-central1-nat-router) modules/base_shared_vpc/nat.tf:22-32
NAT external IPs (ca-{env_code}-shared-base-spoke-us-central1-{0..3}) modules/base_shared_vpc/nat.tf:34-39 (count = var.nat_num_addresses_region1)
Router NAT (rn-{env_code}-shared-base-spoke-us-central1-egress) modules/base_shared_vpc/nat.tf:41-56
Module call with nat_num_addresses_region1 = 4, nat_min_ports_per_vm = 2048 envs/{production,non-production}/boa_vpc_fw.tf:45-47
HB-8217 swarm firewall rules (TCP 2377, TCP+UDP 7946, UDP 4789, TCP 9323), tag-scoped to swarm-node modules/base_shared_vpc/firewall.tf:254-345

NAT IPs to allowlist (verified 2026-04-30 against live state):

All four IPs per region are attached round-robin to the Router NAT, so every IP must be in every allowlist — partial allowlisting causes intermittent ~50% failures as workers cycle through unallowlisted IPs.

Environment Network host project Address resource NAT IP
Non-production prj-n-shared-base-cb89 ca-n-shared-base-spoke-us-central1-0 34.72.82.40
Non-production prj-n-shared-base-cb89 ca-n-shared-base-spoke-us-central1-1 34.122.251.101
Non-production prj-n-shared-base-cb89 ca-n-shared-base-spoke-us-central1-2 34.170.35.69
Non-production prj-n-shared-base-cb89 ca-n-shared-base-spoke-us-central1-3 35.194.23.180
Production prj-p-shared-base-11f6 ca-p-shared-base-spoke-us-central1-0 34.132.72.45
Production prj-p-shared-base-11f6 ca-p-shared-base-spoke-us-central1-1 35.226.246.142
Production prj-p-shared-base-11f6 ca-p-shared-base-spoke-us-central1-2 34.136.40.107
Production prj-p-shared-base-11f6 ca-p-shared-base-spoke-us-central1-3 34.57.41.236

📝 Historical note: HB-8218 originally listed only 2 of 4 IPs per env. The remaining 6 were discovered live during this audit (gcp-networks provisions nat_num_addresses_region1 = 4 for both envs per envs/{production,non-production}/boa_vpc_fw.tf:45). HB-8218 description updated 2026-04-30 with the corrected set.

⚠️ us-west1 NAT IPs exist but are out of scope. A gcloud compute addresses list query against either shared-base-vpc-host project also returns 4 IPs in us-west1 (e.g., non-prod 34.105.111.96 / 34.127.70.253 / 136.109.214.215 / 34.169.152.90; prod 35.227.138.226 / 35.185.201.232 / 34.168.190.12 / 8.229.103.135). These belong to a parallel Cloud NAT in us-west1 for non-swarm workloads (Cloud Functions and any future regional expansion). The swarm cluster runs only in us-central1 — managers in us-central1-{a,b,c}, workers default to ["us-central1-a","us-central1-b","us-central1-c","us-central1-f"] per swarm_worker/variables.tf. Workers cannot egress through the us-west1 NAT pool. Do not include the us-west1 IPs in any partner-side allowlist — they’re zero-yield (no traffic ever comes from them) and create future change-control burden.

Verification command (per env):

# Non-production
gcloud compute addresses list \
  --project=prj-n-shared-base-cb89 \
  --filter='name~^ca-n-shared' \
  --format='table(name,address)'

# Production
gcloud compute addresses list \
  --project=prj-p-shared-base-11f6 \
  --filter='name~^ca-p-shared' \
  --format='table(name,address)'

Manager IPs (3 per env, separate from worker NAT):

Verifying egress before partner outreach

Before notifying partners (§3.5), confirm that worker traffic actually exits via the NAT IPs above. Misalignment between the reserved IPs and the IPs partners see is exactly the silent-failure mode that caused the HB-8218 4-vs-2 finding — but for partner allowlists, that failure happens at the partner’s edge, after cutover, with no logs on our side.

Two complementary checks. Pass criteria: every worker is mapped to one of the 4 NAT IPs, and a curl from inside the worker echoes back the same IP that GCP’s mapping table claims for it.

Check 1 — control-plane mapping: ask GCP which NAT IP each worker is using. No cluster shell access needed.

NETWORK_HOST_NON_PROD="$(gcloud projects list \
  --filter='labels.application_name=base-shared-vpc-host AND labels.environment=non-production' \
  --format='value(projectId)')"

gcloud compute routers get-nat-mapping-info \
  cr-n-shared-base-spoke-us-central1-nat-router \
  --nat-name=rn-n-shared-base-spoke-us-central1-egress \
  --region=us-central1 \
  --project="$NETWORK_HOST_NON_PROD"

Output is a JSON list, one entry per VM, each with natIpPortRanges[].natIp. Group by natIp to confirm: every worker maps to one of 34.72.82.40, 34.122.251.101, 34.170.35.69, 35.194.23.180. Any other IP is a finding.

Check 2 — partner’s-eye view: run a one-shot global service that hits an IP echo from every worker. What it reports is what a partner will see.

# From a swarm manager (SSH via IAP):
docker service create \
  --name nat-ip-check \
  --mode global \
  --restart-condition none \
  alpine sh -c 'echo "$(hostname): $(wget -qO- https://ifconfig.me)"'

# Wait ~30s for tasks to complete, then:
docker service logs nat-ip-check --no-trunc

# Cleanup:
docker service rm nat-ip-check

Cross-check the IP each worker reports against Check 1: they should match per-worker. If they don’t, the egress path isn’t what we think it is — stop and investigate before partner outreach.

If you don’t have manager shell access, alternate per-worker approach:

gcloud compute ssh swarm-worker-n-XXXX --tunnel-through-iap \
  --command='curl -s https://ifconfig.me'

…repeated for each worker.

Optional — Check 3, NAT translation logs: the gcp-networks egress_nat_region1 resource enables log_config { filter = "TRANSLATIONS_ONLY" }, so every flow gets logged. While Check 2 is running, tail the logs to see actual translations:

gcloud logging read \
  'resource.type="nat_gateway"
   resource.labels.gateway_name="rn-n-shared-base-spoke-us-central1-egress"' \
  --project="$NETWORK_HOST_NON_PROD" \
  --limit=50 \
  --format='value(jsonPayload.connection.nat_ip,jsonPayload.connection.src_ip,jsonPayload.connection.dest_ip)'

Three columns: NAT IP exited from, worker internal IP, destination. Confirms what we expect from a third independent angle.

If all three checks agree, the IP set is ground-truthed and partner outreach can go out. If any disagree, the disagreement is the next thing to chase.


3. Audit by Surface

3.1 GCP Cloud SQL authorized_networks

Instance Path What’s allowlisted today Migration action
HB Postgres (hb-p-psql) infra/hb-infra/business_unit_1/production/main.tf:32 var.authorized_networks only (THB VPN: gc{2,4,5,6}-algo) — no per-VM IPs None. App connects via Cloud SQL Proxy sidecar (Phase 0 Step 1).
HB MSSQL sandbox — non-prod only (hb-n-sql-server) infra/hb-infra/business_unit_1/non-production/sql_server.tf:48 var.authorized_networks + hb-n-vm external IP Remove the hb-n-vm entry; nothing replaces it. Added by PR #362 as a DHA sandbox for the Benefits Hub team to test against without colliding with other teams’ deploys. The allowlist entry was for the host-level Cloud SQL proxy connecting via public IP. The new per-stack cloud-sql-proxy sidecar authenticates to the Cloud SQL admin API by IAM, bypassing authorized_networks entirely. No prod equivalent. Workers need cloudsql.client on prj-bu1-n-hb-infra-* (HB-8590) and module.hb-n-sql-server.instance_connection_name in the django_homealign stack’s sidecar command.
PD shared MSSQL (pd-p-mssql) resource: mssql.tf:84; allowlist concat: mssql.tf:110 var.authorized_networks + local.temporary_authorized_networks (lines 11-14) + local.pd_vm_authorized_networks (lines 23-29) HIGH RISK. pd_vm_authorized_networks auto-populates from every PD VM external IP via regex match on app vm_resource_name. As partner VMs go away, regex matches drop → auto-removal from allowlist. Need to add the swarm NAT IPs as static entries (next to local.temporary_authorized_networks) before the first PD partner cuts over.
Per-partner Postgres (pdp{N}-p-psql) infra/pd-infra/business_unit_1/production/pdp*/main.tf:33 var.authorized_networks only (THB VPN) None. Phase 0 sidecar handles connectivity.

Action item: Add a new entry to the allowlist concat in pd-p-mssql.ip_configuration.authorized_networks once swarm NAT IPs are known.

# Live audit
gcloud sql instances describe <instance-name> --project=<env-project-id> \
  --format="value(settings.ipConfiguration.authorizedNetworks)"

3.2 Azure SQL Firewall (HA Apps Only)

Server Resource group Today’s source Migration action
ha-prod1-azsqldb HA-PROD1-SQL-RG Per-VM rules added imperatively via src/scripts/add_az_sql_fw_rule.sh Add swarm NAT IPs as new rules; leave per-VM rules in place during cutover, remove post-migration
ha-dev-azsqldb HA-DEV-SQL-RG Same Same

These rules are NOT in Terraform. They live only in Azure. Inventory:

# Production
az sql server firewall-rule list --resource-group HA-PROD1-SQL-RG \
  --server ha-prod1-azsqldb --output table

# Non-prod
az sql server firewall-rule list --resource-group HA-DEV-SQL-RG \
  --server ha-dev-azsqldb --output table

Add rule (one invocation per NAT IP from §2):

./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_0 -ip <nat-ip-0>
./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_1 -ip <nat-ip-1>
./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_2 -ip <nat-ip-2>
./zig/zig build scripts -- add_az_sql_fw_rule.sh -e production -n SWARM_WORKER_NAT_3 -ip <nat-ip-3>

The script is idempotent (checks for existing rule with same IP first). The _0 through _3 suffix aligns with the gcp-networks address resource indices (ca-{env_code}-shared-base-spoke-us-central1-{0..3}).

3.3 Cloudflare Access Groups (IP Bypass)

Two locations:

A) infra/pd-infra/business_unit_1/production/cloudflare.tf — most action here:

Group Source today Migration action
green_p_ip_bypass (line 59) data.google_compute_addresses.hb_p_vm_ip_address (HB project external IPs) Auto-shrinks as HB VM goes away. Add swarm NAT IPs as static entries.
yellow_p_ip_bypass (line 71) data.google_compute_addresses.bees_p_vm_ip_address Same — auto-shrinks; add static NAT IPs.
pdp24_p_bastion_ip_bypass / pdp18_p_bastion_ip_bypass bastion-specific compute addresses (data.google_compute_addresses.p24_bastion_p_vm_ip_address / p18_bastion_p_vm_ip_address) No change. Bastions stay on dedicated VMs and are not part of the swarm migration; their external IPs are stable.
pdp18_p_ip_bypass, pdp24_p_ip_bypass, salesforce_p_ip_bypass, thb_p_authorized_networks Hardcoded partner / VPN IP ranges No change — these are partner egress, not ours.

B) HB-side cloudflare-access module calls — these live in hb-infra (not pd-infra) and use a dynamic vpn_ip_addresses input rather than named Cloudflare Access groups:

Module call Path Source Migration action
Standard cloudflare-access (covers all HB apps with vpn_bypass policy: hbcrm, hb_buzz, eligibility-via-thehelperbees.com-domain, etc.) hb-infra/.../production/cloudflare.tf:8-15 vpn_ip_addresses = concat(local.pd_addresses, local.vpn_addresses)local.pd_addresses is auto-discovered from data.google_compute_addresses.pd_ip_addresses (filter name:pdp* in PD project) Add swarm NAT IPs as static entries to local.pd_addresses in main.tf:8-11. Single change flows into both module calls below.
eligibility-cloudflare-access (separate call because eligibility uses session_duration = "168h" vs the default; covers eligibility.thb.nu and eligibility.thb.sh) hb-infra/.../production/eligibility_service.tf:6-13 Same expression: concat(local.pd_addresses, local.vpn_addresses) Inherits fix from updating local.pd_addresses.

Why this matters for pdp24 → eligibility: When pdp24 calls https://eligibility.thb.nu, Cloudflare Access checks if the source IP is in the vpn_bypass group. Today, pdp24’s 35.225.100.177 is in local.pd_addresses (via the dynamic data lookup) so it passes. After migration, pdp24’s IP becomes a swarm NAT IP — which is not in the PD project’s external addresses, so the dynamic lookup won’t include it. The static-entry fix above is what keeps this internal flow working post-migration. No partner action required — this is our own infrastructure.

C) infra/ha-infra/business_unit_1/production/cloudflare/cloudflare.tf — uses only var.authorized_networks (THB VPN). No change needed.

Live audit:

# Already in TF — read current state
./zig/zig build plan -- pd-infra production cloudflare
./zig/zig build plan -- hb-infra production cloudflare
./zig/zig build plan -- hb-infra production eligibility_service

3.4 Caddy VPN Bypass List

Important: The Terraform-supplied caddy_vpn_bypass_list is one of three inputs that compose the actual runtime Caddy allowlist. The full picture, assembled by Ansible at deploy time:

admin_allowed_ip_addresses = [vm_ip] + vpn_ip_addresses + caddy_vpn_bypass_list

(Composed across gc_deploy_with_gcp_secrets.yml:358-360 — the playbook actually used in production — and roles/caddy_create/tasks/main.yml:46-48 in hb-ansible.) This list drives the (is-public-ip) / (is-vpn-ip) snippets in roles/caddy_create/templates/Caddyfile.j2, which gate the admin/flower/camunda/websocket routes that bypass Cloudflare Access.

Component Source Migration impact
vm_ip hostvars[vm_name].ansible_ssh_host — the deploy target’s own SSH host Becomes the swarm worker’s NAT egress IP. Since all workers share the same NAT IPs, every worker will (incidentally) self-allowlist every other worker’s traffic via Cloudflare. New property; not necessarily wrong, worth surfacing.
vpn_ip_addresses Maps to infrahive var.authorized_networks (THB algo VPN exits) No change — same VPN IPs pre/post migration.
caddy_vpn_bypass_list Terraform-fed via the ansible_config module (see table below) Active migration concern. See callers below.
How caddy_vpn_bypass_list flows from Terraform to runtime

The caddy_vpn_bypass_list value is not read at deploy time from a static file in hb-ansible. It traverses three systems:

infrahive Terraform (hb-infra/.../main.tf)
    local.caddy_pd_addresses
        |
        v
    module "hb-p-ansible-config5" { caddy_vpn_bypass_list = local.caddy_pd_addresses }
        |  (terraform apply)
        v
    ansible_config TF module (terraform-modules.git, external)
        |
        v
    GCS: gs://ansible-config-65ab/<app>/config.yml      ← regenerated on every apply
        |
        v
    hb-ansible playbook (gc_deploy_with_gcp_secrets.yml) loads bucket_path at deploy time
        |
        v
    filtered_app.caddy_vpn_bypass_list (per-app dict)
        |
        v
    caddy_create role (roles/caddy_create/tasks/main.yml:46-48)
        admin_allowed_ip_addresses = [vm_ip] + vpn_ip_addresses + filtered_app.caddy_vpn_bypass_list
        |
        v
    Caddyfile.j2 (is-public-ip / is-vpn-ip matchers)

Verified end-to-end via gsutil cat gs://ansible-config-65ab/hbcrm/config.yml — that file already contains a populated caddy_vpn_bypass_list array with all PD partner VM IPs (78+ entries today).

Implication for HB-8413 (Caddy VPN bypass) and HB-8414 (Cloudflare Access): changes happen in infrahive only. No edits to hb-ansible. The Caddy template is data-driven — it picks up whatever’s in the GCS config on next deploy. And because the same local.caddy_pd_addresses / local.pd_addresses data lookup feeds all three downstream surfaces (Caddy bypass, standard cloudflare-access, eligibility-cloudflare-access), a single Terraform PR closes both HB-8413 and HB-8414 — coordinate so the work isn’t duplicated.

Terraform sources of caddy_vpn_bypass_list:

Caller Source today Migration action
module.hb-p-ansible-config5 (infra/hb-infra/business_unit_1/production/main.tf:140) local.caddy_pd_addresses — auto-discovered PD VM external IPs (name:pdp* filter, line 12) Auto-shrinks as PD VMs migrate. Replace with the swarm NAT IPs as static entries. HB→PD calls will continue to roundtrip via Cloudflare even on the same cluster (no overlay during migration — see HB-8620 follow-up).
module.hb-n-ansible-config (non-prod equivalent) Same pattern Same
module.benefits-hub-p-ansible-config1 (infra/hb-infra/business_unit_1/production/benefits_platform.tf:73) [] empty No change

Decision (resolves Q1): The migration will not introduce an HB↔︎PD swarm overlay network. HB→PD calls continue to traverse the public internet via Cloudflare, requiring the swarm NAT IPs in this bypass list. A future overlay is tracked under HB-8620 (parented to the HB-6748 DevOps KTLO epic), which would let us remove this bypass list entirely.

Note on partner_allowed_ip_addresses (inbound, out of scope): While inspecting hb-ansible, we also reviewed the per-app partner_allowed_ip_addresses variable (roles/caddy_create/tasks/main.yml:32, only set on stage_hbpdp11 with [76.187.226.216]). This is the inverse direction — a partner’s IP allowlisted on our Caddy for inbound /partner/* traffic — and is therefore out of scope for HB-8210. Additionally, the variable is set as a fact but never referenced in the current Caddyfile.j2 template, suggesting it is either vestigial or enforced at the app layer inside partner_django. Either way, partner IPs don’t change during our migration, so no action.

3.5 Partner API Webhooks / External Integrations

This is the highest-risk and least-discoverable surface. Partner systems often allowlist our outbound IP for inbound webhooks, SFTP polling, or API calls.

Known surfaces requiring live inventory (cannot enumerate from infrahive alone):

  1. Partner SFTP destinationsinfra/pd-infra/cloudfunctions/sftp_pubsub_messages/SFTP/client.go, infra/pd-infra/cloudfunctions/sftp_bucket_file/SFTP/client.go. SFTP from cloud functions, not VMs — likely uses its own NAT. Verify these aren’t routed through partner VMs.

  2. PSR (Provider Services Request) handler — out of scope, documented for clarityinfra/pd-infra/modules/partner_psr_request_handler/. On first read this looked like a possible outbound surface, but inspection of the module and its callers (pdp8, pdp24, …) confirms the flow is the opposite direction: a partner’s Salesforce instance pushes files into our GCS bucket (psr-{partner_code}-{env}-*), authenticated by a service-account key that we generate and hand to the partner’s Salesforce dev. A Cloud Function then converts uploads to PDF/XML for hbcrm to read. No partner-side IP allowlist is involved, and our outbound IP doesn’t appear in this flow at all — so it’s unaffected by the swarm migration. Listed here only so future readers don’t re-add it as an open question.

  3. Per-partner outbound API integrations — likely configured per-partner in either:

    • pd repo .envs/ files (partner API base URLs and auth)
    • hb-buzz / hbcrm / django_homealign per-tenant settings
    • Salesforce / Camunda outbound webhooks

Recommended path: post in #devops asking team members which partners allowlist our outbound IP. Active outreach: #devops thread on 2026-04-30. Track replies there and add new partners to the confirmed-partners table below as they surface.

Owner for partner outreach: PD partner success / integrations team. Best to start parallel to HB-8218 implementation since some partners have multi-week change-control SLAs.

Confirmed partners with our outbound IP allowlisted

All IPs below verified live via curl ifconfig.me from the corresponding VM. Each address resource carries lifecycle { ignore_changes = all } so the IPs have been stable.

Partner Env Outbound IP GCE VM Project
pdp24 prod 35.225.100.177 pdp24-p-vm-6670 prj-bu1-p-pd-infra-b355
pdp24 non-prod 34.30.125.50 pdp24-n-vm-2327 prj-bu1-n-pd-infra-fee5
pdp10 prod 34.69.126.254 pdp10-p-vm-a437 prj-bu1-p-pd-infra-b355
pdp10 non-prod 34.134.208.190 pdp10-n-vm-01f0 prj-bu1-n-pd-infra-fee5
pdp7 prod 35.238.12.142 pdp7-p-vm-ebcb prj-bu1-p-pd-infra-b355
pdp7 non-prod 35.232.30.74 pdp7-n-vm-5c52 prj-bu1-n-pd-infra-fee5

Source TF for all rows: google_compute_address.external in infra/pd-infra/business_unit_1/{production,non-production}/pdp{N}/main.tf:12.

Whether each partner actually has the non-prod IP allowlisted on their side is unknown. Confirmed partner integrations may be prod-only; partner outreach (per pre-cutover checklist below) should ask explicitly: “do you allowlist any IPs of ours other than the prod one above?” and capture the answer here.

Other partners — unknown. Coverage is tribal; PD partner success / integrations team should confirm no others exist. Outreach posted in #devops on 2026-04-30 — track replies there.

Internal flows that depend on these IPs: pdp24 → eligibility.thb.nu / eligibility.thb.sh (confirmed). Our own Cloudflare Access vpn_bypass group includes all PD partner VM IPs (pdp7, pdp10, pdp24, and any other active partner) via local.pd_addresses (dynamic data lookup), so any of them calling our HB-side services rides this allowlist. After migration the dynamic lookup shrinks to nothing and is replaced by static swarm NAT IP entries — covered by §3.3 row B. No partner action needed for these internal flows.

Decision (post-migration): notify each partner contact of the swarm NAT IPs across both environments (4 non-prod + 4 prod, per §2) and have them update their allowlist in a single change. Bundling envs minimizes partner change-control churn — most partners have one allowlist that covers any of our outbound, regardless of which env it originated from. After prod cutover (the last env to migrate), the per-partner production GCE addresses can be released (delete the google_compute_address.external resources in pdpN/main.tf).

Why not preserve the existing per-partner IPs?

To pick the right path it helps to know how Cloud NAT actually distributes IPs to outbound flows. With the MANUAL_ONLY mode + min_ports_per_vm = 2048 that gcp-networks configures, Cloud NAT does per-VM IP allocation, not per-flow:

  1. Each worker VM is assigned a fixed allocation of ~2048 ports from one of the pool’s NAT IPs.
  2. The worker uses that single IP for all its outbound flows (until port exhaustion, which is rare with 2048 ports).
  3. The assignment is Cloud-NAT-internal — we don’t control which IP a given worker gets.
  4. With N IPs in the pool and M workers, roughly M/N workers land on each IP.

Service replicas in Swarm are scheduled across the worker pool, so per-partner outbound traffic (e.g., pdp24 → partner API, pdp10 → partner API) comes from whichever NAT IP was assigned to the worker the scheduler placed it on — not from any IP we can pre-select.

Options compared (per partner)
# Option Preserves the existing per-partner IP? Partner allowlist work Operational cost Verdict
1 Notify the partner, they allowlist all 8 swarm NAT IPs (4 non-prod + 4 prod) No, but doesn’t matter — partner accepts new IPs One email/ticket per partner; ~24-hour turnaround for most Zero — fewer resources to maintain after migration Default. Reconsider only if change-control SLA is multi-week.
2 Don’t migrate the affected partner stack; leave on dedicated VM Yes (fully) None Continued cost of the dedicated VM; partner stays out of Phase 0 scope ❌ Ruled out (we want to migrate all partners).
3 Outbound HTTPS proxy on a tiny dedicated VM owning the existing IP; partner stack uses HTTPS_PROXY env Yes (exactly, via HTTPS_PROXY) None One always-on VM per partner needing preservation; monitoring/alerts/upgrades for each; single point of failure for that partner’s traffic ⚠️ Acceptable if option 1 is blocked by partner SLA. Cost scales linearly with number of preserved IPs.
4 Add the existing IPs to the swarm Cloud NAT pool No. Per-VM allocation means ~1/(N+k) of workers get any given preserved IP and use it for all their outbound regardless of destination. Other workers (and any partner replicas scheduled on them) use the other IPs. Partner still has to allowlist the full set. Same as option 1 Slightly higher than option 1 (extra IPs reserved forever) ❌ Doesn’t preserve IPs for partner-specific traffic. Cloud NAT lacks per-flow IP selection.
5 Pin partner stack to a specific worker; attach the existing IP to that node Yes (the pinned worker uses that IP) None Defeats Swarm scheduling, breaks autoscaler, single point of failure for the partner stack ❌ Ruled out (already rejected).
Why option 4 keeps coming up

Option 4 feels like it should work: “we have the IPs, just keep using them.” It would work if Cloud NAT supported per-flow IP selection rules (“traffic to partner.example.com → source IP X”), but it doesn’t — IP assignment is per-VM, not per-flow or per-destination. So a preserved IP ends up tied to whichever workers Cloud NAT happens to assign it to, used by every outbound flow from those workers regardless of destination, while partner replicas on other workers don’t use it at all. It’s effectively option 1 with extra unused IPs in the pool.

The only real ways to preserve a partner-specific IP are option 3 (proxy steers at the application layer) or option 5 (node pinning forces the worker assignment), and we’ve ruled out 5. Reconsider option 3 only if option 1 is blocked for a specific partner.

Pre-cutover checklist (per partner, run for pdp7, pdp10, and pdp24):

Precondition: the §2 verification procedure (“Verifying egress before partner outreach”) must have been run for non-prod, with the 4 verified IPs locked in. For prod, communicate the IPs we have today and follow up before prod cutover with any updates from the same verification run.

  1. Send the partner contact the full swarm NAT IP set across both environments (from §2). Communicating both envs at once minimizes partner change-control churn:
    • Non-production (4 IPs, verified live): 34.72.82.40, 34.122.251.101, 34.170.35.69, 35.194.23.180
    • Production (4 IPs, all verified live 2026-04-30): 34.132.72.45, 35.226.246.142, 34.136.40.107, 34.57.41.236
  2. Ask the partner to enumerate every IP of ours they currently allowlist — both prod and non-prod, if applicable. Capture this in the audit’s confirmed-partners table so we know which entries to clean up post-migration.
  3. Get written confirmation that all 8 swarm NAT IPs (or current best subset for prod) are in their allowlist before the corresponding env’s migration day.
  4. Keep existing per-partner IPs allowlisted on the partner side throughout the dual-allowlist window per §4. The window must extend through the last migration to complete (= prod, gated on HB-8232). Per-partner IPs to keep (only those the partner confirms in step 2 are actually in their allowlist):
    • pdp7 prod: 35.238.12.142 / non-prod: 35.232.30.74
    • pdp10 prod: 34.69.126.254 / non-prod: 34.134.208.190
    • pdp24 prod: 35.225.100.177 / non-prod: 34.30.125.50
  5. One week after prod cutover: ask the partner to remove the old IPs from their allowlist; release the corresponding google_compute_address.external resources in pdp{N}/main.tf (and non-prod equivalents if applicable) on our side.

Tableau Analytics — internal system that allowlists our partner IPs

The internal Tableau server hosting Analytics pages allowlists outbound IPs from pdp1, pdp5, and pdp14 so those partner stacks can fetch/render dashboards. After migration, those stacks egress through swarm NAT IPs and Tableau’s allowlist won’t match.

This is structurally identical to the partner allowlist pattern above — Tableau is “external” from the swarm’s perspective even though it’s internal to THB. Not managed by infrahive (zero references to tableau / Tableau / tab.thb in the repo); the tab_user DB user in crm_database is a read-role for Tableau to query CRM data, not the Tableau server itself.

The migration action (add swarm NAT IPs to Tableau’s allowlist, dual-allowlist window through prod cutover, verify Tableau → our Cloud SQL access still works post-migration) is tracked under HB-8623 (subtask of HB-8218).

Open questions for whoever owns Tableau (Analytics / DevOps): where does the Tableau server live (self-hosted VM? Tableau Cloud?), who owns its allowlist, and are there any other partner stacks beyond pdp1/5/14 that hit it?

3.6 Other (lower priority but worth verifying)

  • Cloudflare Tunnel / cloudflared configsrc/tools/launchbot/templates/cloudflared/config.yml.tmpl. Tunnel is identity-based, not IP-based, but verify there’s no IP-pinned policy.
  • Datadog allowlists — outbound to Datadog via dd_agent. Datadog accepts traffic from any source; Datadog API key auths the call. No change.
  • Papertrail (syslog+tls://logs7.papertrailapp.com:23105) — same; no IP allowlist on Papertrail’s side.
  • Camunda / Keycloak outbound — internal, no external allowlist.
  • GitHub Actions deploy proxy (infra/common-infra/business_unit_1/shared/gh_actions_deploy_proxy.tf) — separate proxy, not impacted.

4. Cutover Sequence (handoff to HB-8218)

  1. Provision swarm Cloud NAT for prod (HB-8217 already done for non-prod; prod gated on HB-8232).
  2. Capture the NAT IPs per env via gcloud compute addresses list --filter='name~^ca-{n,p}-shared-base-spoke-us-central1-' --format='table(name,address)' against the corresponding network host project (full commands in §2). Use addresses list rather than routers nats describe here — describe returns address-resource URLs, not IP values. For defense-in-depth, also run routers nats describe to confirm those addresses are actually attached to the NAT, and routers get-nat-mapping-info to confirm workers map to them (per §2 verification procedure).
# Non-production
gcloud compute routers nats describe rn-n-shared-base-spoke-us-central1-egress \
  --router=cr-n-shared-base-spoke-us-central1-nat-router \
  --region=us-central1 \
  --project=prj-n-shared-base-cb89

# Production
gcloud compute routers nats describe rn-p-shared-base-spoke-us-central1-egress \
  --router=cr-p-shared-base-spoke-us-central1-nat-router \
  --region=us-central1 \
  --project=prj-p-shared-base-11f6
  1. Pre-add NAT IPs to all surfaces before any workload moves:
    • PD shared MSSQL authorized_networks (Terraform PR adding static entries to the concat in pd-infra/.../mssql.tf:110).
    • Azure SQL firewall: 2 rules per env via add_az_sql_fw_rule.sh.
    • Cloudflare Access groups green_p_ip_bypass and yellow_p_ip_bypass: TF PR adding static entries.
    • Caddy VPN bypass: TF PR adding the swarm NAT IPs as static entries to the caddy_vpn_bypass_list callers. (No overlay during migration — see HB-8620 for the future overlay work.)
    • Partner systems: outreach in flight (longest-tail; start now).
  2. Validate (workers boot but stay drained until allowlists confirmed).
  3. Cut over workloads.
  4. Post-migration cleanup: remove now-stale per-VM rules.

Dual-allowlist window

Steps 3-6 deliberately leave both per-VM IPs and the new NAT IPs allowlisted for the duration of the cutover. This is necessary for zero-downtime migration but expands the allowlist by ~4 IPs per environment for the duration of the window. Treat this as a known temporary security posture:

  • Expected duration: hours to days for in-house surfaces (Cloud SQL, Azure SQL, Cloudflare); weeks for partner-side allowlists where partner change-control SLAs apply (most acute for §3.5 partner webhooks).
  • Highest-impact surfaces during the window: PD shared MSSQL (§3.1) and partner webhooks (§3.5) — both expose data paths, not just management.
  • Monitoring: before the window opens, enable connection logging on Cloud SQL (cloudsql.googleapis.com/postgres.log or equivalent) and Azure SQL audit logging filtered by client_ip. Watch for traffic from the new NAT IPs before cutover (should be zero) and unexpected continued traffic from per-VM IPs after cutover (signals incomplete migration of a workload).
  • Cleanup commitment: step 6 is not optional. Each surface owner is on the hook for removing the stale per-VM entry within 1 week of cutover, tracked in HB-8218 subtasks.

5. Open Questions

# Question Owner
1 Will the swarm cluster route HB↔︎PD via overlay network? Resolved. No — HB↔︎PD traffic continues over Cloudflare during the migration. The swarm NAT IPs must be added to caddy_vpn_bypass_list. Future overlay tracked in HB-8620.
2 Does production HB MSSQL exist? Resolved. Confirmed via PR #362: non-prod-only sandbox added for Benefits Hub testing; no prod equivalent.
3 Is the ADO agent VM (ado-agent-p-vm) included in this migration? Resolved. Not in scope — stays on its dedicated VM. The module.ado-agent-p-vm.instance_external_ip reference in the HA NSG SSH-2290 / SSH-22 rules (infra/ha-infra/business_unit_1/production/azure_devops.tf:87,105) needs no change.
4 Partially resolved. It’s tribal knowledge. Confirmed (all verified live via curl ifconfig.me): pdp7 (35.238.12.142), pdp10 (34.69.126.254), and pdp24 (35.225.100.177) allowlist our outbound IP on their side (see §3.5). Slack outreach posted in #devops on 2026-04-30; awaiting confirmation that no others exist. PD partner success
5 What is the actual NAT IP resource name/path? Resolved. TF lives in thehelperbees/gcp-networks at modules/base_shared_vpc/nat.tf (called from envs/{prod,non-production}/boa_vpc_fw.tf). HB-8217 firewall rules are in modules/base_shared_vpc/firewall.tf:254-345. All 8 NAT IPs verified live 2026-04-30 (4 non-prod + 4 prod, us-central1) — see §2 table. us-west1 NAT IPs exist but are out of scope (swarm is us-central1-only).

6. Action Item Traceability

Every migration action this audit identifies maps to an existing Jira ticket. Comments noted below were posted to the corresponding ticket on 2026-04-30 to add audit-derived scope detail that wasn’t in the original ticket description.

Audit § Action Jira ticket Notes
§2 Provision swarm Cloud NAT (prod) HB-8232 Gates prod migration
§2 NAT IP set composition verification (3-check procedure) HB-8417 See comment 2026-04-30 distinguishing this from connectivity testing
§2 Connectivity test script HB-8211 In Progress
§3.1 PD shared MSSQL authorized_networks HB-8412
§3.1 Remove hb-n-vm entry from sandbox MSSQL HB-8412 Folded in via comment 2026-04-30
§3.2 Azure SQL firewall (prod + non-prod) HB-8411
§3.3 row A pd-infra Cloudflare Access groups HB-8414
§3.3 row B hb-infra cloudflare-access + eligibility (local.pd_addresses) HB-8414 Folded in via comment 2026-04-30 — original scope only mentioned pd-infra
§3.4 Caddy VPN bypass list HB-8413
§3.4 future HB↔︎PD overlay network HB-8620 Parented to HB-6748 (DevOps KTLO)
§3.5 Notify pdp7 / pdp10 / pdp24 (both envs) HB-8415 Concrete list + checklist added via comment 2026-04-30
§3.5 Tableau Analytics allowlist (pdp1/pdp5/pdp14 → Tableau, plus reverse-direction Cloud SQL/Cloudflare verification) HB-8623 Created 2026-04-30
§3.5 Per-partner GCE address cleanup HB-8239 Implicit (decommission also removes address resources)
§4 Pre-add NAT IPs to all surfaces HB-8411 / 8412 / 8413 / 8414 / 8415
§4 Dual-allowlist window (do NOT remove old per-VM IPs during cutover) HB-8416
§4 Cut over workloads HB-8232 + per-app migration tickets
§4 Post-migration cleanup HB-8239
Cross-cutting Worker SA cloudsql.client grants HB-8590 Parented to HB-8216 (worker MIG module)
Parent Allowlist updates (umbrella) HB-8218 Description updated 2026-04-30 to fix 2-IP/4-IP staleness and link audit
Parent Swarm Consolidation Epic HB-8200

7. References

Edit this page