Status: Draft | Author: DevOps Team
| Created: 2026-02-03 | Updated:
2026-02-03
Consolidate 98 isolated Docker Swarm VMs across
staging and production environments into two unified
swarms (one per environment) with MIG-based auto-scaling.
Current infrastructure runs at ~7% CPU and ~40% memory utilization,
representing massive over-provisioning.
Total Organizational Impact:
**92, 000 + a n n u a l s a v i n g s * *( 54k
infrastructure + $38k operational time)
80% reduction in management overhead (98 VMs → 20
nodes)
8-month payback, $214k+ savings over 3 years
Environment
Current VMs
Current Cost
Proposed Nodes
Proposed Cost
Monthly Savings
Staging
45
$3,312
9
$1,221
$2,091 (63%)
Production
53
$4,028
11
$1,604
$2,424 (60%)
Total
98
$7,340
20
$2,825
$4,515 (62%)
Analysis covers 3 GCP projects:
prj-bu1-n-hb-infra-5381 - HB Infrastructure (2 running
VMs)
prj-bu1-n-pd-infra-fee5 - PD Infrastructure (43 running
VMs)
prj-bu1-n-bees-infra - Bees Infrastructure (0 running
VMs)
Metric
Provisioned
Actual Usage
Utilization
vCPUs
98
~7
7.1%
Memory
399 GB
~161 GB
40.3%
Monthly Cost
$3,312
-
-
Analysis covers 3 GCP projects:
prj-bu1-p-hb-infra-1da6 - HB Infrastructure (10 running
VMs)
prj-bu1-p-pd-infra-b355 - PD Infrastructure (42 running
VMs)
prj-bu1-p-bees-infra-8bed - Bees Infrastructure (1
running VM)
Metric
Provisioned
Estimated Usage*
Utilization
vCPUs
125
~9
~7%
Memory
470 GB
~189 GB
~40%
Monthly Cost
$4,028
-
-
*Estimated based on staging utilization patterns; architectures
mirror each other.
Metric
Staging
Production
Total
Running VMs
45
53
98
vCPUs Provisioned
98
125
223
Memory Provisioned
399 GB
470 GB
869 GB
Monthly Cost
$3,312
$4,028
$7,340
Annual Cost
$39,744
$48,336
$88,080
The current architecture deploys each Docker workload on a dedicated
VM, creating 98 separate single-node swarms. This approach suffers
from:
Resource fragmentation: 223 vCPUs provisioned but
only ~16 used; 869 GB memory provisioned but only ~350 GB used
Per-VM overhead: ~98 GB wasted on OS and monitoring
agents across all VMs
Operational complexity: 98 separate management
targets, monitoring configs, and patching cycles
No resource sharing: Underutilized VMs cannot lend
resources to busy ones
Duplicated effort: Same problems solved twice
(staging and production managed separately)
Two separate consolidated swarms (staging and production remain
isolated):
graph TB
subgraph "Staging Swarm (Non-Production)"
subgraph "Staging Managers"
SM1[Manager 1 e2-medium]
SM2[Manager 2 e2-medium]
SM3[Manager 3 e2-medium]
end
subgraph "Staging Workers - MIG (6-9)"
SW1[Worker 1-6 n2-highmem-4]
SWN[Workers 7-9 Auto-scale]
end
SM1 <--> SM2 <--> SM3 <--> SM1
SM1 --> SW1
SM2 -.-> SWN
end
subgraph "Production Swarm"
subgraph "Production Managers"
PM1[Manager 1 e2-medium]
PM2[Manager 2 e2-medium]
PM3[Manager 3 e2-medium]
end
subgraph "Production Workers - MIG (8-12)"
PW1[Worker 1-8 n2-highmem-4]
PWN[Workers 9-12 Auto-scale]
end
PM1 <--> PM2 <--> PM3 <--> PM1
PM1 --> PW1
PM2 -.-> PWN
end
Component
Count
Type
vCPUs
Memory
Purpose
Manager Nodes
3
e2-medium
6
12 GB
HA Raft consensus, scheduling
Worker Nodes (base)
6
n2-highmem-4
24
192 GB
Container workloads
Worker Nodes (max)
9
n2-highmem-4
36
288 GB
Peak/burst capacity
Component
Count
Type
vCPUs
Memory
Purpose
Manager Nodes
3
e2-medium
6
12 GB
HA Raft consensus, scheduling
Worker Nodes (base)
8
n2-highmem-4
32
256 GB
Container workloads
Worker Nodes (max)
12
n2-highmem-4
48
384 GB
Peak/burst capacity
Why n2-highmem-4: Memory is the binding constraint
(~40% utilization vs ~7% CPU). The n2-highmem-4 provides 8 GB/vCPU at
$5.97/GB, the best cost ratio for memory-constrained workloads.
Parameter
Staging
Production
Minimum instances
6 (192 GB, 119% of usage)
8 (256 GB, 135% of usage)
Maximum instances
9 (288 GB, 179% of usage)
12 (384 GB, 203% of usage)
Scale-up trigger
Memory > 70% for 3 min
Memory > 70% for 3 min
Scale-down trigger
Memory < 40% for 15 min
Memory < 40% for 15 min
Cool-down period
5 minutes
5 minutes
Scenario
Monthly Cost
Monthly Savings
Annual Savings
% Reduction
Current (45 VMs)
$3,312
-
-
-
Consolidated (on-demand)
$1,221
$2,091
$25,095
63%
Consolidated + 3yr CUD
$769
$2,543
$30,515
77%
Component
Quantity
Unit Cost
Monthly Cost
Manager Nodes (e2-medium)
3
$24.53
$74
Worker Nodes (n2-highmem-4)
6
$191.26
$1,148
Total (base config)
9 nodes
$1,221
Scenario
Monthly Cost
Monthly Savings
Annual Savings
% Reduction
Current (53 VMs)
$4,028
-
-
-
Consolidated (on-demand)
$1,604
$2,424
$29,088
60%
Consolidated + 3yr CUD
$1,011
$3,017
$36,204
75%
Component
Quantity
Unit Cost
Monthly Cost
Manager Nodes (e2-medium)
3
$24.53
$74
Worker Nodes (n2-highmem-4)
8
$191.26
$1,530
Total (base config)
11 nodes
$1,604
Scenario
Current
Proposed
Monthly Savings
Annual Savings
On-demand
$7,340
$2,825
$4,515
$54,180
With 3yr CUD
$7,340
$1,780
$5,560
$66,720
Team Context: 3 DevOps engineers at 150k /y e a r a v e r a g e s a l a r y ( 72/hr,
or ~$108/hr fully loaded with benefits/overhead).
Investment
Cost
Migration effort (3 engineers × 50% × 8 weeks × $108/hr)
$51,840
Parallel running period (1.5 months × $7,340)
$11,010
Total Investment
$62,850
Timeline optimized to 8 weeks by leveraging shared learnings
between staging and production.
Category
Staging
Production
Combined
Infrastructure (on-demand)
$25,095
$29,088
$54,183
Operational time savings*
$17,280
$20,736
$38,016
Total Annual Savings
$42,375
$49,824
$92,199
Operational savings based on 80% reduction in maintenance
burden:
Staging: 45 VMs → 9 nodes, ~160 hrs/year saved
Production: 53 VMs → 11 nodes, ~192 hrs/year saved
Combined: 98 VMs → 20 nodes, ~352 hrs/year saved at
$108/hr
Metric
Value
Total Investment
$62,850
Annual Savings (on-demand)
$92,199
Payback Period
8 months
Year 1 Net Savings
+$29,349
Year 3 Cumulative Savings
$213,747
Year 3 with CUD
$251,427
Timeline: 8-10 weeks (3 DevOps engineers at ~50%
allocation = 480 person-hours)
Provision 3 manager nodes (e2-medium) in separate zones
Configure worker MIG with n2-highmem-4 template
Set up internal load balancer for swarm ingress
Deploy centralized Prometheus/Grafana stack
Deploy test workloads to validate networking and storage
Verify auto-scaling triggers (memory-based)
Migrate staging workloads (low risk, fast iteration)
Document runbooks based on learnings
Keep source VMs stopped for rollback
Apply learnings from staging
Provision production swarm (3 managers, 8 workers)
Configure with production-grade monitoring
Establish separate MIG auto-scaling policies
Migrate non-critical production workloads first
Migrate standard workloads
Migrate high-memory workloads with dedicated node labels
Validate application behavior under production load
Decommission source VMs after 30-day validation
Analyze actual utilization in consolidated environments
Adjust MIG sizing based on observed patterns
Evaluate 3-year CUD after 6 months of stable operation
Risk
Likelihood
Impact
Mitigation
Migration disruption
Medium
Medium
Phased rollout, instant rollback via DNS
Noisy neighbor issues
Low
Medium
Resource limits, node labels for isolation
Single cluster failure
Low
High
3 managers, multi-zone deployment per env
Workload conflicts
Low
Medium
Testing period, gradual migration
Cross-env contamination
N/A
N/A
Separate swarms maintain isolation
Keep old VMs stopped for 30 days post-migration
DNS-based traffic shifting enables instant rollback
Maintain deployment scripts for old architecture
Staging serves as canary for production changes
Proceed with full consolidation of both
environments. The data is unambiguous:
Paying for 223 vCPUs, using ~16
Paying for 869 GB memory, using ~350 GB
Managing 98 VMs when 20 nodes would suffice
**92, 000 + a n n u a l s a v i n g s * *( 54k
infrastructure + $38k operational time)
Payback in ~8 months, $214k+ savings over 3
years
Factor
Consolidate
Do Nothing
Winner
3-year savings
$213,747
$0
Consolidate
Payback period
8 months
N/A
Consolidate
Management overhead
20 nodes
98 VMs
Consolidate
Implementation effort
480 hours
0 hours
Do Nothing
Operational consistency
High
Low
Consolidate
Scenario
Investment
Infra Cost
Total
vs Do Nothing
Do Nothing
$0
$264,240
$264,240
-
Both Environments
$62,850
$101,700
$164,550
-$99,690
Infra: 36 months × monthly cost.
Metric
Staging Target
Production Target
Monthly cost
< $1,500
< $2,000
Memory utilization
50-70%
50-70%
Deployment success rate
> 99%
> 99.9%
Incident count
< baseline
< baseline
Migration completion
100%
100%
Decision
Recommendation
Rationale
Worker node type
n2-highmem-4
Best $/GB for memory-constrained workloads
Staging workers (base)
6
192 GB = 119% of actual usage
Production workers (base)
8
256 GB = 135% of actual usage
Environment isolation
Separate swarms
Maintain staging/prod boundary
CUD commitment
Defer 6 months
Validate sizing before 3-year commitment
Edit this page