RFC: Docker Swarm VM Consolidation

Status: Draft | Author: DevOps Team | Created: 2026-02-03 | Updated: 2026-02-03

Overview

Consolidate 98 isolated Docker Swarm VMs across staging and production environments into two unified swarms (one per environment) with MIG-based auto-scaling. Current infrastructure runs at ~7% CPU and ~40% memory utilization, representing massive over-provisioning.

Total Organizational Impact:

**92, 000 + annualsavings * *(54k infrastructure + $38k operational time)
80% reduction in management overhead (98 VMs → 20 nodes)
8-month payback, $214k+ savings over 3 years

Executive Summary

Environment	Current VMs	Current Cost	Proposed Nodes	Proposed Cost	Monthly Savings
Staging	45	$3,312	9	$1,221	$2,091 (63%)
Production	53	$4,028	11	$1,604	$2,424 (60%)
Total	98	$7,340	20	$2,825	$4,515 (62%)

Background

Staging Environment (Non-Production)

Analysis covers 3 GCP projects:

prj-bu1-n-hb-infra-5381 - HB Infrastructure (2 running VMs)
prj-bu1-n-pd-infra-fee5 - PD Infrastructure (43 running VMs)
prj-bu1-n-bees-infra - Bees Infrastructure (0 running VMs)

Metric	Provisioned	Actual Usage	Utilization
vCPUs	98	~7	7.1%
Memory	399 GB	~161 GB	40.3%
Monthly Cost	$3,312	-	-

Production Environment

Analysis covers 3 GCP projects:

prj-bu1-p-hb-infra-1da6 - HB Infrastructure (10 running VMs)
prj-bu1-p-pd-infra-b355 - PD Infrastructure (42 running VMs)
prj-bu1-p-bees-infra-8bed - Bees Infrastructure (1 running VM)

Metric	Provisioned	Estimated Usage*	Utilization
vCPUs	125	~9	~7%
Memory	470 GB	~189 GB	~40%
Monthly Cost	$4,028	-	-

*Estimated based on staging utilization patterns; architectures mirror each other.

Combined Current State

Metric	Staging	Production	Total
Running VMs	45	53	98
vCPUs Provisioned	98	125	223
Memory Provisioned	399 GB	470 GB	869 GB
Monthly Cost	$3,312	$4,028	$7,340
Annual Cost	$39,744	$48,336	$88,080

The Problem

The current architecture deploys each Docker workload on a dedicated VM, creating 98 separate single-node swarms. This approach suffers from:

Resource fragmentation: 223 vCPUs provisioned but only ~16 used; 869 GB memory provisioned but only ~350 GB used
Per-VM overhead: ~98 GB wasted on OS and monitoring agents across all VMs
Operational complexity: 98 separate management targets, monitoring configs, and patching cycles
No resource sharing: Underutilized VMs cannot lend resources to busy ones
Duplicated effort: Same problems solved twice (staging and production managed separately)

Proposed Architecture

Two separate consolidated swarms (staging and production remain isolated):

graph TB
    subgraph "Staging Swarm (Non-Production)"
        subgraph "Staging Managers"
            SM1[Manager 1
e2-medium]
            SM2[Manager 2
e2-medium]
            SM3[Manager 3
e2-medium]
        end
        subgraph "Staging Workers - MIG (6-9)"
            SW1[Worker 1-6
n2-highmem-4]
            SWN[Workers 7-9
Auto-scale]
        end
        SM1 <--> SM2 <--> SM3 <--> SM1
        SM1 --> SW1
        SM2 -.-> SWN
    end

    subgraph "Production Swarm"
        subgraph "Production Managers"
            PM1[Manager 1
e2-medium]
            PM2[Manager 2
e2-medium]
            PM3[Manager 3
e2-medium]
        end
        subgraph "Production Workers - MIG (8-12)"
            PW1[Worker 1-8
n2-highmem-4]
            PWN[Workers 9-12
Auto-scale]
        end
        PM1 <--> PM2 <--> PM3 <--> PM1
        PM1 --> PW1
        PM2 -.-> PWN
    end

Node Specifications

Staging Swarm

Component	Count	Type	vCPUs	Memory	Purpose
Manager Nodes	3	e2-medium	6	12 GB	HA Raft consensus, scheduling
Worker Nodes (base)	6	n2-highmem-4	24	192 GB	Container workloads
Worker Nodes (max)	9	n2-highmem-4	36	288 GB	Peak/burst capacity

Production Swarm

Component	Count	Type	vCPUs	Memory	Purpose
Manager Nodes	3	e2-medium	6	12 GB	HA Raft consensus, scheduling
Worker Nodes (base)	8	n2-highmem-4	32	256 GB	Container workloads
Worker Nodes (max)	12	n2-highmem-4	48	384 GB	Peak/burst capacity

Why n2-highmem-4: Memory is the binding constraint (~40% utilization vs ~7% CPU). The n2-highmem-4 provides 8 GB/vCPU at $5.97/GB, the best cost ratio for memory-constrained workloads.

Auto-scaling Configuration

Parameter	Staging	Production
Minimum instances	6 (192 GB, 119% of usage)	8 (256 GB, 135% of usage)
Maximum instances	9 (288 GB, 179% of usage)	12 (384 GB, 203% of usage)
Scale-up trigger	Memory > 70% for 3 min	Memory > 70% for 3 min
Scale-down trigger	Memory < 40% for 15 min	Memory < 40% for 15 min
Cool-down period	5 minutes	5 minutes

Cost Analysis

Staging Environment

Scenario	Monthly Cost	Monthly Savings	Annual Savings	% Reduction
Current (45 VMs)	$3,312	-	-	-
Consolidated (on-demand)	$1,221	$2,091	$25,095	63%
Consolidated + 3yr CUD	$769	$2,543	$30,515	77%

Component	Quantity	Unit Cost	Monthly Cost
Manager Nodes (e2-medium)	3	$24.53	$74
Worker Nodes (n2-highmem-4)	6	$191.26	$1,148
Total (base config)	9 nodes		$1,221

Production Environment

Scenario	Monthly Cost	Monthly Savings	Annual Savings	% Reduction
Current (53 VMs)	$4,028	-	-	-
Consolidated (on-demand)	$1,604	$2,424	$29,088	60%
Consolidated + 3yr CUD	$1,011	$3,017	$36,204	75%

Component	Quantity	Unit Cost	Monthly Cost
Manager Nodes (e2-medium)	3	$24.53	$74
Worker Nodes (n2-highmem-4)	8	$191.26	$1,530
Total (base config)	11 nodes		$1,604

Combined Infrastructure Savings

Scenario	Current	Proposed	Monthly Savings	Annual Savings
On-demand	$7,340	$2,825	$4,515	$54,180
With 3yr CUD	$7,340	$1,780	$5,560	$66,720

ROI Analysis

Team Context: 3 DevOps engineers at 150k/yearaveragesalary( 72/hr, or ~$108/hr fully loaded with benefits/overhead).

One-Time Investment

Investment	Cost
Migration effort (3 engineers × 50% × 8 weeks × $108/hr)	$51,840
Parallel running period (1.5 months × $7,340)	$11,010
Total Investment	$62,850

Timeline optimized to 8 weeks by leveraging shared learnings between staging and production.

Annual Savings

Category	Staging	Production	Combined
Infrastructure (on-demand)	$25,095	$29,088	$54,183
Operational time savings*	$17,280	$20,736	$38,016
Total Annual Savings	$42,375	$49,824	$92,199

Operational savings based on 80% reduction in maintenance burden:

Staging: 45 VMs → 9 nodes, ~160 hrs/year saved
Production: 53 VMs → 11 nodes, ~192 hrs/year saved
Combined: 98 VMs → 20 nodes, ~352 hrs/year saved at $108/hr

Payback and 3-Year ROI

Metric	Value
Total Investment	$62,850
Annual Savings (on-demand)	$92,199
Payback Period	8 months
Year 1 Net Savings	+$29,349
Year 3 Cumulative Savings	$213,747
Year 3 with CUD	$251,427

Implementation Plan

Timeline: 8-10 weeks (3 DevOps engineers at ~50% allocation = 480 person-hours)

Phase 1: Staging Infrastructure (Week 1-2)

Provision 3 manager nodes (e2-medium) in separate zones
Configure worker MIG with n2-highmem-4 template
Set up internal load balancer for swarm ingress
Deploy centralized Prometheus/Grafana stack

Phase 2: Staging Validation & Migration (Week 3-5)

Deploy test workloads to validate networking and storage
Verify auto-scaling triggers (memory-based)
Migrate staging workloads (low risk, fast iteration)
Document runbooks based on learnings
Keep source VMs stopped for rollback

Phase 3: Production Infrastructure (Week 6-7)

Apply learnings from staging
Provision production swarm (3 managers, 8 workers)
Configure with production-grade monitoring
Establish separate MIG auto-scaling policies

Phase 4: Production Migration (Week 8-10)

Migrate non-critical production workloads first
Migrate standard workloads
Migrate high-memory workloads with dedicated node labels
Validate application behavior under production load

Phase 5: Decommission & Optimization (Week 11-12+)

Decommission source VMs after 30-day validation
Analyze actual utilization in consolidated environments
Adjust MIG sizing based on observed patterns
Evaluate 3-year CUD after 6 months of stable operation

Risks and Mitigations

Risk	Likelihood	Impact	Mitigation
Migration disruption	Medium	Medium	Phased rollout, instant rollback via DNS
Noisy neighbor issues	Low	Medium	Resource limits, node labels for isolation
Single cluster failure	Low	High	3 managers, multi-zone deployment per env
Workload conflicts	Low	Medium	Testing period, gradual migration
Cross-env contamination	N/A	N/A	Separate swarms maintain isolation

Rollback Strategy

Keep old VMs stopped for 30 days post-migration
DNS-based traffic shifting enables instant rollback
Maintain deployment scripts for old architecture
Staging serves as canary for production changes

Decision

Recommendation

Proceed with full consolidation of both environments. The data is unambiguous:

Paying for 223 vCPUs, using ~16
Paying for 869 GB memory, using ~350 GB
Managing 98 VMs when 20 nodes would suffice
**92, 000 + annualsavings * *(54k infrastructure + $38k operational time)
Payback in ~8 months, $214k+ savings over 3 years

Decision Matrix

Factor	Consolidate	Do Nothing	Winner
3-year savings	$213,747	$0	Consolidate
Payback period	8 months	N/A	Consolidate
Management overhead	20 nodes	98 VMs	Consolidate
Implementation effort	480 hours	0 hours	Do Nothing
Operational consistency	High	Low	Consolidate

3-Year Total Cost Comparison

Scenario	Investment	Infra Cost	Total	vs Do Nothing
Do Nothing	$0	$264,240	$264,240	-
Both Environments	$62,850	$101,700	$164,550	-$99,690

Infra: 36 months × monthly cost.

Success Criteria

Metric	Staging Target	Production Target
Monthly cost	< $1,500	< $2,000
Memory utilization	50-70%	50-70%
Deployment success rate	> 99%	> 99.9%
Incident count	< baseline	< baseline
Migration completion	100%	100%

Key Decision Points

Decision	Recommendation	Rationale
Worker node type	n2-highmem-4	Best $/GB for memory-constrained workloads
Staging workers (base)	6	192 GB = 119% of actual usage
Production workers (base)	8	256 GB = 135% of actual usage
Environment isolation	Separate swarms	Maintain staging/prod boundary
CUD commitment	Defer 6 months	Validate sizing before 3-year commitment

Edit this page