GitHub

RFC: Docker Swarm VM Consolidation

Status: Draft | Author: DevOps Team | Created: 2026-02-03 | Updated: 2026-02-03

Overview

Consolidate 98 isolated Docker Swarm VMs across staging and production environments into two unified swarms (one per environment) with MIG-based auto-scaling. Current infrastructure runs at ~7% CPU and ~40% memory utilization, representing massive over-provisioning.

Total Organizational Impact:

  • **92, 000 + annualsavings * *(54k infrastructure + $38k operational time)
  • 80% reduction in management overhead (98 VMs → 20 nodes)
  • 8-month payback, $214k+ savings over 3 years

Executive Summary

Environment Current VMs Current Cost Proposed Nodes Proposed Cost Monthly Savings
Staging 45 $3,312 9 $1,221 $2,091 (63%)
Production 53 $4,028 11 $1,604 $2,424 (60%)
Total 98 $7,340 20 $2,825 $4,515 (62%)

Background

Staging Environment (Non-Production)

Analysis covers 3 GCP projects:

  • prj-bu1-n-hb-infra-5381 - HB Infrastructure (2 running VMs)
  • prj-bu1-n-pd-infra-fee5 - PD Infrastructure (43 running VMs)
  • prj-bu1-n-bees-infra - Bees Infrastructure (0 running VMs)
Metric Provisioned Actual Usage Utilization
vCPUs 98 ~7 7.1%
Memory 399 GB ~161 GB 40.3%
Monthly Cost $3,312 - -

Production Environment

Analysis covers 3 GCP projects:

  • prj-bu1-p-hb-infra-1da6 - HB Infrastructure (10 running VMs)
  • prj-bu1-p-pd-infra-b355 - PD Infrastructure (42 running VMs)
  • prj-bu1-p-bees-infra-8bed - Bees Infrastructure (1 running VM)
Metric Provisioned Estimated Usage* Utilization
vCPUs 125 ~9 ~7%
Memory 470 GB ~189 GB ~40%
Monthly Cost $4,028 - -

*Estimated based on staging utilization patterns; architectures mirror each other.

Combined Current State

Metric Staging Production Total
Running VMs 45 53 98
vCPUs Provisioned 98 125 223
Memory Provisioned 399 GB 470 GB 869 GB
Monthly Cost $3,312 $4,028 $7,340
Annual Cost $39,744 $48,336 $88,080

The Problem

The current architecture deploys each Docker workload on a dedicated VM, creating 98 separate single-node swarms. This approach suffers from:

  • Resource fragmentation: 223 vCPUs provisioned but only ~16 used; 869 GB memory provisioned but only ~350 GB used
  • Per-VM overhead: ~98 GB wasted on OS and monitoring agents across all VMs
  • Operational complexity: 98 separate management targets, monitoring configs, and patching cycles
  • No resource sharing: Underutilized VMs cannot lend resources to busy ones
  • Duplicated effort: Same problems solved twice (staging and production managed separately)

Proposed Architecture

Two separate consolidated swarms (staging and production remain isolated):

graph TB
    subgraph "Staging Swarm (Non-Production)"
        subgraph "Staging Managers"
            SM1[Manager 1
e2-medium] SM2[Manager 2
e2-medium] SM3[Manager 3
e2-medium] end subgraph "Staging Workers - MIG (6-9)" SW1[Worker 1-6
n2-highmem-4] SWN[Workers 7-9
Auto-scale] end SM1 <--> SM2 <--> SM3 <--> SM1 SM1 --> SW1 SM2 -.-> SWN end subgraph "Production Swarm" subgraph "Production Managers" PM1[Manager 1
e2-medium] PM2[Manager 2
e2-medium] PM3[Manager 3
e2-medium] end subgraph "Production Workers - MIG (8-12)" PW1[Worker 1-8
n2-highmem-4] PWN[Workers 9-12
Auto-scale] end PM1 <--> PM2 <--> PM3 <--> PM1 PM1 --> PW1 PM2 -.-> PWN end

Node Specifications

Staging Swarm

Component Count Type vCPUs Memory Purpose
Manager Nodes 3 e2-medium 6 12 GB HA Raft consensus, scheduling
Worker Nodes (base) 6 n2-highmem-4 24 192 GB Container workloads
Worker Nodes (max) 9 n2-highmem-4 36 288 GB Peak/burst capacity

Production Swarm

Component Count Type vCPUs Memory Purpose
Manager Nodes 3 e2-medium 6 12 GB HA Raft consensus, scheduling
Worker Nodes (base) 8 n2-highmem-4 32 256 GB Container workloads
Worker Nodes (max) 12 n2-highmem-4 48 384 GB Peak/burst capacity

Why n2-highmem-4: Memory is the binding constraint (~40% utilization vs ~7% CPU). The n2-highmem-4 provides 8 GB/vCPU at $5.97/GB, the best cost ratio for memory-constrained workloads.

Auto-scaling Configuration

Parameter Staging Production
Minimum instances 6 (192 GB, 119% of usage) 8 (256 GB, 135% of usage)
Maximum instances 9 (288 GB, 179% of usage) 12 (384 GB, 203% of usage)
Scale-up trigger Memory > 70% for 3 min Memory > 70% for 3 min
Scale-down trigger Memory < 40% for 15 min Memory < 40% for 15 min
Cool-down period 5 minutes 5 minutes

Cost Analysis

Staging Environment

Scenario Monthly Cost Monthly Savings Annual Savings % Reduction
Current (45 VMs) $3,312 - - -
Consolidated (on-demand) $1,221 $2,091 $25,095 63%
Consolidated + 3yr CUD $769 $2,543 $30,515 77%
Component Quantity Unit Cost Monthly Cost
Manager Nodes (e2-medium) 3 $24.53 $74
Worker Nodes (n2-highmem-4) 6 $191.26 $1,148
Total (base config) 9 nodes $1,221

Production Environment

Scenario Monthly Cost Monthly Savings Annual Savings % Reduction
Current (53 VMs) $4,028 - - -
Consolidated (on-demand) $1,604 $2,424 $29,088 60%
Consolidated + 3yr CUD $1,011 $3,017 $36,204 75%
Component Quantity Unit Cost Monthly Cost
Manager Nodes (e2-medium) 3 $24.53 $74
Worker Nodes (n2-highmem-4) 8 $191.26 $1,530
Total (base config) 11 nodes $1,604

Combined Infrastructure Savings

Scenario Current Proposed Monthly Savings Annual Savings
On-demand $7,340 $2,825 $4,515 $54,180
With 3yr CUD $7,340 $1,780 $5,560 $66,720

ROI Analysis

Team Context: 3 DevOps engineers at 150k/yearaveragesalary72/hr, or ~$108/hr fully loaded with benefits/overhead).

One-Time Investment

Investment Cost
Migration effort (3 engineers × 50% × 8 weeks × $108/hr) $51,840
Parallel running period (1.5 months × $7,340) $11,010
Total Investment $62,850

Timeline optimized to 8 weeks by leveraging shared learnings between staging and production.

Annual Savings

Category Staging Production Combined
Infrastructure (on-demand) $25,095 $29,088 $54,183
Operational time savings* $17,280 $20,736 $38,016
Total Annual Savings $42,375 $49,824 $92,199

Operational savings based on 80% reduction in maintenance burden:

  • Staging: 45 VMs → 9 nodes, ~160 hrs/year saved
  • Production: 53 VMs → 11 nodes, ~192 hrs/year saved
  • Combined: 98 VMs → 20 nodes, ~352 hrs/year saved at $108/hr

Payback and 3-Year ROI

Metric Value
Total Investment $62,850
Annual Savings (on-demand) $92,199
Payback Period 8 months
Year 1 Net Savings +$29,349
Year 3 Cumulative Savings $213,747
Year 3 with CUD $251,427

Implementation Plan

Timeline: 8-10 weeks (3 DevOps engineers at ~50% allocation = 480 person-hours)

Phase 1: Staging Infrastructure (Week 1-2)

  • Provision 3 manager nodes (e2-medium) in separate zones
  • Configure worker MIG with n2-highmem-4 template
  • Set up internal load balancer for swarm ingress
  • Deploy centralized Prometheus/Grafana stack

Phase 2: Staging Validation & Migration (Week 3-5)

  • Deploy test workloads to validate networking and storage
  • Verify auto-scaling triggers (memory-based)
  • Migrate staging workloads (low risk, fast iteration)
  • Document runbooks based on learnings
  • Keep source VMs stopped for rollback

Phase 3: Production Infrastructure (Week 6-7)

  • Apply learnings from staging
  • Provision production swarm (3 managers, 8 workers)
  • Configure with production-grade monitoring
  • Establish separate MIG auto-scaling policies

Phase 4: Production Migration (Week 8-10)

  • Migrate non-critical production workloads first
  • Migrate standard workloads
  • Migrate high-memory workloads with dedicated node labels
  • Validate application behavior under production load

Phase 5: Decommission & Optimization (Week 11-12+)

  • Decommission source VMs after 30-day validation
  • Analyze actual utilization in consolidated environments
  • Adjust MIG sizing based on observed patterns
  • Evaluate 3-year CUD after 6 months of stable operation

Risks and Mitigations

Risk Likelihood Impact Mitigation
Migration disruption Medium Medium Phased rollout, instant rollback via DNS
Noisy neighbor issues Low Medium Resource limits, node labels for isolation
Single cluster failure Low High 3 managers, multi-zone deployment per env
Workload conflicts Low Medium Testing period, gradual migration
Cross-env contamination N/A N/A Separate swarms maintain isolation

Rollback Strategy

  • Keep old VMs stopped for 30 days post-migration
  • DNS-based traffic shifting enables instant rollback
  • Maintain deployment scripts for old architecture
  • Staging serves as canary for production changes

Decision

Recommendation

Proceed with full consolidation of both environments. The data is unambiguous:

  • Paying for 223 vCPUs, using ~16
  • Paying for 869 GB memory, using ~350 GB
  • Managing 98 VMs when 20 nodes would suffice
  • **92, 000 + annualsavings * *(54k infrastructure + $38k operational time)
  • Payback in ~8 months, $214k+ savings over 3 years

Decision Matrix

Factor Consolidate Do Nothing Winner
3-year savings $213,747 $0 Consolidate
Payback period 8 months N/A Consolidate
Management overhead 20 nodes 98 VMs Consolidate
Implementation effort 480 hours 0 hours Do Nothing
Operational consistency High Low Consolidate

3-Year Total Cost Comparison

Scenario Investment Infra Cost Total vs Do Nothing
Do Nothing $0 $264,240 $264,240 -
Both Environments $62,850 $101,700 $164,550 -$99,690

Infra: 36 months × monthly cost.

Success Criteria

Metric Staging Target Production Target
Monthly cost < $1,500 < $2,000
Memory utilization 50-70% 50-70%
Deployment success rate > 99% > 99.9%
Incident count < baseline < baseline
Migration completion 100% 100%

Key Decision Points

Decision Recommendation Rationale
Worker node type n2-highmem-4 Best $/GB for memory-constrained workloads
Staging workers (base) 6 192 GB = 119% of actual usage
Production workers (base) 8 256 GB = 135% of actual usage
Environment isolation Separate swarms Maintain staging/prod boundary
CUD commitment Defer 6 months Validate sizing before 3-year commitment
Edit this page