GitHub

Debug Docker Container Issues

Step-by-step guide for troubleshooting Docker Swarm services running on remote VMs.

Containers Are Down - Troubleshooting Flow

Step 1: Try Redeploying

First, attempt to redeploy the application through:

  • Azure DevOps (ADO) - for CI/CD pipelines
  • AWX - for Ansible-based deployments

If redeployment succeeds, verify the service is healthy. If it fails or the issue persists, continue to Step 2.

Step 2: Find the VM and SSH In

  1. Identify the VM the application is running on

    • hb-p-vm-$ID [Greens, DHA, Buzz, Eligibility API]
    • bees-p-vm-$ID [Yellow]
    • pdp$NUM-p-vm-$ID [Blues, CoPos]
    • [HA Admin Portals]
      • Check the name of the Deployment Group in the Azure DevOps WebApp release definition.
      • The name of the Deployment Group will match the name of the VM on Azure
  2. SSH into the appropriate VM:

Step 3: Docker Debug Commands

Once on the VM, check service status:

  1. List all services:

    docker service ls

    If REPLICAS shows 0/1, the service is down.

  2. View detailed task info with full error messages:

    docker service ps --no-trunc SERVICE_NAME
  3. Check service logs:

    docker service logs --tail 100 SERVICE_NAME

Step 4: Rescale the Service

Try scaling the service down and back up:

# Scale down to 0
docker service scale SERVICE_NAME=0

# Wait a few seconds, then scale back up
docker service scale SERVICE_NAME=1

Verify the service is running:

docker service ls

If rescaling works, you’re done. If not, continue to Step 5.

Step 5: If Rescaling Doesn’t Work

  1. Check for image pull issues:

    docker service ps --no-trunc SERVICE_NAME | grep -i error

    If there are image errors, verify the image exists and credentials are correct.

  2. Check for resource issues:

    # Check node resources
    docker node ls
    
    # Inspect service constraints
    docker service inspect --pretty SERVICE_NAME
  3. Check disk space (common cause of failures):

    # Check disk space on host
    df -h
    
    # Check Docker disk usage
    docker system df
    
    # Clean up unused resources if needed
    # GCP
    ./static/scripts/docker-system-cleanup.sh
    
    # Azure Linux
    sudo /DevOps/scripts/cleanup_docker_images.sh
    
    # Azure Windows VM - Run on Powershell as Admin
    C:\DevOps\scripts\cleanup_docker_images.ps1
  4. Check memory/CPU usage:

    # Check container resource usage
    docker stats --no-stream
  5. Check network issues:

    # List networks
    docker network ls
    
    # Inspect overlay network
    docker network inspect NETWORK_NAME
    
    # Check if ingress network is healthy
    docker network inspect ingress
  6. Check secrets/config availability:

    # List secrets available
    docker secret ls
    
    # Check if service can access required secrets
    docker service inspect SERVICE_NAME --format '{{.Spec.TaskTemplate.ContainerSpec.Secrets}}'
  7. Try a force update to restart all tasks:

    docker service update --force SERVICE_NAME

Step 6: If Nothing Works

More aggressive troubleshooting:

  1. Remove and recreate the service (if you have the deployment config):

    docker service rm SERVICE_NAME
    # Redeploy via ADO/AWX
  2. Check if the node itself is unhealthy:

    docker node inspect NODE_NAME
  3. Drain and reactivate a problematic node:

    docker node update --availability drain NODE_NAME
    docker node update --availability active NODE_NAME
  4. Check Docker daemon logs on the host:

    journalctl -u docker --since "1 hour ago"

Step 7: When to Escalate

Escalate to the appropriate team when:

  • Devops Team: VM is unreachable, persistent node failures, underlying host issues, Overlay network failures across multiple services, ingress issues
  • Application Team: Application-specific errors in logs, configuration issues

Include the following information when escalating:

  • Service name and VM hostname
  • Error messages from docker service ps --no-trunc
  • Relevant logs from docker service logs
  • Steps already attempted

Common Error Messages Reference

Error Message Likely Cause Solution
no suitable node Node constraints not met, insufficient resources Check node labels, resource availability
image not found / pull access denied Missing image or auth issues Verify image exists, check registry credentials
task: non-zero exit Application crashed Check docker service logs for app errors
container unhealthy Health check failing Check app health endpoint, review health check config
network not found Overlay network missing Recreate network or redeploy stack
secret not found Missing Docker secret Verify secret exists with docker secret ls
no space left on device Disk full Run docker system prune -f, expand disk
OOM killed Out of memory Increase memory limits or optimize app
context deadline exceeded Timeout during operations Check network connectivity, retry operation
Edit this page