GitHub

Debug Docker Container Issues

Step-by-step guide for troubleshooting Docker Swarm services running on remote VMs.

Containers Are Down - Troubleshooting Flow

Step 1: Try Redeploying

First, attempt to redeploy the application through:

Azure DevOps (ADO) - for CI/CD pipelines
AWX - for Ansible-based deployments

If redeployment succeeds, verify the service is healthy. If it fails or the issue persists, continue to Step 2.

Step 2: Find the VM and SSH In

Identify the VM the application is running on
- hb-p-vm-$ID [Greens, DHA, Buzz, Eligibility API]
- bees-p-vm-$ID [Yellow]
- pdp$NUM-p-vm-$ID [Blues, CoPos]
- [HA Admin Portals]
  - Check the name of the Deployment Group in the Azure DevOps WebApp release definition.
  - The name of the Deployment Group will match the name of the VM on Azure
SSH into the appropriate VM:
- SSH Into GCP VMs
- SSH Into Azure VMs

Step 3: Docker Debug Commands

Once on the VM, check service status:

List all services:
```
docker service ls
```
If REPLICAS shows 0/1, the service is down.
View detailed task info with full error messages:
```
docker service ps --no-trunc SERVICE_NAME
```

Check service logs:

docker service logs --tail 100 SERVICE_NAME

Step 4: Rescale the Service

Try scaling the service down and back up:

# Scale down to 0
docker service scale SERVICE_NAME=0

# Wait a few seconds, then scale back up
docker service scale SERVICE_NAME=1

Verify the service is running:

docker service ls

If rescaling works, you’re done. If not, continue to Step 5.

Step 5: If Rescaling Doesn’t Work

Check for image pull issues:
```
docker service ps --no-trunc SERVICE_NAME | grep -i error
```
If there are image errors, verify the image exists and credentials are correct.

Check for resource issues:

# Check node resources
docker node ls

# Inspect service constraints
docker service inspect --pretty SERVICE_NAME

Check disk space (common cause of failures):

# Check disk space on host
df -h

# Check Docker disk usage
docker system df

# Clean up unused resources if needed
# GCP
./static/scripts/docker-system-cleanup.sh

# Azure Linux
sudo /DevOps/scripts/cleanup_docker_images.sh

# Azure Windows VM - Run on Powershell as Admin
C:\DevOps\scripts\cleanup_docker_images.ps1

Check memory/CPU usage:

# Check container resource usage
docker stats --no-stream

Check network issues:

# List networks
docker network ls

# Inspect overlay network
docker network inspect NETWORK_NAME

# Check if ingress network is healthy
docker network inspect ingress

Check secrets/config availability:

# List secrets available
docker secret ls

# Check if service can access required secrets
docker service inspect SERVICE_NAME --format '{{.Spec.TaskTemplate.ContainerSpec.Secrets}}'

Try a force update to restart all tasks:

docker service update --force SERVICE_NAME

Step 6: If Nothing Works

More aggressive troubleshooting:

Remove and recreate the service (if you have the deployment config):
```
docker service rm SERVICE_NAME
# Redeploy via ADO/AWX
```
Check if the node itself is unhealthy:
```
docker node inspect NODE_NAME
```

Drain and reactivate a problematic node:

docker node update --availability drain NODE_NAME
docker node update --availability active NODE_NAME

Check Docker daemon logs on the host:

journalctl -u docker --since "1 hour ago"

Step 7: When to Escalate

Escalate to the appropriate team when:

Devops Team: VM is unreachable, persistent node failures, underlying host issues, Overlay network failures across multiple services, ingress issues
Application Team: Application-specific errors in logs, configuration issues

Include the following information when escalating:

Service name and VM hostname
Error messages from docker service ps --no-trunc
Relevant logs from docker service logs
Steps already attempted

Common Error Messages Reference

Error Message	Likely Cause	Solution
`no suitable node`	Node constraints not met, insufficient resources	Check node labels, resource availability
`image not found` / `pull access denied`	Missing image or auth issues	Verify image exists, check registry credentials
`task: non-zero exit`	Application crashed	Check `docker service logs` for app errors
`container unhealthy`	Health check failing	Check app health endpoint, review health check config
`network not found`	Overlay network missing	Recreate network or redeploy stack
`secret not found`	Missing Docker secret	Verify secret exists with `docker secret ls`
`no space left on device`	Disk full	Run `docker system prune -f`, expand disk
`OOM killed`	Out of memory	Increase memory limits or optimize app
`context deadline exceeded`	Timeout during operations	Check network connectivity, retry operation

Edit this page