Debug Docker Container Issues
Step-by-step guide for troubleshooting Docker Swarm services running on remote VMs.
Containers Are Down - Troubleshooting Flow
Step 1: Try Redeploying
First, attempt to redeploy the application through:
- Azure DevOps (ADO) - for CI/CD pipelines
- AWX - for Ansible-based deployments
If redeployment succeeds, verify the service is healthy. If it fails or the issue persists, continue to Step 2.
Step 2: Find the VM and SSH In
Identify the VM the application is running on
hb-p-vm-$ID[Greens, DHA, Buzz, Eligibility API]bees-p-vm-$ID[Yellow]pdp$NUM-p-vm-$ID[Blues, CoPos]- [HA Admin Portals]
- Check the name of the Deployment Group in the Azure DevOps WebApp release definition.
- The name of the Deployment Group will match the name of the VM on Azure
SSH into the appropriate VM:
Step 3: Docker Debug Commands
Once on the VM, check service status:
List all services:
docker service lsIf
REPLICASshows0/1, the service is down.View detailed task info with full error messages:
docker service ps --no-trunc SERVICE_NAMECheck service logs:
docker service logs --tail 100 SERVICE_NAME
Step 4: Rescale the Service
Try scaling the service down and back up:
# Scale down to 0
docker service scale SERVICE_NAME=0
# Wait a few seconds, then scale back up
docker service scale SERVICE_NAME=1Verify the service is running:
docker service lsIf rescaling works, you’re done. If not, continue to Step 5.
Step 5: If Rescaling Doesn’t Work
Check for image pull issues:
docker service ps --no-trunc SERVICE_NAME | grep -i errorIf there are image errors, verify the image exists and credentials are correct.
Check for resource issues:
# Check node resources docker node ls # Inspect service constraints docker service inspect --pretty SERVICE_NAMECheck disk space (common cause of failures):
# Check disk space on host df -h # Check Docker disk usage docker system df # Clean up unused resources if needed # GCP ./static/scripts/docker-system-cleanup.sh # Azure Linux sudo /DevOps/scripts/cleanup_docker_images.sh # Azure Windows VM - Run on Powershell as Admin C:\DevOps\scripts\cleanup_docker_images.ps1Check memory/CPU usage:
# Check container resource usage docker stats --no-streamCheck network issues:
# List networks docker network ls # Inspect overlay network docker network inspect NETWORK_NAME # Check if ingress network is healthy docker network inspect ingressCheck secrets/config availability:
# List secrets available docker secret ls # Check if service can access required secrets docker service inspect SERVICE_NAME --format '{{.Spec.TaskTemplate.ContainerSpec.Secrets}}'Try a force update to restart all tasks:
docker service update --force SERVICE_NAME
Step 6: If Nothing Works
More aggressive troubleshooting:
Remove and recreate the service (if you have the deployment config):
docker service rm SERVICE_NAME # Redeploy via ADO/AWXCheck if the node itself is unhealthy:
docker node inspect NODE_NAMEDrain and reactivate a problematic node:
docker node update --availability drain NODE_NAME docker node update --availability active NODE_NAMECheck Docker daemon logs on the host:
journalctl -u docker --since "1 hour ago"
Step 7: When to Escalate
Escalate to the appropriate team when:
- Devops Team: VM is unreachable, persistent node failures, underlying host issues, Overlay network failures across multiple services, ingress issues
- Application Team: Application-specific errors in logs, configuration issues
Include the following information when escalating:
- Service name and VM hostname
- Error messages from
docker service ps --no-trunc - Relevant logs from
docker service logs - Steps already attempted
Common Error Messages Reference
| Error Message | Likely Cause | Solution |
|---|---|---|
no suitable node |
Node constraints not met, insufficient resources | Check node labels, resource availability |
image not found / pull access denied |
Missing image or auth issues | Verify image exists, check registry credentials |
task: non-zero exit |
Application crashed | Check docker service logs for app errors |
container unhealthy |
Health check failing | Check app health endpoint, review health check config |
network not found |
Overlay network missing | Recreate network or redeploy stack |
secret not found |
Missing Docker secret | Verify secret exists with docker secret ls |
no space left on device |
Disk full | Run docker system prune -f, expand disk |
OOM killed |
Out of memory | Increase memory limits or optimize app |
context deadline exceeded |
Timeout during operations | Check network connectivity, retry operation |