Debug Docker Container Issues
Step-by-step guide for troubleshooting Docker Swarm services running on remote VMs.
Containers Are Down - Troubleshooting Flow
Step 1: Try Redeploying
First, attempt to redeploy the application through:
- Azure DevOps (ADO) - for CI/CD pipelines
- AWX - for Ansible-based deployments
If redeployment succeeds, verify the service is healthy. If it fails or the issue persists, continue to Step 2.
Step 2: Find the VM and SSH In
Identify the VM the application is running on
hb-p-vm-$ID[Greens, DHA, Buzz, Eligibility API]bees-p-vm-$ID[Yellow]pdp$NUM-p-vm-$ID[Blues, CoPos]- [HA Admin Portals]
- Check the name of the Deployment Group in the Azure DevOps WebApp release definition.
- The name of the Deployment Group will match the name of the VM on Azure
SSH into the appropriate VM:
Step 3: Docker Debug Commands
Once on the VM, check service status:
List all services:
docker service lsIf
REPLICASshows0/1, the service is down.View detailed task info with full error messages:
docker service ps --no-trunc SERVICE_NAMECheck service logs:
docker service logs --tail 100 SERVICE_NAME
Step 4: Rescale the Service
Try scaling the service down and back up:
# Scale down to 0
docker service scale SERVICE_NAME=0
# Wait a few seconds, then scale back up
docker service scale SERVICE_NAME=1Verify the service is running:
docker service lsIf rescaling works, you’re done. If not, continue to Step 5.
Step 5: If Rescaling Doesn’t Work
Check for image pull issues:
docker service ps --no-trunc SERVICE_NAME | grep -i errorIf there are image errors, verify the image exists and credentials are correct.
Check for resource issues:
# Check node resources docker node ls # Inspect service constraints docker service inspect --pretty SERVICE_NAMECheck disk space (common cause of failures):
# Check disk space on host df -h # Check Docker disk usage docker system df # Clean up unused resources if needed # GCP ./static/scripts/docker-system-cleanup.sh # Azure Linux sudo /DevOps/scripts/cleanup_docker_images.sh # Azure Windows VM - Run on Powershell as Admin C:\DevOps\scripts\cleanup_docker_images.ps1Check memory/CPU usage:
# Check container resource usage docker stats --no-streamCheck network issues:
# List networks docker network ls # Inspect overlay network docker network inspect NETWORK_NAME # Check if ingress network is healthy docker network inspect ingressCheck secrets/config availability:
# List secrets available docker secret ls # Check if service can access required secrets docker service inspect SERVICE_NAME --format '{{.Spec.TaskTemplate.ContainerSpec.Secrets}}'Try a force update to restart all tasks:
docker service update --force SERVICE_NAME
Step 6: If Nothing Works
More aggressive troubleshooting:
Remove and recreate the service (if you have the deployment config):
docker service rm SERVICE_NAME # Redeploy via ADO/AWXCheck if the node itself is unhealthy:
docker node inspect NODE_NAMEDrain and reactivate a problematic node:
docker node update --availability drain NODE_NAME docker node update --availability active NODE_NAMECheck Docker daemon logs on the host:
journalctl -u docker --since "1 hour ago"
Step 7: When to Escalate
Escalate to the appropriate team when:
- Devops Team: VM is unreachable, persistent node failures, underlying host issues, Overlay network failures across multiple services, ingress issues
- Application Team: Application-specific errors in logs, configuration issues
Include the following information when escalating:
- Service name and VM hostname
- Error messages from
docker service ps --no-trunc - Relevant logs from
docker service logs - Steps already attempted
Common Error Messages Reference
| Error Message | Likely Cause | Solution |
|---|---|---|
no suitable node |
Node constraints not met, insufficient resources | Check node labels, resource availability |
image not found / pull access denied |
Missing image or auth issues | Verify image exists, check registry credentials |
task: non-zero exit |
Application crashed | Check docker service logs for app errors |
container unhealthy |
Health check failing | Check app health endpoint, review health check config |
network not found |
Overlay network missing | Recreate network or redeploy stack |
secret not found |
Missing Docker secret | Verify secret exists with docker secret ls |
no space left on device |
Disk full | Run docker system prune -f, expand disk |
OOM killed (exit 137) |
Out of memory | Diagnose OOM kills |
context deadline exceeded |
Timeout during operations | Check network connectivity, retry operation |
Diagnosing OOM Kills
An OOM (“out of memory”) kill happens when a process uses more memory than is available to it and the Linux kernel’s OOM killer terminates it. The signature is exit code 137 (128 + SIGKILL) and an abrupt death with no application traceback — the process is killed mid-execution, so it never gets to log a normal error.
There are two distinct flavors, and they have different fixes:
- Container/cgroup limit exceeded — the service hit its own configured memory limit.
- Host memory exhaustion — the VM ran out of memory and the kernel killed the largest consumer, regardless of any per-container limit.
Step 1: Confirm it was an OOM kill
Check the task exit state:
docker service ps --no-trunc SERVICE_NAMELook for a task that exited with code
137.Check the container’s OOM flag:
docker inspect CONTAINER_ID --format '{{.State.OOMKilled}}'trueconfirms the container’s PID 1 was OOM-killed. Note: if a child process was killed (e.g. a Celery worker forked under the main process), this flag can still befalse— fall back to the kernel log below.Check the kernel log on the host (the authoritative source):
journalctl -k | grep -i 'killed process' # or dmesg -T | grep -i -E 'oom|killed process'This shows which PID and cgroup was killed and how much memory it was using when it died.
Step 2: Container limit vs. host exhaustion
Per-container limit: inspect the configured limit and compare against the RSS from the kernel log.
docker service inspect --pretty SERVICE_NAME # check memory reservation/limitIf the killed process was near its own limit, it’s a per-container problem.
Host exhaustion: check overall host memory and per-container usage.
free -h docker stats --no-streamIf total host memory is exhausted across many services, the fix is host-level (right-size the VM or rebalance services), not a single service limit.
Known OOM: HomeAlign payor-auth bulk register
The most common OOM we see is the HomeAlign payor-auth bulk register
job. BulkCreatePayorAuth.start_process loaded the
entire eligible-member set into memory before doing any
work. On large tenants (Centene at ~594k members, and now Humana) the
Celery worker exceeded its memory and was OOM-killed — the tell is a
Celery worker dying with no Python traceback.
Current mitigation (in prod): the job is dispatched through a sharding parent task that walks the member-eligibility id range in fixed-size chunks (default 50k ids) and runs one register task per shard. Each shard runs in a fresh worker process whose memory is reclaimed on exit, so peak memory is bounded by the shard size rather than the tenant size.
If this OOM recurs:
- Confirm the job is being invoked through the sharded
dispatcher, not by calling
BulkCreatePayorAuth.start_processdirectly. - Consider lowering the shard size if a single 50k-id shard is still too large.
- Be aware that explicit member-id lists and “limit to N members” calls intentionally bypass sharding (sharding would multiply the row-count limit), so those paths are not memory-bounded — run them with caution on large tenants.
Planned robust fix: a durable in-task chunking refactor that bounds memory inside the task itself (streaming/iterating the member set rather than loading it all at once), which will remove the need for the external sharding dispatcher. This is not yet implemented — the proposed approach is captured in the design doc.
General remediation
- Short term: raise the service’s memory limit in its swarm/Terraform service definition, or reduce task concurrency so fewer memory-heavy tasks run at once.
- Right fix: reduce the job’s peak memory — stream or chunk large datasets instead of loading them whole, and free intermediate objects.
- Escalation: an OOM is almost always an application memory-usage problem, so escalate to the application team first. Escalate to DevOps only if the host itself is under-provisioned (Step 2 points to host exhaustion).