GitHub

Debug Docker Container Issues

Step-by-step guide for troubleshooting Docker Swarm services running on remote VMs.

Containers Are Down - Troubleshooting Flow

Step 1: Try Redeploying

First, attempt to redeploy the application through:

Azure DevOps (ADO) - for CI/CD pipelines
AWX - for Ansible-based deployments

If redeployment succeeds, verify the service is healthy. If it fails or the issue persists, continue to Step 2.

Step 2: Find the VM and SSH In

Identify the VM the application is running on
- hb-p-vm-$ID [Greens, DHA, Buzz, Eligibility API]
- bees-p-vm-$ID [Yellow]
- pdp$NUM-p-vm-$ID [Blues, CoPos]
- [HA Admin Portals]
  - Check the name of the Deployment Group in the Azure DevOps WebApp release definition.
  - The name of the Deployment Group will match the name of the VM on Azure
SSH into the appropriate VM:
- SSH Into GCP VMs
- SSH Into Azure VMs

Step 3: Docker Debug Commands

Once on the VM, check service status:

List all services:
```
docker service ls
```
If REPLICAS shows 0/1, the service is down.
View detailed task info with full error messages:
```
docker service ps --no-trunc SERVICE_NAME
```

Check service logs:

docker service logs --tail 100 SERVICE_NAME

Step 4: Rescale the Service

Try scaling the service down and back up:

# Scale down to 0
docker service scale SERVICE_NAME=0

# Wait a few seconds, then scale back up
docker service scale SERVICE_NAME=1

Verify the service is running:

docker service ls

If rescaling works, you’re done. If not, continue to Step 5.

Step 5: If Rescaling Doesn’t Work

Check for image pull issues:
```
docker service ps --no-trunc SERVICE_NAME | grep -i error
```
If there are image errors, verify the image exists and credentials are correct.

Check for resource issues:

# Check node resources
docker node ls

# Inspect service constraints
docker service inspect --pretty SERVICE_NAME

Check disk space (common cause of failures):

# Check disk space on host
df -h

# Check Docker disk usage
docker system df

# Clean up unused resources if needed
# GCP
./static/scripts/docker-system-cleanup.sh

# Azure Linux
sudo /DevOps/scripts/cleanup_docker_images.sh

# Azure Windows VM - Run on Powershell as Admin
C:\DevOps\scripts\cleanup_docker_images.ps1

Check memory/CPU usage:

# Check container resource usage
docker stats --no-stream

Check network issues:

# List networks
docker network ls

# Inspect overlay network
docker network inspect NETWORK_NAME

# Check if ingress network is healthy
docker network inspect ingress

Check secrets/config availability:

# List secrets available
docker secret ls

# Check if service can access required secrets
docker service inspect SERVICE_NAME --format '{{.Spec.TaskTemplate.ContainerSpec.Secrets}}'

Try a force update to restart all tasks:

docker service update --force SERVICE_NAME

Step 6: If Nothing Works

More aggressive troubleshooting:

Remove and recreate the service (if you have the deployment config):
```
docker service rm SERVICE_NAME
# Redeploy via ADO/AWX
```
Check if the node itself is unhealthy:
```
docker node inspect NODE_NAME
```

Drain and reactivate a problematic node:

docker node update --availability drain NODE_NAME
docker node update --availability active NODE_NAME

Check Docker daemon logs on the host:

journalctl -u docker --since "1 hour ago"

Step 7: When to Escalate

Escalate to the appropriate team when:

Devops Team: VM is unreachable, persistent node failures, underlying host issues, Overlay network failures across multiple services, ingress issues
Application Team: Application-specific errors in logs, configuration issues

Include the following information when escalating:

Service name and VM hostname
Error messages from docker service ps --no-trunc
Relevant logs from docker service logs
Steps already attempted

Common Error Messages Reference

Error Message	Likely Cause	Solution
`no suitable node`	Node constraints not met, insufficient resources	Check node labels, resource availability
`image not found` / `pull access denied`	Missing image or auth issues	Verify image exists, check registry credentials
`task: non-zero exit`	Application crashed	Check `docker service logs` for app errors
`container unhealthy`	Health check failing	Check app health endpoint, review health check config
`network not found`	Overlay network missing	Recreate network or redeploy stack
`secret not found`	Missing Docker secret	Verify secret exists with `docker secret ls`
`no space left on device`	Disk full	Run `docker system prune -f`, expand disk
`OOM killed` (exit 137)	Out of memory	Diagnose OOM kills
`context deadline exceeded`	Timeout during operations	Check network connectivity, retry operation

An OOM (“out of memory”) kill happens when a process uses more memory than is available to it and the Linux kernel’s OOM killer terminates it. The signature is exit code 137 (128 + SIGKILL) and an abrupt death with no application traceback — the process is killed mid-execution, so it never gets to log a normal error.

There are two distinct flavors, and they have different fixes:

Container/cgroup limit exceeded — the service hit its own configured memory limit.
Host memory exhaustion — the VM ran out of memory and the kernel killed the largest consumer, regardless of any per-container limit.

Step 1: Confirm it was an OOM kill

Check the task exit state:
```
docker service ps --no-trunc SERVICE_NAME
```
Look for a task that exited with code 137.
Check the container’s OOM flag:
```
docker inspect CONTAINER_ID --format '{{.State.OOMKilled}}'
```
true confirms the container’s PID 1 was OOM-killed. Note: if a child process was killed (e.g. a Celery worker forked under the main process), this flag can still be false — fall back to the kernel log below.
Check the kernel log on the host (the authoritative source):
```
journalctl -k | grep -i 'killed process'
# or
dmesg -T | grep -i -E 'oom|killed process'
```
This shows which PID and cgroup was killed and how much memory it was using when it died.

Step 2: Container limit vs. host exhaustion

Per-container limit: inspect the configured limit and compare against the RSS from the kernel log.
```
docker service inspect --pretty SERVICE_NAME   # check memory reservation/limit
```
If the killed process was near its own limit, it’s a per-container problem.
Host exhaustion: check overall host memory and per-container usage.
```
free -h
docker stats --no-stream
```
If total host memory is exhausted across many services, the fix is host-level (right-size the VM or rebalance services), not a single service limit.

Known OOM: HomeAlign payor-auth bulk register

The most common OOM we see is the HomeAlign payor-auth bulk register job. BulkCreatePayorAuth.start_process loaded the entire eligible-member set into memory before doing any work. On large tenants (Centene at ~594k members, and now Humana) the Celery worker exceeded its memory and was OOM-killed — the tell is a Celery worker dying with no Python traceback.

Current mitigation (in prod): the job is dispatched through a sharding parent task that walks the member-eligibility id range in fixed-size chunks (default 50k ids) and runs one register task per shard. Each shard runs in a fresh worker process whose memory is reclaimed on exit, so peak memory is bounded by the shard size rather than the tenant size.

If this OOM recurs:

Confirm the job is being invoked through the sharded dispatcher, not by calling BulkCreatePayorAuth.start_process directly.
Consider lowering the shard size if a single 50k-id shard is still too large.
Be aware that explicit member-id lists and “limit to N members” calls intentionally bypass sharding (sharding would multiply the row-count limit), so those paths are not memory-bounded — run them with caution on large tenants.

Planned robust fix: a durable in-task chunking refactor that bounds memory inside the task itself (streaming/iterating the member set rather than loading it all at once), which will remove the need for the external sharding dispatcher. This is not yet implemented — the proposed approach is captured in the design doc.

General remediation

Short term: raise the service’s memory limit in its swarm/Terraform service definition, or reduce task concurrency so fewer memory-heavy tasks run at once.
Right fix: reduce the job’s peak memory — stream or chunk large datasets instead of loading them whole, and free intermediate objects.
Escalation: an OOM is almost always an application memory-usage problem, so escalate to the application team first. Escalate to DevOps only if the host itself is under-provisioned (Step 2 points to host exhaustion).

Edit this page