docker-gcr-proxy
docker-gcr-proxy is a tiny per-node reverse proxy for
gcr.io: a single Go binary that runs on
every swarm manager and worker. The host’s
dockerd pulls images anonymously from
docker-gcr-proxy:7676; the proxy forwards the request to
https://gcr.io and attaches a fresh GCP
OAuth token on the upstream leg. It keeps that token refreshed locally
so a pull at any moment always carries a valid credential.
| Runs on | every swarm manager and worker |
| Listens | 127.0.0.1:7676 — loopback only |
| Upstream | https://gcr.io |
| Language | Go — single static binary, no third-party deps |
| Source | src/services/docker-gcr-proxy/ |
| Shipped as | generic Artifact Registry artifact, installed by the VM startup script |
Despite “pull-through proxy” naming, it does not cache image layers. “Pull-through” here means it forwards every request to gcr.io and adds auth. The only thing it caches is the OAuth token.
Why it exists
Pulling private gcr.io images on Swarm runs into moby/moby#31063:
- The node’s task executor pulls images on behalf of a task using
only
Spec.PullOptions.RegistryAuth— the credential baked into the service spec at deploy time bydocker service ... --with-registry-auth. It ignores host credential helpers likedocker-credential-gcr. - That baked-in credential is a GCP OAuth token with a ~1h TTL, stored in Raft as part of the spec.
- Workers are an autoscaling MIG. A worker created or
replaced more than ~1h after the last
--with-registry-authdeploy receives a spec whose embedded token is already expired → the image pull fails → the task never starts. - Re-embedding fresh tokens cluster-wide on a schedule would mean constant service-spec churn through Raft.
The proxy sidesteps all of it. Service specs point at
docker-gcr-proxy:7676 (no embedded credential needed), and
each node’s proxy injects a token it refreshes on its own. Token
refresh leaves Raft entirely — there’s no per-spec credential
to keep current.
What it does
| Path | Action |
|---|---|
GET /healthz |
200 ok if a token is cached, else
503 no token yet (used by the startup-script readiness
poll) |
/v2, /v2/* |
Reverse-proxy to gcr.io with a fresh
Authorization: Bearer <token> |
/artifacts-downloads/* |
Reverse-proxy to gcr.io without the bearer (signed blob URLs self-authenticate) |
A normal image pull is a two-leg redirect chain (verified by
integration_test.go::TestProxy_BlobRedirectChain):
dockerd ──anonymous──▶ docker-gcr-proxy:7676
│ GET /v2/<img>/manifests|blobs/... proxy adds Bearer ──▶ gcr.io
│ ◀── 302 Location: /artifacts-downloads/.../signed-token
└─ follows redirect ──▶ proxy forwards (NO Bearer) ──▶ gcr.io ──▶ blob bytes
On every proxied request the director strips any inbound
Authorization first, sets
Host: gcr.io, then (for /v2* only) adds the
node’s own bearer token. Upstream failures return
502 upstream gcr.io error.
Token refresh
The proxy keeps one OAuth token cached and refreshes it ahead of
expiry, so a pull at any moment carries a valid credential. The token
comes from the GCE metadata server (the node’s own service account). On
boot it primes the cache, retrying with backoff, and
exits if it can never get a token — a node with no token is useless
(this is the 503 no token yet window). It then refreshes
well before each token’s TTL; if a refresh fails it keeps the previous
token (stale beats empty) and retries shortly after. See the
token refreshed log line for the live cadence.
Deployment
The binary is published as a generic Artifact Registry artifact (a raw binary, not a container image) and installed onto VMs by their startup script. There are three moving parts: build/publish, the AR repo + IAM, and the install.
1. Build & publish
Driven by the Build
docker-gcr-proxy workflow
(.github/workflows/build-docker-gcr-proxy.yml). You never
build the release binary from your laptop.
Triggers:
- Push a tag
docker-gcr-proxy/v*— the version is parsed from the tag (docker-gcr-proxy/v0.2.0→0.2.0). - Manual dispatch (
workflow_dispatch) with a requiredversioninput (e.g.0.1.3).
The job builds the binary with
./zig/zig build docker-gcr-proxy and publishes it as
version VERSION to the docker-gcr-proxy
generic repo in prj-bu1-c-pkg-registry-f6f2. Tests gate the
broader pipeline (.github/workflows/ci-tests.yml).
2. Artifact Registry repo & IAM
infra/common-infra/business_unit_1/shared/pkg_registry/docker_gcr_proxy.tf
creates the generic AR repo and wires least-privilege,
repo-scoped IAM:
roles/artifactregistry.writer→ CI pipeline SA (publish only this repo).roles/artifactregistry.admin→ hb-infra TF SAs (non-prod + prod), so they can set the repo IAM policy the swarm modules attach.roles/artifactregistry.reader→ the swarm manager and worker runtime SAs (swarm_manager/iam.tf,swarm_worker/iam.tf). This is what lets a booting VM download the binary.
3. Install onto VMs
Version pin. docker_gcr_proxy_version
(module variable in both swarm_manager and
swarm_worker) defaults to "0.2.0". Neither
environment overrides it, so prod and non-prod run the same version.
Systemd unit source of truth. The root
swarm.tf reads the unit straight from the canonical file
and passes it into the modules, so edits to
docker-gcr-proxy.service flow to fresh VMs on the next
apply:
docker_gcr_proxy_unit_content = file(".../src/services/docker-gcr-proxy/docker-gcr-proxy.service")
Startup script — Phase 0.65 (identical on managers
and workers). Earlier, in Phase 0, the daemon.json merge already adds
"insecure-registries": ["docker-gcr-proxy:7676"] (the proxy
is plain HTTP on loopback). Phase 0.65 then:
- Adds
127.0.0.1 docker-gcr-proxyto/etc/hosts(idempotent). - If
/usr/local/bin/docker-gcr-proxyisn’t already present,gcloud artifacts generic downloadthe pinned version andinstall -m 0755it (idempotent skip if present; needs gcloud ≥ 451, enforced by Phase 0.63). - Installs the systemd unit,
daemon-reload,enable --now. - Polls
http://docker-gcr-proxy:7676/healthzfor up to 30s before continuing to Phase 1.
Systemd unit
(docker-gcr-proxy.service):
ExecStart=/usr/local/bin/docker-gcr-proxy --addr=127.0.0.1:7676,
Restart=on-failure/RestartSec=5s, runs as
nobody:nogroup, and is sandboxed
(NoNewPrivileges, ProtectSystem=strict,
ProtectHome, PrivateTmp). It is ordered
Before=docker.service and
After=network-online.target, logging to the journal.
Rolling a new version
- Publish — tag
docker-gcr-proxy/vX.Y.Z(or run the build workflow with aversioninput). Confirm the version lands in the AR repo. - Pin — bump the
docker_gcr_proxy_versiondefault (or set a per-env override), then apply hb-infra. - Roll the fleets — both fleets are deliberately insulated from automatic recreation, so applying the pin lands the new version on no running VM; only VMs created afterward (autoscaled/auto-healed workers, recreated managers) pick it up automatically. Roll the running VMs by hand, one node at a time, via the existing runbooks:
Operating it
All commands run on the node (over IAP SSH); the proxy is loopback-only.
Is it healthy?
curl -s http://docker-gcr-proxy:7676/healthz # → "ok"503 no token yet means it’s up but hasn’t primed a token
— almost always an IAM (missing AR reader) or metadata-server
reachability problem.
Logs (structured JSON via slog):
journalctl -u docker-gcr-proxy -fWatch for listening, token refreshed (with
expires_in_seconds / next_refresh_seconds),
scheduled token refresh failed, and
upstream proxy error.
Service control:
systemctl status docker-gcr-proxy ·
systemctl restart docker-gcr-proxy.
Confirm dockerd actually routes through it: check
that /etc/docker/daemon.json lists
docker-gcr-proxy:7676 under
insecure-registries, then pull a private gcr.io image and
watch the proxy logs light up.
Gotchas:
- A
503 no token yetright after boot is normal for a second or two during priming; sustained503is a credential/metadata problem. - The proxy is not a layer cache — it won’t speed up repeated pulls, only keep them authenticated.
- It binds
127.0.0.1only; there’s nothing to reach from another host.
Code map
src/services/docker-gcr-proxy/
├── docker-gcr-proxy.service # systemd unit (also the source TF injects onto VMs)
├── go.mod # self-contained module, no third-party deps
└── src/
├── main.go # wiring + lifecycle; --addr flag; graceful shutdown
├── proxy.go # buildMux (routes) + proxyHandler (director, auth injection)
├── cache.go # tokenCache: metadata fetch, adaptive refresh, atomic storage
├── assert.go # assert() invariant helper (TigerStyle)
└── internal/bearer/ # opaque, parse-validated bearer Token type
Development
A self-contained Go module
(src/services/docker-gcr-proxy/go.mod). Use the
project-local toolchain, never system Go.
- Test (what CI runs):
./bin/go test -C ./src/services/docker-gcr-proxy ./... -race. The high-signal tests areintegration_test.go(the full blob-redirect chain + a fuzzer that throws attacker-controlled redirects at the director),property_test.go/refresh_test.go(token-refresh scheduling), andinternal/bearer/bearer_test.go(theFuzzParseheader-injection guard). - Build the binary:
./zig/zig build docker-gcr-proxy→./bin/docker-gcr-proxy. - Repo gate:
./zig/zig build check.
There’s no local-run harness — the proxy needs the GCE metadata server for tokens, so the loop is change → run the test suite → validate on a non-prod node via the deploy flow above.
Related
- swarmctl — the other manager-side service on the same cluster.
- Docker Swarm Consolidation RFC — why we run Swarm and how the cluster is shaped.
- Cluster infrastructure:
infra/hb-infra/modules/swarm_managerandswarm_worker— the startup scripts, IAM, and version pin that install this proxy. - AR repo + IAM:
infra/common-infra/business_unit_1/shared/pkg_registry/docker_gcr_proxy.tf.