GitHub

docker-gcr-proxy

docker-gcr-proxy is a tiny per-node reverse proxy for gcr.io: a single Go binary that runs on every swarm manager and worker. The host’s dockerd pulls images anonymously from docker-gcr-proxy:7676; the proxy forwards the request to https://gcr.io and attaches a fresh GCP OAuth token on the upstream leg. It keeps that token refreshed locally so a pull at any moment always carries a valid credential.

Runs on	every swarm manager and worker
Listens	`127.0.0.1:7676` — loopback only
Upstream	`https://gcr.io`
Language	Go — single static binary, no third-party deps
Source	`src/services/docker-gcr-proxy/`
Shipped as	generic Artifact Registry artifact, installed by the VM startup script

Despite “pull-through proxy” naming, it does not cache image layers. “Pull-through” here means it forwards every request to gcr.io and adds auth. The only thing it caches is the OAuth token.

Why it exists

Pulling private gcr.io images on Swarm runs into moby/moby#31063:

The node’s task executor pulls images on behalf of a task using only Spec.PullOptions.RegistryAuth — the credential baked into the service spec at deploy time by docker service ... --with-registry-auth. It ignores host credential helpers like docker-credential-gcr.
That baked-in credential is a GCP OAuth token with a ~1h TTL, stored in Raft as part of the spec.
Workers are an autoscaling MIG. A worker created or replaced more than ~1h after the last --with-registry-auth deploy receives a spec whose embedded token is already expired → the image pull fails → the task never starts.
Re-embedding fresh tokens cluster-wide on a schedule would mean constant service-spec churn through Raft.

The proxy sidesteps all of it. Service specs point at docker-gcr-proxy:7676 (no embedded credential needed), and each node’s proxy injects a token it refreshes on its own. Token refresh leaves Raft entirely — there’s no per-spec credential to keep current.

What it does

Path	Action
`GET /healthz`	`200 ok` if a token is cached, else `503 no token yet` (used by the startup-script readiness poll)
`/v2`, `/v2/*`	Reverse-proxy to gcr.io with a fresh `Authorization: Bearer <token>`
`/artifacts-downloads/*`	Reverse-proxy to gcr.io without the bearer (signed blob URLs self-authenticate)

A normal image pull is a two-leg redirect chain (verified by integration_test.go::TestProxy_BlobRedirectChain):

dockerd ──anonymous──▶ docker-gcr-proxy:7676
   │  GET /v2/<img>/manifests|blobs/...   proxy adds Bearer ──▶ gcr.io
   │  ◀── 302 Location: /artifacts-downloads/.../signed-token
   └─ follows redirect ──▶ proxy forwards (NO Bearer) ──▶ gcr.io ──▶ blob bytes

On every proxied request the director strips any inbound Authorization first, sets Host: gcr.io, then (for /v2* only) adds the node’s own bearer token. Upstream failures return 502 upstream gcr.io error.

Token refresh

The proxy keeps one OAuth token cached and refreshes it ahead of expiry, so a pull at any moment carries a valid credential. The token comes from the GCE metadata server (the node’s own service account). On boot it primes the cache, retrying with backoff, and exits if it can never get a token — a node with no token is useless (this is the 503 no token yet window). It then refreshes well before each token’s TTL; if a refresh fails it keeps the previous token (stale beats empty) and retries shortly after. See the token refreshed log line for the live cadence.

Deployment

The binary is published as a generic Artifact Registry artifact (a raw binary, not a container image) and installed onto VMs by their startup script. There are three moving parts: build/publish, the AR repo + IAM, and the install.

1. Build & publish

Driven by the Build docker-gcr-proxy workflow (.github/workflows/build-docker-gcr-proxy.yml). You never build the release binary from your laptop.

Triggers:

Push a tag docker-gcr-proxy/v* — the version is parsed from the tag (docker-gcr-proxy/v0.2.0 → 0.2.0).
Manual dispatch (workflow_dispatch) with a required version input (e.g. 0.1.3).

The job builds the binary with ./zig/zig build docker-gcr-proxy and publishes it as version VERSION to the docker-gcr-proxy generic repo in prj-bu1-c-pkg-registry-f6f2. Tests gate the broader pipeline (.github/workflows/ci-tests.yml).

2. Artifact Registry repo & IAM

infra/common-infra/business_unit_1/shared/pkg_registry/docker_gcr_proxy.tf creates the generic AR repo and wires least-privilege, repo-scoped IAM:

roles/artifactregistry.writer → CI pipeline SA (publish only this repo).
roles/artifactregistry.admin → hb-infra TF SAs (non-prod + prod), so they can set the repo IAM policy the swarm modules attach.
roles/artifactregistry.reader → the swarm manager and worker runtime SAs (swarm_manager/iam.tf, swarm_worker/iam.tf). This is what lets a booting VM download the binary.

3. Install onto VMs

Version pin. docker_gcr_proxy_version (module variable in both swarm_manager and swarm_worker) defaults to "0.2.0". Neither environment overrides it, so prod and non-prod run the same version.

Systemd unit source of truth. The root swarm.tf reads the unit straight from the canonical file and passes it into the modules, so edits to docker-gcr-proxy.service flow to fresh VMs on the next apply:

docker_gcr_proxy_unit_content = file(".../src/services/docker-gcr-proxy/docker-gcr-proxy.service")

Startup script — Phase 0.65 (identical on managers and workers). Earlier, in Phase 0, the daemon.json merge already adds "insecure-registries": ["docker-gcr-proxy:7676"] (the proxy is plain HTTP on loopback). Phase 0.65 then:

Adds 127.0.0.1 docker-gcr-proxy to /etc/hosts (idempotent).
If /usr/local/bin/docker-gcr-proxy isn’t already present, gcloud artifacts generic download the pinned version and install -m 0755 it (idempotent skip if present; needs gcloud ≥ 451, enforced by Phase 0.63).
Installs the systemd unit, daemon-reload, enable --now.
Polls http://docker-gcr-proxy:7676/healthz for up to 30s before continuing to Phase 1.

Systemd unit (docker-gcr-proxy.service): ExecStart=/usr/local/bin/docker-gcr-proxy --addr=127.0.0.1:7676, Restart=on-failure/RestartSec=5s, runs as nobody:nogroup, and is sandboxed (NoNewPrivileges, ProtectSystem=strict, ProtectHome, PrivateTmp). It is ordered Before=docker.service and After=network-online.target, logging to the journal.

Rolling a new version

Publish — tag docker-gcr-proxy/vX.Y.Z (or run the build workflow with a version input). Confirm the version lands in the AR repo.
Pin — bump the docker_gcr_proxy_version default (or set a per-env override), then apply hb-infra.
Roll the fleets — both fleets are deliberately insulated from automatic recreation, so applying the pin lands the new version on no running VM; only VMs created afterward (autoscaled/auto-healed workers, recreated managers) pick it up automatically. Roll the running VMs by hand, one node at a time, via the existing runbooks:
- Workers: Rolling Update of Swarm Worker Nodes
- Managers: Rolling Boot Image Upgrade for Swarm Managers

Operating it

All commands run on the node (over IAP SSH); the proxy is loopback-only.

Is it healthy?

curl -s http://docker-gcr-proxy:7676/healthz   # → "ok"

503 no token yet means it’s up but hasn’t primed a token — almost always an IAM (missing AR reader) or metadata-server reachability problem.

Logs (structured JSON via slog):

journalctl -u docker-gcr-proxy -f

Watch for listening, token refreshed (with expires_in_seconds / next_refresh_seconds), scheduled token refresh failed, and upstream proxy error.

Service control: systemctl status docker-gcr-proxy · systemctl restart docker-gcr-proxy.

Confirm dockerd actually routes through it: check that /etc/docker/daemon.json lists docker-gcr-proxy:7676 under insecure-registries, then pull a private gcr.io image and watch the proxy logs light up.

Gotchas:

A 503 no token yet right after boot is normal for a second or two during priming; sustained 503 is a credential/metadata problem.
The proxy is not a layer cache — it won’t speed up repeated pulls, only keep them authenticated.
It binds 127.0.0.1 only; there’s nothing to reach from another host.

Code map

src/services/docker-gcr-proxy/
├── docker-gcr-proxy.service   # systemd unit (also the source TF injects onto VMs)
├── go.mod                     # self-contained module, no third-party deps
└── src/
    ├── main.go                # wiring + lifecycle; --addr flag; graceful shutdown
    ├── proxy.go               # buildMux (routes) + proxyHandler (director, auth injection)
    ├── cache.go               # tokenCache: metadata fetch, adaptive refresh, atomic storage
    ├── assert.go              # assert() invariant helper (TigerStyle)
    └── internal/bearer/       # opaque, parse-validated bearer Token type

Development

A self-contained Go module (src/services/docker-gcr-proxy/go.mod). Use the project-local toolchain, never system Go.

Test (what CI runs): ./bin/go test -C ./src/services/docker-gcr-proxy ./... -race. The high-signal tests are integration_test.go (the full blob-redirect chain + a fuzzer that throws attacker-controlled redirects at the director), property_test.go / refresh_test.go (token-refresh scheduling), and internal/bearer/bearer_test.go (the FuzzParse header-injection guard).
Build the binary: ./zig/zig build docker-gcr-proxy → ./bin/docker-gcr-proxy.
Repo gate: ./zig/zig build check.

There’s no local-run harness — the proxy needs the GCE metadata server for tokens, so the loop is change → run the test suite → validate on a non-prod node via the deploy flow above.

swarmctl — the other manager-side service on the same cluster.
Docker Swarm Consolidation RFC — why we run Swarm and how the cluster is shaped.
Cluster infrastructure: infra/hb-infra/modules/swarm_manager and swarm_worker — the startup scripts, IAM, and version pin that install this proxy.
AR repo + IAM: infra/common-infra/business_unit_1/shared/pkg_registry/docker_gcr_proxy.tf.

Edit this page