GitHub

RFC: Triage Rotation for DevOps Team

Status: Draft | Author: DevOps Team | Created: 2026-02-03

Overview

Establish a weekly triage rotation for the DevOps team with clear escalation paths and burnout prevention measures.

Background

Global Triage was disbanded, leaving a 1-week gap where Afaq fielded all triage until DevOps could establish a formal process. A 3-person team cannot sustain ad-hoc response without defined rotation and escalation paths.

Expected trajectory: Initial rollout will be front-loaded with triage work as we identify orphaned alerts, reroute misplaced alerts to appropriate teams, and build runbooks. As routing matures and ownership stabilizes, maintenance overhead should decrease over time.

DARCI

Activity Decision-Maker Accountable Responsible Consulted Informed
RFC approval Afaq AJ DevOps Team Engineering

Rotation Schedule

Cadence: Weekly (Monday 9am to Monday 9am)

Week Primary Secondary
1 AJ Miko
2 Miko Topher
3 Topher AJ

Cycle repeats. Secondary becomes the following week’s Primary.

Responsibilities

Primary: First responder to all PagerDuty alerts. Triage, investigate, resolve or escalate as Type 1. Update runbooks if applicable.

Secondary: Escalation point if primary unavailable. Covers PTO/sick days. Reviews incident documentation.

Escalation Path

Alert → Primary → Secondary → AJ (Lead) / Afaq (EM)

Escalate when: no response from current tier, need additional expertise, or major incident declared.

Handoff

Every Monday before standup. PagerDuty rotation is automatic.

Daily Overnight Review

Primary checks overnight/off-hours alarms at 9am each day:

  1. Review any alerts that fired outside business hours
  2. Acknowledge, resolve, or escalate as needed
  3. Note patterns or recurring issues for standup

Time Off

  • Swap shifts with another team member (update PagerDuty override)
  • If both primary and secondary unavailable, escalate to AJ or Afaq

Triage Overload Protocol

Thresholds (either triggers protocol):

  • More than 3 hours/day on triage
  • More than 40% of week on triage

When triggered:

  1. Post in #devops: what alerts are consuming time, patterns observed, time spent
  2. Flag in standup (or ad-hoc sync if urgent)
  3. Team reviews patterns and decides on KTLO work
  4. KTLO options: fix noisy alerts, improve runbooks, add automation, adjust thresholds, transfer ownership

Orphaned Services

When an alert arrives without clear ownership:

  1. Triage the alert normally
  2. Attempt owner identification via deployment configs, recent commits, or team leads
  3. Update alerting to route to correct team/channel
  4. Document routing decision in handoff summary

If no owner found after investigation:

  • Flag as orphaned in tracking document
  • Escalate to AJ or Afaq for ownership decision
  • Options: assign temporary owner or deprecate service

Single Points of Failure

When primary handles an incident on a system only they know:

  • Secondary shadows the incident and documents the resolution
  • Primary creates or updates the runbook within 48 hours

Before PTO, the expert must ensure a runbook exists for any system where they are the sole expert. Minimum coverage: how to restart, common failure modes, external escalation contacts.

Tooling

Tool Purpose
PagerDuty Alerting aggregation and escalation
Pingdom Uptime monitoring alerts
GCP Uptime Checks Cloud infrastructure uptime alerts
GCP Cloud Monitoring Compute metrics and utilization alerts
Azure Monitor Alerts Compute metrics and utilization alerts
Deadman’s Snitch Cron/scheduled job failure alerts
#triage Incident communication
#devops Team coordination, overload escalation
docs/runbooks/ Version-controlled runbooks
Edit this page