RFC: Triage Rotation for DevOps Team

Status: Draft | Author: DevOps Team | Created: 2026-02-03

Overview

Establish a weekly triage rotation for the DevOps team with clear escalation paths and burnout prevention measures.

Global Triage was disbanded, leaving a 1-week gap where Afaq fielded all triage until DevOps could establish a formal process. A 3-person team cannot sustain ad-hoc response without defined rotation and escalation paths.

Expected trajectory: Initial rollout will be front-loaded with triage work as we identify orphaned alerts, reroute misplaced alerts to appropriate teams, and build runbooks. As routing matures and ownership stabilizes, maintenance overhead should decrease over time.

DARCI

Activity	Decision-Maker	Accountable	Responsible	Consulted	Informed
RFC approval	Afaq	AJ	DevOps Team	—	Engineering

Rotation Schedule

Cadence: Weekly (Monday 9am to Monday 9am)

Week	Primary	Secondary
1	AJ	Miko
2	Miko	Topher
3	Topher	AJ

Cycle repeats. Secondary becomes the following week’s Primary.

Responsibilities

Primary: First responder to all PagerDuty alerts. Triage, investigate, resolve or escalate as Type 1. Update runbooks if applicable.

Secondary: Escalation point if primary unavailable. Covers PTO/sick days. Reviews incident documentation.

Escalation Path

Alert → Primary → Secondary → AJ (Lead) / Afaq (EM)

Escalate when: no response from current tier, need additional expertise, or major incident declared.

Handoff

Every Monday before standup. PagerDuty rotation is automatic.

Daily Overnight Review

Primary checks overnight/off-hours alarms at 9am each day:

Review any alerts that fired outside business hours
Acknowledge, resolve, or escalate as needed
Note patterns or recurring issues for standup

Time Off

Swap shifts with another team member (update PagerDuty override)
If both primary and secondary unavailable, escalate to AJ or Afaq

Triage Overload Protocol

Thresholds (either triggers protocol):

More than 3 hours/day on triage
More than 40% of week on triage

When triggered:

Post in #devops: what alerts are consuming time, patterns observed, time spent
Flag in standup (or ad-hoc sync if urgent)
Team reviews patterns and decides on KTLO work
KTLO options: fix noisy alerts, improve runbooks, add automation, adjust thresholds, transfer ownership

Orphaned Services

When an alert arrives without clear ownership:

Triage the alert normally
Attempt owner identification via deployment configs, recent commits, or team leads
Update alerting to route to correct team/channel
Document routing decision in handoff summary

If no owner found after investigation:

Flag as orphaned in tracking document
Escalate to AJ or Afaq for ownership decision
Options: assign temporary owner or deprecate service

Single Points of Failure

When primary handles an incident on a system only they know:

Secondary shadows the incident and documents the resolution
Primary creates or updates the runbook within 48 hours

Before PTO, the expert must ensure a runbook exists for any system where they are the sole expert. Minimum coverage: how to restart, common failure modes, external escalation contacts.

Tooling

Tool	Purpose
PagerDuty	Alerting aggregation and escalation
Pingdom	Uptime monitoring alerts
GCP Uptime Checks	Cloud infrastructure uptime alerts
GCP Cloud Monitoring	Compute metrics and utilization alerts
Azure Monitor Alerts	Compute metrics and utilization alerts
Deadman’s Snitch	Cron/scheduled job failure alerts
#triage	Incident communication
#devops	Team coordination, overload escalation
docs/runbooks/	Version-controlled runbooks

Edit this page