RFC: Triage Rotation for DevOps Team
Status: Draft | Author: DevOps Team | Created: 2026-02-03
Overview
Establish a weekly triage rotation for the DevOps team with clear escalation paths and burnout prevention measures.
Background
Global Triage was disbanded, leaving a 1-week gap where Afaq fielded all triage until DevOps could establish a formal process. A 3-person team cannot sustain ad-hoc response without defined rotation and escalation paths.
Expected trajectory: Initial rollout will be front-loaded with triage work as we identify orphaned alerts, reroute misplaced alerts to appropriate teams, and build runbooks. As routing matures and ownership stabilizes, maintenance overhead should decrease over time.
DARCI
| Activity | Decision-Maker | Accountable | Responsible | Consulted | Informed |
|---|---|---|---|---|---|
| RFC approval | Afaq | AJ | DevOps Team | — | Engineering |
Rotation Schedule
Cadence: Weekly (Monday 9am to Monday 9am)
| Week | Primary | Secondary |
|---|---|---|
| 1 | AJ | Miko |
| 2 | Miko | Topher |
| 3 | Topher | AJ |
Cycle repeats. Secondary becomes the following week’s Primary.
Responsibilities
Primary: First responder to all PagerDuty alerts. Triage, investigate, resolve or escalate as Type 1. Update runbooks if applicable.
Secondary: Escalation point if primary unavailable. Covers PTO/sick days. Reviews incident documentation.
Escalation Path
Alert → Primary → Secondary → AJ (Lead) / Afaq (EM)
Escalate when: no response from current tier, need additional expertise, or major incident declared.
Handoff
Every Monday before standup. PagerDuty rotation is automatic.
Daily Overnight Review
Primary checks overnight/off-hours alarms at 9am each day:
- Review any alerts that fired outside business hours
- Acknowledge, resolve, or escalate as needed
- Note patterns or recurring issues for standup
Time Off
- Swap shifts with another team member (update PagerDuty override)
- If both primary and secondary unavailable, escalate to AJ or Afaq
Triage Overload Protocol
Thresholds (either triggers protocol):
- More than 3 hours/day on triage
- More than 40% of week on triage
When triggered:
- Post in #devops: what alerts are consuming time, patterns observed, time spent
- Flag in standup (or ad-hoc sync if urgent)
- Team reviews patterns and decides on KTLO work
- KTLO options: fix noisy alerts, improve runbooks, add automation, adjust thresholds, transfer ownership
Orphaned Services
When an alert arrives without clear ownership:
- Triage the alert normally
- Attempt owner identification via deployment configs, recent commits, or team leads
- Update alerting to route to correct team/channel
- Document routing decision in handoff summary
If no owner found after investigation:
- Flag as orphaned in tracking document
- Escalate to AJ or Afaq for ownership decision
- Options: assign temporary owner or deprecate service
Single Points of Failure
When primary handles an incident on a system only they know:
- Secondary shadows the incident and documents the resolution
- Primary creates or updates the runbook within 48 hours
Before PTO, the expert must ensure a runbook exists for any system where they are the sole expert. Minimum coverage: how to restart, common failure modes, external escalation contacts.
Tooling
| Tool | Purpose |
|---|---|
| PagerDuty | Alerting aggregation and escalation |
| Pingdom | Uptime monitoring alerts |
| GCP Uptime Checks | Cloud infrastructure uptime alerts |
| GCP Cloud Monitoring | Compute metrics and utilization alerts |
| Azure Monitor Alerts | Compute metrics and utilization alerts |
| Deadman’s Snitch | Cron/scheduled job failure alerts |
| #triage | Incident communication |
| #devops | Team coordination, overload escalation |
| docs/runbooks/ | Version-controlled runbooks |