GitHub

Execute a Planned Outage

This runbook covers the process for planned service outages — scheduling, stakeholder communication, alert suppression, and post-outage verification.

This is not for unplanned incidents. For unplanned outages, follow the triage rotation process in the Triage Rotation RFC.

Requirements

  • Access to Pingdom dashboard
  • Access to PagerDuty (to create maintenance windows)
  • Posting permissions in #product and #devops Slack channels

Pre-Outage (3+ Business Days Before)

Step 1: Define the Outage Scope

Document the following before notifying anyone:

Step 2: Notify #product with Advance Notice

Post in #product at least 3 business days before the outage window. Use this template:

Planned Outage Notice

Service(s): [service name(s)]
Environment: [production / non-production]
Date: [date]
Time: [start time] – [end time] [timezone]
Reason: [brief description of the work]

Accounts team: please confirm whether partners need to be notified for the affected service(s). Reply in-thread.

This is a hard gate. Do not proceed to the outage window until the Accounts team has responded in-thread confirming one of:

  • Partners have been notified
  • No partner notification is needed

Verify: Accounts team has replied in the #product thread.

Step 3: Wait for Accounts Team Confirmation

The Accounts team may need several business days to coordinate partner notifications. Do not rush this step.

If no response after 2 business days, follow up in-thread and tag the Accounts team lead directly.

Outage Window (Day-Of)

Step 4: Post a Final Heads-Up

Post in both #product and #devops shortly before the outage begins (30–60 minutes):

Reminder: Planned outage starting in [X] minutes.

Service(s): [service name(s)]
Expected duration: [duration]

Step 5: Pause Monitoring Alerts

Before starting the outage work:

  1. Pingdom: Pause uptime checks for the affected service(s) via the Pingdom dashboard
  2. PagerDuty: Create a maintenance window covering the outage duration for the affected service(s)

Do not skip this step. Unpaused alerts during a planned outage create unnecessary noise and confusion for the triage rotation.

Verify: Pingdom checks show as paused. PagerDuty maintenance window is active.

Step 6: Execute the Planned Work

Perform the maintenance or change that requires the outage. Follow the relevant runbook or procedure for the specific work being done.

Post-Outage

Step 7: Verify Services Are Restored

Confirm that all affected services are back online and healthy:

Step 8: Re-Enable Monitoring Alerts

  1. Pingdom: Resume the paused uptime checks
  2. PagerDuty: End or delete the maintenance window (if it hasn’t auto-expired)

Verify: Pingdom checks are active and showing UP status. PagerDuty maintenance window is closed.

Step 9: Post Completion Confirmation

Post in both #product and #devops:

Planned outage complete.

Service(s): [service name(s)]
Status: All services restored and verified.

Troubleshooting

Services not recovering after outage

If services do not come back within the expected window:

  1. Execute the rollback plan defined in Step 1
  2. Post an update in #product and #devops with the new expected timeline
  3. If the issue escalates to an unplanned incident, hand off to the triage rotation
Edit this page