GitHub

Execute a Planned Outage

This runbook covers the process for planned service outages — scheduling, stakeholder communication, alert suppression, and post-outage verification.

This is not for unplanned incidents. For unplanned outages, follow the triage rotation process in the Triage Rotation RFC.

Requirements

Access to Pingdom dashboard
Access to PagerDuty (to create maintenance windows)
Posting permissions in #product and #devops Slack channels

Pre-Outage (3+ Business Days Before)

Step 1: Define the Outage Scope

Document the following before notifying anyone:

Affected service(s) and environment(s) (production, non-production, or both)
Expected start time and duration
Reason for the outage
Rollback plan if the work cannot be completed in the window

Step 2: Notify `#product` with Advance Notice

Post in #product at least 3 business days before the outage window. Use this template:

Planned Outage Notice

Service(s): [service name(s)]
Environment: [production / non-production]
Date: [date]
Time: [start time] – [end time] [timezone]
Reason: [brief description of the work]

Accounts team: please confirm whether partners need to be notified for the affected service(s). Reply in-thread.

This is a hard gate. Do not proceed to the outage window until the Accounts team has responded in-thread confirming one of:

Partners have been notified

No partner notification is needed

Verify: Accounts team has replied in the #product thread.

Step 3: Wait for Accounts Team Confirmation

The Accounts team may need several business days to coordinate partner notifications. Do not rush this step.

If no response after 2 business days, follow up in-thread and tag the Accounts team lead directly.

Outage Window (Day-Of)

Step 4: Post a Final Heads-Up

Post in both #product and #devops shortly before the outage begins (30–60 minutes):

Reminder: Planned outage starting in [X] minutes.

Service(s): [service name(s)]
Expected duration: [duration]

Step 5: Pause Monitoring Alerts

Before starting the outage work:

Pingdom: Pause uptime checks for the affected service(s) via the Pingdom dashboard
PagerDuty: Create a maintenance window covering the outage duration for the affected service(s)

Do not skip this step. Unpaused alerts during a planned outage create unnecessary noise and confusion for the triage rotation.

Verify: Pingdom checks show as paused. PagerDuty maintenance window is active.

Step 6: Execute the Planned Work

Perform the maintenance or change that requires the outage. Follow the relevant runbook or procedure for the specific work being done.

Post-Outage

Step 7: Verify Services Are Restored

Confirm that all affected services are back online and healthy:

Application responds to health checks
No error spikes in logs
Key user flows work as expected

Step 8: Re-Enable Monitoring Alerts

Pingdom: Resume the paused uptime checks
PagerDuty: End or delete the maintenance window (if it hasn’t auto-expired)

Verify: Pingdom checks are active and showing UP status. PagerDuty maintenance window is closed.

Step 9: Post Completion Confirmation

Post in both #product and #devops:

Planned outage complete.

Service(s): [service name(s)]
Status: All services restored and verified.

Troubleshooting

Services not recovering after outage

If services do not come back within the expected window:

Execute the rollback plan defined in Step 1
Post an update in #product and #devops with the new expected timeline
If the issue escalates to an unplanned incident, hand off to the triage rotation

Edit this page