Execute a Planned Outage
This runbook covers the process for planned service outages — scheduling, stakeholder communication, alert suppression, and post-outage verification.
This is not for unplanned incidents. For unplanned outages, follow the triage rotation process in the Triage Rotation RFC.
Requirements
- Access to Pingdom dashboard
- Access to PagerDuty (to create maintenance windows)
- Posting permissions in
#productand#devopsSlack channels
Pre-Outage (3+ Business Days Before)
Step 1: Define the Outage Scope
Document the following before notifying anyone:
Step 2:
Notify #product with Advance Notice
Post in #product at least 3 business
days before the outage window. Use this template:
Planned Outage Notice
Service(s): [service name(s)]
Environment: [production / non-production]
Date: [date]
Time: [start time] – [end time] [timezone]
Reason: [brief description of the work]
Accounts team: please confirm whether partners need to be notified for the affected service(s). Reply in-thread.
This is a hard gate. Do not proceed to the outage window until the Accounts team has responded in-thread confirming one of:
- Partners have been notified
- No partner notification is needed
Verify: Accounts team has replied in the
#product thread.
Step 3: Wait for Accounts Team Confirmation
The Accounts team may need several business days to coordinate partner notifications. Do not rush this step.
If no response after 2 business days, follow up in-thread and tag the Accounts team lead directly.
Outage Window (Day-Of)
Step 4: Post a Final Heads-Up
Post in both #product and #devops shortly
before the outage begins (30–60 minutes):
Reminder: Planned outage starting in [X] minutes.
Service(s): [service name(s)]
Expected duration: [duration]
Step 5: Pause Monitoring Alerts
Before starting the outage work:
- Pingdom: Pause uptime checks for the affected service(s) via the Pingdom dashboard
- PagerDuty: Create a maintenance window covering the outage duration for the affected service(s)
Do not skip this step. Unpaused alerts during a planned outage create unnecessary noise and confusion for the triage rotation.
Verify: Pingdom checks show as paused. PagerDuty maintenance window is active.
Step 6: Execute the Planned Work
Perform the maintenance or change that requires the outage. Follow the relevant runbook or procedure for the specific work being done.
Post-Outage
Step 7: Verify Services Are Restored
Confirm that all affected services are back online and healthy:
Step 8: Re-Enable Monitoring Alerts
- Pingdom: Resume the paused uptime checks
- PagerDuty: End or delete the maintenance window (if it hasn’t auto-expired)
Verify: Pingdom checks are active and showing UP status. PagerDuty maintenance window is closed.
Step 9: Post Completion Confirmation
Post in both #product and #devops:
Planned outage complete.
Service(s): [service name(s)]
Status: All services restored and verified.
Troubleshooting
Services not recovering after outage
If services do not come back within the expected window:
- Execute the rollback plan defined in Step 1
- Post an update in
#productand#devopswith the new expected timeline - If the issue escalates to an unplanned incident, hand off to the triage rotation