Runbook

This document goes together with the Incident Response Plan. This document is organized that for each incident you start at the top of this document and work your way down.

Initial Response

Acknowledge the Alert

Confirm receipt of the alert in the monitoring tool (e.g., PagerDuty, Slack, email).

Notify the team if not already done by the alerting system.
Assign an Incident Commander (IC)

The first responder assumes the role of IC until it’s reassigned.
Create a communication channel (e.g., Slack channel) and name it #incident-YYYY-MM-DD.
Declare Incident Severity

Use the following criteria to determine severity: - Low: Minor issue, no immediate impact, business continues as usual. - Medium: Limited user impact, non-critical systems affected. - High: Critical systems affected, major user impact, or security issue.

Assessment & Identification

Assess Impact

Identify which systems or services are affected

Determine if there is ongoing damage (e.g., data loss, security breach).
Gather Initial Information

Check logs, alerts, and dashboards for anomalies.

Answer these questions: - [ ] What is currently not working? - [ ] What was the last change made? - [ ] What is the scope of the issue (single service or multiple)?
Create Incident Ticket

Document the incident details in the ticketing system (e.g., JIRA). Include the following: - Incident description. - Systems/services impacted. - Current status.

If root cause is identified, proceed to Containment. If not, escalate to additional responders for help.

Containment

Implement Short-Term Containment

Isolate the affected systems or services to prevent further impact. (Disable access, apply firewall rules, or shut down impacted processes as necessary.)
Document Containment Actions

Record each action taken in the incident ticket with timestamps. Notify stakeholders of the containment status.

If containment is successful and impact is stopped, move to Mitigation & Resolution.

Mitigation & Resolution

Mitigate the Impact

Apply temporary fixes to mitigate the impact if a full resolution is not possible immediately.

Examples: Redirect traffic, increase resource limits, revert recent changes.
Implement Full Resolution

Apply the permanent fix or patch to resolve the root cause of the incident.

Monitor the systems to ensure stability and that the issue does not recur.
Verify Resolution

Ensure that all affected services are restored and functioning as expected.

Confirm with stakeholders that the issue is resolved.

Once the resolution is verified, proceed to Post-Incident Review.

Post-Incident Review

Schedule a Post-Incident Review

Within 24-48 hours of incident resolution, schedule a review meeting.

Include all responders and relevant stakeholders.

Create Post-Incident (Postmortem) Report

Document the incident timeline, root cause, actions taken, and resolution.

Identify areas for improvement and update the runbook if necessary.