Here are several prompt templates for IT service runbooks, which you can adapt based on the specific IT service or issue you’re addressing:
1. General IT Service Incident Response
Prompt:
-
Title: IT Service Incident Response: [Service Name] Failure
-
Scope:
-
Define the scope of the service affected (e.g., application, network, database).
-
-
Problem Description:
-
Describe the issue in detail (e.g., users unable to log in, website down, network congestion).
-
-
Affected Users/Systems:
-
Identify which users or systems are impacted (e.g., all employees, specific department, external customers).
-
-
Immediate Actions to Take:
-
Step-by-step instructions for isolating the issue (e.g., checking system logs, verifying service status).
-
-
Root Cause Investigation:
-
Instructions on how to investigate the root cause of the issue (e.g., reviewing error messages, checking system metrics).
-
-
Resolution Steps:
-
Provide a detailed set of steps for resolving the issue (e.g., restarting services, applying patches, restoring backups).
-
-
Post-Incident Review:
-
What to check after resolution (e.g., verify system stability, monitor for recurrence).
-
-
Preventive Actions:
-
Recommendations to prevent the issue from happening again (e.g., monitoring improvements, configuration changes).
-
2. System Maintenance Runbook
Prompt:
-
Title: System Maintenance for [System/Service Name]
-
Scope:
-
Define the scope of the maintenance (e.g., server upgrades, software patching).
-
-
Preparation:
-
What preparations need to be made before starting the maintenance (e.g., notify users, ensure backups are available).
-
-
Maintenance Tasks:
-
Step-by-step instructions on the maintenance tasks (e.g., apply security patches, upgrade hardware).
-
-
Expected Downtime:
-
How long will the system be down, if applicable (e.g., 30 minutes, 1 hour)?
-
-
Rollback Plan:
-
Define how to revert changes if something goes wrong (e.g., restore from backup, roll back patches).
-
-
Post-Maintenance Checks:
-
What to verify once maintenance is completed (e.g., system performance, availability checks).
-
-
Sign-Off:
-
Instructions on who needs to approve the completion of the maintenance and confirm system stability.
-
3. System Monitoring & Alerting Response
Prompt:
-
Title: System Monitoring Alert Response: [Alert Type]
-
Scope:
-
Define the scope of the alert (e.g., CPU usage exceeds 90%, disk space running low).
-
-
Alert Details:
-
Specifics of the alert (e.g., high memory usage, critical server failure).
-
-
Immediate Actions:
-
Step-by-step actions to take when an alert is triggered (e.g., check system logs, run diagnostics).
-
-
Investigation:
-
Guidance on how to investigate the root cause (e.g., checking logs, identifying trends).
-
-
Resolution:
-
Steps to resolve the issue (e.g., restart the service, optimize resource usage).
-
-
Post-Alert Follow-up:
-
What to monitor after resolution to ensure the issue is fully resolved (e.g., check metrics over the next 24 hours).
-
-
Documentation & Reporting:
-
How to document the issue, actions taken, and resolution for future reference (e.g., create an incident report).
-
4. Disaster Recovery Runbook
Prompt:
-
Title: Disaster Recovery Plan: [Service/System Name]
-
Scope:
-
Define the systems and services covered by the disaster recovery plan (e.g., critical database servers, file storage systems).
-
-
Disaster Trigger:
-
List the conditions that trigger the disaster recovery plan (e.g., total system failure, data corruption).
-
-
Initial Response:
-
Steps to take immediately after the disaster is detected (e.g., assess the scope of the disaster, alert stakeholders).
-
-
Recovery Strategy:
-
Detailed steps for recovery (e.g., restoring from backups, setting up failover systems).
-
-
System Validation:
-
What to check after recovery to ensure the system is operational (e.g., verify data integrity, system performance).
-
-
Communication Plan:
-
How to communicate the recovery status to stakeholders (e.g., email updates, status page updates).
-
-
Post-Recovery Actions:
-
Steps to review and prevent future disasters (e.g., conduct root cause analysis, improve system monitoring).
-
5. Security Incident Response
Prompt:
-
Title: Security Incident Response: [Incident Type]
-
Scope:
-
Define the security incident (e.g., data breach, ransomware attack, phishing attempt).
-
-
Immediate Containment:
-
First steps to take to contain the security threat (e.g., disconnect affected systems from the network, disable compromised accounts).
-
-
Investigation:
-
How to investigate the extent of the security breach (e.g., check logs, identify entry points).
-
-
Mitigation & Eradication:
-
Actions to remove the threat and mitigate further risks (e.g., patch vulnerabilities, remove malware).
-
-
Recovery:
-
Steps for system recovery (e.g., restore data from backups, rebuild affected systems).
-
-
Post-Incident Review:
-
Analyze the incident to improve future responses (e.g., update incident response procedures, enhance security policies).
-
-
Communication:
-
How to communicate with stakeholders during and after the incident (e.g., informing affected users, public statements).
-
6. Backup and Restore Runbook
Prompt:
-
Title: Backup and Restore Procedures: [System/Service Name]
-
Scope:
-
Define the systems and services covered by the backup process (e.g., file servers, databases).
-
-
Backup Schedule:
-
Outline the backup schedule (e.g., daily, weekly, monthly).
-
-
Backup Verification:
-
Instructions on how to verify the integrity of backups (e.g., test restores, checksum verification).
-
-
Restore Process:
-
Detailed instructions for restoring data (e.g., from cloud storage, on-prem backup system).
-
-
Testing Restoration:
-
How to test the restored system to ensure data integrity and functionality (e.g., spot-check files, run application tests).
-
-
Post-Restore Actions:
-
Steps to perform after a successful restore (e.g., notify users, re-enable affected services).
-
-
Documentation:
-
How to document the backup and restore process, including any issues encountered.
-
7. Patch Management Runbook
Prompt:
-
Title: Patch Management: [System/Service Name]
-
Scope:
-
Define the scope of patch management (e.g., security patches, software updates).
-
-
Patch Assessment:
-
How to assess patches for relevance and urgency (e.g., security-critical patches, functionality updates).
-
-
Testing Patches:
-
Steps to test patches in a non-production environment (e.g., deploy in staging, run functional tests).
-
-
Deployment:
-
Detailed deployment steps for applying patches to production systems (e.g., use deployment tools, manual installation).
-
-
Post-Deployment Validation:
-
Steps to validate the success of the patch deployment (e.g., check logs, verify service availability).
-
-
Rollback Plan:
-
Instructions on how to roll back patches if necessary (e.g., revert changes, restore system from backup).
-
-
Documentation & Reporting:
-
How to document applied patches and any issues encountered (e.g., record patch IDs, issue details).
-
These templates provide a structured approach for IT teams to follow, ensuring consistency and efficiency in handling common IT tasks and incidents.