# Disaster Recovery Runbook Emergency procedures for recovering from system failures and disasters. ## Severity Levels | Level | Description | Response Time | |-------|-------------|---------------| | **P0** | Complete system failure | Immediate | | **P1** | Critical service outage | < 15 minutes | | **P2** | Degraded performance | < 1 hour | | **P3** | Minor issues | < 4 hours | ## Initial Response ### 1. Incident Detection (0-5 minutes) ```bash # Verify incident scope ansible all -i inventories/ -m ping # Identify failed hosts ansible-playbook playbooks/security_audit.yml --tags assess ``` ### 2. Incident Classification (5-10 minutes) Determine: - Affected hosts/services - Severity level - Business impact - Recovery time objective (RTO) ### 3. Communication (10-15 minutes) **Notify:** - Infrastructure team - Management (P0/P1 only) - Affected stakeholders **Template:** ``` INCIDENT ALERT [P0/P1/P2/P3] Incident ID: DR-YYYYMMDD-NNN Detected: [Timestamp] Scope: [Affected systems] Impact: [Business impact] Status: Investigating/Responding/Resolved ETA: [Estimated resolution time] ``` ## Recovery Procedures ### Scenario 1: Single Host Failure (P1) **Symptoms:** Host unreachable, services down **Recovery:** ```bash # 1. Assess damage ansible-playbook playbooks/disaster_recovery.yml \ --limit failed_host \ --tags assess # 2. Attempt service restart ansible failed_host -m systemd -a "name= state=restarted" # 3. If unsuccessful, initiate full recovery ansible-playbook playbooks/disaster_recovery.yml \ --limit failed_host \ --extra-vars "dr_backup_date=latest" # 4. Verify recovery ansible-playbook playbooks/disaster_recovery.yml \ --limit failed_host \ --tags verify ``` **RTO:** 30 minutes ### Scenario 2: Database Corruption (P0) **Symptoms:** Database errors, data inconsistency **Recovery:** ```bash # 1. Stop application services ansible dbserver -m systemd -a "name=application state=stopped" # 2. Restore database from backup ansible-playbook playbooks/disaster_recovery.yml \ --limit dbserver \ --tags restore_data \ --extra-vars "dr_backup_date=YYYY-MM-DD" # 3. Verify database integrity ansible dbserver -m shell -a "mysqlcheck --all-databases" # 4. Restart services ansible dbserver -m systemd -a "name=mysql state=restarted" ansible dbserver -m systemd -a "name=application state=restarted" ``` **RTO:** 1 hour ### Scenario 3: Complete Environment Failure (P0) **Symptoms:** All hosts unreachable, total outage **Recovery:** ```bash # 1. Verify network connectivity ping # 2. Check infrastructure provider status # (AWS, Azure, etc.) # 3. If infrastructure is available, restore hosts individually for host in host1 host2 host3; do ansible-playbook playbooks/disaster_recovery.yml \ --limit $host \ --extra-vars "dr_backup_date=latest" done # 4. Verify environment health ansible-playbook -i inventories/ site.yml --check ``` **RTO:** 4 hours ### Scenario 4: Configuration Corruption (P2) **Symptoms:** Services misconfigured, errors in logs **Recovery:** ```bash # 1. Restore configuration only ansible-playbook playbooks/disaster_recovery.yml \ --limit affected_hosts \ --tags restore_config \ --extra-vars "dr_backup_date=YYYY-MM-DD" # 2. Restart affected services ansible affected_hosts -m systemd -a "name= state=restarted" # 3. Verify configuration ansible affected_hosts -m shell -a " -t" # Test config ``` **RTO:** 30 minutes ## Escalation Path 1. **L1:** On-call engineer (initial response) 2. **L2:** Senior infrastructure engineer (if unresolved in 30 min) 3. **L3:** Infrastructure team lead (P0/P1 or > 1 hour) 4. **L4:** CTO/Management (> 2 hours or business-critical) ## Post-Incident Procedures ### 1. Verification (Immediate) ```bash # System health check ansible-playbook playbooks/maintenance.yml --tags verify # Security audit ansible-playbook playbooks/security_audit.yml ``` ### 2. Documentation (Within 2 hours) Document in incident log: - Timeline of events - Actions taken - Recovery time - Root cause (if known) ### 3. Post-Mortem (Within 48 hours) Conduct post-mortem meeting: - What happened - What went well - What could be improved - Action items ### 4. Preventive Actions (Within 1 week) - Implement fixes - Update runbooks - Improve monitoring - Test recovery procedures ## Testing Schedule | Test Type | Frequency | Scope | |-----------|-----------|-------| | Single host recovery | Monthly | Development | | Configuration restore | Monthly | Staging | | Database restore | Quarterly | Staging | | Full DR drill | Semi-annually | All | ## Emergency Contacts | Role | Name | Contact | Backup | |------|------|---------|--------| | On-Call Engineer | TBD | TBD | TBD | | Team Lead | TBD | TBD | TBD | | Management | TBD | TBD | TBD | | Vendor Support | TBD | TBD | - | ## Critical Information ### Backup Locations - Local: `/var/backups/` - Remote: `[Remote backup server]` - Off-site: `[Off-site location]` ### Recovery Credentials - Vault password location: `[Secure location]` - Emergency access: `[Break-glass procedure]` - Root passwords: `[Secure password manager]` ### Service Dependencies ``` Load Balancer ↓ Web Servers (webserver01, webserver02) ↓ Application Servers (appserver01, appserver02) ↓ Database (dbserver01) → Replica (dbserver02) ↓ Cache (redis01) ``` ## Quick Reference ```bash # Assess all hosts ansible-playbook playbooks/disaster_recovery.yml --tags assess # Full recovery single host ansible-playbook playbooks/disaster_recovery.yml --limit host # Configuration only ansible-playbook playbooks/disaster_recovery.yml --limit host --tags restore_config # Verify recovery ansible-playbook playbooks/disaster_recovery.yml --limit host --tags verify # Check backup availability ansible all -m shell -a "ls -lh /var/backups/" ``` --- **Last Updated:** 2025-11-11 **Next Review:** 2025-02-11