# Disaster Recovery Playbook Cheatsheet Quick reference for using the disaster recovery playbook. ## ⚠️ WARNING This playbook performs **DESTRUCTIVE OPERATIONS**. Only use when recovering from a disaster or system failure. ## Quick Start ```bash # Assess damage only (safe) ansible-playbook playbooks/disaster_recovery.yml --limit failed_host --tags assess # Full recovery ansible-playbook playbooks/disaster_recovery.yml --limit failed_host \ --extra-vars "dr_backup_date=2025-01-11" ``` ## Prerequisites 1. **Backups available** - Ensure backups exist in `/var/backups/` 2. **System accessible** - Host must be reachable via SSH 3. **Confirmation ready** - You'll need to type "RECOVER" to proceed ## Common Usage ### Assessment Phase (Safe) ```bash # Assess system damage without making changes ansible-playbook playbooks/disaster_recovery.yml \ --limit failed_host \ --tags assess # Multiple hosts ansible-playbook playbooks/disaster_recovery.yml \ --limit "host1,host2,host3" \ --tags assess ``` ### Configuration Recovery ```bash # Restore configuration files only ansible-playbook playbooks/disaster_recovery.yml \ --limit failed_host \ --tags restore_config \ --extra-vars "dr_backup_date=2025-01-11" ``` ### Data Recovery ```bash # Restore application data only ansible-playbook playbooks/disaster_recovery.yml \ --limit failed_host \ --tags restore_data \ --extra-vars "dr_backup_date=2025-01-11" ``` ### Full Recovery ```bash # Complete system recovery ansible-playbook playbooks/disaster_recovery.yml \ --limit failed_host \ --extra-vars "dr_backup_date=2025-01-11" ``` ## Available Tags | Tag | Description | Destructive? | |-----|-------------|--------------| | `assess` | Assess system state | No ✅ | | `prepare` | Prepare for recovery | Yes ⚠️ | | `restore_config` | Restore configuration | Yes ⚠️ | | `restore_data` | Restore data | Yes ⚠️ | | `services` | Restart services | No ✅ | | `verify` | Verify restoration | No ✅ | ## Extra Variables | Variable | Default | Description | |----------|---------|-------------| | `dr_backup_date` | `latest` | Backup date to restore (format: YYYY-MM-DD) | | `dr_verify_only` | `false` | Assessment mode only (no changes) | ## Recovery Phases ### 1. Assessment ```bash ansible-playbook playbooks/disaster_recovery.yml \ --limit host \ --tags assess ``` **Checks:** - System accessibility - Filesystem status - Service status - System errors ### 2. Preparation ```bash ansible-playbook playbooks/disaster_recovery.yml \ --limit host \ --tags prepare ``` **Actions:** - Stops non-critical services - Creates pre-recovery backup - Syncs filesystems ### 3. Restoration ```bash ansible-playbook playbooks/disaster_recovery.yml \ --limit host \ --tags restore_config,restore_data ``` **Restores:** - System configuration (/etc) - SSH configuration - Application data - Database dumps ### 4. Service Restart ```bash ansible-playbook playbooks/disaster_recovery.yml \ --limit host \ --tags services ``` **Restarts:** - SSH daemon - Time synchronization - Auditd - Firewall ### 5. Verification ```bash ansible-playbook playbooks/disaster_recovery.yml \ --limit host \ --tags verify ``` **Verifies:** - SSH connectivity - Critical services running - Filesystem integrity - NTP synchronization ## Recovery Scenarios ### Scenario 1: Configuration Corruption ```bash # Restore only configuration files ansible-playbook playbooks/disaster_recovery.yml \ --limit webserver01 \ --tags assess,restore_config,verify \ --extra-vars "dr_backup_date=2025-01-11" ``` ### Scenario 2: Failed System Upgrade ```bash # Full recovery from pre-upgrade backup ansible-playbook playbooks/disaster_recovery.yml \ --limit dbserver01 \ --extra-vars "dr_backup_date=2025-01-10" ``` ### Scenario 3: Data Loss ```bash # Restore application data only ansible-playbook playbooks/disaster_recovery.yml \ --limit appserver01 \ --tags restore_data \ --extra-vars "dr_backup_date=latest" ``` ### Scenario 4: Complete System Failure ```bash # 1. Rebuild OS (manual or automated provisioning) # 2. Ensure SSH access works # 3. Run full recovery ansible-playbook playbooks/disaster_recovery.yml \ --limit new_replacement_host \ --extra-vars "dr_backup_date=2025-01-11" ``` ## Finding Available Backups ```bash # List all available backups for a host ansible failed_host -m shell -a "ls -lh /var/backups/config/" # Check backup dates ansible failed_host -m shell -a "ls /var/backups/*/backup_manifest_*.txt" # View backup manifest ansible failed_host -m shell -a "cat /var/backups/backup_manifest_2025-01-11_0230.txt" ``` ## Logs and Reports Recovery logs: `./logs/disaster_recovery//_recovery.log` ## Example Output ``` ========================================= !! DISASTER RECOVERY MODE !! ========================================= Host: webserver01 Environment: production Timestamp: 2025-01-11T10:00:00Z Backup Date: 2025-01-11 WARNING: This playbook performs destructive operations! ========================================= [Pause for confirmation - type 'RECOVER'] === System Assessment === OS: Ubuntu 22.04 Uptime: 2 hours Filesystems: OK === Restoration Status === Configuration restored: Yes Data restored: Yes Services restarted: Yes === Service Status === SSH: Running Firewall: Running NTP: Synchronized === Next Steps === 1. Verify application-specific services 2. Test application functionality 3. Monitor system logs for errors 4. Update documentation 5. Conduct post-recovery review ========================================= ``` ## Troubleshooting ### Backup not found ```bash # Check backup location ansible failed_host -m shell -a "ls -la /var/backups/" # Restore from remote backup server ansible failed_host -m synchronize \ -a "src=/mnt/backup-server/backups/ dest=/var/backups/ mode=pull" ``` ### SSH connection lost during recovery The SSH service restart is designed to maintain connections. If lost: ```bash # Wait 60 seconds for SSH to restart # Retry connection ansible failed_host -m ping ``` ### Service won't start after recovery ```bash # Check service status ansible failed_host -m shell -a "systemctl status service_name" # Check service logs ansible failed_host -m shell -a "journalctl -u service_name -n 50" ``` ### SELinux blocking services ```bash # Relabel SELinux contexts ansible failed_host -m shell -a "restorecon -R /etc /var" ``` ## Post-Recovery Checklist - [ ] Verify all services running - [ ] Test application functionality - [ ] Check disk space - [ ] Review system logs - [ ] Verify backups are current - [ ] Update documentation - [ ] Notify stakeholders - [ ] Conduct lessons learned review ## Best Practices 1. **Test recovery procedures regularly** - Monthly DR drills 2. **Document recovery time objectives (RTO)** - Know your targets 3. **Keep backups off-site** - Don't rely on local backups only 4. **Verify backup integrity** - Test restores before disasters 5. **Maintain runbooks** - Document specific recovery procedures 6. **Practice on staging** - Test recovery in non-production first 7. **Have communication plan** - Know who to notify ## Quick Reference Commands ```bash # Assess damage only ansible-playbook playbooks/disaster_recovery.yml \ --limit host --tags assess # Full recovery with latest backup ansible-playbook playbooks/disaster_recovery.yml \ --limit host # Specific backup date ansible-playbook playbooks/disaster_recovery.yml \ --limit host \ --extra-vars "dr_backup_date=2025-01-11" # Configuration only ansible-playbook playbooks/disaster_recovery.yml \ --limit host \ --tags restore_config # Verify recovery ansible-playbook playbooks/disaster_recovery.yml \ --limit host \ --tags verify # Assessment mode (no changes) ansible-playbook playbooks/disaster_recovery.yml \ --limit host \ --extra-vars "dr_verify_only=true" ``` ## Emergency Contacts Keep this information updated: - Infrastructure Team Lead: [Contact] - On-Call Engineer: [Contact] - Backup System Admin: [Contact] - Management Escalation: [Contact] ## See Also - [Disaster Recovery Playbook](../../playbooks/disaster_recovery.yml) - [Backup Playbook](../../playbooks/backup.yml) - [Disaster Recovery Runbook](../../docs/runbooks/disaster-recovery.md)