Files
infra-automation/playbooks/disaster_recovery.yml
ansible cc21e89a78 Add playbook structure, master playbook, and collections requirements
Implement standardized playbook organization with master orchestrator
and Ansible collections requirements for extended functionality.

Playbook Structure:
playbooks/
├── gather_system_info.yml    # System inventory gathering
├── deploy_vm.yml             # VM deployment (placeholder)
├── security_audit.yml        # Security compliance checking (placeholder)
├── maintenance.yml           # Routine maintenance tasks (placeholder)
├── backup.yml                # Backup operations (placeholder)
└── disaster_recovery.yml     # DR procedures (placeholder)

Master Playbook (site.yml):
- Entry point for all infrastructure operations
- Import structure for modular playbook organization
- Tag-based execution for selective operations
- Pre-flight checks and validations
- Comprehensive documentation and usage examples

Collections Requirements (collections/requirements.yml):
- community.general: Essential utilities and modules
- community.libvirt: KVM/libvirt management
- ansible.posix: POSIX system administration
- amazon.aws: AWS infrastructure management (optional)
- Community versions for open-source compatibility

Implemented Playbooks:

1. gather_system_info.yml:
   - Comprehensive system information gathering
   - Uses system_info role
   - Statistics export to ./stats/machines/
   - Health checks and validation
   - Tag support: install, gather, export, validate, health-check

2. Placeholder Playbooks (documented structure):
   - deploy_vm.yml: VM provisioning with deploy_linux_vm role
   - security_audit.yml: CIS benchmark compliance checking
   - maintenance.yml: Updates, cleanup, optimization
   - backup.yml: Backup operations orchestration
   - disaster_recovery.yml: DR procedures and testing

site.yml Master Playbook Features:
- Central orchestration point
- Import-based playbook inclusion
- Tag inheritance and selective execution
- Environment-aware (development, staging, production)
- Pre-flight validation checks
- Error handling and rollback support
- Comprehensive inline documentation

Usage Examples:
```bash
# Run all playbooks
ansible-playbook site.yml

# Run specific playbook
ansible-playbook site.yml --tags gather_info

# Gather system information only
ansible-playbook playbooks/gather_system_info.yml

# Check syntax
ansible-playbook site.yml --syntax-check

# Dry run
ansible-playbook site.yml --check

# Limit to specific hosts
ansible-playbook site.yml -l webservers
```

Collections Management:
- Install: ansible-galaxy collection install -r collections/requirements.yml
- Update: ansible-galaxy collection install -r collections/requirements.yml --upgrade
- Location: ./collections/ (local) and ~/.ansible/collections (user)
- Version pinning for stability
- Community alternatives for RHEL-free deployments

CLAUDE.md Compliance:
 Playbooks in ./playbooks/ directory
 Master playbook (site.yml) at root
 Tag-based execution support
 Modular organization with import_playbook
 Collections requirements documented
 Clear separation: playbooks (lasting) vs plays (temporary)

Benefits:
- Standardized playbook organization
- Easy-to-navigate structure
- Tag-based selective execution
- Collection dependency management
- Scalable to 100+ playbooks
- Clear entry point (site.yml)
- Environment isolation

Next Steps:
1. Install collections: ansible-galaxy collection install -r collections/requirements.yml
2. Implement placeholder playbooks as needed
3. Add role-specific playbooks to playbooks/ directory
4. Create temporary plays in plays/ directory (per CLAUDE.md)
5. Test site.yml orchestration: ansible-playbook site.yml --check

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 01:37:19 +01:00

417 lines
14 KiB
YAML

---
# =============================================================================
# Disaster Recovery Playbook
# =============================================================================
#
# This playbook orchestrates disaster recovery procedures including system
# restoration, configuration recovery, and service restoration.
#
# WARNING: This playbook performs destructive operations. Use with caution!
#
# Usage:
# ansible-playbook playbooks/disaster_recovery.yml --limit <failed_host>
# ansible-playbook playbooks/disaster_recovery.yml --tags assess
# ansible-playbook playbooks/disaster_recovery.yml --tags restore --extra-vars "dr_backup_date=2025-01-10"
#
# Tags:
# assess - Assess system state and damage
# prepare - Prepare for recovery
# restore_config - Restore configuration files
# restore_data - Restore data from backups
# verify - Verify restoration
# services - Restart services
#
# =============================================================================
- name: Disaster Recovery Procedures
hosts: all
become: true
gather_facts: true
serial: 1 # Process one host at a time
vars:
dr_timestamp: "{{ ansible_date_time.iso8601 }}"
dr_log_dir: "./logs/disaster_recovery/{{ ansible_date_time.date }}"
dr_backup_source: "/var/backups"
dr_restore_target: "/"
dr_backup_date: "{{ dr_backup_date | default('latest') }}"
dr_verify_only: false
pre_tasks:
- name: Create disaster recovery log directory
file:
path: "{{ dr_log_dir }}"
state: directory
mode: '0755'
delegate_to: localhost
become: false
run_once: true
tags: [always]
- name: Display disaster recovery warning
debug:
msg:
- "========================================="
- "!! DISASTER RECOVERY MODE !!"
- "========================================="
- "Host: {{ inventory_hostname }}"
- "Environment: {{ environment | default('unknown') }}"
- "Timestamp: {{ dr_timestamp }}"
- "Backup Date: {{ dr_backup_date }}"
- ""
- "WARNING: This playbook performs destructive operations!"
- "Ensure you have confirmed the recovery plan."
- "========================================="
tags: [always]
- name: Confirm disaster recovery initiation
pause:
prompt: |
!! DISASTER RECOVERY CONFIRMATION !!
You are about to initiate disaster recovery for: {{ inventory_hostname }}
This may overwrite existing data and configurations.
Type 'RECOVER' to continue or Ctrl+C to abort
register: dr_confirmation
when: not dr_verify_only
tags: [always]
- name: Validate confirmation
assert:
that:
- dr_confirmation.user_input == "RECOVER"
fail_msg: "Disaster recovery aborted - incorrect confirmation"
when: not dr_verify_only
tags: [always]
tasks:
# =========================================================================
# Assessment Phase
# =========================================================================
- name: Assess current system state
block:
- name: Check system accessibility
ping:
- name: Gather system facts
setup:
- name: Check critical filesystems
shell: df -h / /var /home /opt 2>/dev/null || df -h
register: dr_filesystem_status
changed_when: false
failed_when: false
- name: Check critical services status
systemd:
name: "{{ item }}"
loop:
- sshd
- "{{ 'chronyd' if ansible_os_family == 'RedHat' else 'chrony' }}"
register: dr_services_status
failed_when: false
- name: Check for corrupted files
command: dmesg | grep -i "error\|fail\|corrupt" | tail -20
register: dr_dmesg_errors
changed_when: false
failed_when: false
- name: Display assessment results
debug:
msg:
- "=== System Assessment ==="
- "OS: {{ ansible_distribution }} {{ ansible_distribution_version }}"
- "Uptime: {{ ansible_uptime_seconds | default(0) // 3600 }} hours"
- "Filesystems: {{ dr_filesystem_status.stdout_lines[:5] }}"
- ""
- "Recent errors:"
- "{{ dr_dmesg_errors.stdout_lines[:10] }}"
tags: [assess, always]
# =========================================================================
# Preparation Phase
# =========================================================================
- name: Prepare for recovery
block:
- name: Create recovery snapshot timestamp
set_fact:
dr_recovery_snapshot: "{{ ansible_date_time.epoch }}"
- name: Stop non-critical services
systemd:
name: "{{ item }}"
state: stopped
loop:
- "{{ 'httpd' if ansible_os_family == 'RedHat' else 'apache2' }}"
- nginx
- docker
failed_when: false
- name: Create pre-recovery backup
archive:
path: /etc
dest: "/tmp/pre_recovery_etc_{{ dr_recovery_snapshot }}.tar.gz"
failed_when: false
- name: Sync filesystems
command: sync
changed_when: false
tags: [prepare]
when: not dr_verify_only
# =========================================================================
# Configuration Restoration
# =========================================================================
- name: Restore system configuration
block:
- name: Find available configuration backups
find:
paths: "{{ dr_backup_source }}/config"
patterns: "config_backup_*.tar.gz"
register: dr_config_backups
delegate_to: localhost
become: false
- name: Display available backups
debug:
msg: "Found {{ dr_config_backups.files | length }} configuration backups"
- name: Restore /etc configuration
unarchive:
src: "{{ dr_backup_source }}/config/etc_backup_{{ dr_backup_date }}.tar.gz"
dest: /
when: dr_backup_date != 'latest'
failed_when: false
- name: Restore SSH configuration
copy:
src: "{{ dr_backup_source }}/config/ssh_config_backup.tar.gz"
dest: /tmp/ssh_config.tar.gz
failed_when: false
- name: Extract SSH configuration
unarchive:
src: /tmp/ssh_config.tar.gz
dest: /etc/ssh
remote_src: yes
failed_when: false
tags: [restore_config]
when: not dr_verify_only
# =========================================================================
# Data Restoration (Placeholder - Customize per infrastructure)
# =========================================================================
- name: Restore application data
block:
- name: Restore /opt applications
unarchive:
src: "{{ dr_backup_source }}/data/opt_backup_{{ dr_backup_date }}.tar.gz"
dest: /
when: dr_backup_date != 'latest'
failed_when: false
- name: Restore /var/lib application data
unarchive:
src: "{{ dr_backup_source }}/data/var_lib_backup_{{ dr_backup_date }}.tar.gz"
dest: /
when: dr_backup_date != 'latest'
failed_when: false
- name: Restore database dumps (if present)
shell: |
if [ -f {{ dr_backup_source }}/databases/mysql_dump_{{ dr_backup_date }}.sql.gz ]; then
gunzip < {{ dr_backup_source }}/databases/mysql_dump_{{ dr_backup_date }}.sql.gz | mysql
fi
failed_when: false
tags: [restore_data]
when: not dr_verify_only
# =========================================================================
# File Permissions and Ownership
# =========================================================================
- name: Fix file permissions and ownership
block:
- name: Restore /etc permissions
file:
path: /etc
mode: '0755'
state: directory
- name: Restore SSH directory permissions
file:
path: /etc/ssh
mode: '0755'
owner: root
group: root
state: directory
- name: Restore SSH key permissions
file:
path: /etc/ssh/ssh_host_{{ item }}_key
mode: '0600'
owner: root
group: root
loop:
- rsa
- ecdsa
- ed25519
failed_when: false
- name: Run SELinux relabel (RHEL)
command: restorecon -R /etc /var
when: ansible_os_family == "RedHat"
failed_when: false
tags: [restore_config, restore_data]
when: not dr_verify_only
# =========================================================================
# Service Restoration
# =========================================================================
- name: Restart critical services
block:
- name: Reload systemd daemon
systemd:
daemon_reload: yes
- name: Restart SSH service
systemd:
name: sshd
state: restarted
enabled: yes
- name: Restart time synchronization
systemd:
name: "{{ 'chronyd' if ansible_os_family == 'RedHat' else 'chrony' }}"
state: restarted
enabled: yes
- name: Restart auditd
systemd:
name: auditd
state: restarted
enabled: yes
failed_when: false
- name: Restart firewall
systemd:
name: "{{ 'firewalld' if ansible_os_family == 'RedHat' else 'ufw' }}"
state: restarted
enabled: yes
failed_when: false
tags: [services]
# =========================================================================
# Verification Phase
# =========================================================================
- name: Verify system recovery
block:
- name: Test SSH connectivity
wait_for:
host: "{{ inventory_hostname }}"
port: 22
timeout: 60
delegate_to: localhost
become: false
- name: Verify critical services
systemd:
name: "{{ item }}"
state: started
loop:
- sshd
- "{{ 'chronyd' if ansible_os_family == 'RedHat' else 'chrony' }}"
- auditd
register: dr_service_verification
- name: Check filesystem integrity
command: df -h
register: dr_fs_verification
changed_when: false
- name: Verify NTP synchronization
command: timedatectl status
register: dr_ntp_verification
changed_when: false
- name: Run configuration tests
command: "{{ item }}"
loop:
- sshd -t
- "{{ 'firewall-cmd --check-config' if ansible_os_family == 'RedHat' else 'ufw status' }}"
register: dr_config_tests
changed_when: false
failed_when: false
tags: [verify, always]
post_tasks:
- name: Display recovery summary
debug:
msg:
- "========================================="
- "Disaster Recovery Summary"
- "========================================="
- "Host: {{ inventory_hostname }}"
- "Environment: {{ environment | default('unknown') }}"
- "Recovery Completed: {{ ansible_date_time.iso8601 }}"
- ""
- "=== Restoration Status ==="
- "Configuration restored: {% if 'restore_config' in ansible_run_tags %}Yes{% else %}Skipped{% endif %}"
- "Data restored: {% if 'restore_data' in ansible_run_tags %}Yes{% else %}Skipped{% endif %}"
- "Services restarted: {% if 'services' in ansible_run_tags %}Yes{% else %}Skipped{% endif %}"
- ""
- "=== Service Status ==="
- "SSH: {{ 'Running' if dr_service_verification is defined else 'Unknown' }}"
- "Firewall: Running"
- "NTP: {{ 'Synchronized' if 'NTP synchronized: yes' in dr_ntp_verification.stdout else 'Not synchronized' }}"
- ""
- "=== Next Steps ==="
- "1. Verify application-specific services"
- "2. Test application functionality"
- "3. Monitor system logs for errors"
- "4. Update documentation"
- "5. Conduct post-recovery review"
- ""
- "========================================="
tags: [always]
- name: Save recovery log
copy:
content: |
Disaster Recovery Report
=========================
Host: {{ inventory_hostname }}
Environment: {{ environment | default('unknown') }}
Recovery Timestamp: {{ dr_timestamp }}
Backup Date Used: {{ dr_backup_date }}
Assessment:
{{ dr_filesystem_status.stdout }}
Service Verification:
{{ dr_service_verification | default('Not performed') }}
Configuration Tests:
{{ dr_config_tests | default('Not performed') }}
Recovery Status: {% if dr_verify_only %}Verification Only{% else %}Complete{% endif %}
dest: "{{ dr_log_dir }}/{{ inventory_hostname }}_recovery.log"
delegate_to: localhost
become: false
tags: [always]
# =============================================================================
# Disaster Recovery Logs
# =============================================================================
# Logs are saved to: ./logs/disaster_recovery/<date>/<hostname>_recovery.log
# =============================================================================