infra-automation/docs/runbooks/disaster-recovery.md

# Disaster Recovery Runbook

Emergency procedures for recovering from system failures and disasters.

## Severity Levels

| Level | Description | Response Time |
|-------|-------------|---------------|
| **P0** | Complete system failure | Immediate |
| **P1** | Critical service outage | < 15 minutes |
| **P2** | Degraded performance | < 1 hour |
| **P3** | Minor issues | < 4 hours |

## Initial Response

### 1. Incident Detection (0-5 minutes)

```bash
# Verify incident scope
ansible all -i inventories/<environment> -m ping

# Identify failed hosts
ansible-playbook playbooks/security_audit.yml --tags assess
```

### 2. Incident Classification (5-10 minutes)

Determine:
- Affected hosts/services
- Severity level
- Business impact
- Recovery time objective (RTO)

### 3. Communication (10-15 minutes)

**Notify:**
- Infrastructure team
- Management (P0/P1 only)
- Affected stakeholders

**Template:**
```
INCIDENT ALERT [P0/P1/P2/P3]

Incident ID: DR-YYYYMMDD-NNN
Detected: [Timestamp]
Scope: [Affected systems]
Impact: [Business impact]
Status: Investigating/Responding/Resolved
ETA: [Estimated resolution time]
```

## Recovery Procedures

### Scenario 1: Single Host Failure (P1)

**Symptoms:** Host unreachable, services down

**Recovery:**

```bash
# 1. Assess damage
ansible-playbook playbooks/disaster_recovery.yml \
  --limit failed_host \
  --tags assess

# 2. Attempt service restart
ansible failed_host -m systemd -a "name=<service> state=restarted"

# 3. If unsuccessful, initiate full recovery
ansible-playbook playbooks/disaster_recovery.yml \
  --limit failed_host \
  --extra-vars "dr_backup_date=latest"

# 4. Verify recovery
ansible-playbook playbooks/disaster_recovery.yml \
  --limit failed_host \
  --tags verify
```

**RTO:** 30 minutes

### Scenario 2: Database Corruption (P0)

**Symptoms:** Database errors, data inconsistency

**Recovery:**

```bash
# 1. Stop application services
ansible dbserver -m systemd -a "name=application state=stopped"

# 2. Restore database from backup
ansible-playbook playbooks/disaster_recovery.yml \
  --limit dbserver \
  --tags restore_data \
  --extra-vars "dr_backup_date=YYYY-MM-DD"

# 3. Verify database integrity
ansible dbserver -m shell -a "mysqlcheck --all-databases"

# 4. Restart services
ansible dbserver -m systemd -a "name=mysql state=restarted"
ansible dbserver -m systemd -a "name=application state=restarted"
```

**RTO:** 1 hour

### Scenario 3: Complete Environment Failure (P0)

**Symptoms:** All hosts unreachable, total outage

**Recovery:**

```bash
# 1. Verify network connectivity
ping <hosts>

# 2. Check infrastructure provider status
# (AWS, Azure, etc.)

# 3. If infrastructure is available, restore hosts individually
for host in host1 host2 host3; do
  ansible-playbook playbooks/disaster_recovery.yml \
    --limit $host \
    --extra-vars "dr_backup_date=latest"
done

# 4. Verify environment health
ansible-playbook -i inventories/<environment> site.yml --check
```

**RTO:** 4 hours

### Scenario 4: Configuration Corruption (P2)

**Symptoms:** Services misconfigured, errors in logs

**Recovery:**

```bash
# 1. Restore configuration only
ansible-playbook playbooks/disaster_recovery.yml \
  --limit affected_hosts \
  --tags restore_config \
  --extra-vars "dr_backup_date=YYYY-MM-DD"

# 2. Restart affected services
ansible affected_hosts -m systemd -a "name=<service> state=restarted"

# 3. Verify configuration
ansible affected_hosts -m shell -a "<service> -t"  # Test config
```

**RTO:** 30 minutes

## Escalation Path

1. **L1:** On-call engineer (initial response)
2. **L2:** Senior infrastructure engineer (if unresolved in 30 min)
3. **L3:** Infrastructure team lead (P0/P1 or > 1 hour)
4. **L4:** CTO/Management (> 2 hours or business-critical)

## Post-Incident Procedures

### 1. Verification (Immediate)

```bash
# System health check
ansible-playbook playbooks/maintenance.yml --tags verify

# Security audit
ansible-playbook playbooks/security_audit.yml
```

### 2. Documentation (Within 2 hours)

Document in incident log:
- Timeline of events
- Actions taken
- Recovery time
- Root cause (if known)

### 3. Post-Mortem (Within 48 hours)

Conduct post-mortem meeting:
- What happened
- What went well
- What could be improved
- Action items

### 4. Preventive Actions (Within 1 week)

- Implement fixes
- Update runbooks
- Improve monitoring
- Test recovery procedures

## Testing Schedule

| Test Type | Frequency | Scope |
|-----------|-----------|-------|
| Single host recovery | Monthly | Development |
| Configuration restore | Monthly | Staging |
| Database restore | Quarterly | Staging |
| Full DR drill | Semi-annually | All |

## Emergency Contacts

| Role | Name | Contact | Backup |
|------|------|---------|--------|
| On-Call Engineer | TBD | TBD | TBD |
| Team Lead | TBD | TBD | TBD |
| Management | TBD | TBD | TBD |
| Vendor Support | TBD | TBD | - |

## Critical Information

### Backup Locations
- Local: `/var/backups/`
- Remote: `[Remote backup server]`
- Off-site: `[Off-site location]`

### Recovery Credentials
- Vault password location: `[Secure location]`
- Emergency access: `[Break-glass procedure]`
- Root passwords: `[Secure password manager]`

### Service Dependencies

```
Load Balancer
    ↓
Web Servers (webserver01, webserver02)
    ↓
Application Servers (appserver01, appserver02)
    ↓
Database (dbserver01) → Replica (dbserver02)
    ↓
Cache (redis01)
```

## Quick Reference

```bash
# Assess all hosts
ansible-playbook playbooks/disaster_recovery.yml --tags assess

# Full recovery single host
ansible-playbook playbooks/disaster_recovery.yml --limit host

# Configuration only
ansible-playbook playbooks/disaster_recovery.yml --limit host --tags restore_config

# Verify recovery
ansible-playbook playbooks/disaster_recovery.yml --limit host --tags verify

# Check backup availability
ansible all -m shell -a "ls -lh /var/backups/"
```

---
**Last Updated:** 2025-11-11
**Next Review:** 2025-02-11