infra-automation/docs/runbooks/incident-response.md

# Incident Response Runbook

Procedures for responding to security incidents and breaches.

## Incident Categories

| Category | Examples | Severity |
|----------|----------|----------|
| **Security Breach** | Unauthorized access, data exfiltration | Critical |
| **Malware** | Ransomware, trojans, rootkits | Critical |
| **DoS/DDoS** | Service flooding, resource exhaustion | High |
| **Policy Violation** | Unauthorized changes, compliance breach | Medium |
| **Suspicious Activity** | Unusual logins, port scans | Low |

## Initial Response (First 15 Minutes)

### 1. Detection and Verification

```bash
# Check for suspicious activity
ansible all -m shell -a "last -a | head -20"  # Recent logins
ansible all -m shell -a "who"  # Current users
ansible all -m shell -a "ss -tulpn | grep LISTEN"  # Listening ports

# Check failed login attempts
ansible all -m shell -a "grep 'Failed password' /var/log/auth.log | tail -50"

# Check for privilege escalation
ansible all -m shell -a "grep sudo /var/log/auth.log | tail -20"
```

### 2. Immediate Containment

**If breach confirmed:**

```bash
# Block suspicious IP (replace with actual IP)
ansible all -m shell -a "ufw deny from <suspicious_ip>"

# Disable compromised user account
ansible all -m shell -a "usermod -L <username>"

# Kill suspicious processes
ansible all -m shell -a "pkill -9 <process_name>"

# Isolate compromised host
ansible compromised_host -m shell -a "iptables -P INPUT DROP; iptables -P OUTPUT DROP"
```

### 3. Notification

**Notify (within 15 minutes):**
- Security team
- Infrastructure team lead
- Management (critical incidents)
- Legal/compliance (data breaches)

**Template:**
```
SECURITY INCIDENT [CRITICAL/HIGH/MEDIUM/LOW]

Incident ID: SEC-YYYYMMDD-NNN
Detected: [Timestamp]
Type: [Breach/Malware/DoS/Policy/Suspicious]
Affected Systems: [List]
Initial Assessment: [Description]
Containment Status: [Contained/In Progress/Not Contained]
Response Lead: [Name]
```

## Investigation Phase (15-60 Minutes)

### 1. Evidence Collection

```bash
# Capture system state
ansible compromised_host -m shell -a "ps aux > /tmp/processes_$(date +%s).txt"
ansible compromised_host -m shell -a "netstat -tulpn > /tmp/network_$(date +%s).txt"
ansible compromised_host -m shell -a "df -h > /tmp/disk_$(date +%s).txt"

# Collect logs
ansible compromised_host -m shell -a "tar czf /tmp/logs_$(date +%s).tar.gz /var/log/"

# Copy evidence to secure location
ansible compromised_host -m fetch \
  -a "src=/tmp/logs_*.tar.gz dest=./evidence/ flat=yes"
```

### 2. Forensic Analysis

```bash
# Check for unauthorized files
ansible compromised_host -m shell -a "find / -type f -mtime -1 2>/dev/null | head -100"

# Check for SUID files
ansible compromised_host -m shell -a "find / -perm -4000 -type f 2>/dev/null"

# Check cron jobs
ansible compromised_host -m shell -a "cat /etc/crontab; ls -la /etc/cron.*/"

# Check startup services
ansible compromised_host -m shell -a "systemctl list-unit-files | grep enabled"

# Check network connections
ansible compromised_host -m shell -a "ss -tnp"

# AIDE integrity check (if configured)
ansible compromised_host -m shell -a "aide --check"
```

### 3. Root Cause Analysis

Determine:
- Entry point
- Attack vector
- Extent of compromise
- Data accessed/exfiltrated
- Duration of access

## Eradication Phase (1-4 Hours)

### 1. Remove Threat

```bash
# Remove malicious files
ansible compromised_host -m file -a "path=<malicious_file> state=absent"

# Kill malicious processes
ansible compromised_host -m shell -a "pkill -9 <malicious_process>"

# Remove unauthorized users
ansible compromised_host -m user -a "name=<unauthorized_user> state=absent remove=yes"

# Remove backdoors
ansible compromised_host -m shell -a "rm -f /etc/cron.d/<backdoor>"
```

### 2. Patch Vulnerabilities

```bash
# Apply security updates
ansible-playbook -i inventories/<environment> playbooks/maintenance.yml \
  --limit compromised_host \
  --tags updates

# Harden configuration
ansible-playbook -i inventories/<environment> playbooks/security_audit.yml \
  --limit compromised_host
```

### 3. Credential Rotation

```bash
# Rotate SSH keys
ansible compromised_host -m shell \
  -a "rm -f /home/*/.ssh/authorized_keys; echo '<new_key>' > /home/ansible/.ssh/authorized_keys"

# Rotate passwords (use vault)
ansible-playbook -i inventories/<environment> site.yml \
  --limit compromised_host \
  --tags user_management \
  --ask-vault-pass

# Rotate API tokens
# Update tokens in vault and redeploy
ansible-vault edit inventories/<environment>/group_vars/all/vault.yml
```

## Recovery Phase (4-8 Hours)

### 1. System Restoration

```bash
# Option A: Rebuild from scratch (recommended for severe breaches)
# 1. Provision new host
# 2. Deploy via Ansible
ansible-playbook -i inventories/<environment> site.yml --limit new_host

# Option B: Restore from clean backup
ansible-playbook playbooks/disaster_recovery.yml \
  --limit compromised_host \
  --extra-vars "dr_backup_date=<known_clean_date>"
```

### 2. Enhanced Monitoring

```bash
# Enable enhanced logging
ansible all -m lineinfile \
  -a "path=/etc/rsyslog.conf line='*.* @@<siem_server>:514'"

# Restart logging
ansible all -m systemd -a "name=rsyslog state=restarted"

# Deploy monitoring agents (if not present)
# Configure alerts for suspicious activity
```

### 3. Security Hardening

```bash
# Run full security audit
ansible-playbook playbooks/security_audit.yml

# Apply additional hardening
ansible all -m sysctl -a "name=net.ipv4.conf.all.accept_source_route value=0 state=present reload=yes"
ansible all -m sysctl -a "name=net.ipv4.tcp_syncookies value=1 state=present reload=yes"

# Enable AIDE file integrity monitoring
ansible all -m shell -a "aideinit && aide --check"
```

## Post-Incident Activities

### 1. Documentation (Within 24 Hours)

Create incident report with:
- Timeline of events
- Actions taken
- Impact assessment
- Root cause
- Evidence collected
- Lessons learned

### 2. Stakeholder Communication (Within 24 Hours)

Notify:
- Management
- Legal/compliance
- Affected customers (if applicable)
- Regulatory bodies (if required)

### 3. Post-Incident Review (Within 72 Hours)

Review meeting agenda:
- What happened
- How was it detected
- Response effectiveness
- What went well
- What needs improvement
- Action items

### 4. Preventive Measures (Within 2 Weeks)

- Implement security controls
- Update security policies
- Enhance monitoring
- Conduct training
- Test incident response procedures

## Compliance Requirements

### Data Breach Notification

| Regulation | Notification Timeline | Who to Notify |
|------------|----------------------|---------------|
| GDPR | 72 hours | Supervisory authority, affected individuals |
| HIPAA | 60 days | HHS, affected individuals, media (if >500) |
| PCI-DSS | Immediately | Payment brands, acquiring bank |
| State Laws | Varies | State AG, affected residents |

### Evidence Preservation

- Maintain chain of custody
- Preserve logs for minimum 90 days
- Document all investigative steps
- Secure evidence with encryption

## Tools and Resources

### Analysis Tools

```bash
# Log analysis
grep -i "failed\|error\|unauthorized" /var/log/auth.log

# Network analysis
tcpdump -i eth0 -w capture.pcap

# Process analysis
ps aux | grep -v "^\[" | sort -k3 -rn | head -20

# File analysis
find / -type f -name "*.php" -exec grep -l "eval\|base64_decode" {} \;
```

### External Resources

- NIST Cybersecurity Framework
- SANS Incident Response Guide
- MITRE ATT&CK Framework
- CERT Incident Handling Guide

## Incident Categories and Response Times

| Severity | Examples | Response Time | Recovery Time |
|----------|----------|---------------|---------------|
| **Critical** | Active data breach, ransomware | 15 min | 4 hours |
| **High** | Unauthorized access attempt, malware | 30 min | 8 hours |
| **Medium** | Policy violation, suspicious activity | 2 hours | 24 hours |
| **Low** | Failed login attempts, port scans | 8 hours | 48 hours |

## Quick Reference

```bash
# Block IP immediately
ansible all -m shell -a "ufw deny from <ip>"

# Check current users
ansible all -m shell -a "w"

# Check listening ports
ansible all -m shell -a "ss -tulpn"

# Collect evidence
ansible host -m shell -a "tar czf /tmp/evidence.tar.gz /var/log/"

# Isolate host
ansible host -m shell -a "iptables -P INPUT DROP; iptables -A INPUT -s <trusted_ip> -j ACCEPT"

# Security audit
ansible-playbook playbooks/security_audit.yml --limit host
```

## Emergency Contacts

| Role | Name | Contact | Backup |
|------|------|---------|--------|
| Security Lead | TBD | TBD | TBD |
| Incident Commander | TBD | TBD | TBD |
| Legal Counsel | TBD | TBD | TBD |
| PR/Communications | TBD | TBD | TBD |
| Law Enforcement | TBD | TBD | - |

---
**Last Updated:** 2025-11-11
**Next Review:** 2025-02-11
**Classification:** Confidential