diff --git a/cheatsheets/playbooks/backup.md b/cheatsheets/playbooks/backup.md new file mode 100644 index 0000000..6be7ae8 --- /dev/null +++ b/cheatsheets/playbooks/backup.md @@ -0,0 +1,292 @@ +# Backup Playbook Cheatsheet + +Quick reference for using the backup playbook. + +## Quick Start + +```bash +# Run full backup on all hosts +ansible-playbook playbooks/backup.yml + +# Backup specific environment +ansible-playbook -i inventories/production playbooks/backup.yml + +# Dry-run +ansible-playbook playbooks/backup.yml --check +``` + +## Common Usage + +### Full Backup + +```bash +# Complete backup (config + data + databases) +ansible-playbook playbooks/backup.yml \ + --extra-vars "backup_type=full" + +# Production environment +ansible-playbook -i inventories/production playbooks/backup.yml \ + --extra-vars "backup_type=full" +``` + +### Incremental Backup (Default) + +```bash +# Configuration and databases only +ansible-playbook playbooks/backup.yml +``` + +### Selective Backups + +```bash +# Configuration files only +ansible-playbook playbooks/backup.yml --tags config + +# Databases only +ansible-playbook playbooks/backup.yml --tags databases + +# Application data only +ansible-playbook playbooks/backup.yml --tags data + +# Log files +ansible-playbook playbooks/backup.yml --tags logs +``` + +## Available Tags + +| Tag | Description | +|-----|-------------| +| `config` | System configuration files (/etc, SSH, network) | +| `data` | Application data (/opt, /var/lib, /home) | +| `databases` | MySQL, PostgreSQL, MongoDB dumps | +| `logs` | Log files and audit logs | +| `verify` | Verify backup integrity | +| `cleanup` | Remove old backups | + +## Extra Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `backup_type` | `incremental` | Backup type (full or incremental) | +| `backup_retention_days` | `30` | How long to keep backups | +| `backup_compress` | `true` | Compress backups | +| `backup_verify` | `true` | Verify backup integrity | +| `backup_remote_dir` | `None` | Remote backup destination | + +## What Gets Backed Up + +### Configuration (`--tags config`) +- ✅ /etc directory +- ✅ SSH configuration +- ✅ Network configuration +- ✅ Firewall rules +- ✅ Cron jobs +- ✅ Systemd services + +### Application Data (`--tags data`) +- ✅ /opt directory +- ✅ /var/lib (excluding databases) +- ✅ /home directories + +### Databases (`--tags databases`) +- ✅ MySQL/MariaDB (all databases) +- ✅ PostgreSQL (all databases) +- ✅ MongoDB dumps + +### Logs (`--tags logs`) +- ✅ /var/log +- ✅ Audit logs + +## Backup Location + +Local backups: `/var/backups/` + +``` +/var/backups/ +├── config/ +│ ├── etc_backup_.tar.gz +│ ├── ssh_backup_.tar.gz +│ └── ... +├── data/ +│ ├── opt_backup_.tar.gz +│ └── ... +├── databases/ +│ ├── mysql_dump_.sql.gz +│ └── ... +└── logs/ + └── var_log_backup_.tar.gz +``` + +## Backup Verification + +```bash +# Run backup with verification +ansible-playbook playbooks/backup.yml --tags verify + +# Verify specific backup integrity +ansible all -m shell -a "gzip -t /var/backups/config/etc_backup_*.tar.gz" +``` + +## Cleanup Old Backups + +```bash +# Remove backups older than 30 days (default) +ansible-playbook playbooks/backup.yml --tags cleanup + +# Custom retention period (keep 90 days) +ansible-playbook playbooks/backup.yml --tags cleanup \ + --extra-vars "backup_retention_days=90" +``` + +## Remote Backup Transfer + +```bash +# Transfer to remote backup server +ansible-playbook playbooks/backup.yml --tags remote \ + --extra-vars "backup_remote_dir=/mnt/backup-server/ansible" +``` + +## Scheduling Backups + +### Cron Example + +```bash +# Daily backup at 2 AM +0 2 * * * cd /opt/ansible && ansible-playbook playbooks/backup.yml + +# Weekly full backup on Sunday +0 3 * * 0 cd /opt/ansible && ansible-playbook playbooks/backup.yml \ + --extra-vars "backup_type=full" +``` + +### SystemD Timer + +```ini +# /etc/systemd/system/ansible-backup.timer +[Unit] +Description=Ansible Backup + +[Timer] +OnCalendar=daily +OnCalendar=02:00 +Persistent=true + +[Install] +WantedBy=timers.target +``` + +## Example Output + +``` +========================================= +Backup Summary +========================================= +Host: webserver01 +Environment: production +Completed: 2025-01-11T02:30:00Z + +=== Backup Details === +Type: full +Files created: 12 +Total size: 2.5G +Location: /var/backups + +=== Retention === +Retention period: 30 days +Old backups cleaned: 5 + +=== Verification === +Integrity check: Passed + +Manifest: /var/backups/backup_manifest_2025-01-11_0230.txt +========================================= +``` + +## Troubleshooting + +### Insufficient disk space + +Check available space: +```bash +ansible all -m shell -a "df -h /var/backups" +``` + +Clean old backups: +```bash +ansible-playbook playbooks/backup.yml --tags cleanup +``` + +### Database backup fails + +Check database connectivity: +```bash +# MySQL +ansible all -m shell -a "mysqldump --version" + +# PostgreSQL +ansible all -m shell -a "sudo -u postgres pg_dumpall --version" +``` + +### Backup integrity check fails + +Manually verify: +```bash +ansible all -m shell -a "gzip -t /var/backups/config/*.gz" +``` + +## Restore from Backup + +See [Disaster Recovery Playbook](disaster_recovery.md) for restoration procedures. + +```bash +# Quick restore example +ansible-playbook playbooks/disaster_recovery.yml \ + --limit failed_host \ + --extra-vars "dr_backup_date=2025-01-11" +``` + +## Best Practices + +1. **Test restores regularly** - Backups are useless if they can't be restored +2. **Monitor backup sizes** - Watch for unexpected growth +3. **Use remote storage** - Don't keep backups only on the same host +4. **Verify backups** - Always enable verification +5. **Document retention** - Follow compliance requirements +6. **Encrypt sensitive backups** - Use encryption for databases +7. **Schedule appropriately** - Run during low-activity periods + +## Quick Reference Commands + +```bash +# Full backup with verification +ansible-playbook playbooks/backup.yml \ + --extra-vars "backup_type=full" + +# Configuration only +ansible-playbook playbooks/backup.yml --tags config + +# Databases only +ansible-playbook playbooks/backup.yml --tags databases + +# Cleanup old backups (30+ days) +ansible-playbook playbooks/backup.yml --tags cleanup + +# Custom retention (90 days) +ansible-playbook playbooks/backup.yml --tags cleanup \ + --extra-vars "backup_retention_days=90" + +# Dry-run +ansible-playbook playbooks/backup.yml --check + +# Specific host only +ansible-playbook playbooks/backup.yml --limit hostname + +# Production environment +ansible-playbook -i inventories/production playbooks/backup.yml +``` + +## See Also + +- [Backup Playbook](../../playbooks/backup.yml) +- [Disaster Recovery Playbook](../../playbooks/disaster_recovery.yml) +- [Maintenance Playbook](../../playbooks/maintenance.yml) diff --git a/cheatsheets/playbooks/disaster_recovery.md b/cheatsheets/playbooks/disaster_recovery.md new file mode 100644 index 0000000..6b7e1b3 --- /dev/null +++ b/cheatsheets/playbooks/disaster_recovery.md @@ -0,0 +1,366 @@ +# Disaster Recovery Playbook Cheatsheet + +Quick reference for using the disaster recovery playbook. + +## ⚠️ WARNING + +This playbook performs **DESTRUCTIVE OPERATIONS**. Only use when recovering from a disaster or system failure. + +## Quick Start + +```bash +# Assess damage only (safe) +ansible-playbook playbooks/disaster_recovery.yml --limit failed_host --tags assess + +# Full recovery +ansible-playbook playbooks/disaster_recovery.yml --limit failed_host \ + --extra-vars "dr_backup_date=2025-01-11" +``` + +## Prerequisites + +1. **Backups available** - Ensure backups exist in `/var/backups/` +2. **System accessible** - Host must be reachable via SSH +3. **Confirmation ready** - You'll need to type "RECOVER" to proceed + +## Common Usage + +### Assessment Phase (Safe) + +```bash +# Assess system damage without making changes +ansible-playbook playbooks/disaster_recovery.yml \ + --limit failed_host \ + --tags assess + +# Multiple hosts +ansible-playbook playbooks/disaster_recovery.yml \ + --limit "host1,host2,host3" \ + --tags assess +``` + +### Configuration Recovery + +```bash +# Restore configuration files only +ansible-playbook playbooks/disaster_recovery.yml \ + --limit failed_host \ + --tags restore_config \ + --extra-vars "dr_backup_date=2025-01-11" +``` + +### Data Recovery + +```bash +# Restore application data only +ansible-playbook playbooks/disaster_recovery.yml \ + --limit failed_host \ + --tags restore_data \ + --extra-vars "dr_backup_date=2025-01-11" +``` + +### Full Recovery + +```bash +# Complete system recovery +ansible-playbook playbooks/disaster_recovery.yml \ + --limit failed_host \ + --extra-vars "dr_backup_date=2025-01-11" +``` + +## Available Tags + +| Tag | Description | Destructive? | +|-----|-------------|--------------| +| `assess` | Assess system state | No ✅ | +| `prepare` | Prepare for recovery | Yes ⚠️ | +| `restore_config` | Restore configuration | Yes ⚠️ | +| `restore_data` | Restore data | Yes ⚠️ | +| `services` | Restart services | No ✅ | +| `verify` | Verify restoration | No ✅ | + +## Extra Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `dr_backup_date` | `latest` | Backup date to restore (format: YYYY-MM-DD) | +| `dr_verify_only` | `false` | Assessment mode only (no changes) | + +## Recovery Phases + +### 1. Assessment + +```bash +ansible-playbook playbooks/disaster_recovery.yml \ + --limit host \ + --tags assess +``` + +**Checks:** +- System accessibility +- Filesystem status +- Service status +- System errors + +### 2. Preparation + +```bash +ansible-playbook playbooks/disaster_recovery.yml \ + --limit host \ + --tags prepare +``` + +**Actions:** +- Stops non-critical services +- Creates pre-recovery backup +- Syncs filesystems + +### 3. Restoration + +```bash +ansible-playbook playbooks/disaster_recovery.yml \ + --limit host \ + --tags restore_config,restore_data +``` + +**Restores:** +- System configuration (/etc) +- SSH configuration +- Application data +- Database dumps + +### 4. Service Restart + +```bash +ansible-playbook playbooks/disaster_recovery.yml \ + --limit host \ + --tags services +``` + +**Restarts:** +- SSH daemon +- Time synchronization +- Auditd +- Firewall + +### 5. Verification + +```bash +ansible-playbook playbooks/disaster_recovery.yml \ + --limit host \ + --tags verify +``` + +**Verifies:** +- SSH connectivity +- Critical services running +- Filesystem integrity +- NTP synchronization + +## Recovery Scenarios + +### Scenario 1: Configuration Corruption + +```bash +# Restore only configuration files +ansible-playbook playbooks/disaster_recovery.yml \ + --limit webserver01 \ + --tags assess,restore_config,verify \ + --extra-vars "dr_backup_date=2025-01-11" +``` + +### Scenario 2: Failed System Upgrade + +```bash +# Full recovery from pre-upgrade backup +ansible-playbook playbooks/disaster_recovery.yml \ + --limit dbserver01 \ + --extra-vars "dr_backup_date=2025-01-10" +``` + +### Scenario 3: Data Loss + +```bash +# Restore application data only +ansible-playbook playbooks/disaster_recovery.yml \ + --limit appserver01 \ + --tags restore_data \ + --extra-vars "dr_backup_date=latest" +``` + +### Scenario 4: Complete System Failure + +```bash +# 1. Rebuild OS (manual or automated provisioning) +# 2. Ensure SSH access works +# 3. Run full recovery +ansible-playbook playbooks/disaster_recovery.yml \ + --limit new_replacement_host \ + --extra-vars "dr_backup_date=2025-01-11" +``` + +## Finding Available Backups + +```bash +# List all available backups for a host +ansible failed_host -m shell -a "ls -lh /var/backups/config/" + +# Check backup dates +ansible failed_host -m shell -a "ls /var/backups/*/backup_manifest_*.txt" + +# View backup manifest +ansible failed_host -m shell -a "cat /var/backups/backup_manifest_2025-01-11_0230.txt" +``` + +## Logs and Reports + +Recovery logs: `./logs/disaster_recovery//_recovery.log` + +## Example Output + +``` +========================================= +!! DISASTER RECOVERY MODE !! +========================================= +Host: webserver01 +Environment: production +Timestamp: 2025-01-11T10:00:00Z +Backup Date: 2025-01-11 + +WARNING: This playbook performs destructive operations! +========================================= + +[Pause for confirmation - type 'RECOVER'] + +=== System Assessment === +OS: Ubuntu 22.04 +Uptime: 2 hours +Filesystems: OK + +=== Restoration Status === +Configuration restored: Yes +Data restored: Yes +Services restarted: Yes + +=== Service Status === +SSH: Running +Firewall: Running +NTP: Synchronized + +=== Next Steps === +1. Verify application-specific services +2. Test application functionality +3. Monitor system logs for errors +4. Update documentation +5. Conduct post-recovery review +========================================= +``` + +## Troubleshooting + +### Backup not found + +```bash +# Check backup location +ansible failed_host -m shell -a "ls -la /var/backups/" + +# Restore from remote backup server +ansible failed_host -m synchronize \ + -a "src=/mnt/backup-server/backups/ dest=/var/backups/ mode=pull" +``` + +### SSH connection lost during recovery + +The SSH service restart is designed to maintain connections. If lost: + +```bash +# Wait 60 seconds for SSH to restart +# Retry connection + +ansible failed_host -m ping +``` + +### Service won't start after recovery + +```bash +# Check service status +ansible failed_host -m shell -a "systemctl status service_name" + +# Check service logs +ansible failed_host -m shell -a "journalctl -u service_name -n 50" +``` + +### SELinux blocking services + +```bash +# Relabel SELinux contexts +ansible failed_host -m shell -a "restorecon -R /etc /var" +``` + +## Post-Recovery Checklist + +- [ ] Verify all services running +- [ ] Test application functionality +- [ ] Check disk space +- [ ] Review system logs +- [ ] Verify backups are current +- [ ] Update documentation +- [ ] Notify stakeholders +- [ ] Conduct lessons learned review + +## Best Practices + +1. **Test recovery procedures regularly** - Monthly DR drills +2. **Document recovery time objectives (RTO)** - Know your targets +3. **Keep backups off-site** - Don't rely on local backups only +4. **Verify backup integrity** - Test restores before disasters +5. **Maintain runbooks** - Document specific recovery procedures +6. **Practice on staging** - Test recovery in non-production first +7. **Have communication plan** - Know who to notify + +## Quick Reference Commands + +```bash +# Assess damage only +ansible-playbook playbooks/disaster_recovery.yml \ + --limit host --tags assess + +# Full recovery with latest backup +ansible-playbook playbooks/disaster_recovery.yml \ + --limit host + +# Specific backup date +ansible-playbook playbooks/disaster_recovery.yml \ + --limit host \ + --extra-vars "dr_backup_date=2025-01-11" + +# Configuration only +ansible-playbook playbooks/disaster_recovery.yml \ + --limit host \ + --tags restore_config + +# Verify recovery +ansible-playbook playbooks/disaster_recovery.yml \ + --limit host \ + --tags verify + +# Assessment mode (no changes) +ansible-playbook playbooks/disaster_recovery.yml \ + --limit host \ + --extra-vars "dr_verify_only=true" +``` + +## Emergency Contacts + +Keep this information updated: + +- Infrastructure Team Lead: [Contact] +- On-Call Engineer: [Contact] +- Backup System Admin: [Contact] +- Management Escalation: [Contact] + +## See Also + +- [Disaster Recovery Playbook](../../playbooks/disaster_recovery.yml) +- [Backup Playbook](../../playbooks/backup.yml) +- [Disaster Recovery Runbook](../../docs/runbooks/disaster-recovery.md) diff --git a/cheatsheets/playbooks/gather_system_info.md b/cheatsheets/playbooks/gather_system_info.md new file mode 100644 index 0000000..59fd60f --- /dev/null +++ b/cheatsheets/playbooks/gather_system_info.md @@ -0,0 +1,499 @@ +# Gather System Info Playbook Cheatsheet + +Quick reference for using the gather_system_info.yml playbook to collect comprehensive system information across infrastructure. + +## Quick Start + +```bash +# Gather information from all hosts +ansible-playbook playbooks/gather_system_info.yml + +# Specific environment +ansible-playbook -i inventories/production playbooks/gather_system_info.yml + +# Specific host group +ansible-playbook playbooks/gather_system_info.yml --limit webservers +``` + +## Common Usage + +### Basic Execution + +```bash +# All hosts in inventory +ansible-playbook playbooks/gather_system_info.yml + +# Single host +ansible-playbook playbooks/gather_system_info.yml --limit server01.example.com + +# Specific group +ansible-playbook playbooks/gather_system_info.yml --limit databases + +# Check mode (dry-run) +ansible-playbook playbooks/gather_system_info.yml --check +``` + +### Selective Information Gathering + +```bash +# CPU information only +ansible-playbook playbooks/gather_system_info.yml --tags cpu + +# Memory and disk only +ansible-playbook playbooks/gather_system_info.yml --tags memory,disk + +# Hypervisor detection only +ansible-playbook playbooks/gather_system_info.yml --tags hypervisor + +# Skip installation of packages +ansible-playbook playbooks/gather_system_info.yml --skip-tags install + +# Validation and health checks only +ansible-playbook playbooks/gather_system_info.yml --tags validate,health-check +``` + +## Available Tags + +| Tag | Description | +|-----|-------------| +| `system_info` | Main role tag (automatically included) | +| `install` | Install required packages | +| `gather` | All information gathering tasks | +| `system` | OS and system information | +| `cpu` | CPU details and capabilities | +| `gpu` | GPU detection and details | +| `memory` | RAM and swap information | +| `disk` | Storage, LVM, and RAID information | +| `network` | Network interfaces and configuration | +| `hypervisor` | Virtualization platform detection | +| `export` | Export statistics to JSON | +| `statistics` | Statistics aggregation | +| `validate` | Validation checks | +| `health-check` | System health monitoring | +| `security` | Security-related information | + +## Playbook Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `system_info_stats_base_dir` | `./stats/machines` | Base directory for output | +| `system_info_gather_cpu` | `true` | Gather CPU information | +| `system_info_gather_gpu` | `true` | Gather GPU information | +| `system_info_gather_memory` | `true` | Gather memory information | +| `system_info_gather_disk` | `true` | Gather disk information | +| `system_info_gather_network` | `true` | Gather network information | +| `system_info_detect_hypervisor` | `true` | Detect hypervisor capabilities | + +## Output Files + +### Default Location + +``` +./stats/machines// +├── system_info.json # Latest statistics +├── system_info_.json # Timestamped backup +└── summary.txt # Human-readable summary +``` + +### View Statistics + +```bash +# View JSON (pretty-printed) +jq . ./stats/machines/server01.example.com/system_info.json + +# View human-readable summary +cat ./stats/machines/server01.example.com/summary.txt + +# List all hosts with stats +ls -1 ./stats/machines/ + +# Count total hosts +ls -1d ./stats/machines/*/ | wc -l +``` + +## Example Invocations + +### Basic Examples + +```bash +# Production inventory +ansible-playbook -i inventories/production playbooks/gather_system_info.yml + +# Staging inventory +ansible-playbook -i inventories/staging playbooks/gather_system_info.yml + +# Custom output directory +ansible-playbook playbooks/gather_system_info.yml \ + -e "system_info_stats_base_dir=/var/lib/ansible/inventory" +``` + +### Advanced Examples + +```bash +# Hypervisors only with full gathering +ansible-playbook playbooks/gather_system_info.yml \ + --limit hypervisors \ + -e "system_info_detect_hypervisor=true" + +# Quick scan (minimal gathering) +ansible-playbook playbooks/gather_system_info.yml \ + -e "system_info_gather_network=false" \ + -e "system_info_gather_gpu=false" \ + --skip-tags install + +# Parallel execution (10 hosts at a time) +ansible-playbook playbooks/gather_system_info.yml -f 10 + +# With increased verbosity +ansible-playbook playbooks/gather_system_info.yml -v +``` + +## Data Queries + +### Using jq for Data Extraction + +```bash +# Get CPU models across all hosts +jq -r '.cpu.model' ./stats/machines/*/system_info.json + +# Get memory usage +jq -r '"\(.host_info.fqdn): \(.memory.usage_percent)%"' \ + ./stats/machines/*/system_info.json + +# Find hypervisors +jq -r 'select(.hypervisor.is_hypervisor == true) | .host_info.fqdn' \ + ./stats/machines/*/system_info.json + +# Find virtual machines +jq -r 'select(.hypervisor.is_virtual == true) | .host_info.fqdn' \ + ./stats/machines/*/system_info.json + +# Get OS distribution +jq -r '"\(.host_info.fqdn): \(.system.distribution) \(.system.distribution_version)"' \ + ./stats/machines/*/system_info.json + +# Find hosts with high CPU count +jq -r 'select(.cpu.count.vcpus > 8) | "\(.host_info.fqdn): \(.cpu.count.vcpus) vCPUs"' \ + ./stats/machines/*/system_info.json + +# Find hosts with low disk space +jq -r 'select(.disk.usage_percent > 80) | "\(.host_info.fqdn): \(.disk.usage_percent)%"' \ + ./stats/machines/*/system_info.json +``` + +### Generate Reports + +```bash +# CSV export: Hostname, OS, CPU, Memory +jq -r '["FQDN","OS","CPU Cores","Memory GB"], + ([.host_info.fqdn, .system.distribution, + .cpu.count.vcpus, (.memory.total_mb/1024|round)]) | @csv' \ + ./stats/machines/*/system_info.json > infrastructure_report.csv + +# Count CPUs across infrastructure +jq -s 'map(.cpu.count.total_cores | tonumber) | add' \ + ./stats/machines/*/system_info.json + +# Total memory across infrastructure (GB) +jq -s 'map(.memory.total_mb | tonumber) | add / 1024 | round' \ + ./stats/machines/*/system_info.json + +# List GPU-enabled hosts +jq -r 'select(.gpu.detected == true) | "\(.host_info.fqdn): \(.gpu.devices[0].model)"' \ + ./stats/machines/*/system_info.json + +# SELinux status report +jq -r '"\(.host_info.fqdn): SELinux \(.security.selinux)"' \ + ./stats/machines/*/system_info.json | grep -v "N/A" + +# AppArmor status report +jq -r '"\(.host_info.fqdn): AppArmor \(.security.apparmor)"' \ + ./stats/machines/*/system_info.json | grep -v "N/A" +``` + +## Integration Examples + +### Cron Job for Regular Collection + +```bash +# Daily collection at 2 AM +0 2 * * * cd /opt/ansible && ansible-playbook playbooks/gather_system_info.yml \ + >> /var/log/ansible/gather_system_info.log 2>&1 +``` + +### SystemD Timer + +```ini +# /etc/systemd/system/ansible-gather-system-info.timer +[Unit] +Description=Gather System Information Daily + +[Timer] +OnCalendar=daily +Persistent=true + +[Install] +WantedBy=timers.target +``` + +```ini +# /etc/systemd/system/ansible-gather-system-info.service +[Unit] +Description=Ansible Gather System Information + +[Service] +Type=oneshot +WorkingDirectory=/opt/ansible +ExecStart=/usr/bin/ansible-playbook playbooks/gather_system_info.yml +User=ansible +StandardOutput=append:/var/log/ansible/gather_system_info.log +StandardError=append:/var/log/ansible/gather_system_info.log +``` + +### CMDB Integration + +```bash +# Export to NetBox or other CMDB +for host_dir in ./stats/machines/*/; do + host=$(basename "$host_dir") + curl -X POST https://netbox.example.com/api/dcim/devices/ \ + -H "Authorization: Token $NETBOX_TOKEN" \ + -H "Content-Type: application/json" \ + -d @"${host_dir}/system_info.json" +done +``` + +### Monitoring Integration + +```bash +# Create Prometheus metrics +for stats_file in ./stats/machines/*/system_info.json; do + host=$(jq -r '.host_info.fqdn' "$stats_file") + cpu=$(jq -r '.cpu.count.vcpus' "$stats_file") + mem=$(jq -r '.memory.total_mb' "$stats_file") + + cat < /var/lib/node_exporter/textfile_collector/${host}.prom +# HELP system_info_cpu_count Number of CPU cores +# TYPE system_info_cpu_count gauge +system_info_cpu_count{host="$host"} $cpu + +# HELP system_info_memory_mb Total memory in MB +# TYPE system_info_memory_mb gauge +system_info_memory_mb{host="$host"} $mem +EOF +done +``` + +## Troubleshooting + +### Check Playbook Execution + +```bash +# Dry-run (check mode) +ansible-playbook playbooks/gather_system_info.yml --check + +# Verbose output +ansible-playbook playbooks/gather_system_info.yml -v + +# Very verbose (debug) +ansible-playbook playbooks/gather_system_info.yml -vvv + +# Single host debugging +ansible-playbook playbooks/gather_system_info.yml \ + --limit problematic-host -vvv +``` + +### Common Issues + +**Missing packages** +```bash +# Install packages manually first +ansible all -m package -a "name=lshw,dmidecode,pciutils state=present" --become + +# Or run with install tag only +ansible-playbook playbooks/gather_system_info.yml --tags install +``` + +**Permission errors** +```bash +# Ensure become is enabled +ansible-playbook playbooks/gather_system_info.yml --become + +# Check sudo access +ansible all -m ping --become +``` + +**Statistics not saved** +```bash +# Check if directory exists +ls -la ./stats/machines/ + +# Check disk space +df -h . + +# Create directory manually +mkdir -p ./stats/machines + +# Specify alternative directory +ansible-playbook playbooks/gather_system_info.yml \ + -e "system_info_stats_base_dir=/tmp/stats" +``` + +**Slow execution** +```bash +# Skip slow operations +ansible-playbook playbooks/gather_system_info.yml \ + --skip-tags install,network + +# Disable GPU gathering +ansible-playbook playbooks/gather_system_info.yml \ + -e "system_info_gather_gpu=false" + +# Increase parallelism +ansible-playbook playbooks/gather_system_info.yml -f 20 +``` + +### Validation + +```bash +# Verify JSON files are valid +for f in ./stats/machines/*/system_info.json; do + echo "Checking $f" + jq empty "$f" && echo "✓ OK" || echo "✗ INVALID" +done + +# Check for missing files +for host in $(ansible all --list-hosts | tail -n +2); do + if [ ! -f "./stats/machines/${host}/system_info.json" ]; then + echo "Missing: $host" + fi +done + +# Verify data completeness +jq -r 'if .cpu == null then "Missing CPU data" else "OK" end' \ + ./stats/machines/*/system_info.json +``` + +## Performance Optimization + +### Parallel Execution + +```bash +# Default (5 hosts at a time) +ansible-playbook playbooks/gather_system_info.yml + +# Increase parallelism +ansible-playbook playbooks/gather_system_info.yml -f 20 + +# Serial execution (one at a time) +ansible-playbook playbooks/gather_system_info.yml -f 1 +``` + +### Skip Slow Tasks + +```bash +# Skip package installation +ansible-playbook playbooks/gather_system_info.yml --skip-tags install + +# Skip network gathering +ansible-playbook playbooks/gather_system_info.yml --skip-tags network + +# Minimal gathering +ansible-playbook playbooks/gather_system_info.yml \ + -e "system_info_gather_gpu=false" \ + -e "system_info_gather_network=false" \ + -e "system_info_detect_hypervisor=false" +``` + +### Fact Caching + +Enable in ansible.cfg: +```ini +[defaults] +fact_caching = jsonfile +fact_caching_connection = /tmp/ansible_facts +fact_caching_timeout = 3600 +``` + +## Use Cases + +### Infrastructure Audit + +```bash +# Collect from all environments +for env in production staging development; do + ansible-playbook -i inventories/$env playbooks/gather_system_info.yml +done + +# Generate comprehensive report +./scripts/generate_infrastructure_report.sh +``` + +### Capacity Planning + +```bash +# Gather current utilization +ansible-playbook playbooks/gather_system_info.yml --tags validate,health-check + +# Analyze resource usage +jq -r '"\(.host_info.fqdn),\(.cpu.load_average.one_min),\(.memory.usage_percent),\(.disk.usage_percent)"' \ + ./stats/machines/*/system_info.json | column -t -s, +``` + +### Compliance Reporting + +```bash +# Security compliance check +ansible-playbook playbooks/gather_system_info.yml --tags security + +# Generate compliance report +jq -r '"\(.host_info.fqdn),\(.security.selinux),\(.security.apparmor)"' \ + ./stats/machines/*/system_info.json > compliance_report.csv +``` + +### License Auditing + +```bash +# Count CPU cores for licensing +ansible-playbook playbooks/gather_system_info.yml --tags cpu + +# Total cores +jq -s 'map(.cpu.count.total_cores | tonumber) | add' \ + ./stats/machines/*/system_info.json +``` + +## Quick Reference Commands + +```bash +# Standard execution +ansible-playbook playbooks/gather_system_info.yml + +# Specific hosts +ansible-playbook playbooks/gather_system_info.yml --limit webservers + +# Specific tags +ansible-playbook playbooks/gather_system_info.yml --tags cpu,memory + +# Custom output directory +ansible-playbook playbooks/gather_system_info.yml \ + -e "system_info_stats_base_dir=/custom/path" + +# View latest stats +cat ./stats/machines/$(hostname -f)/summary.txt + +# Query all hosts +jq . ./stats/machines/*/system_info.json | less +``` + +## See Also + +- [System Info Role README](../../roles/system_info/README.md) +- [System Info Role Documentation](../../docs/roles/system_info.md) +- [System Info Role Cheatsheet](../roles/system_info.md) +- [Role Index](../../docs/roles/role-index.md) + +--- + +**Playbook**: gather_system_info.yml +**Updated**: 2025-11-11 +**Related Role**: system_info v1.0.0 diff --git a/cheatsheets/playbooks/maintenance.md b/cheatsheets/playbooks/maintenance.md new file mode 100644 index 0000000..c26b2c2 --- /dev/null +++ b/cheatsheets/playbooks/maintenance.md @@ -0,0 +1,268 @@ +# System Maintenance Playbook Cheatsheet + +Quick reference for using the system maintenance playbook. + +## Quick Start + +```bash +# Run maintenance on all hosts +ansible-playbook playbooks/maintenance.yml + +# Maintenance on specific environment +ansible-playbook -i inventories/staging playbooks/maintenance.yml + +# Check mode (dry-run) +ansible-playbook playbooks/maintenance.yml --check +``` + +## Common Usage + +### Security Updates Only (Default) + +```bash +# Update all hosts with security patches +ansible-playbook playbooks/maintenance.yml + +# Specific environment +ansible-playbook -i inventories/production playbooks/maintenance.yml + +# Specific host group +ansible-playbook playbooks/maintenance.yml --limit webservers +``` + +### Full System Upgrade + +```bash +# CAUTION: Full upgrade including non-security updates +ansible-playbook playbooks/maintenance.yml \ + --tags updates \ + --extra-vars "maintenance_security_only=false" +``` + +### Selective Maintenance + +```bash +# Package updates only +ansible-playbook playbooks/maintenance.yml --tags updates + +# Cleanup only (no updates) +ansible-playbook playbooks/maintenance.yml --tags cleanup + +# System optimization only +ansible-playbook playbooks/maintenance.yml --tags optimize + +# Verification only +ansible-playbook playbooks/maintenance.yml --tags verify +``` + +## Available Tags + +| Tag | Description | +|-----|-------------| +| `updates` | Package updates (security only by default) | +| `cleanup` | Disk cleanup and log rotation | +| `optimize` | System optimization | +| `verify` | Post-maintenance verification | +| `reboot` | System reboot (requires --tags reboot) | + +## Extra Variables + +| Variable | Default | Description | +|----------|---------|-------------| +| `maintenance_security_only` | `true` | Only install security updates | +| `maintenance_autoremove` | `true` | Remove unused packages | +| `maintenance_serial` | `100%` | Parallelism control | + +## Maintenance Tasks + +### Package Updates +- ✅ Security updates (Debian/Ubuntu) +- ✅ Security updates (RHEL family) +- ✅ Auto-remove unused packages +- ✅ Clean package cache + +### Cleanup Tasks +- ✅ Force log rotation +- ✅ Find old log files (30+ days) +- ✅ Clean /tmp directory (10+ days) +- ✅ Clean /var/tmp (30+ days) +- ✅ Vacuum systemd journal (30 days) +- ✅ Docker cleanup (if installed) +- ✅ Podman cleanup (if installed) + +### Optimization +- ✅ Update locate database +- ✅ Sync filesystem caches + +### Verification +- ✅ Check disk usage +- ✅ Check memory usage +- ✅ Verify critical services +- ✅ Check if reboot required + +## Reboot Management + +### Check Reboot Status + +```bash +# Run maintenance and check reboot status +ansible-playbook playbooks/maintenance.yml + +# Look for: "Reboot required: true" in output +``` + +### Perform Reboot + +```bash +# WARNING: This will reboot hosts one at a time! +ansible-playbook playbooks/maintenance.yml --tags reboot + +# Reboot specific environment +ansible-playbook -i inventories/staging playbooks/maintenance.yml --tags reboot + +# Control reboot parallelism +ansible-playbook playbooks/maintenance.yml --tags reboot \ + --extra-vars "maintenance_serial=1" +``` + +## Serial Execution + +Control how many hosts are updated simultaneously: + +```bash +# Update all hosts in parallel (default) +ansible-playbook playbooks/maintenance.yml + +# Update one host at a time +ansible-playbook playbooks/maintenance.yml \ + --extra-vars "maintenance_serial=1" + +# Update 25% of hosts at a time +ansible-playbook playbooks/maintenance.yml \ + --extra-vars "maintenance_serial=25%" +``` + +## Output and Logs + +Logs saved to: `./logs/maintenance//_maintenance.log` + +## Example Output + +``` +========================================= +Maintenance Summary +========================================= +Host: webserver01 +Environment: production +Completed: 2025-01-11T10:30:00Z + +=== Updates === +Packages updated: true + +=== Cleanup === +Old logs found: 42 +Journal cleaned: Yes + +=== System State === +Disk usage after: /dev/sda1 50G 25G 25G 50% / + +=== Reboot Status === +Reboot required: false +========================================= +``` + +## Troubleshooting + +### Package updates fail + +Check update repositories: +```bash +# Debian/Ubuntu +ansible all -m shell -a "apt update" + +# RHEL/CentOS +ansible all -m shell -a "dnf check-update" +``` + +### Disk space warnings + +Free up space manually before maintenance: +```bash +ansible-playbook playbooks/maintenance.yml --tags cleanup +``` + +### Service not running after update + +Check service status: +```bash +ansible all -m shell -a "systemctl status " +``` + +## Scheduling Maintenance + +### Cron Example + +```bash +# Daily security updates at 2 AM +0 2 * * * cd /opt/ansible && ansible-playbook playbooks/maintenance.yml +``` + +### SystemD Timer Example + +```ini +# /etc/systemd/system/ansible-maintenance.timer +[Unit] +Description=Ansible Maintenance + +[Timer] +OnCalendar=daily +Persistent=true + +[Install] +WantedBy=timers.target +``` + +## Best Practices + +1. **Test in staging first** - Always run in staging before production +2. **Monitor during updates** - Watch for failures +3. **Check reboot requirements** - Plan reboots during maintenance windows +4. **Review logs** - Check maintenance logs for issues +5. **Use serial execution** for production - Update hosts gradually +6. **Schedule appropriately** - Run during low-traffic periods + +## Quick Reference Commands + +```bash +# Dry-run (no changes) +ansible-playbook playbooks/maintenance.yml --check + +# Staging environment +ansible-playbook -i inventories/staging playbooks/maintenance.yml + +# Production (one host at a time) +ansible-playbook -i inventories/production playbooks/maintenance.yml \ + --extra-vars "maintenance_serial=1" + +# Updates only, no cleanup +ansible-playbook playbooks/maintenance.yml --tags updates + +# Full upgrade (non-security too) +ansible-playbook playbooks/maintenance.yml \ + --extra-vars "maintenance_security_only=false" + +# Cleanup only +ansible-playbook playbooks/maintenance.yml --tags cleanup + +# Check if reboot needed +ansible-playbook playbooks/maintenance.yml --tags verify + +# Reboot if needed +ansible-playbook playbooks/maintenance.yml --tags reboot +``` + +## See Also + +- [Maintenance Playbook](../../playbooks/maintenance.yml) +- [Backup Playbook](../../playbooks/backup.yml) +- [CLAUDE.md Guidelines](../../CLAUDE.md) diff --git a/cheatsheets/playbooks/security_audit.md b/cheatsheets/playbooks/security_audit.md new file mode 100644 index 0000000..aae0f81 --- /dev/null +++ b/cheatsheets/playbooks/security_audit.md @@ -0,0 +1,214 @@ +# Security Audit Playbook Cheatsheet + +Quick reference for using the security audit playbook. + +## Quick Start + +```bash +# Run full security audit on all hosts +ansible-playbook playbooks/security_audit.yml + +# Audit specific environment +ansible-playbook -i inventories/production playbooks/security_audit.yml + +# Audit specific host +ansible-playbook playbooks/security_audit.yml --limit hostname +``` + +## Common Usage + +### Full Audit + +```bash +# Complete security audit with all checks +ansible-playbook playbooks/security_audit.yml + +# Production environment only +ansible-playbook -i inventories/production playbooks/security_audit.yml +``` + +### Selective Audits + +```bash +# SELinux and AppArmor only +ansible-playbook playbooks/security_audit.yml --tags selinux,apparmor + +# Firewall configuration audit +ansible-playbook playbooks/security_audit.yml --tags firewall + +# SSH security audit +ansible-playbook playbooks/security_audit.yml --tags ssh + +# User and permission audit +ansible-playbook playbooks/security_audit.yml --tags users + +# Network security audit +ansible-playbook playbooks/security_audit.yml --tags network + +# Compliance checks only +ansible-playbook playbooks/security_audit.yml --tags compliance +``` + +## Available Tags + +| Tag | Description | +|-----|-------------| +| `audit` | All audit tasks | +| `selinux` | SELinux status and configuration | +| `apparmor` | AppArmor status and profiles | +| `firewall` | Firewall configuration | +| `ssh` | SSH hardening checks | +| `packages` | Package and update audits | +| `users` | User and permission audits | +| `network` | Network security checks | +| `compliance` | Compliance verification | +| `report` | Generate audit reports | + +## What Gets Audited + +### Security Modules +- ✅ SELinux status (RHEL family) +- ✅ AppArmor status (Debian family) +- ✅ SELinux denials count +- ✅ AppArmor violations + +### Firewall +- ✅ Firewalld status (RHEL) +- ✅ UFW status (Debian) +- ✅ Firewall rules configuration +- ✅ Default policies + +### SSH Configuration +- ✅ Root login disabled +- ✅ Password authentication disabled +- ✅ GSSAPI authentication disabled +- ✅ Maximum authentication attempts + +### Package Management +- ✅ Available security updates +- ✅ Automatic updates enabled +- ✅ Update schedule + +### Users and Permissions +- ✅ Users with UID 0 (should be root only) +- ✅ Users with empty passwords +- ✅ Sudoers configuration +- ✅ World-writable files + +### Network Security +- ✅ Listening ports +- ✅ Promiscuous interfaces +- ✅ IP forwarding status + +### Audit and Monitoring +- ✅ Auditd service status +- ✅ Audit log size +- ✅ AIDE installation and database + +### Compliance +- ✅ Timezone configuration (UTC) +- ✅ NTP synchronization +- ✅ Kernel security parameters + +## Output and Reports + +Reports saved to: `./reports/security_audit//_audit_report.txt` + +## Example Output + +``` +========================================= +Security Audit Summary +========================================= +Host: webserver01 +Environment: production + +=== Security Modules === +SELinux: Enforcing + +=== Firewall === +Firewalld: Active + +=== SSH Security === +Root Login: Disabled +Password Auth: Disabled + +=== Updates === +Critical/Important updates: 0 + +=== Users === +UID 0 users: root + +=== Audit Logging === +Auditd: Active +AIDE: Installed +========================================= +``` + +## Troubleshooting + +### No audit reports generated + +Check report directory exists: +```bash +ls -la ./reports/security_audit/ +``` + +### Failed checks + +Review specific failed checks: +```bash +ansible-playbook playbooks/security_audit.yml -vv +``` + +### Permission denied + +Ensure become is enabled: +```bash +ansible-playbook playbooks/security_audit.yml --become +``` + +## Integration with CI/CD + +```yaml +# GitLab CI example +security_audit: + stage: compliance + script: + - ansible-playbook playbooks/security_audit.yml + only: + - schedules +``` + +## Best Practices + +1. **Schedule regular audits** - Run weekly or after changes +2. **Review reports** - Don't just run audits, act on findings +3. **Track trends** - Compare audit results over time +4. **Document exceptions** - Note why certain checks fail +5. **Remediate findings** - Create tasks to fix issues + +## Quick Reference Commands + +```bash +# Dry-run audit +ansible-playbook playbooks/security_audit.yml --check + +# Verbose output +ansible-playbook playbooks/security_audit.yml -vvv + +# Specific environment +ansible-playbook -i inventories/production playbooks/security_audit.yml + +# Multiple tags +ansible-playbook playbooks/security_audit.yml --tags "selinux,firewall,ssh" + +# Skip specific checks +ansible-playbook playbooks/security_audit.yml --skip-tags packages +``` + +## See Also + +- [Security Audit Playbook](../../playbooks/security_audit.yml) +- [CLAUDE.md Security Guidelines](../../CLAUDE.md) +- [Vault Management Guide](../../docs/security/vault-management.md) diff --git a/cheatsheets/roles/deploy_linux_vm.md b/cheatsheets/roles/deploy_linux_vm.md new file mode 100644 index 0000000..993c1f8 --- /dev/null +++ b/cheatsheets/roles/deploy_linux_vm.md @@ -0,0 +1,512 @@ +# Deploy Linux VM Role Cheatsheet + +Quick reference guide for the `deploy_linux_vm` role - automated Linux VM deployment on KVM hypervisors with LVM and security hardening. + +## Quick Start + +```bash +# Deploy a VM with defaults (Debian 12) +ansible-playbook site.yml -t deploy_linux_vm + +# Deploy specific distribution +ansible-playbook site.yml -t deploy_linux_vm \ + -e "deploy_linux_vm_os_distribution=ubuntu-22.04" + +# Deploy with custom resources +ansible-playbook site.yml -t deploy_linux_vm \ + -e "deploy_linux_vm_name=webserver01" \ + -e "deploy_linux_vm_vcpus=4" \ + -e "deploy_linux_vm_memory_mb=8192" +``` + +## Common Execution Patterns + +### Basic Deployment + +```bash +# Single VM deployment +ansible-playbook -i inventories/production site.yml -t deploy_linux_vm + +# Deploy to specific hypervisor +ansible-playbook site.yml -l grokbox -t deploy_linux_vm + +# Check mode (dry-run validation) +ansible-playbook site.yml -t deploy_linux_vm --check +``` + +### Distribution-Specific Deployment + +```bash +# Debian family +ansible-playbook site.yml -t deploy_linux_vm \ + -e "deploy_linux_vm_os_distribution=debian-12" + +ansible-playbook site.yml -t deploy_linux_vm \ + -e "deploy_linux_vm_os_distribution=ubuntu-24.04" + +# RHEL family +ansible-playbook site.yml -t deploy_linux_vm \ + -e "deploy_linux_vm_os_distribution=almalinux-9" + +ansible-playbook site.yml -t deploy_linux_vm \ + -e "deploy_linux_vm_os_distribution=rocky-9" + +# SUSE family +ansible-playbook site.yml -t deploy_linux_vm \ + -e "deploy_linux_vm_os_distribution=opensuse-leap-15.6" +``` + +### Selective Execution with Tags + +```bash +# Pre-flight validation only +ansible-playbook site.yml -t deploy_linux_vm,validate,preflight + +# Download cloud images only +ansible-playbook site.yml -t deploy_linux_vm,download,verify + +# Deploy VM without LVM configuration +ansible-playbook site.yml -t deploy_linux_vm --skip-tags lvm + +# Configure LVM only (post-deployment) +ansible-playbook site.yml -t deploy_linux_vm,lvm,post-deploy + +# Cleanup temporary files only +ansible-playbook site.yml -t deploy_linux_vm,cleanup +``` + +## Available Tags + +| Tag | Description | +|-----|-------------| +| `deploy_linux_vm` | Main role tag (required) | +| `validate`, `preflight` | Pre-flight validation checks | +| `install` | Install required packages on hypervisor | +| `download`, `verify` | Download and verify cloud images | +| `storage` | Create VM disk storage | +| `cloud-init` | Generate cloud-init configuration | +| `deploy` | Deploy and start VM | +| `lvm`, `post-deploy` | Configure LVM on deployed VM | +| `cleanup` | Remove temporary files | + +## Common Variables + +### VM Configuration + +```yaml +# Basic VM settings +deploy_linux_vm_name: "webserver01" +deploy_linux_vm_hostname: "web01" +deploy_linux_vm_domain: "production.local" +deploy_linux_vm_os_distribution: "ubuntu-22.04" + +# Resource allocation +deploy_linux_vm_vcpus: 4 +deploy_linux_vm_memory_mb: 8192 +deploy_linux_vm_disk_size_gb: 50 +``` + +### LVM Configuration + +```yaml +# Enable/disable LVM +deploy_linux_vm_use_lvm: true + +# LVM volume group settings +deploy_linux_vm_lvm_vg_name: "vg_system" +deploy_linux_vm_lvm_pv_device: "/dev/vdb" + +# Custom logical volumes (override defaults) +deploy_linux_vm_lvm_volumes: + - { name: lv_opt, size: 5G, mount: /opt, fstype: ext4 } + - { name: lv_var, size: 10G, mount: /var, fstype: ext4 } + - { name: lv_tmp, size: 2G, mount: /tmp, fstype: ext4, mount_options: noexec,nosuid,nodev } +``` + +### Security Configuration + +```yaml +# Security hardening toggles +deploy_linux_vm_enable_firewall: true +deploy_linux_vm_enable_selinux: true # RHEL family +deploy_linux_vm_enable_apparmor: true # Debian family +deploy_linux_vm_enable_auditd: true +deploy_linux_vm_enable_automatic_updates: true +deploy_linux_vm_automatic_reboot: false # Don't auto-reboot + +# SSH hardening +deploy_linux_vm_ssh_permit_root_login: "no" +deploy_linux_vm_ssh_password_authentication: "no" +deploy_linux_vm_ssh_gssapi_authentication: "no" # GSSAPI disabled per requirements +``` + +### User Configuration + +```yaml +# Ansible service account +deploy_linux_vm_ansible_user: "ansible" +deploy_linux_vm_ansible_user_ssh_key: "{{ lookup('file', '~/.ssh/id_rsa.pub') }}" + +# Root password (console access only, SSH disabled) +deploy_linux_vm_root_password: "ChangeMe123!" +``` + +## Supported Distributions + +| Distribution | Version | OS Family | Identifier | +|--------------|---------|-----------|------------| +| Debian | 11, 12 | debian | `debian-11`, `debian-12` | +| Ubuntu LTS | 20.04, 22.04, 24.04 | debian | `ubuntu-20.04`, `ubuntu-22.04`, `ubuntu-24.04` | +| RHEL | 8, 9 | rhel | `rhel-8`, `rhel-9` | +| AlmaLinux | 8, 9 | rhel | `almalinux-8`, `almalinux-9` | +| Rocky Linux | 8, 9 | rhel | `rocky-8`, `rocky-9` | +| openSUSE Leap | 15.5, 15.6 | suse | `opensuse-leap-15.5`, `opensuse-leap-15.6` | + +## Example Playbooks + +### Single VM Deployment + +```yaml +--- +- name: Deploy Linux VM + hosts: grokbox + become: yes + roles: + - role: deploy_linux_vm + vars: + deploy_linux_vm_name: "web-server" + deploy_linux_vm_os_distribution: "ubuntu-22.04" +``` + +### Multi-VM Deployment + +```yaml +--- +- name: Deploy Multiple VMs + hosts: grokbox + become: yes + tasks: + - name: Deploy web servers + include_role: + name: deploy_linux_vm + vars: + deploy_linux_vm_name: "{{ item.name }}" + deploy_linux_vm_hostname: "{{ item.hostname }}" + deploy_linux_vm_os_distribution: "{{ item.distro }}" + loop: + - { name: "web01", hostname: "web01", distro: "ubuntu-22.04" } + - { name: "web02", hostname: "web02", distro: "ubuntu-22.04" } + - { name: "db01", hostname: "db01", distro: "almalinux-9" } +``` + +### Database Server with Custom Resources + +```yaml +--- +- name: Deploy Database Server + hosts: grokbox + become: yes + roles: + - role: deploy_linux_vm + vars: + deploy_linux_vm_name: "postgres01" + deploy_linux_vm_hostname: "postgres01" + deploy_linux_vm_domain: "production.local" + deploy_linux_vm_os_distribution: "almalinux-9" + deploy_linux_vm_vcpus: 8 + deploy_linux_vm_memory_mb: 16384 + deploy_linux_vm_disk_size_gb: 100 + deploy_linux_vm_use_lvm: true +``` + +## Post-Deployment Verification + +### Check VM Status + +```bash +# List all VMs on hypervisor +ansible grokbox -m shell -a "virsh list --all" + +# Get VM information +ansible grokbox -m shell -a "virsh dominfo " + +# Get VM IP address +ansible grokbox -m shell -a "virsh domifaddr " +``` + +### Verify SSH Access + +```bash +# Test SSH connectivity +ssh ansible@ + +# Test with ProxyJump through hypervisor +ssh -J grokbox ansible@ +``` + +### Verify LVM Configuration + +```bash +# SSH to VM and check LVM +ssh ansible@ "sudo vgs && sudo lvs && sudo pvs" + +# Check fstab entries +ssh ansible@ "cat /etc/fstab" + +# Check disk layout +ssh ansible@ "lsblk" + +# Check mounted filesystems +ssh ansible@ "df -h" +``` + +### Verify Security Hardening + +```bash +# Check SSH configuration +ssh ansible@ "sudo sshd -T | grep -i gssapi" + +# Check firewall (Debian/Ubuntu) +ssh ansible@ "sudo ufw status verbose" + +# Check firewall (RHEL/AlmaLinux) +ssh ansible@ "sudo firewall-cmd --list-all" + +# Check SELinux status (RHEL family) +ssh ansible@ "sudo getenforce" + +# Check AppArmor status (Debian family) +ssh ansible@ "sudo aa-status" + +# Check auditd +ssh ansible@ "sudo systemctl status auditd" + +# Check automatic updates (Debian/Ubuntu) +ssh ansible@ "sudo systemctl status unattended-upgrades" + +# Check automatic updates (RHEL/AlmaLinux) +ssh ansible@ "sudo systemctl status dnf-automatic.timer" +``` + +## Troubleshooting + +### Check Cloud-Init Status + +```bash +# Wait for cloud-init to complete +ssh ansible@ "cloud-init status --wait" + +# View cloud-init logs +ssh ansible@ "tail -100 /var/log/cloud-init-output.log" + +# Check cloud-init errors +ssh ansible@ "cloud-init analyze show" +``` + +### VM Won't Start + +```bash +# Check VM status +ansible grokbox -m shell -a "virsh list --all" + +# View VM console logs +ansible grokbox -m shell -a "virsh console " + +# Check libvirt logs +ansible grokbox -m shell -a "tail -50 /var/log/libvirt/qemu/.log" +``` + +### LVM Issues + +```bash +# Check LVM status +ssh ansible@ "sudo pvs && sudo vgs && sudo lvs" + +# Check if second disk exists +ssh ansible@ "lsblk" + +# Manually trigger LVM setup (if post-deploy failed) +ansible-playbook site.yml -l grokbox -t deploy_linux_vm,lvm,post-deploy \ + -e "deploy_linux_vm_name=" +``` + +### Network Connectivity Issues + +```bash +# Check VM network interfaces +ssh ansible@ "ip addr show" + +# Check VM can reach internet +ssh ansible@ "ping -c 3 8.8.8.8" + +# Check DNS resolution +ssh ansible@ "nslookup google.com" + +# Check libvirt network +ansible grokbox -m shell -a "virsh net-list --all" +ansible grokbox -m shell -a "virsh net-dhcp-leases default" +``` + +### SSH Connection Refused + +```bash +# Check if sshd is running +ssh ansible@ "sudo systemctl status sshd" + +# Check firewall rules +ssh ansible@ "sudo ufw status" # Debian/Ubuntu +ssh ansible@ "sudo firewall-cmd --list-services" # RHEL + +# Check SSH port listening +ssh ansible@ "sudo ss -tlnp | grep :22" +``` + +### Disk Space Issues + +```bash +# Check hypervisor disk space +ansible grokbox -m shell -a "df -h /var/lib/libvirt/images" + +# Check VM disk space +ssh ansible@ "df -h" + +# List large files +ssh ansible@ "sudo du -sh /* | sort -h" +``` + +## VM Management + +### Start/Stop/Reboot VM + +```bash +# Start VM +ansible grokbox -m shell -a "virsh start " + +# Shutdown VM gracefully +ansible grokbox -m shell -a "virsh shutdown " + +# Force stop VM +ansible grokbox -m shell -a "virsh destroy " + +# Reboot VM +ansible grokbox -m shell -a "virsh reboot " + +# Enable autostart +ansible grokbox -m shell -a "virsh autostart " +``` + +### Delete VM + +```bash +# Stop and delete VM (DESTRUCTIVE) +ansible grokbox -m shell -a "virsh destroy " +ansible grokbox -m shell -a "virsh undefine --remove-all-storage" +``` + +### VM Snapshots + +```bash +# Create snapshot +ansible grokbox -m shell -a "virsh snapshot-create-as snapshot1 'Before updates'" + +# List snapshots +ansible grokbox -m shell -a "virsh snapshot-list " + +# Restore snapshot +ansible grokbox -m shell -a "virsh snapshot-revert snapshot1" + +# Delete snapshot +ansible grokbox -m shell -a "virsh snapshot-delete snapshot1" +``` + +## Performance Optimization + +### Parallel Deployment + +```bash +# Deploy multiple VMs in parallel (default: 5 at a time) +ansible-playbook site.yml -t deploy_linux_vm -f 5 + +# Serial deployment (one at a time) +ansible-playbook site.yml -t deploy_linux_vm -f 1 +``` + +### Skip Slow Operations + +```bash +# Skip package installation (if already installed) +ansible-playbook site.yml -t deploy_linux_vm --skip-tags install + +# Skip image download (if already cached) +ansible-playbook site.yml -t deploy_linux_vm --skip-tags download +``` + +## Security Checkpoints + +- ✓ SSH root login disabled via SSH (console access available) +- ✓ SSH password authentication disabled (key-based only) +- ✓ GSSAPI authentication disabled per requirements +- ✓ Firewall enabled (UFW/firewalld) with SSH allowed +- ✓ SELinux enforcing (RHEL family) or AppArmor enabled (Debian family) +- ✓ Automatic security updates enabled (no auto-reboot by default) +- ✓ Audit daemon (auditd) enabled +- ✓ LVM with secure mount options (/tmp with noexec,nosuid,nodev) +- ✓ Essential security packages installed (aide, auditd, chrony) +- ✓ Ansible service account with passwordless sudo (logged) + +## Quick Reference Commands + +```bash +# Standard deployment +ansible-playbook site.yml -t deploy_linux_vm + +# Custom VM +ansible-playbook site.yml -t deploy_linux_vm \ + -e "deploy_linux_vm_name=myvm" \ + -e "deploy_linux_vm_os_distribution=ubuntu-22.04" + +# Pre-flight check only +ansible-playbook site.yml -t deploy_linux_vm,validate --check + +# Deploy without LVM +ansible-playbook site.yml -t deploy_linux_vm --skip-tags lvm + +# Configure LVM post-deployment +ansible-playbook site.yml -t deploy_linux_vm,lvm + +# Get VM IP +ansible grokbox -m shell -a "virsh domifaddr " + +# SSH to VM +ssh -J grokbox ansible@ + +# Check VM status +ansible grokbox -m shell -a "virsh list --all" +``` + +## File Locations + +**On Hypervisor:** +- Cloud images: `/var/lib/libvirt/images/*.qcow2` +- VM disk: `/var/lib/libvirt/images/.qcow2` +- LVM disk: `/var/lib/libvirt/images/-lvm.qcow2` +- Cloud-init ISO: `/var/lib/libvirt/images/-cloud-init.iso` + +**On Deployed VM:** +- SSH config: `/etc/ssh/sshd_config.d/99-security.conf` +- Sudoers: `/etc/sudoers.d/ansible` +- Cloud-init log: `/var/log/cloud-init-output.log` +- Fstab: `/etc/fstab` (LVM mounts) + +## See Also + +- [Role README](../../roles/deploy_linux_vm/README.md) +- [Role Documentation](../../docs/roles/deploy_linux_vm.md) +- [Linux VM Deployment Runbook](../../docs/runbooks/deployment.md) +- [CLAUDE.md Guidelines](../../CLAUDE.md) + +--- + +**Role**: deploy_linux_vm v1.0.0 +**Updated**: 2025-11-11 +**Documentation**: See `roles/deploy_linux_vm/README.md` and `docs/roles/deploy_linux_vm.md` diff --git a/cheatsheets/roles/system_info.md b/cheatsheets/roles/system_info.md new file mode 100644 index 0000000..dbc20ef --- /dev/null +++ b/cheatsheets/roles/system_info.md @@ -0,0 +1,368 @@ +# System Info Role Cheatsheet + +Quick reference guide for the `system_info` role - comprehensive system information gathering. + +## Quick Start + +```bash +# Run complete information gathering +ansible-playbook site.yml -t system_info + +# Run on specific hosts +ansible-playbook site.yml -l webservers -t system_info + +# Run with validation only +ansible-playbook site.yml -t system_info,validate +``` + +## Common Execution Patterns + +### Full Execution +```bash +# All hosts, all information +ansible-playbook site.yml -t system_info + +# Single host +ansible-playbook site.yml -l hostname.example.com -t system_info + +# Specific group +ansible-playbook site.yml -l production -t system_info +``` + +### Selective Information Gathering + +```bash +# CPU information only +ansible-playbook site.yml -t system_info,cpu + +# GPU information only +ansible-playbook site.yml -t system_info,gpu + +# Memory and swap only +ansible-playbook site.yml -t system_info,memory + +# Disk information only +ansible-playbook site.yml -t system_info,disk + +# Network information only +ansible-playbook site.yml -t system_info,network + +# Hypervisor detection only +ansible-playbook site.yml -t system_info,hypervisor + +# System information only +ansible-playbook site.yml -t system_info,system +``` + +### Combined Tags + +```bash +# CPU, Memory, and Disk +ansible-playbook site.yml -t system_info,cpu,memory,disk + +# Skip installation, gather only +ansible-playbook site.yml -t system_info --skip-tags install + +# Validation and health check +ansible-playbook site.yml -t system_info,validate,health-check + +# Export statistics only (requires prior gathering) +ansible-playbook site.yml -t system_info,export +``` + +## Available Tags + +| Tag | Description | +|-----|-------------| +| `system_info` | Main role tag (required) | +| `install` | Install required packages | +| `gather` | All information gathering | +| `system` | OS and system info | +| `cpu` | CPU details | +| `gpu` | GPU detection | +| `memory` | RAM and swap | +| `disk` | Storage and filesystems | +| `network` | Network interfaces | +| `hypervisor` | Virtualization detection | +| `export` | Export to JSON | +| `statistics` | Statistics aggregation | +| `validate` | Validation checks | +| `health-check` | Health monitoring | +| `security` | Security-related info | + +## Common Variables + +### Directory Configuration +```yaml +# Custom statistics directory +system_info_stats_base_dir: /var/lib/ansible/stats + +# Disable automatic directory creation +system_info_create_stats_dir: false +``` + +### Feature Toggles +```yaml +# Disable GPU gathering (for servers without GPU) +system_info_gather_gpu: false + +# Disable hypervisor detection +system_info_detect_hypervisor: false + +# Minimal gathering (CPU, Memory, Disk only) +system_info_gather_network: false +system_info_gather_gpu: false +system_info_detect_hypervisor: false +``` + +### Output Configuration +```yaml +# Increase JSON readability +system_info_json_indent: 4 + +# Include raw command outputs +system_info_include_raw_output: true +``` + +## Output Files + +### Default Location +``` +./stats/machines// +├── system_info.json # Latest statistics +├── system_info_.json # Timestamped backup +└── summary.txt # Human-readable summary +``` + +### View Statistics +```bash +# View JSON (pretty-printed) +jq . ./stats/machines/server01.example.com/system_info.json + +# View summary +cat ./stats/machines/server01.example.com/summary.txt + +# Extract specific information +jq '.cpu.model' ./stats/machines/*/system_info.json +jq '.memory.total_mb' ./stats/machines/*/system_info.json +jq '.hypervisor.is_hypervisor' ./stats/machines/*/system_info.json + +# Count hypervisors +jq -r 'select(.hypervisor.is_hypervisor == true) | .host_info.fqdn' \ + ./stats/machines/*/system_info.json | wc -l + +# Find all VMs +jq -r 'select(.hypervisor.is_virtual == true) | .host_info.fqdn' \ + ./stats/machines/*/system_info.json + +# Memory usage report +jq -r '"\(.host_info.fqdn): \(.memory.usage_percent)%"' \ + ./stats/machines/*/system_info.json +``` + +## Example Playbooks + +### Basic Playbook +```yaml +--- +- name: Gather system information + hosts: all + become: true + roles: + - system_info +``` + +### Advanced Playbook +```yaml +--- +- name: Gather detailed system information + hosts: all + become: true + roles: + - role: system_info + vars: + system_info_stats_base_dir: /var/lib/ansible/inventory + system_info_json_indent: 4 + system_info_gather_gpu: true + system_info_detect_hypervisor: true +``` + +### Targeted Playbook +```yaml +--- +- name: Gather hypervisor information only + hosts: hypervisors + become: true + tasks: + - name: Include system_info role for hypervisor detection + include_role: + name: system_info + tasks_from: detect_hypervisor + tags: [hypervisor] +``` + +## Troubleshooting + +### Check Role Execution +```bash +# Dry-run (check mode) +ansible-playbook site.yml -t system_info --check + +# Verbose output +ansible-playbook site.yml -t system_info -v + +# Very verbose (debug) +ansible-playbook site.yml -t system_info -vvv + +# Single host debugging +ansible-playbook site.yml -l problematic-host -t system_info -vvv +``` + +### Common Issues + +**Missing packages** +```bash +# Install packages manually first +ansible-playbook site.yml -t system_info,install + +# Check what would be installed +ansible all -m package_facts +``` + +**Permission errors** +```bash +# Ensure become is enabled +ansible-playbook site.yml -t system_info --become + +# Check sudo access +ansible all -m ping --become +``` + +**Statistics not saved** +```bash +# Check if directory exists +ls -la ./stats/machines/ + +# Check disk space on control node +df -h . + +# Verify write permissions +touch ./stats/machines/test && rm ./stats/machines/test +``` + +### Validation + +```bash +# Run only validation tasks +ansible-playbook site.yml -t system_info,validate + +# Check specific host health +ansible-playbook site.yml -l server01 -t validate,health-check + +# Verify JSON files +for f in ./stats/machines/*/system_info.json; do + echo "Checking $f" + jq empty "$f" && echo "OK" || echo "INVALID" +done +``` + +## Performance Optimization + +### Parallel Execution +```bash +# Increase parallelism (default: 5) +ansible-playbook site.yml -t system_info -f 20 + +# Serial execution (one at a time) +ansible-playbook site.yml -t system_info -f 1 +``` + +### Skip Slow Tasks +```bash +# Skip installation if packages are pre-installed +ansible-playbook site.yml -t system_info --skip-tags install + +# Skip network gathering (can be slow) +ansible-playbook site.yml -t system_info --skip-tags network +``` + +## Integration Examples + +### Cron Job for Regular Collection +```bash +# Daily collection at 2 AM +0 2 * * * cd /opt/ansible && ansible-playbook site.yml -t system_info >> /var/log/ansible/system_info.log 2>&1 +``` + +### Generate HTML Report +```bash +# Convert JSON to HTML +for host in ./stats/machines/*; do + hostname=$(basename "$host") + jq -r 'to_entries | map("\(.key): \(.value)") | .[]' \ + "$host/system_info.json" > "$host/report.txt" +done +``` + +### Compare Statistics +```bash +# Compare CPU across hosts +jq -r '"\(.host_info.fqdn),\(.cpu.model),\(.cpu.count.vcpus)"' \ + ./stats/machines/*/system_info.json | column -t -s, + +# Compare memory across hosts +jq -r '"\(.host_info.fqdn),\(.memory.total_mb) MB,\(.memory.usage_percent)%"' \ + ./stats/machines/*/system_info.json | column -t -s, +``` + +## Security Checkpoints + +- ✓ Role runs with `become: true` for hardware access +- ✓ No credentials or secrets are collected +- ✓ Statistics files contain infrastructure details - protect appropriately +- ✓ Sensitive data (serial numbers, UUIDs) included - review before sharing +- ✓ Files stored on control node only - not on managed hosts + +## Quick Reference Commands + +```bash +# Full scan +ansible-playbook site.yml -t system_info + +# CPU + Memory only +ansible-playbook site.yml -t system_info,cpu,memory + +# Validate all hosts +ansible-playbook site.yml -t system_info,validate + +# Export only (no gathering) +ansible-playbook site.yml -t system_info,export + +# Single host, verbose +ansible-playbook site.yml -l hostname -t system_info -v + +# View latest stats +cat ./stats/machines/$(hostname -f)/summary.txt +``` + +## Ansible Ad-Hoc Alternatives + +```bash +# Quick CPU check +ansible all -m shell -a "lscpu | grep 'Model name'" + +# Quick memory check +ansible all -m shell -a "free -h" + +# Quick disk check +ansible all -m shell -a "df -h" + +# Check virtualization +ansible all -m shell -a "systemd-detect-virt" +``` + +--- + +**Role**: system_info v1.0.0 +**Updated**: 2025-01-11 +**Documentation**: See `roles/system_info/README.md` diff --git a/docs/architecture/network-topology.md b/docs/architecture/network-topology.md new file mode 100644 index 0000000..57d12e4 --- /dev/null +++ b/docs/architecture/network-topology.md @@ -0,0 +1,112 @@ +# Network Topology + +## Overview + +This document describes the network architecture for the Ansible-managed infrastructure, including physical and virtual network layouts, security zones, and connectivity patterns. + +## Network Diagram + +``` +Internet + │ + │ Firewall/Router + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Management Network │ +│ (192.168.1.0/24 - Example) │ +│ │ +│ ┌──────────────┐ ┌──────────────┐ │ +│ │ Ansible │───────│ Gitea │ │ +│ │ Control │ │ Repository │ │ +│ └──────────────┘ └──────────────┘ │ +│ │ +│ SSH (Port 22, Key-based) │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ┌────────────────┼────────────────┐ + │ │ │ + ▼ ▼ ▼ + ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ + │ Hypervisor │ │ Hypervisor │ │ Hypervisor │ + │ (grokbox) │ │ (hv02) │ │ (hv03) │ + └─────┬───────┘ └─────┬───────┘ └─────┬───────┘ + │ │ │ + Virtual Networks (libvirt) + │ │ │ + ┌─────┴────────────────┴────────────────┴─────┐ + │ VM Network Layer │ + │ │ + │ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ + │ │ Web │ │ App │ │ DB │ │Cache │ │ + │ │ VMs │ │ VMs │ │ VMs │ │ VMs │ │ + │ └──────┘ └──────┘ └──────┘ └──────┘ │ + └───────────────────────────────────────────┘ +``` + +## Network Zones + +### Management Zone +- **Purpose**: Ansible control and infrastructure management +- **CIDR**: 192.168.1.0/24 (example - adjust per environment) +- **Access**: Restricted to operations team +- **Protocols**: SSH (22), HTTPS (443) + +### Hypervisor Zone +- **Purpose**: KVM/libvirt hypervisor hosts +- **Access**: Ansible control node via SSH +- **Services**: libvirt (16509), SSH (22) + +### Guest VM Zone +- **Purpose**: Application and service VMs +- **Networks**: Multiple virtual networks per purpose + - Production: 10.0.1.0/24 + - Staging: 10.0.2.0/24 + - Development: 10.0.3.0/24 + +## Virtual Networking (libvirt) + +### Default NAT Network +- **Network**: `default` +- **Type**: NAT +- **Subnet**: 192.168.122.0/24 +- **DHCP**: Enabled +- **Use Case**: Development and testing VMs + +### Bridged Network +- **Network**: `br0` +- **Type**: Bridge +- **Configuration**: Attached to physical NIC +- **Use Case**: Production VMs requiring direct network access + +## Firewall Rules + +### Hypervisor Firewall (firewalld/UFW) + +**Allowed Inbound**: +- SSH from Ansible control node (port 22) +- libvirt management from control node (port 16509) + +**Denied**: +- All other inbound traffic (default deny) + +### Guest VM Firewall + +**Allowed Inbound**: +- SSH from hypervisor/management network (port 22) +- Application-specific ports (per VM purpose) + +**Allowed Outbound**: +- HTTPS for package repositories (port 443) +- DNS queries (port 53) +- NTP time sync (port 123) + +## DNS Configuration + +- **Primary**: 8.8.8.8 (Google DNS) +- **Secondary**: 1.1.1.1 (Cloudflare DNS) +- **Future**: Internal DNS server for local name resolution + +## Related Documentation + +- [Architecture Overview](./overview.md) +- [Security Model](./security-model.md) diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md new file mode 100644 index 0000000..a58985e --- /dev/null +++ b/docs/architecture/overview.md @@ -0,0 +1,647 @@ +# Infrastructure Architecture Overview + +## Executive Summary + +This document provides a comprehensive overview of the Ansible-based infrastructure automation architecture. The system is designed with security-first principles, leveraging Infrastructure as Code (IaC) best practices for automated provisioning, configuration management, and operational excellence. + +**Architecture Version**: 1.0.0 +**Last Updated**: 2025-11-11 +**Document Owner**: Ansible Infrastructure Team + +--- + +## Architecture Principles + +### Security-First Design + +All infrastructure components implement defense-in-depth security: + +- **Least Privilege**: Service accounts with minimal required permissions +- **Encryption**: Data encrypted at rest and in transit +- **Hardening**: CIS Benchmark-compliant system configuration +- **Auditing**: Comprehensive logging and audit trails +- **Automation**: Security patches applied automatically + +### Infrastructure as Code (IaC) + +All infrastructure is defined, versioned, and managed as code: + +- **Version Control**: Git-based change tracking +- **Declarative Configuration**: Ansible playbooks and roles +- **Idempotency**: Safe re-execution without side effects +- **Documentation**: Self-documenting through code + +### Scalability & Modularity + +Architecture scales from small to enterprise deployments: + +- **Modular Roles**: Single-purpose, reusable components +- **Dynamic Inventories**: Auto-discovery of infrastructure +- **Parallel Execution**: Concurrent operations for speed +- **Horizontal Scaling**: Add capacity by adding hosts + +--- + +## High-Level Architecture + +``` +┌──────────────────────────────────────────────────────────────────┐ +│ Management Layer │ +│ ┌─────────────────┐ ┌──────────────────┐ │ +│ │ Ansible Control │────────▶│ Git Repository │ │ +│ │ Node │ │ (Gitea) │ │ +│ │ │ └──────────────────┘ │ +│ │ - Playbooks │ ┌──────────────────┐ │ +│ │ - Inventories │────────▶│ Secret Manager │ │ +│ │ - Roles │ │ (Ansible Vault) │ │ +│ └────────┬────────┘ └──────────────────┘ │ +└───────────┼──────────────────────────────────────────────────────┘ + │ + │ SSH (port 22) + │ Encrypted, Key-based Auth + │ +┌───────────┼──────────────────────────────────────────────────────┐ +│ │ Compute Layer │ +│ ▼ │ +│ ┌─────────────────────────────────────────────────────────────┐│ +│ │ Hypervisor Hosts ││ +│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ +│ │ │ KVM/Libvirt │ │ KVM/Libvirt │ │ KVM/Libvirt │ ││ +│ │ │ Hypervisor │ │ Hypervisor │ │ Hypervisor │ ││ +│ │ │ (grokbox) │ │ (hv02) │ │ (hv03) │ ││ +│ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ ││ +│ └─────────┼──────────────────┼──────────────────┼──────────────┘│ +│ │ │ │ │ +│ ▼ ▼ ▼ │ +│ ┌─────────────────────────────────────────────────────────────┐│ +│ │ Guest Virtual Machines ││ +│ │ ││ +│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ +│ │ │ Web │ │ App │ │ Database │ │ Cache │ ││ +│ │ │ Servers │ │ Servers │ │ Servers │ │ Servers │ ││ +│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ ││ +│ │ ││ +│ │ - SELinux/AppArmor Enforcing ││ +│ │ - Firewall (UFW/firewalld) ││ +│ │ - Automatic Security Updates ││ +│ │ - LVM Storage Management ││ +│ └─────────────────────────────────────────────────────────────┘│ +└────────────────────────────────────────────────────────────────────┘ + │ + │ Logs, Metrics, Events + ▼ +┌──────────────────────────────────────────────────────────────────┐ +│ Observability Layer │ +│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ +│ │ Logging │ │ Monitoring │ │ Audit │ │ +│ │ (Future) │ │ (Future) │ │ Logs │ │ +│ └────────────┘ └────────────┘ └────────────┘ │ +└──────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Component Architecture + +### Management Layer + +#### Ansible Control Node + +**Purpose**: Central orchestration and automation hub + +**Components**: +- Ansible Core (2.12+) +- Python 3.x +- Custom roles and playbooks +- Dynamic inventory plugins +- Ansible Vault for secrets + +**Responsibilities**: +- Execute playbooks and roles +- Manage inventory (dynamic and static) +- Secure secrets management +- Version control integration +- Audit log collection + +**Security Controls**: +- SSH key-based authentication only +- No password-based access +- Encrypted secrets (Ansible Vault) +- Git-backed change tracking +- Limited user access with RBAC + +#### Git Repository (Gitea) + +**Purpose**: Version control for Infrastructure as Code + +**Hosted**: https://git.mymx.me +**Authentication**: SSH keys, user accounts + +**Content**: +- Ansible playbooks +- Role definitions +- Inventory configurations (public) +- Documentation +- Scripts and utilities + +**Workflow**: +- Feature branch development +- Pull request reviews +- Main branch protection +- Semantic versioning tags + +**Note**: Secrets stored in separate private repository + +#### Secret Management + +**Primary**: Ansible Vault (file-based encryption) +**Future**: HashiCorp Vault, AWS Secrets Manager integration + +**Secrets Managed**: +- SSH private keys +- Service account credentials +- API tokens +- Encryption certificates +- Database passwords + +**Location**: `./secrets` directory (private git submodule) + +### Compute Layer + +#### Hypervisor Hosts + +**Platform**: KVM/libvirt on Linux (Debian 12, Ubuntu 22.04, AlmaLinux 9) + +**Key Capabilities**: +- Hardware virtualization (Intel VT-x / AMD-V) +- Nested virtualization support +- Storage pools (LVM-backed) +- Virtual networking (bridges, NAT) +- Live migration (planned) + +**Resource Allocation**: +- CPU overcommit ratio: 2:1 (2 vCPUs per physical core) +- Memory overcommit: Disabled for production +- Storage: Thin provisioning with LVM + +**Management**: +- virsh CLI +- libvirt API +- Ansible automation +- No GUI (security requirement) + +#### Guest Virtual Machines + +**Provisioning**: Automated via `deploy_linux_vm` role + +**Supported Distributions**: +- Debian 11, 12 +- Ubuntu 20.04, 22.04, 24.04 LTS +- RHEL 8, 9 +- AlmaLinux 8, 9 +- Rocky Linux 8, 9 +- openSUSE Leap 15.5, 15.6 + +**Standard Configuration**: +- Cloud-init provisioning +- LVM storage (CLAUDE.md compliant) +- SSH hardening (key-only, no root login) +- SELinux enforcing (RHEL) / AppArmor (Debian) +- Firewall enabled (UFW/firewalld) +- Automatic security updates +- Audit daemon (auditd) +- Time synchronization (chrony) + +**Resource Tiers**: + +| Tier | vCPUs | RAM | Disk | Use Case | +|------|-------|-----|------|----------| +| Small | 2 | 2 GB | 30 GB | Development, testing | +| Medium | 4 | 8 GB | 50 GB | Web servers, app servers | +| Large | 8 | 16 GB | 100 GB | Databases, data processing | +| XLarge | 16+ | 32+ GB | 200+ GB | High-performance applications | + +### Observability Layer (Planned) + +#### Logging + +**Future Integration**: ELK Stack, Graylog, or Loki + +**Log Sources**: +- System logs (rsyslog/journald) +- Application logs +- Audit logs (auditd) +- Security events +- Ansible execution logs + +**Retention**: 30 days local, 1 year centralized + +#### Monitoring + +**Future Integration**: Prometheus + Grafana + +**Metrics Collected**: +- CPU, memory, disk, network utilization +- Service availability +- Application performance +- Infrastructure health + +**Alerting**: PagerDuty, Slack, Email + +#### Audit & Compliance + +**Current**: +- auditd on all systems +- Ansible execution logs +- Git change tracking + +**Future**: +- Centralized audit log aggregation +- SIEM integration +- Compliance dashboards (CIS, NIST) + +--- + +## Deployment Patterns + +### Greenfield Deployment + +**Scenario**: New infrastructure from scratch + +``` +1. Setup Ansible Control Node + └─▶ Install Ansible + └─▶ Clone git repository + └─▶ Configure inventories + └─▶ Setup secrets management + +2. Provision Hypervisors + └─▶ Install KVM/libvirt + └─▶ Configure storage pools + └─▶ Setup networking + └─▶ Apply security hardening + +3. Deploy Guest VMs + └─▶ Use deploy_linux_vm role + └─▶ Apply LVM configuration + └─▶ Verify security posture + +4. Configure Applications + └─▶ Apply application roles + └─▶ Configure services + └─▶ Implement monitoring + +5. Validate & Document + └─▶ Run system_info role + └─▶ Generate inventory + └─▶ Update documentation +``` + +### Incremental Expansion + +**Scenario**: Add capacity to existing infrastructure + +``` +1. Add Hypervisor (if needed) + └─▶ Physical installation + └─▶ Ansible provisioning + └─▶ Add to inventory + +2. Deploy Additional VMs + └─▶ Execute deploy_linux_vm role + └─▶ Configure per requirements + └─▶ Integrate with load balancer + +3. Update Inventory + └─▶ Refresh dynamic inventory + └─▶ Update group assignments + └─▶ Verify connectivity + +4. Apply Configuration + └─▶ Run relevant playbooks + └─▶ Validate functionality + └─▶ Monitor performance +``` + +### Disaster Recovery + +**Scenario**: Rebuild after failure + +``` +1. Assess Damage + └─▶ Identify affected systems + └─▶ Check backup status + └─▶ Plan recovery order + +2. Restore Hypervisor (if needed) + └─▶ Reinstall from bare metal + └─▶ Apply Ansible configuration + └─▶ Restore storage pools + +3. Restore VMs + └─▶ Restore from backups, OR + └─▶ Redeploy with deploy_linux_vm + └─▶ Restore application data + +4. Verify & Resume + └─▶ Run validation checks + └─▶ Test application functionality + └─▶ Resume normal operations +``` + +--- + +## Data Flow + +### Provisioning Flow + +``` +Ansible Control + │ + │ 1. Read inventory + │ (dynamic or static) + ▼ + Inventory + │ + │ 2. Execute playbook + │ with role(s) + ▼ + Hypervisor + │ + │ 3. Create VM + │ - Download cloud image + │ - Create disks + │ - Generate cloud-init ISO + │ - Define & start VM + ▼ + Guest VM + │ + │ 4. Cloud-init first boot + │ - User creation + │ - SSH key deployment + │ - Package installation + │ - Security hardening + ▼ + Guest VM (Running) + │ + │ 5. Post-deployment + │ - LVM configuration + │ - Additional hardening + │ - Service configuration + ▼ + Guest VM (Ready) +``` + +### Configuration Management Flow + +``` +Git Repository + │ + │ 1. Developer commits changes + │ (playbook, role, config) + ▼ + Pull Request + │ + │ 2. Code review + │ Approval required + ▼ + Main Branch + │ + │ 3. Ansible control pulls changes + │ (manual or automated) + ▼ + Ansible Control + │ + │ 4. Execute playbook + │ Target specific environment + ▼ + Target Hosts + │ + │ 5. Apply configuration + │ Idempotent execution + ▼ + Updated State + │ + │ 6. Validation + │ Verify desired state + ▼ + Audit Log +``` + +### Information Gathering Flow + +``` +Ansible Control + │ + │ 1. Execute gather_system_info.yml + ▼ + Target Hosts + │ + │ 2. Collect data + │ - CPU, GPU, Memory + │ - Disk, Network + │ - Hypervisor info + ▼ + system_info role + │ + │ 3. Aggregate and format + │ JSON structure + ▼ + Ansible Control + │ + │ 4. Save to local filesystem + │ ./stats/machines// + ▼ + JSON Files + │ + │ 5. Query and analyze + │ - jq queries + │ - Report generation + │ - CMDB sync + ▼ + Reports/Dashboards +``` + +--- + +## Environment Segregation + +### Environment Structure + +``` +inventories/ +├── production/ +│ ├── hosts.yml (or dynamic plugin config) +│ └── group_vars/ +│ ├── all.yml +│ └── webservers.yml +├── staging/ +│ ├── hosts.yml +│ └── group_vars/ +│ └── all.yml +└── development/ + ├── hosts.yml + └── group_vars/ + └── all.yml +``` + +### Environment Isolation + +| Environment | Purpose | Change Control | Automation | Data | +|-------------|---------|----------------|------------|------| +| **Production** | Live systems | Strict approval | Scheduled | Real | +| **Staging** | Pre-production testing | Approval required | On-demand | Sanitized | +| **Development** | Feature development | Minimal | On-demand | Synthetic | + +### Promotion Pipeline + +``` +Development + │ + │ 1. Develop & test features + │ No approval required + ▼ +Staging + │ + │ 2. Integration testing + │ Approval: Tech Lead + ▼ +Production + │ + │ 3. Gradual rollout + │ Approval: Operations Manager + ▼ +Live +``` + +--- + +## Scaling Strategy + +### Horizontal Scaling + +**Add compute capacity**: +- Add hypervisor hosts +- Deploy additional VMs +- Update load balancer configuration +- Rebalance workloads + +**Automation**: +- Dynamic inventory auto-discovers new hosts +- Ansible playbooks target groups, not individuals +- Configuration applied uniformly + +### Vertical Scaling + +**Increase VM resources**: +- Shutdown VM +- Modify vCPU/memory allocation (virsh) +- Resize disk volumes (LVM) +- Restart VM +- Verify application performance + +### Storage Scaling + +**Expand LVM volumes**: +```bash +# Add new disk to hypervisor +# Attach to VM as /dev/vdc + +# Extend volume group +pvcreate /dev/vdc +vgextend vg_system /dev/vdc + +# Extend logical volume +lvextend -L +50G /dev/vg_system/lv_var +resize2fs /dev/vg_system/lv_var # ext4 +# or +xfs_growfs /var # xfs +``` + +--- + +## High Availability & Disaster Recovery + +### Current State + +**Single Points of Failure**: +- Ansible control node (manual failover) +- Individual hypervisors (VM migration required) +- No automated failover + +**Mitigation**: +- Regular backups (VM snapshots) +- Documentation for rebuild +- Idempotent playbooks for re-deployment + +### Future Enhancements (Planned) + +**High Availability**: +- Multiple Ansible control nodes (Ansible Tower/AWX) +- Hypervisor clustering (Proxmox cluster) +- Load-balanced application tiers +- Database replication (PostgreSQL streaming) + +**Disaster Recovery**: +- Automated backup solution +- Off-site backup replication +- DR site with regular testing +- Documented RTO/RPO objectives + +--- + +## Performance Considerations + +### Ansible Execution Optimization + +- **Fact Caching**: Reduces gather time +- **Parallelism**: Increase forks for concurrent execution +- **Pipelining**: Reduces SSH overhead +- **Strategy Plugins**: Use `free` strategy when tasks are independent + +### VM Performance Tuning + +- **CPU Pinning**: For latency-sensitive applications +- **NUMA Awareness**: Optimize memory access +- **virtio Drivers**: Use paravirtualized devices +- **Disk I/O**: Use virtio-scsi with native AIO + +### Network Performance + +- **SR-IOV**: For high-throughput networking +- **Bridge Offloading**: Reduce CPU overhead +- **MTU Optimization**: Jumbo frames where supported + +--- + +## Cost Optimization + +### Resource Efficiency + +- **Right-Sizing**: Match VM resources to actual needs +- **Consolidation**: Maximize hypervisor utilization +- **Thin Provisioning**: Allocate storage on-demand +- **Decommissioning**: Remove unused infrastructure + +### Automation Benefits + +- **Reduced Manual Labor**: Faster deployments +- **Fewer Errors**: Consistent configurations +- **Faster Recovery**: Automated DR procedures +- **Better Utilization**: Data-driven capacity planning + +--- + +## Related Documentation + +- [Network Topology](./network-topology.md) +- [Security Model](./security-model.md) +- [Role Index](../roles/role-index.md) +- [CLAUDE.md Guidelines](../../CLAUDE.md) + +--- + +**Document Version**: 1.0.0 +**Last Updated**: 2025-11-11 +**Review Schedule**: Quarterly +**Document Owner**: Ansible Infrastructure Team diff --git a/docs/architecture/security-model.md b/docs/architecture/security-model.md new file mode 100644 index 0000000..3cbbeac --- /dev/null +++ b/docs/architecture/security-model.md @@ -0,0 +1,355 @@ +# Security Model + +## Security Architecture Overview + +This document describes the security architecture, controls, and practices implemented across the Ansible-managed infrastructure. + +## Security Principles + +### Defense in Depth +Multiple layers of security controls protect infrastructure: +1. **Network Security**: Firewalls, network segmentation +2. **Access Control**: SSH keys, least privilege, MFA (planned) +3. **System Hardening**: SELinux/AppArmor, secure configurations +4. **Patch Management**: Automatic security updates +5. **Audit & Logging**: Comprehensive activity tracking +6. **Encryption**: Data at rest and in transit + +### Least Privilege +- Service accounts with minimal required permissions +- No root SSH access +- Sudo logging enabled +- Regular access reviews + +### Security by Default +- SSH password authentication disabled +- Firewall enabled by default +- SELinux/AppArmor enforcing mode +- Automatic security updates enabled +- Audit daemon (auditd) active + +## Access Control + +### Authentication + +**SSH Key-Based Authentication**: +- RSA 4096-bit or Ed25519 keys +- No password-based SSH login +- Key rotation every 90-180 days +- Root login disabled + +**Service Accounts**: +- `ansible` user on all managed systems +- Passwordless sudo with logging +- SSH public keys pre-deployed +- No interactive shell access + +### Authorization + +**Sudo Configuration** (`/etc/sudoers.d/ansible`): +``` +ansible ALL=(ALL) NOPASSWD: ALL +Defaults:ansible !requiretty +Defaults:ansible log_output +``` + +**Future Enhancements**: +- RBAC via Ansible Tower/AWX +- Multi-factor authentication (MFA) +- Privileged access management (PAM) + +## Network Security + +### Firewall Configuration + +**Debian/Ubuntu (UFW)**: +```bash +# Default policies +ufw default deny incoming +ufw default allow outgoing + +# Allow SSH +ufw allow 22/tcp + +# Application-specific rules added per VM +``` + +**RHEL/AlmaLinux (firewalld)**: +```bash +# Default zone: drop +firewall-cmd --set-default-zone=drop + +# Allow SSH in public zone +firewall-cmd --zone=public --add-service=ssh --permanent +``` + +### Network Segmentation + +| Zone | Purpose | Access Control | +|------|---------|---------------| +| Management | Ansible control, tooling | Restricted to ops team | +| Hypervisor | KVM hosts | Ansible control node only | +| Production VMs | Live services | Application-specific rules | +| Staging VMs | Testing | More permissive for testing | +| Development VMs | Dev/test | Minimal restrictions | + +### SSH Hardening + +**Configuration** (`/etc/ssh/sshd_config.d/99-security.conf`): +```ini +PermitRootLogin no +PasswordAuthentication no +PubkeyAuthentication yes +GSSAPIAuthentication no # Explicitly disabled per CLAUDE.md +MaxAuthTries 3 +ClientAliveInterval 300 +ClientAliveCountMax 2 +X11Forwarding no +Protocol 2 +``` + +## System Hardening + +### Mandatory Access Control + +**RHEL Family (SELinux)**: +- Mode: `enforcing` +- Policy: `targeted` +- Verification: `getenforce` +- No setenforce 0 in production + +**Debian Family (AppArmor)**: +- Status: `enabled` +- Mode: `enforce` +- Profiles: All default profiles active + +### File System Security + +**LVM Mount Options** (CLAUDE.md compliant): +- `/tmp`: mounted with `noexec,nosuid,nodev` +- `/var/tmp`: mounted with `noexec,nosuid,nodev` +- Separate partitions for `/var`, `/var/log`, `/var/log/audit` + +### Kernel Hardening + +**sysctl parameters** (`/etc/sysctl.d/99-security.conf`): +```ini +# Network security +net.ipv4.conf.all.rp_filter = 1 +net.ipv4.conf.default.rp_filter = 1 +net.ipv4.icmp_echo_ignore_broadcasts = 1 +net.ipv4.conf.all.accept_source_route = 0 +net.ipv4.conf.default.accept_source_route = 0 +net.ipv4.conf.all.send_redirects = 0 +net.ipv4.conf.default.send_redirects = 0 + +# Security hardening +kernel.dmesg_restrict = 1 +kernel.kptr_restrict = 2 +``` + +## Patch Management + +### Automatic Security Updates + +**Debian/Ubuntu (unattended-upgrades)**: +- Security updates: Automatically installed +- Reboot: Manual (not automatic) +- Notifications: Email on errors + +**RHEL/AlmaLinux (dnf-automatic)**: +- Security updates: Automatically applied +- Reboot: Manual (not automatic) +- Logging: All actions logged + +### Update Strategy + +| Environment | Update Schedule | Testing | Rollback Plan | +|-------------|----------------|---------|---------------| +| Development | Immediate | Minimal | Redeploy if issues | +| Staging | Weekly | Full regression | Snapshot restore | +| Production | Monthly (security: weekly) | Comprehensive | Snapshot + DR plan | + +## Secrets Management + +### Current: Ansible Vault + +**Encrypted Content**: +- SSH private keys +- Service account passwords +- API tokens +- Database credentials + +**Location**: `./secrets` directory (private git repository) + +**Key Rotation**: Every 90 days + +### Future: External Secrets Manager + +**Planned Integration**: +- HashiCorp Vault +- AWS Secrets Manager +- Azure Key Vault + +**Benefits**: +- Centralized secrets management +- Dynamic secret generation +- Audit trail for secret access +- Automated rotation + +## Audit & Logging + +### Audit Daemon (auditd) + +**Enabled on All Systems**: +- Monitors privileged operations +- Logs file access events +- Tracks authentication attempts +- Immutable log files + +**Key Rules**: +- Monitor `/etc/sudoers` changes +- Track user account modifications +- Log privileged command execution +- Monitor sensitive file access + +### Log Management + +**Local Logging**: +- `/var/log/audit/audit.log` (auditd) +- `/var/log/auth.log` (authentication - Debian) +- `/var/log/secure` (authentication - RHEL) +- `journalctl` (systemd) + +**Retention**: 30 days local + +**Future**: Centralized logging (ELK, Graylog, or Loki) + +### Ansible Execution Logging + +All Ansible playbook executions are logged: +- Command executed +- User who executed +- Target hosts +- Timestamp +- Results and changes + +## Compliance & Standards + +### CIS Benchmarks + +| Control Area | Implementation | CIS Reference | +|-------------|----------------|---------------| +| SSH Hardening | ✓ Implemented | 5.2.x | +| Firewall | ✓ Enabled | 3.5.x | +| Audit Logging | ✓ Active | 4.1.x | +| File Permissions | ✓ Configured | 1.x | +| User Accounts | ✓ Managed | 5.x | +| SELinux/AppArmor | ✓ Enforcing | 1.6.x | + +### NIST Cybersecurity Framework + +| Function | Controls | Status | +|----------|----------|--------| +| Identify | Asset inventory (system_info role) | ✓ | +| Protect | Access control, encryption | ✓ | +| Detect | Audit logging, monitoring (planned) | Partial | +| Respond | Incident response playbooks | Planned | +| Recover | DR procedures, backups | Partial | + +## Incident Response + +### Security Incident Workflow + +``` +1. Detection + └─▶ Audit logs, monitoring alerts + +2. Containment + └─▶ Isolate affected systems (firewall rules) + └─▶ Disable compromised accounts + +3. Investigation + └─▶ Review audit logs + └─▶ Analyze system state + └─▶ Identify root cause + +4. Eradication + └─▶ Remove malware/backdoors + └─▶ Patch vulnerabilities + └─▶ Restore from clean backups + +5. Recovery + └─▶ Restore services + └─▶ Verify security posture + └─▶ Monitor for re-infection + +6. Lessons Learned + └─▶ Document incident + └─▶ Update playbooks + └─▶ Improve defenses +``` + +### Emergency Contacts + +- **Security Team**: security@example.com +- **On-Call**: +1-XXX-XXX-XXXX +- **Escalation**: CTO/CISO + +## Security Testing + +### Regular Activities + +**Weekly**: +- Review audit logs +- Check for security updates +- Validate firewall rules + +**Monthly**: +- Run system_info for inventory +- Review user access +- Test backup restore + +**Quarterly**: +- Vulnerability scanning +- Configuration audits +- DR testing +- Access reviews + +### Tools + +- **Lynis**: System auditing +- **OpenSCAP**: Compliance scanning +- **ansible-lint**: Playbook security checks +- **AIDE**: File integrity monitoring + +## Security Hardening Checklist + +### Per-System Checklist + +- [ ] SSH hardening applied +- [ ] Firewall configured and enabled +- [ ] SELinux/AppArmor enforcing +- [ ] Automatic security updates enabled +- [ ] Audit daemon running +- [ ] Time synchronization configured +- [ ] LVM with secure mount options +- [ ] Unnecessary services disabled +- [ ] Security packages installed (aide, fail2ban) +- [ ] Root login disabled +- [ ] Service account configured +- [ ] Logs being collected + +## Related Documentation + +- [Architecture Overview](./overview.md) +- [Network Topology](./network-topology.md) +- [Security Compliance](../security-compliance.md) +- [CLAUDE.md Guidelines](../../CLAUDE.md) + +--- + +**Document Version**: 1.0.0 +**Last Updated**: 2025-11-11 +**Review Schedule**: Quarterly +**Document Owner**: Security & Infrastructure Team diff --git a/docs/roles/deploy_linux_vm.md b/docs/roles/deploy_linux_vm.md new file mode 100644 index 0000000..8d1c831 --- /dev/null +++ b/docs/roles/deploy_linux_vm.md @@ -0,0 +1,898 @@ +# Deploy Linux VM Role Documentation + +## Overview + +The `deploy_linux_vm` role provides enterprise-grade automated deployment of Linux virtual machines on KVM/libvirt hypervisors. It implements comprehensive security hardening, LVM storage management, and multi-distribution support aligned with CLAUDE.md infrastructure guidelines. + +## Purpose + +- **Automated VM Provisioning**: Unattended deployment using cloud-init for consistent infrastructure +- **Security-First Design**: Built-in SSH hardening, SELinux/AppArmor enforcement, firewall configuration +- **LVM Storage Management**: Automated LVM setup with CLAUDE.md-compliant partition schema +- **Multi-Distribution Support**: Debian, Ubuntu, RHEL, AlmaLinux, Rocky Linux, openSUSE +- **Production Ready**: Idempotent, well-tested, and suitable for production environments + +## Architecture + +### Deployment Flow + +``` +┌──────────────────────┐ +│ Ansible Controller │ +│ (Control Node) │ +└──────────┬───────────┘ + │ + │ SSH (port 22) + ▼ +┌──────────────────────┐ +│ KVM Hypervisor │ +│ (grokbox, etc.) │ +└──────────┬───────────┘ + │ + │ 1. Download cloud image + │ 2. Create VM disks + │ 3. Generate cloud-init ISO + │ 4. Define & start VM + ▼ +┌──────────────────────┐ +│ Guest VM │ +│ ┌────────────────┐ │ +│ │ Cloud-Init │──┼──▶ User creation +│ │ First Boot │ │ SSH keys +│ │ │ │ Package installation +│ └────────┬───────┘ │ Security hardening +│ │ │ +│ ▼ │ +│ ┌────────────────┐ │ +│ │ Post-Deploy │──┼──▶ LVM configuration +│ │ Configuration │ │ Data migration +│ │ │ │ Fstab updates +│ └────────────────┘ │ +└──────────────────────┘ +``` + +### Storage Architecture + +``` +Hypervisor: /var/lib/libvirt/images/ +├── ubuntu-22.04-cloud.qcow2 # Base cloud image (shared) +├── vm_name.qcow2 # Primary disk (30GB default) +│ ├── /dev/vda1 → /boot (2GB) +│ ├── /dev/vda2 → / (root, 8GB) +│ └── /dev/vda3 → swap (1GB) +├── vm_name-lvm.qcow2 # LVM disk (30GB default) +│ └── /dev/vdb → Physical Volume +│ └── vg_system (Volume Group) +│ ├── lv_opt → /opt (3GB) +│ ├── lv_tmp → /tmp (1GB, noexec) +│ ├── lv_home → /home (2GB) +│ ├── lv_var → /var (5GB) +│ ├── lv_var_log → /var/log (2GB) +│ ├── lv_var_tmp → /var/tmp (5GB, noexec) +│ ├── lv_var_audit → /var/log/audit (1GB) +│ └── lv_swap → swap (2GB) +└── vm_name-cloud-init.iso # Cloud-init configuration +``` + +### Task Organization + +The role follows modular task organization: + +``` +roles/deploy_linux_vm/tasks/ +├── main.yml # Orchestration and task flow +├── preflight.yml # Pre-deployment validation +├── install.yml # Hypervisor package installation +├── download_image.yml # Cloud image download and verification +├── create_storage.yml # VM disk creation +├── cloud-init.yml # Cloud-init configuration generation +├── deploy_vm.yml # VM definition and deployment +├── post_deploy_lvm.yml # LVM configuration on guest +└── cleanup.yml # Temporary file cleanup +``` + +## Integration Points + +### With Infrastructure + +The role integrates seamlessly with: + +- **Dynamic Inventories**: Works with AWS, Azure, Proxmox, VMware inventory sources +- **Configuration Management**: Post-deployment hooks for additional role application +- **Monitoring Integration**: Collects deployment metrics for tracking +- **CMDB Sync**: Can export VM metadata to NetBox, ServiceNow + +### With Other Roles + +**Typical Workflow:** + +```yaml +# 1. Deploy VM infrastructure +- role: deploy_linux_vm + +# 2. Gather system information +- role: system_info + +# 3. Apply application-specific configuration +- role: webserver + # or +- role: database + # or +- role: kubernetes_node +``` + +### Cloud-Init Integration + +The role generates comprehensive cloud-init configuration: + +- **User Data**: User creation, SSH keys, package installation +- **Meta Data**: Instance ID, hostname, network configuration +- **Vendor Data**: Distribution-specific customizations + +Cloud-init handles: +- Ansible user creation with sudo access +- SSH key deployment +- Essential package installation (vim, htop, git, python3, etc.) +- Security package installation (aide, auditd, chrony) +- SSH hardening configuration +- Firewall setup +- SELinux/AppArmor configuration +- Automatic security updates + +## Data Model + +### Role Variables + +#### Required Variables + +| Variable | Type | Description | Example | +|----------|------|-------------|---------| +| `deploy_linux_vm_os_distribution` | string | Target distribution identifier | `ubuntu-22.04`, `almalinux-9` | + +#### VM Configuration Variables + +| Variable | Type | Default | Description | +|----------|------|---------|-------------| +| `deploy_linux_vm_name` | string | `linux-guest` | VM name in libvirt | +| `deploy_linux_vm_hostname` | string | `linux-vm` | Guest hostname | +| `deploy_linux_vm_domain` | string | `localdomain` | Domain name (FQDN = hostname.domain) | +| `deploy_linux_vm_vcpus` | integer | `2` | Number of virtual CPUs | +| `deploy_linux_vm_memory_mb` | integer | `2048` | RAM allocation in MB | +| `deploy_linux_vm_disk_size_gb` | integer | `30` | Primary disk size in GB | + +#### LVM Configuration Variables + +| Variable | Type | Default | Description | +|----------|------|---------|-------------| +| `deploy_linux_vm_use_lvm` | boolean | `true` | Enable LVM configuration | +| `deploy_linux_vm_lvm_vg_name` | string | `vg_system` | Volume group name | +| `deploy_linux_vm_lvm_pv_device` | string | `/dev/vdb` | Physical volume device | +| `deploy_linux_vm_lvm_volumes` | list | (see below) | Logical volume definitions | + +**Default LVM Volumes (CLAUDE.md Compliant):** + +```yaml +deploy_linux_vm_lvm_volumes: + - name: lv_opt + size: 3G + mount: /opt + fstype: ext4 + - name: lv_tmp + size: 1G + mount: /tmp + fstype: ext4 + mount_options: noexec,nosuid,nodev + - name: lv_home + size: 2G + mount: /home + fstype: ext4 + - name: lv_var + size: 5G + mount: /var + fstype: ext4 + - name: lv_var_log + size: 2G + mount: /var/log + fstype: ext4 + - name: lv_var_tmp + size: 5G + mount: /var/tmp + fstype: ext4 + mount_options: noexec,nosuid,nodev + - name: lv_var_audit + size: 1G + mount: /var/log/audit + fstype: ext4 + - name: lv_swap + size: 2G + mount: none + fstype: swap +``` + +#### Security Configuration Variables + +| Variable | Type | Default | Description | +|----------|------|---------|-------------| +| `deploy_linux_vm_enable_firewall` | boolean | `true` | Enable UFW (Debian) or firewalld (RHEL) | +| `deploy_linux_vm_enable_selinux` | boolean | `true` | Enable SELinux enforcing (RHEL family) | +| `deploy_linux_vm_enable_apparmor` | boolean | `true` | Enable AppArmor (Debian family) | +| `deploy_linux_vm_enable_auditd` | boolean | `true` | Enable audit daemon | +| `deploy_linux_vm_enable_automatic_updates` | boolean | `true` | Enable automatic security updates | +| `deploy_linux_vm_automatic_reboot` | boolean | `false` | Auto-reboot after updates (not recommended) | + +#### SSH Hardening Variables + +| Variable | Type | Default | Description | +|----------|------|---------|-------------| +| `deploy_linux_vm_ssh_permit_root_login` | string | `no` | Allow root SSH login | +| `deploy_linux_vm_ssh_password_authentication` | string | `no` | Allow password authentication | +| `deploy_linux_vm_ssh_gssapi_authentication` | string | `no` | **GSSAPI disabled per requirements** | +| `deploy_linux_vm_ssh_gssapi_cleanup_credentials` | string | `no` | GSSAPI credential cleanup | +| `deploy_linux_vm_ssh_max_auth_tries` | integer | `3` | Maximum authentication attempts | +| `deploy_linux_vm_ssh_client_alive_interval` | integer | `300` | SSH keepalive interval (seconds) | +| `deploy_linux_vm_ssh_client_alive_count_max` | integer | `2` | Maximum keepalive probes | + +#### User Configuration Variables + +| Variable | Type | Default | Description | +|----------|------|---------|-------------| +| `deploy_linux_vm_ansible_user` | string | `ansible` | Service account username | +| `deploy_linux_vm_ansible_user_ssh_key` | string | (generated) | SSH public key for ansible user | +| `deploy_linux_vm_root_password` | string | `ChangeMe123!` | Root password (console only) | + +### Distribution Support Matrix + +| Distribution | Versions | Cloud Image Source | Tested | +|--------------|----------|-------------------|--------| +| **Debian** | 11 (Bullseye)
12 (Bookworm) | https://cloud.debian.org/images/cloud/ | ✓ | +| **Ubuntu** | 20.04 LTS (Focal)
22.04 LTS (Jammy)
24.04 LTS (Noble) | https://cloud-images.ubuntu.com/ | ✓ | +| **RHEL** | 8, 9 | Red Hat Customer Portal | ✓ | +| **AlmaLinux** | 8, 9 | https://repo.almalinux.org/almalinux/ | ✓ | +| **Rocky Linux** | 8, 9 | https://download.rockylinux.org/pub/rocky/ | ✓ | +| **CentOS Stream** | 8, 9 | https://cloud.centos.org/centos/ | ✓ | +| **openSUSE Leap** | 15.5, 15.6 | https://download.opensuse.org/distribution/ | ✓ | + +## Use Cases + +### Use Case 1: Development Environment + +**Scenario**: Create development VMs for a development team. + +```yaml +--- +- name: Deploy Development VMs + hosts: hypervisor_dev + become: yes + vars: + dev_vms: + - { name: dev01, user: alice, distro: ubuntu-22.04 } + - { name: dev02, user: bob, distro: debian-12 } + - { name: dev03, user: charlie, distro: almalinux-9 } + tasks: + - name: Deploy developer VMs + include_role: + name: deploy_linux_vm + vars: + deploy_linux_vm_name: "{{ item.name }}" + deploy_linux_vm_hostname: "{{ item.name }}" + deploy_linux_vm_os_distribution: "{{ item.distro }}" + deploy_linux_vm_vcpus: 2 + deploy_linux_vm_memory_mb: 4096 + deploy_linux_vm_use_lvm: false # Skip LVM for dev environments + loop: "{{ dev_vms }}" +``` + +**Benefits**: +- Rapid provisioning of consistent dev environments +- Easy destruction and recreation +- Reduced LVM overhead for ephemeral VMs + +### Use Case 2: Production Web Application Stack + +**Scenario**: Deploy a 3-tier web application (load balancer, app servers, database). + +```yaml +--- +- name: Deploy Production Web Stack + hosts: hypervisor_prod + become: yes + serial: 1 # Deploy one at a time for safety + tasks: + # Load Balancer + - name: Deploy load balancer + include_role: + name: deploy_linux_vm + vars: + deploy_linux_vm_name: "lb01" + deploy_linux_vm_hostname: "lb01" + deploy_linux_vm_domain: "production.example.com" + deploy_linux_vm_os_distribution: "ubuntu-22.04" + deploy_linux_vm_vcpus: 2 + deploy_linux_vm_memory_mb: 4096 + deploy_linux_vm_use_lvm: true + + # Application Servers + - name: Deploy application servers + include_role: + name: deploy_linux_vm + vars: + deploy_linux_vm_name: "app{{ '%02d' | format(item) }}" + deploy_linux_vm_hostname: "app{{ '%02d' | format(item) }}" + deploy_linux_vm_domain: "production.example.com" + deploy_linux_vm_os_distribution: "almalinux-9" + deploy_linux_vm_vcpus: 4 + deploy_linux_vm_memory_mb: 8192 + deploy_linux_vm_disk_size_gb: 50 + loop: [1, 2, 3] + + # Database Server + - name: Deploy database server + include_role: + name: deploy_linux_vm + vars: + deploy_linux_vm_name: "db01" + deploy_linux_vm_hostname: "db01" + deploy_linux_vm_domain: "production.example.com" + deploy_linux_vm_os_distribution: "almalinux-9" + deploy_linux_vm_vcpus: 8 + deploy_linux_vm_memory_mb: 32768 + deploy_linux_vm_disk_size_gb: 200 + deploy_linux_vm_lvm_volumes: + - { name: lv_opt, size: 5G, mount: /opt, fstype: ext4 } + - { name: lv_tmp, size: 2G, mount: /tmp, fstype: ext4, mount_options: noexec,nosuid,nodev } + - { name: lv_home, size: 2G, mount: /home, fstype: ext4 } + - { name: lv_var, size: 10G, mount: /var, fstype: ext4 } + - { name: lv_var_log, size: 5G, mount: /var/log, fstype: ext4 } + - { name: lv_pgsql, size: 100G, mount: /var/lib/pgsql, fstype: xfs } + - { name: lv_swap, size: 4G, mount: none, fstype: swap } +``` + +**Benefits**: +- Consistent infrastructure across tiers +- Customized resources per tier +- LVM allows for database storage expansion +- Security hardening applied uniformly + +### Use Case 3: CI/CD Build Agents + +**Scenario**: Deploy ephemeral build agents for CI/CD pipeline. + +```yaml +--- +- name: Deploy CI/CD Build Agents + hosts: hypervisor_ci + become: yes + vars: + agent_count: 5 + tasks: + - name: Deploy build agents + include_role: + name: deploy_linux_vm + vars: + deploy_linux_vm_name: "ci-agent-{{ item }}" + deploy_linux_vm_hostname: "ci-agent-{{ item }}" + deploy_linux_vm_os_distribution: "ubuntu-22.04" + deploy_linux_vm_vcpus: 4 + deploy_linux_vm_memory_mb: 8192 + deploy_linux_vm_use_lvm: false + deploy_linux_vm_enable_automatic_updates: false # Controlled updates + loop: "{{ range(1, agent_count + 1) | list }}" +``` + +**Benefits**: +- Quick provisioning of build capacity +- Easy horizontal scaling +- Consistent build environment +- Simple cleanup after job completion + +### Use Case 4: Disaster Recovery Testing + +**Scenario**: Create replica VMs for DR testing without impacting production. + +```yaml +--- +- name: Deploy DR Test Environment + hosts: hypervisor_dr + become: yes + tasks: + - name: Deploy DR replicas + include_role: + name: deploy_linux_vm + vars: + deploy_linux_vm_name: "dr-{{ item.name }}" + deploy_linux_vm_hostname: "dr-{{ item.name }}" + deploy_linux_vm_domain: "dr.example.com" + deploy_linux_vm_os_distribution: "{{ item.distro }}" + deploy_linux_vm_vcpus: "{{ item.vcpus }}" + deploy_linux_vm_memory_mb: "{{ item.memory }}" + loop: + - { name: web01, distro: ubuntu-22.04, vcpus: 4, memory: 8192 } + - { name: db01, distro: almalinux-9, vcpus: 8, memory: 16384 } +``` + +**Benefits**: +- Isolated DR testing environment +- Production-like configuration +- Quick teardown after testing + +## Security Implementation + +### Security Controls Mapping + +| Control Area | Implementation | Compliance | +|-------------|---------------|------------| +| **Access Control** | SSH key-only authentication, root login disabled | CIS 5.2.10, 5.2.9 | +| **Network Security** | Firewall enabled, minimal services exposed | CIS 3.5.x | +| **Audit & Logging** | auditd enabled, centralized logging ready | CIS 4.1.x, NIST AU family | +| **Cryptography** | SSH v2 only, strong ciphers | CIS 5.2.11 | +| **Least Privilege** | Non-root ansible user, sudo with logging | CIS 5.3.x | +| **Patch Management** | Automatic security updates | NIST SI-2 | +| **Mandatory Access Control** | SELinux enforcing / AppArmor enabled | CIS 1.6.x, NIST AC-3 | +| **File Integrity** | AIDE installed and configured | CIS 1.3.2, NIST SI-7 | +| **Time Sync** | chrony configured | CIS 2.2.1.1, NIST AU-8 | +| **Storage Security** | /tmp noexec, separate /var/log | CIS 1.1.x | + +### SSH Hardening Details + +The role implements comprehensive SSH hardening per CLAUDE.md requirements: + +**Configuration File**: `/etc/ssh/sshd_config.d/99-security.conf` + +```ini +# Authentication +PermitRootLogin no +PasswordAuthentication no +PubkeyAuthentication yes +ChallengeResponseAuthentication no +KerberosAuthentication no +GSSAPIAuthentication no # Explicitly disabled per requirements +GSSAPICleanupCredentials no + +# Connection limits +MaxAuthTries 3 +MaxSessions 10 +ClientAliveInterval 300 +ClientAliveCountMax 2 + +# Security hardening +PermitEmptyPasswords no +X11Forwarding no +Protocol 2 +``` + +### Firewall Configuration + +**Debian/Ubuntu (UFW)**: +```bash +# Default policies +ufw default deny incoming +ufw default allow outgoing + +# Allow SSH +ufw allow 22/tcp + +# Enable +ufw --force enable +``` + +**RHEL/AlmaLinux (firewalld)**: +```bash +# Default zone: drop +firewall-cmd --set-default-zone=drop + +# Allow SSH in public zone +firewall-cmd --zone=public --add-service=ssh --permanent + +# Reload +firewall-cmd --reload +``` + +### SELinux/AppArmor + +**RHEL Family (SELinux)**: +- Mode: `enforcing` +- Policy: `targeted` +- Status check: `getenforce` +- Troubleshooting: `ausearch -m avc -ts recent` + +**Debian Family (AppArmor)**: +- Status: `enabled` +- Mode: `enforce` +- Status check: `aa-status` +- Profiles: All default profiles enabled + +### Automatic Updates Configuration + +**Debian/Ubuntu (unattended-upgrades)**: +```conf +# /etc/apt/apt.conf.d/50unattended-upgrades +Unattended-Upgrade::Allowed-Origins { + "${distro_id}:${distro_codename}-security"; +}; +Unattended-Upgrade::Automatic-Reboot "false"; +``` + +**RHEL/AlmaLinux (dnf-automatic)**: +```conf +# /etc/dnf/automatic.conf +[commands] +upgrade_type = security +apply_updates = yes +reboot = never +``` + +## Performance Considerations + +### Execution Time + +Typical deployment timeline: +- **Pre-flight checks**: 5-10 seconds +- **Package installation**: 10-30 seconds (first run only) +- **Cloud image download**: 30-120 seconds (first run only, cached thereafter) +- **VM deployment**: 30-60 seconds +- **Cloud-init first boot**: 60-180 seconds +- **LVM configuration**: 30-60 seconds +- **Total**: 3-7 minutes per VM + +Factors affecting performance: +- Internet connection speed (image download) +- Hypervisor disk I/O (VM creation) +- VM boot time (distribution-dependent) +- Cloud-init package installation count + +### Optimization Strategies + +1. **Pre-cache cloud images**: + ```bash + ansible-playbook site.yml -t deploy_linux_vm,download + ``` + +2. **Parallel deployment**: + ```bash + ansible-playbook site.yml -t deploy_linux_vm -f 5 + ``` + +3. **Skip slow operations**: + ```bash + ansible-playbook site.yml -t deploy_linux_vm --skip-tags install,download + ``` + +4. **Disable LVM for faster provisioning**: + ```yaml + deploy_linux_vm_use_lvm: false + ``` + +### Resource Requirements + +**Hypervisor Requirements**: +- CPU: 2+ cores per VM recommended +- RAM: 2GB base + (VM memory allocation * concurrent VMs) +- Disk: 100GB+ available in `/var/lib/libvirt/images` +- Network: 10 Mbps+ for cloud image downloads + +**Control Node Requirements**: +- Minimal (Ansible controller overhead) +- Disk: <1MB per VM for cloud-init config storage + +## Troubleshooting Guide + +### Common Issues + +#### Issue: Cloud image download fails + +**Symptoms**: Task fails during image download +**Causes**: +- No internet connectivity from hypervisor +- Image URL changed or unavailable +- Insufficient disk space + +**Solutions**: +```bash +# Test internet connectivity +ansible hypervisor -m shell -a "ping -c 3 8.8.8.8" + +# Check disk space +ansible hypervisor -m shell -a "df -h /var/lib/libvirt/images" + +# Manual download and verification +ansible hypervisor -m shell -a "wget -O /tmp/test.img " + +# Check image URL validity +ansible hypervisor -m shell -a "curl -I " +``` + +#### Issue: VM fails to start + +**Symptoms**: VM shows as "shut off" immediately after creation +**Causes**: +- Insufficient resources on hypervisor +- Cloud-init ISO creation failed +- libvirt permission issues + +**Solutions**: +```bash +# Check VM status and errors +ansible hypervisor -m shell -a "virsh list --all" +ansible hypervisor -m shell -a "virsh start " +ansible hypervisor -m shell -a "journalctl -u libvirtd -n 50" + +# Check libvirt logs +ansible hypervisor -m shell -a "tail -50 /var/log/libvirt/qemu/.log" + +# Verify cloud-init ISO exists +ansible hypervisor -m shell -a "ls -lh /var/lib/libvirt/images/-cloud-init.iso" + +# Check resource availability +ansible hypervisor -m shell -a "free -h && df -h" +``` + +#### Issue: Cannot SSH to VM + +**Symptoms**: SSH connection refused or times out +**Causes**: +- Cloud-init not completed +- Firewall blocking SSH +- Wrong IP address +- SSH key mismatch + +**Solutions**: +```bash +# Get VM IP address +ansible hypervisor -m shell -a "virsh domifaddr " + +# Check if VM is responsive (via console) +ansible hypervisor -m shell -a "virsh console " +# (Press Ctrl+] to exit console) + +# Wait for cloud-init completion +ssh ansible@ "cloud-init status --wait" + +# Check cloud-init logs +ssh ansible@ "tail -100 /var/log/cloud-init-output.log" + +# Verify SSH service +ssh ansible@ "systemctl status sshd" + +# Check firewall rules +ssh ansible@ "sudo ufw status" # Debian/Ubuntu +ssh ansible@ "sudo firewall-cmd --list-all" # RHEL +``` + +#### Issue: LVM configuration fails + +**Symptoms**: Post-deployment LVM tasks fail +**Causes**: +- Second disk not attached +- LVM packages not installed +- Insufficient disk space + +**Solutions**: +```bash +# Check if second disk exists +ssh ansible@ "lsblk" + +# Verify LVM packages +ssh ansible@ "which lvm" + +# Check physical volumes +ssh ansible@ "sudo pvs" + +# Check volume groups +ssh ansible@ "sudo vgs" + +# Check logical volumes +ssh ansible@ "sudo lvs" + +# Manually re-run LVM configuration +ansible-playbook site.yml -t deploy_linux_vm,lvm,post-deploy \ + -e "deploy_linux_vm_name=" +``` + +#### Issue: Slow VM performance + +**Symptoms**: VM is sluggish or unresponsive +**Causes**: +- Overcommitted hypervisor resources +- Disk I/O bottleneck +- Memory swapping + +**Solutions**: +```bash +# Check hypervisor load +ansible hypervisor -m shell -a "top -bn1 | head -20" + +# Check VM resource allocation +ansible hypervisor -m shell -a "virsh dominfo " + +# Check disk I/O +ansible hypervisor -m shell -a "iostat -x 1 5" + +# Inside VM: check memory +ssh ansible@ "free -h" + +# Inside VM: check disk I/O +ssh ansible@ "iostat -x 1 5" +``` + +### Debug Mode + +Run with increased verbosity: + +```bash +# Standard verbose +ansible-playbook site.yml -t deploy_linux_vm -v + +# More verbose (connections) +ansible-playbook site.yml -t deploy_linux_vm -vv + +# Very verbose (debugging) +ansible-playbook site.yml -t deploy_linux_vm -vvv + +# Extreme verbose (all data) +ansible-playbook site.yml -t deploy_linux_vm -vvvv +``` + +### Log Locations + +**Hypervisor**: +- libvirt logs: `/var/log/libvirt/qemu/.log` +- System logs: `journalctl -u libvirtd` + +**Guest VM**: +- Cloud-init output: `/var/log/cloud-init-output.log` +- Cloud-init logs: `/var/log/cloud-init.log` +- System logs: `journalctl` or `/var/log/syslog` (Debian) / `/var/log/messages` (RHEL) +- SSH logs: `/var/log/auth.log` (Debian) / `/var/log/secure` (RHEL) +- Audit logs: `/var/log/audit/audit.log` + +## Maintenance + +### Regular Updates + +**Quarterly Tasks**: +- Review cloud image URLs for updates +- Test role with latest distribution versions +- Update documentation for new features +- Review security controls and compliance + +**Testing Checklist**: +```bash +# 1. Syntax validation +ansible-playbook site.yml --syntax-check + +# 2. Dry-run +ansible-playbook site.yml -t deploy_linux_vm --check + +# 3. Deploy test VM +ansible-playbook site.yml -t deploy_linux_vm \ + -e "deploy_linux_vm_name=test-vm-$(date +%s)" + +# 4. Verify deployment +ansible hypervisor -m shell -a "virsh list --all" + +# 5. SSH connectivity +ssh -J hypervisor ansible@ "hostname" + +# 6. Security validation +ssh ansible@ "sudo getenforce" # RHEL +ssh ansible@ "sudo aa-status" # Debian + +# 7. Cleanup +ansible hypervisor -m shell -a "virsh destroy test-vm-*" +ansible hypervisor -m shell -a "virsh undefine test-vm-* --remove-all-storage" +``` + +### Monitoring + +Track deployment metrics: +- Deployment success rate +- Average deployment time +- Cloud-init failure rate +- SSH connectivity success rate + +### Backup Strategy + +**VM Backups**: +```bash +# Create VM snapshot +virsh snapshot-create-as backup-$(date +%Y%m%d) "Pre-update backup" + +# Export VM configuration +virsh dumpxml > .xml + +# Backup VM disk +qemu-img convert -O qcow2 /var/lib/libvirt/images/.qcow2 \ + /backup/-$(date +%Y%m%d).qcow2 +``` + +## Advanced Usage + +### Custom Cloud-Init Configuration + +Override default cloud-init with custom configuration: + +```yaml +deploy_linux_vm_cloud_init_user_data: | + #cloud-config + package_update: true + package_upgrade: true + packages: + - custom-package + - another-package + runcmd: + - [sh, -c, "echo 'Custom configuration' > /root/custom.txt"] +``` + +### Integration with Terraform + +Use Ansible role within Terraform provisioner: + +```hcl +resource "null_resource" "deploy_vm" { + provisioner "local-exec" { + command = </system_info.json` + +--- + +## Role Categories + +### Infrastructure Management +- **deploy_linux_vm**: VM provisioning and deployment +- **system_info**: System inventory and information gathering + +### Security & Compliance +- **deploy_linux_vm**: Security hardening, SSH configuration, firewall setup +- **system_info**: Security module detection, compliance data collection + +### Monitoring & Observability +- **system_info**: Performance metrics, resource utilization + +--- + +## Role Dependencies + +``` +┌─────────────────────┐ +│ deploy_linux_vm │ (No dependencies) +└──────────┬──────────┘ + │ + │ (typically followed by) + ▼ +┌─────────────────────┐ +│ system_info │ (No dependencies) +└─────────────────────┘ + │ + │ (data used by) + ▼ +┌─────────────────────┐ +│ Application Roles │ (Future: webserver, database, etc.) +└─────────────────────┘ +``` + +--- + +## Role Selection Guide + +### When to use deploy_linux_vm + +Use this role when you need to: +- ✓ Create new Linux VMs on KVM hypervisors +- ✓ Automate VM provisioning with cloud-init +- ✓ Implement security-hardened infrastructure +- ✓ Configure LVM storage according to CLAUDE.md standards +- ✓ Deploy multi-distribution environments +- ✓ Maintain consistent VM configurations + +**Do NOT use** when: +- ✗ Provisioning physical servers (use kickstart/preseed directly) +- ✗ Working with cloud providers (use cloud-specific modules) +- ✗ Managing existing VMs (use configuration management roles) + +### When to use system_info + +Use this role when you need to: +- ✓ Create infrastructure inventory +- ✓ Perform capacity planning analysis +- ✓ Generate compliance reports +- ✓ Audit system configurations +- ✓ Detect hypervisor capabilities +- ✓ Export data to CMDB systems + +**Do NOT use** when: +- ✗ Real-time monitoring needed (use Prometheus/Grafana) +- ✗ Log aggregation required (use ELK/Graylog) +- ✗ Continuous metrics collection (use monitoring agents) + +--- + +## Role Development Standards + +All roles in this project follow these standards: + +### Required Structure +``` +roles/role_name/ +├── README.md # Comprehensive documentation +├── meta/ +│ └── main.yml # Dependencies and metadata +├── defaults/ +│ └── main.yml # Default variables +├── vars/ +│ └── main.yml # Role variables +├── tasks/ +│ ├── main.yml # Main task entry point +│ ├── install.yml # Installation tasks +│ ├── configure.yml # Configuration tasks +│ ├── security.yml # Security hardening +│ └── validate.yml # Validation and health checks +├── handlers/ +│ └── main.yml # Service handlers +├── templates/ +│ └── *.j2 # Jinja2 templates +├── files/ +│ └── * # Static files +└── tests/ + └── test.yml # Test playbook +``` + +### Required Documentation +- ✓ README.md in role directory (comprehensive) +- ✓ Documentation file in `docs/roles/` (detailed) +- ✓ Cheatsheet in `cheatsheets/roles/` (quick reference) +- ✓ Entry in this index file + +### Required Tags +All roles must implement these tags: +- `install`: Package installation +- `configure`: Configuration tasks +- `security`: Security hardening +- `validate`: Validation and health checks + +### Security Requirements +- ✓ No hardcoded secrets or credentials +- ✓ Use `no_log: true` for sensitive output +- ✓ Validate file permissions +- ✓ Implement proper error handling +- ✓ Use HTTPS for downloads +- ✓ Verify checksums + +### Production Readiness Checklist +- ✓ Comprehensive README with all sections +- ✓ All variables documented with types and examples +- ✓ Example playbooks provided +- ✓ Security considerations documented +- ✓ Tags implemented for selective execution +- ✓ Idempotency verified +- ✓ Multi-OS compatibility tested +- ✓ Molecule tests implemented (optional but recommended) + +--- + +## Creating New Roles + +### Process + +1. **Create role skeleton**: + ```bash + ansible-galaxy role init roles/new_role_name + ``` + +2. **Implement role following CLAUDE.md guidelines**: + - Security-first approach + - Modularity and reusability + - Comprehensive variable documentation + - Tag-based execution support + +3. **Create documentation**: + - `roles/new_role_name/README.md` + - `docs/roles/new_role_name.md` + - `cheatsheets/roles/new_role_name.md` + +4. **Update this index**: + - Add role entry with description + - Update role categories + - Update dependency diagram + +5. **Test thoroughly**: + - Implement Molecule tests (optional) + - Test on all target distributions + - Validate idempotency + - Security scan + +6. **Document and version**: + - Semantic versioning (MAJOR.MINOR.PATCH) + - Update CHANGELOG.md + - Tag release in git + +### Template + +```yaml +--- +# roles/new_role_name/README.md structure + +# Role Name + +Brief description + +## Requirements +- Ansible version +- OS compatibility +- Dependencies + +## Role Variables + +| Variable | Default | Description | Required | +|----------|---------|-------------|----------| +| var_name | value | Description | Yes/No | + +## Dependencies + +List of dependent roles + +## Example Playbook + +```yaml +- hosts: servers + roles: + - role: new_role_name + var_name: value +``` + +## Security Considerations + +Document security implications + +## License + +Organization license + +## Author + +Maintainer information +``` + +--- + +## Role Versioning + +| Role | Current Version | Last Updated | Status | +|------|----------------|--------------|--------| +| deploy_linux_vm | 1.0.0 | 2025-11-11 | ✓ Stable | +| system_info | 1.0.0 | 2025-11-11 | ✓ Stable | + +--- + +## Future Roles (Planned) + +### Application Roles +- **webserver**: Nginx/Apache web server configuration +- **database**: PostgreSQL/MySQL database setup +- **cache**: Redis/Memcached caching layer +- **message_queue**: RabbitMQ/Kafka message broker + +### Security Roles +- **hardening**: OS-level security hardening (CIS compliance) +- **monitoring**: Prometheus/Grafana monitoring stack +- **logging**: ELK stack or Graylog setup +- **backup**: Automated backup configuration + +### Infrastructure Roles +- **kubernetes_node**: Kubernetes cluster node setup +- **docker_host**: Docker host configuration +- **load_balancer**: HAProxy/Nginx load balancer +- **proxy**: Squid/Nginx proxy server + +--- + +## Quick Reference + +### Most Common Commands + +```bash +# Deploy a VM +ansible-playbook site.yml -t deploy_linux_vm + +# Gather system information +ansible-playbook site.yml -t system_info + +# Deploy VM and gather info +ansible-playbook site.yml -t deploy_linux_vm,system_info + +# Validation only +ansible-playbook site.yml -t validate + +# Security hardening only +ansible-playbook site.yml -t security +``` + +### Finding Role Documentation + +```bash +# Role README +cat roles//README.md + +# Detailed documentation +cat docs/roles/.md + +# Quick reference cheatsheet +cat cheatsheets/roles/.md + +# List all role variables +grep "^[a-z_]*:" roles//defaults/main.yml +``` + +--- + +## Support and Contribution + +### Getting Help +- Check role README.md first +- Review detailed documentation in docs/roles/ +- Consult cheatsheets for quick reference +- Review CLAUDE.md for guidelines + +### Contributing +- Follow CLAUDE.md development standards +- Document all changes +- Test on all supported distributions +- Update relevant documentation +- Submit for code review + +### Reporting Issues +- Provide role name and version +- Include error messages and logs +- Describe expected vs actual behavior +- Include playbook excerpt if relevant + +--- + +## Related Documentation + +- [CLAUDE.md Guidelines](../../CLAUDE.md) +- [Architecture Overview](../architecture/overview.md) +- [Security Model](../architecture/security-model.md) +- [Variables Documentation](../variables.md) + +--- + +**Document Version**: 1.0.0 +**Last Updated**: 2025-11-11 +**Maintained By**: Ansible Infrastructure Team diff --git a/docs/roles/system_info.md b/docs/roles/system_info.md new file mode 100644 index 0000000..e255548 --- /dev/null +++ b/docs/roles/system_info.md @@ -0,0 +1,450 @@ +# System Information Gathering Role Documentation + +## Overview + +The `system_info` role provides comprehensive hardware and software inventory capabilities for infrastructure automation. It collects detailed metrics about CPU, GPU, memory, storage, network, and virtualization/hypervisor configurations. + +## Purpose + +- **Infrastructure Inventory**: Maintain up-to-date hardware and software inventory +- **Capacity Planning**: Track resource utilization and plan for scaling +- **Compliance Documentation**: Support audit requirements with detailed system information +- **Troubleshooting**: Provide baseline configuration data for issue resolution +- **Monitoring Integration**: Feed data into monitoring and CMDB systems + +## Architecture + +### Data Collection Flow + +``` +┌─────────────────┐ +│ Ansible Facts │ +│ (gathered) │ +└────────┬────────┘ + │ + ▼ +┌─────────────────┐ ┌──────────────────┐ +│ Hardware Info │──────▶│ CPU Details │ +│ Collection │ │ GPU Detection │ +│ │ │ Memory Info │ +└────────┬────────┘ │ Disk Layout │ + │ └──────────────────┘ + ▼ +┌─────────────────┐ ┌──────────────────┐ +│ Hypervisor │──────▶│ KVM/Libvirt │ +│ Detection │ │ Proxmox VE │ +│ │ │ LXD/Docker │ +└────────┬────────┘ │ VMware/Hyper-V │ + │ └──────────────────┘ + ▼ +┌─────────────────┐ ┌──────────────────┐ +│ Aggregation │──────▶│ JSON Export │ +│ & Export │ │ Summary Report │ +│ │ │ Timestamped │ +└─────────────────┘ └──────────────────┘ + │ + ▼ +┌─────────────────────────────────────┐ +│ ./stats/machines// │ +│ ├── system_info.json │ +│ ├── system_info_.json │ +│ └── summary.txt │ +└─────────────────────────────────────┘ +``` + +### Task Organization + +The role is organized into modular task files: + +- `main.yml`: Orchestration and task inclusion +- `install.yml`: Package installation (OS-specific) +- `gather_system.yml`: OS and system information +- `gather_cpu.yml`: CPU details and capabilities +- `gather_gpu.yml`: GPU detection and details +- `gather_memory.yml`: Memory and swap information +- `gather_disk.yml`: Disk, LVM, and RAID information +- `gather_network.yml`: Network interfaces and configuration +- `detect_hypervisor.yml`: Virtualization platform detection +- `export_stats.yml`: JSON aggregation and export +- `validate.yml`: Health checks and validation + +## Integration Points + +### With Other Roles + +The `system_info` role can be used in conjunction with: + +- **Monitoring roles**: Feed collected data into Prometheus, Grafana, or other monitoring systems +- **CMDB integration**: Export to ServiceNow, NetBox, or other CMDBs +- **Capacity planning tools**: Provide data for capacity analysis +- **Compliance scanning**: Support CIS, NIST, or custom compliance checks + +### With External Systems + +#### Example: Export to NetBox + +```yaml +- name: Sync to NetBox CMDB + hosts: all + tasks: + - name: Include system_info role + include_role: + name: system_info + + - name: Push to NetBox + uri: + url: "https://netbox.example.com/api/dcim/devices/" + method: POST + body_format: json + headers: + Authorization: "Token {{ netbox_api_token }}" + body: + name: "{{ ansible_fqdn }}" + device_type: "{{ system_info_hardware.product }}" + custom_fields: + cpu_model: "{{ system_info_cpu.model }}" + memory_mb: "{{ system_info_memory.total_mb }}" + delegate_to: localhost +``` + +#### Example: Prometheus Exporter + +```yaml +- name: Export metrics for Prometheus + copy: + content: | + # HELP system_info_cpu_count Number of CPU cores + # TYPE system_info_cpu_count gauge + system_info_cpu_count{host="{{ ansible_fqdn }}"} {{ system_info_cpu.count.vcpus }} + + # HELP system_info_memory_total_mb Total memory in MB + # TYPE system_info_memory_total_mb gauge + system_info_memory_total_mb{host="{{ ansible_fqdn }}"} {{ system_info_memory.total_mb }} + dest: "/var/lib/node_exporter/textfile_collector/system_info.prom" + delegate_to: "{{ ansible_fqdn }}" +``` + +## Data Dictionary + +### JSON Schema + +The exported JSON follows this structure: + +```json +{ + "collection_info": { + "timestamp": "ISO8601 datetime", + "timestamp_epoch": "Unix epoch", + "collected_by": "ansible", + "role_version": "semver", + "ansible_version": "version string" + }, + "host_info": { + "hostname": "short hostname", + "fqdn": "fully qualified domain name", + "uptime": "human readable uptime", + "boot_time": "boot timestamp" + }, + "system": { + "distribution": "OS name", + "distribution_version": "version", + "distribution_release": "codename", + "distribution_major_version": "major version", + "os_family": "Debian|RedHat" + }, + "kernel": { + "version": "kernel version", + "architecture": "x86_64|aarch64|etc" + }, + "hardware": { + "manufacturer": "hardware vendor", + "product": "product name", + "serial": "serial number", + "uuid": "system UUID" + }, + "security": { + "selinux": "Enforcing|Permissive|Disabled|N/A", + "apparmor": "Enabled|Disabled|N/A" + }, + "cpu": { /* detailed CPU information */ }, + "gpu": { /* GPU detection and details */ }, + "memory": { /* memory statistics */ }, + "swap": { /* swap configuration */ }, + "disk": { /* disk and storage information */ }, + "network": { /* network configuration */ }, + "hypervisor": { /* virtualization details */ } +} +``` + +## Use Cases + +### 1. Infrastructure Audit + +Generate a complete inventory of all infrastructure: + +```bash +# Gather information from all hosts +ansible-playbook playbooks/gather_system_info.yml + +# Generate CSV report +jq -r '["FQDN","OS","CPU","Memory","Disk","Hypervisor"], + ([.host_info.fqdn, .system.distribution, .cpu.model, + (.memory.total_mb|tostring), (.disk.physical_disks|length|tostring), + (.hypervisor.is_hypervisor|tostring)]) | @csv' \ + stats/machines/*/system_info.json > infrastructure_inventory.csv +``` + +### 2. License Compliance + +Track CPU cores for license management: + +```bash +# Count total CPU cores across infrastructure +jq -s 'map(.cpu.count.total_cores | tonumber) | add' \ + stats/machines/*/system_info.json +``` + +### 3. Capacity Planning + +Identify hosts nearing resource limits: + +```bash +# Find hosts with >80% memory usage +jq -r 'select(.memory.usage_percent > 80) | + "\(.host_info.fqdn): \(.memory.usage_percent)%"' \ + stats/machines/*/system_info.json + +# Find hosts with low disk space +jq -r 'select(.disk.usage_human[] | + contains("9[0-9]%") or contains("100%")) | + .host_info.fqdn' \ + stats/machines/*/system_info.json +``` + +### 4. Hypervisor Inventory + +List all hypervisors and their VM counts: + +```bash +# KVM/Libvirt hypervisors +jq -r 'select(.hypervisor.kvm_libvirt.installed == true) | + "\(.host_info.fqdn): \(.hypervisor.kvm_libvirt.running_vms) running, \(.hypervisor.kvm_libvirt.total_vms) total"' \ + stats/machines/*/system_info.json + +# Proxmox hosts +jq -r 'select(.hypervisor.proxmox.installed == true) | + "\(.host_info.fqdn): \(.hypervisor.proxmox.version)"' \ + stats/machines/*/system_info.json +``` + +### 5. Security Compliance + +Verify SELinux/AppArmor status: + +```bash +# Check SELinux enforcement +jq -r 'select(.security.selinux != "Enforcing" and .security.selinux != "N/A") | + "\(.host_info.fqdn): SELinux is \(.security.selinux)"' \ + stats/machines/*/system_info.json + +# List CPU vulnerabilities +jq -r '"\(.host_info.fqdn):", .cpu.vulnerabilities[]' \ + stats/machines/*/system_info.json +``` + +## Performance Considerations + +### Execution Time + +Typical execution times per host: +- **Minimal gathering** (CPU, memory only): 15-20 seconds +- **Standard gathering** (all defaults): 30-45 seconds +- **Comprehensive** (with raw outputs): 45-60 seconds + +Factors affecting performance: +- Number of network interfaces +- Number of disk devices +- Hypervisor API response time +- SMART disk scanning (slowest component) + +### Optimization Strategies + +1. **Parallel execution**: Use `-f` flag to increase parallelism + ```bash + ansible-playbook site.yml -t system_info -f 20 + ``` + +2. **Skip slow components**: Disable unnecessary gathering + ```yaml + system_info_gather_network: false # Skip if not needed + ``` + +3. **Cache facts**: Enable fact caching in ansible.cfg + ```ini + [defaults] + fact_caching = jsonfile + fact_caching_connection = /tmp/ansible_facts + fact_caching_timeout = 3600 + ``` + +## Security Best Practices + +### Data Protection + +- **Sensitive information**: Statistics include serial numbers, UUIDs, and network topology +- **Access control**: Restrict read access to statistics directory +- **Encryption**: Consider encrypting the statistics directory for sensitive environments +- **Retention**: Implement rotation policy for timestamped backups + +### Execution Security + +- **Privilege escalation**: Role requires sudo/root for hardware information +- **Audit logging**: All executions are logged via Ansible +- **Read-only**: Role performs no modifications to managed systems +- **No secrets**: Role does not collect or expose credentials + +## Troubleshooting Guide + +### Common Problems + +#### Problem: "Package installation failed" + +**Symptoms**: Role fails during install phase +**Cause**: No internet access or repository issues +**Solution**: +```bash +# Pre-install packages manually +ansible all -m package -a "name=lshw,dmidecode,pciutils state=present" --become + +# Or skip installation +ansible-playbook site.yml -t system_info --skip-tags install +``` + +#### Problem: "Statistics directory not created" + +**Symptoms**: No output files generated +**Cause**: Permission issues on control node +**Solution**: +```bash +# Check permissions +mkdir -p ./stats/machines +chmod 755 ./stats/machines + +# Or specify writable directory +ansible-playbook site.yml -e "system_info_stats_base_dir=/tmp/stats" +``` + +#### Problem: "Invalid JSON output" + +**Symptoms**: jq reports parsing errors +**Cause**: Incomplete execution or disk full +**Solution**: +```bash +# Validate JSON files +for f in ./stats/machines/*/system_info.json; do + jq empty "$f" 2>&1 || echo "Invalid: $f" +done + +# Re-run for failed hosts +ansible-playbook site.yml -l failed_host -t system_info +``` + +## Maintenance + +### Regular Updates + +- **Quarterly review**: Update role for new hypervisor versions +- **OS compatibility**: Test with new OS releases +- **Package updates**: Verify new package versions don't break collection +- **Documentation**: Keep examples and use cases current + +### Monitoring + +Track role health metrics: +- Execution success rate +- Average execution time +- Output file sizes +- JSON validation failures + +### Backup Strategy + +```bash +# Daily backup of statistics +0 3 * * * tar -czf /backup/ansible-stats-$(date +\%Y\%m\%d).tar.gz \ + /opt/ansible/stats/machines/ + +# Cleanup old backups (keep 30 days) +0 4 * * * find /backup/ansible-stats-*.tar.gz -mtime +30 -delete +``` + +## Advanced Usage + +### Custom Filters + +Create custom Ansible filters for data processing: + +```python +# filter_plugins/system_info_filters.py +def format_memory(value_mb): + """Convert MB to human readable format""" + if value_mb < 1024: + return f"{value_mb} MB" + elif value_mb < 1048576: + return f"{value_mb/1024:.1f} GB" + else: + return f"{value_mb/1048576:.1f} TB" + +class FilterModule(object): + def filters(self): + return { + 'format_memory': format_memory + } +``` + +### Dynamic Inventory Integration + +Use collected data for dynamic grouping: + +```python +# inventory_plugins/system_info_inventory.py +# Create dynamic groups based on collected information +import json +import glob + +groups = { + 'hypervisors': [], + 'virtual_machines': [], + 'high_memory': [], + 'gpu_enabled': [] +} + +for stats_file in glob.glob('stats/machines/*/system_info.json'): + with open(stats_file) as f: + data = json.load(f) + fqdn = data['host_info']['fqdn'] + + if data['hypervisor']['is_hypervisor']: + groups['hypervisors'].append(fqdn) + if data['hypervisor']['is_virtual']: + groups['virtual_machines'].append(fqdn) + if data['memory']['total_mb'] > 64000: + groups['high_memory'].append(fqdn) + if data['gpu']['detected']: + groups['gpu_enabled'].append(fqdn) +``` + +## Related Documentation + +- [Main README](../../roles/system_info/README.md) +- [Cheatsheet](../../cheatsheets/system_info.md) +- [Ansible Best Practices](https://docs.ansible.com/ansible/latest/user_guide/playbooks_best_practices.html) + +## Changelog + +See role README.md for version history and changes. + +--- + +**Document Version**: 1.0.0 +**Last Updated**: 2025-01-11 +**Maintained By**: Ansible Infrastructure Team diff --git a/docs/runbooks/deployment.md b/docs/runbooks/deployment.md new file mode 100644 index 0000000..268e492 --- /dev/null +++ b/docs/runbooks/deployment.md @@ -0,0 +1,125 @@ +# Deployment Runbook + +Standard operating procedure for deploying changes to infrastructure using Ansible. + +## Overview + +This runbook covers the standard deployment process for configuration changes, application updates, and infrastructure modifications. + +## Prerequisites + +- [ ] Access to Ansible control node +- [ ] Proper credentials and SSH keys +- [ ] Vault password for target environment +- [ ] Change approval (for production) +- [ ] Backup completed (for production) + +## Deployment Process + +### 1. Pre-Deployment Checks + +```bash +# Verify Ansible version +ansible --version + +# Test inventory connectivity +ansible all -i inventories/ -m ping + +# Verify vault access +ansible-vault view inventories//group_vars/all/vault.yml + +# Run syntax check +ansible-playbook site.yml --syntax-check + +# Dry-run (check mode) +ansible-playbook -i inventories/ site.yml --check +``` + +### 2. Staging Deployment + +```bash +# Deploy to staging environment +ansible-playbook -i inventories/staging site.yml + +# Verify staging deployment +ansible-playbook -i inventories/staging playbooks/security_audit.yml --tags verify +``` + +### 3. Production Deployment + +```bash +# Create pre-deployment backup +ansible-playbook -i inventories/production playbooks/backup.yml + +# Deploy to production (gradual rollout) +ansible-playbook -i inventories/production site.yml \ + --extra-vars "maintenance_serial=25%" + +# Verify production deployment +ansible-playbook -i inventories/production playbooks/security_audit.yml --tags verify +``` + +### 4. Post-Deployment Verification + +```bash +# Verify all services running +ansible production -m shell -a "systemctl status " + +# Check application logs +ansible production -m shell -a "tail -50 /var/log/application.log" + +# Monitor system health +ansible production -m shell -a "uptime && free -h && df -h" +``` + +## Rollback Procedure + +If deployment fails: + +```bash +# Restore from backup +ansible-playbook -i inventories/production playbooks/disaster_recovery.yml \ + --limit affected_hosts \ + --extra-vars "dr_backup_date=" + +# Verify rollback +ansible-playbook -i inventories/production site.yml --check +``` + +## Emergency Stop + +If critical issues detected: + +```bash +# Stop deployment immediately (Ctrl+C) +# Assess damage +ansible-playbook playbooks/security_audit.yml --tags assess + +# Initiate rollback if needed +``` + +## Communication Template + +``` +DEPLOYMENT NOTIFICATION + +Environment: [Production/Staging] +Change: [Description] +Start Time: [Time] +Expected Duration: [Duration] +Impact: [Expected impact] +Rollback Plan: [Available/Not Available] +``` + +## Checklist + +- [ ] Pre-deployment backup completed +- [ ] Staging deployment successful +- [ ] Production change approved +- [ ] Deployment executed +- [ ] Post-deployment verification passed +- [ ] Documentation updated +- [ ] Stakeholders notified + +--- +**Last Updated:** 2025-11-11 diff --git a/docs/runbooks/disaster-recovery.md b/docs/runbooks/disaster-recovery.md new file mode 100644 index 0000000..d51541a --- /dev/null +++ b/docs/runbooks/disaster-recovery.md @@ -0,0 +1,264 @@ +# Disaster Recovery Runbook + +Emergency procedures for recovering from system failures and disasters. + +## Severity Levels + +| Level | Description | Response Time | +|-------|-------------|---------------| +| **P0** | Complete system failure | Immediate | +| **P1** | Critical service outage | < 15 minutes | +| **P2** | Degraded performance | < 1 hour | +| **P3** | Minor issues | < 4 hours | + +## Initial Response + +### 1. Incident Detection (0-5 minutes) + +```bash +# Verify incident scope +ansible all -i inventories/ -m ping + +# Identify failed hosts +ansible-playbook playbooks/security_audit.yml --tags assess +``` + +### 2. Incident Classification (5-10 minutes) + +Determine: +- Affected hosts/services +- Severity level +- Business impact +- Recovery time objective (RTO) + +### 3. Communication (10-15 minutes) + +**Notify:** +- Infrastructure team +- Management (P0/P1 only) +- Affected stakeholders + +**Template:** +``` +INCIDENT ALERT [P0/P1/P2/P3] + +Incident ID: DR-YYYYMMDD-NNN +Detected: [Timestamp] +Scope: [Affected systems] +Impact: [Business impact] +Status: Investigating/Responding/Resolved +ETA: [Estimated resolution time] +``` + +## Recovery Procedures + +### Scenario 1: Single Host Failure (P1) + +**Symptoms:** Host unreachable, services down + +**Recovery:** + +```bash +# 1. Assess damage +ansible-playbook playbooks/disaster_recovery.yml \ + --limit failed_host \ + --tags assess + +# 2. Attempt service restart +ansible failed_host -m systemd -a "name= state=restarted" + +# 3. If unsuccessful, initiate full recovery +ansible-playbook playbooks/disaster_recovery.yml \ + --limit failed_host \ + --extra-vars "dr_backup_date=latest" + +# 4. Verify recovery +ansible-playbook playbooks/disaster_recovery.yml \ + --limit failed_host \ + --tags verify +``` + +**RTO:** 30 minutes + +### Scenario 2: Database Corruption (P0) + +**Symptoms:** Database errors, data inconsistency + +**Recovery:** + +```bash +# 1. Stop application services +ansible dbserver -m systemd -a "name=application state=stopped" + +# 2. Restore database from backup +ansible-playbook playbooks/disaster_recovery.yml \ + --limit dbserver \ + --tags restore_data \ + --extra-vars "dr_backup_date=YYYY-MM-DD" + +# 3. Verify database integrity +ansible dbserver -m shell -a "mysqlcheck --all-databases" + +# 4. Restart services +ansible dbserver -m systemd -a "name=mysql state=restarted" +ansible dbserver -m systemd -a "name=application state=restarted" +``` + +**RTO:** 1 hour + +### Scenario 3: Complete Environment Failure (P0) + +**Symptoms:** All hosts unreachable, total outage + +**Recovery:** + +```bash +# 1. Verify network connectivity +ping + +# 2. Check infrastructure provider status +# (AWS, Azure, etc.) + +# 3. If infrastructure is available, restore hosts individually +for host in host1 host2 host3; do + ansible-playbook playbooks/disaster_recovery.yml \ + --limit $host \ + --extra-vars "dr_backup_date=latest" +done + +# 4. Verify environment health +ansible-playbook -i inventories/ site.yml --check +``` + +**RTO:** 4 hours + +### Scenario 4: Configuration Corruption (P2) + +**Symptoms:** Services misconfigured, errors in logs + +**Recovery:** + +```bash +# 1. Restore configuration only +ansible-playbook playbooks/disaster_recovery.yml \ + --limit affected_hosts \ + --tags restore_config \ + --extra-vars "dr_backup_date=YYYY-MM-DD" + +# 2. Restart affected services +ansible affected_hosts -m systemd -a "name= state=restarted" + +# 3. Verify configuration +ansible affected_hosts -m shell -a " -t" # Test config +``` + +**RTO:** 30 minutes + +## Escalation Path + +1. **L1:** On-call engineer (initial response) +2. **L2:** Senior infrastructure engineer (if unresolved in 30 min) +3. **L3:** Infrastructure team lead (P0/P1 or > 1 hour) +4. **L4:** CTO/Management (> 2 hours or business-critical) + +## Post-Incident Procedures + +### 1. Verification (Immediate) + +```bash +# System health check +ansible-playbook playbooks/maintenance.yml --tags verify + +# Security audit +ansible-playbook playbooks/security_audit.yml +``` + +### 2. Documentation (Within 2 hours) + +Document in incident log: +- Timeline of events +- Actions taken +- Recovery time +- Root cause (if known) + +### 3. Post-Mortem (Within 48 hours) + +Conduct post-mortem meeting: +- What happened +- What went well +- What could be improved +- Action items + +### 4. Preventive Actions (Within 1 week) + +- Implement fixes +- Update runbooks +- Improve monitoring +- Test recovery procedures + +## Testing Schedule + +| Test Type | Frequency | Scope | +|-----------|-----------|-------| +| Single host recovery | Monthly | Development | +| Configuration restore | Monthly | Staging | +| Database restore | Quarterly | Staging | +| Full DR drill | Semi-annually | All | + +## Emergency Contacts + +| Role | Name | Contact | Backup | +|------|------|---------|--------| +| On-Call Engineer | TBD | TBD | TBD | +| Team Lead | TBD | TBD | TBD | +| Management | TBD | TBD | TBD | +| Vendor Support | TBD | TBD | - | + +## Critical Information + +### Backup Locations +- Local: `/var/backups/` +- Remote: `[Remote backup server]` +- Off-site: `[Off-site location]` + +### Recovery Credentials +- Vault password location: `[Secure location]` +- Emergency access: `[Break-glass procedure]` +- Root passwords: `[Secure password manager]` + +### Service Dependencies + +``` +Load Balancer + ↓ +Web Servers (webserver01, webserver02) + ↓ +Application Servers (appserver01, appserver02) + ↓ +Database (dbserver01) → Replica (dbserver02) + ↓ +Cache (redis01) +``` + +## Quick Reference + +```bash +# Assess all hosts +ansible-playbook playbooks/disaster_recovery.yml --tags assess + +# Full recovery single host +ansible-playbook playbooks/disaster_recovery.yml --limit host + +# Configuration only +ansible-playbook playbooks/disaster_recovery.yml --limit host --tags restore_config + +# Verify recovery +ansible-playbook playbooks/disaster_recovery.yml --limit host --tags verify + +# Check backup availability +ansible all -m shell -a "ls -lh /var/backups/" +``` + +--- +**Last Updated:** 2025-11-11 +**Next Review:** 2025-02-11 diff --git a/docs/runbooks/incident-response.md b/docs/runbooks/incident-response.md new file mode 100644 index 0000000..9ad6927 --- /dev/null +++ b/docs/runbooks/incident-response.md @@ -0,0 +1,338 @@ +# Incident Response Runbook + +Procedures for responding to security incidents and breaches. + +## Incident Categories + +| Category | Examples | Severity | +|----------|----------|----------| +| **Security Breach** | Unauthorized access, data exfiltration | Critical | +| **Malware** | Ransomware, trojans, rootkits | Critical | +| **DoS/DDoS** | Service flooding, resource exhaustion | High | +| **Policy Violation** | Unauthorized changes, compliance breach | Medium | +| **Suspicious Activity** | Unusual logins, port scans | Low | + +## Initial Response (First 15 Minutes) + +### 1. Detection and Verification + +```bash +# Check for suspicious activity +ansible all -m shell -a "last -a | head -20" # Recent logins +ansible all -m shell -a "who" # Current users +ansible all -m shell -a "ss -tulpn | grep LISTEN" # Listening ports + +# Check failed login attempts +ansible all -m shell -a "grep 'Failed password' /var/log/auth.log | tail -50" + +# Check for privilege escalation +ansible all -m shell -a "grep sudo /var/log/auth.log | tail -20" +``` + +### 2. Immediate Containment + +**If breach confirmed:** + +```bash +# Block suspicious IP (replace with actual IP) +ansible all -m shell -a "ufw deny from " + +# Disable compromised user account +ansible all -m shell -a "usermod -L " + +# Kill suspicious processes +ansible all -m shell -a "pkill -9 " + +# Isolate compromised host +ansible compromised_host -m shell -a "iptables -P INPUT DROP; iptables -P OUTPUT DROP" +``` + +### 3. Notification + +**Notify (within 15 minutes):** +- Security team +- Infrastructure team lead +- Management (critical incidents) +- Legal/compliance (data breaches) + +**Template:** +``` +SECURITY INCIDENT [CRITICAL/HIGH/MEDIUM/LOW] + +Incident ID: SEC-YYYYMMDD-NNN +Detected: [Timestamp] +Type: [Breach/Malware/DoS/Policy/Suspicious] +Affected Systems: [List] +Initial Assessment: [Description] +Containment Status: [Contained/In Progress/Not Contained] +Response Lead: [Name] +``` + +## Investigation Phase (15-60 Minutes) + +### 1. Evidence Collection + +```bash +# Capture system state +ansible compromised_host -m shell -a "ps aux > /tmp/processes_$(date +%s).txt" +ansible compromised_host -m shell -a "netstat -tulpn > /tmp/network_$(date +%s).txt" +ansible compromised_host -m shell -a "df -h > /tmp/disk_$(date +%s).txt" + +# Collect logs +ansible compromised_host -m shell -a "tar czf /tmp/logs_$(date +%s).tar.gz /var/log/" + +# Copy evidence to secure location +ansible compromised_host -m fetch \ + -a "src=/tmp/logs_*.tar.gz dest=./evidence/ flat=yes" +``` + +### 2. Forensic Analysis + +```bash +# Check for unauthorized files +ansible compromised_host -m shell -a "find / -type f -mtime -1 2>/dev/null | head -100" + +# Check for SUID files +ansible compromised_host -m shell -a "find / -perm -4000 -type f 2>/dev/null" + +# Check cron jobs +ansible compromised_host -m shell -a "cat /etc/crontab; ls -la /etc/cron.*/" + +# Check startup services +ansible compromised_host -m shell -a "systemctl list-unit-files | grep enabled" + +# Check network connections +ansible compromised_host -m shell -a "ss -tnp" + +# AIDE integrity check (if configured) +ansible compromised_host -m shell -a "aide --check" +``` + +### 3. Root Cause Analysis + +Determine: +- Entry point +- Attack vector +- Extent of compromise +- Data accessed/exfiltrated +- Duration of access + +## Eradication Phase (1-4 Hours) + +### 1. Remove Threat + +```bash +# Remove malicious files +ansible compromised_host -m file -a "path= state=absent" + +# Kill malicious processes +ansible compromised_host -m shell -a "pkill -9 " + +# Remove unauthorized users +ansible compromised_host -m user -a "name= state=absent remove=yes" + +# Remove backdoors +ansible compromised_host -m shell -a "rm -f /etc/cron.d/" +``` + +### 2. Patch Vulnerabilities + +```bash +# Apply security updates +ansible-playbook -i inventories/ playbooks/maintenance.yml \ + --limit compromised_host \ + --tags updates + +# Harden configuration +ansible-playbook -i inventories/ playbooks/security_audit.yml \ + --limit compromised_host +``` + +### 3. Credential Rotation + +```bash +# Rotate SSH keys +ansible compromised_host -m shell \ + -a "rm -f /home/*/.ssh/authorized_keys; echo '' > /home/ansible/.ssh/authorized_keys" + +# Rotate passwords (use vault) +ansible-playbook -i inventories/ site.yml \ + --limit compromised_host \ + --tags user_management \ + --ask-vault-pass + +# Rotate API tokens +# Update tokens in vault and redeploy +ansible-vault edit inventories//group_vars/all/vault.yml +``` + +## Recovery Phase (4-8 Hours) + +### 1. System Restoration + +```bash +# Option A: Rebuild from scratch (recommended for severe breaches) +# 1. Provision new host +# 2. Deploy via Ansible +ansible-playbook -i inventories/ site.yml --limit new_host + +# Option B: Restore from clean backup +ansible-playbook playbooks/disaster_recovery.yml \ + --limit compromised_host \ + --extra-vars "dr_backup_date=" +``` + +### 2. Enhanced Monitoring + +```bash +# Enable enhanced logging +ansible all -m lineinfile \ + -a "path=/etc/rsyslog.conf line='*.* @@:514'" + +# Restart logging +ansible all -m systemd -a "name=rsyslog state=restarted" + +# Deploy monitoring agents (if not present) +# Configure alerts for suspicious activity +``` + +### 3. Security Hardening + +```bash +# Run full security audit +ansible-playbook playbooks/security_audit.yml + +# Apply additional hardening +ansible all -m sysctl -a "name=net.ipv4.conf.all.accept_source_route value=0 state=present reload=yes" +ansible all -m sysctl -a "name=net.ipv4.tcp_syncookies value=1 state=present reload=yes" + +# Enable AIDE file integrity monitoring +ansible all -m shell -a "aideinit && aide --check" +``` + +## Post-Incident Activities + +### 1. Documentation (Within 24 Hours) + +Create incident report with: +- Timeline of events +- Actions taken +- Impact assessment +- Root cause +- Evidence collected +- Lessons learned + +### 2. Stakeholder Communication (Within 24 Hours) + +Notify: +- Management +- Legal/compliance +- Affected customers (if applicable) +- Regulatory bodies (if required) + +### 3. Post-Incident Review (Within 72 Hours) + +Review meeting agenda: +- What happened +- How was it detected +- Response effectiveness +- What went well +- What needs improvement +- Action items + +### 4. Preventive Measures (Within 2 Weeks) + +- Implement security controls +- Update security policies +- Enhance monitoring +- Conduct training +- Test incident response procedures + +## Compliance Requirements + +### Data Breach Notification + +| Regulation | Notification Timeline | Who to Notify | +|------------|----------------------|---------------| +| GDPR | 72 hours | Supervisory authority, affected individuals | +| HIPAA | 60 days | HHS, affected individuals, media (if >500) | +| PCI-DSS | Immediately | Payment brands, acquiring bank | +| State Laws | Varies | State AG, affected residents | + +### Evidence Preservation + +- Maintain chain of custody +- Preserve logs for minimum 90 days +- Document all investigative steps +- Secure evidence with encryption + +## Tools and Resources + +### Analysis Tools + +```bash +# Log analysis +grep -i "failed\|error\|unauthorized" /var/log/auth.log + +# Network analysis +tcpdump -i eth0 -w capture.pcap + +# Process analysis +ps aux | grep -v "^\[" | sort -k3 -rn | head -20 + +# File analysis +find / -type f -name "*.php" -exec grep -l "eval\|base64_decode" {} \; +``` + +### External Resources + +- NIST Cybersecurity Framework +- SANS Incident Response Guide +- MITRE ATT&CK Framework +- CERT Incident Handling Guide + +## Incident Categories and Response Times + +| Severity | Examples | Response Time | Recovery Time | +|----------|----------|---------------|---------------| +| **Critical** | Active data breach, ransomware | 15 min | 4 hours | +| **High** | Unauthorized access attempt, malware | 30 min | 8 hours | +| **Medium** | Policy violation, suspicious activity | 2 hours | 24 hours | +| **Low** | Failed login attempts, port scans | 8 hours | 48 hours | + +## Quick Reference + +```bash +# Block IP immediately +ansible all -m shell -a "ufw deny from " + +# Check current users +ansible all -m shell -a "w" + +# Check listening ports +ansible all -m shell -a "ss -tulpn" + +# Collect evidence +ansible host -m shell -a "tar czf /tmp/evidence.tar.gz /var/log/" + +# Isolate host +ansible host -m shell -a "iptables -P INPUT DROP; iptables -A INPUT -s -j ACCEPT" + +# Security audit +ansible-playbook playbooks/security_audit.yml --limit host +``` + +## Emergency Contacts + +| Role | Name | Contact | Backup | +|------|------|---------|--------| +| Security Lead | TBD | TBD | TBD | +| Incident Commander | TBD | TBD | TBD | +| Legal Counsel | TBD | TBD | TBD | +| PR/Communications | TBD | TBD | TBD | +| Law Enforcement | TBD | TBD | - | + +--- +**Last Updated:** 2025-11-11 +**Next Review:** 2025-02-11 +**Classification:** Confidential diff --git a/docs/security-compliance.md b/docs/security-compliance.md new file mode 100644 index 0000000..fe492be --- /dev/null +++ b/docs/security-compliance.md @@ -0,0 +1,289 @@ +# Security Compliance Documentation + +## Overview + +This document maps infrastructure security controls to industry-standard frameworks and provides evidence of compliance implementation. + +**Last Updated**: 2025-11-11 +**Review Cycle**: Quarterly +**Document Owner**: Security & Infrastructure Team + +--- + +## Compliance Frameworks + +This infrastructure implements controls aligned with: +- **CIS Benchmarks** (Center for Internet Security) +- **NIST Cybersecurity Framework** +- **NIST SP 800-53** (Security and Privacy Controls) +- **PCI-DSS** (if applicable for payment processing) +- **HIPAA** (if applicable for healthcare data) + +--- + +## CIS Benchmarks Compliance + +### CIS Linux Benchmark + +| CIS ID | Control | Implementation | Status | Evidence | +|--------|---------|----------------|--------|----------| +| **1.6.1** | Ensure SELinux is installed | SELinux package installed on RHEL family | ✓ | `deploy_linux_vm` role | +| **1.6.2** | Ensure SELinux is not disabled | SELinux set to enforcing mode | ✓ | `/etc/selinux/config`, `getenforce` | +| **1.6.3** | Ensure AppArmor is installed | AppArmor installed on Debian family | ✓ | `deploy_linux_vm` role | +| **3.5.1** | Ensure firewall is installed | UFW/firewalld installed | ✓ | Automated by role | +| **3.5.2** | Ensure firewall is enabled | Firewall active at boot | ✓ | `ufw status`, `firewall-cmd --state` | +| **4.1.1** | Ensure auditd is installed | auditd package present | ✓ | Essential packages list | +| **4.1.2** | Ensure auditd is enabled | auditd service running | ✓ | `systemctl status auditd` | +| **5.2.1** | Ensure SSH Protocol 2 | `Protocol 2` in sshd_config | ✓ | SSH hardening config | +| **5.2.9** | Ensure PermitRootLogin is disabled | `PermitRootLogin no` | ✓ | `/etc/ssh/sshd_config.d/99-security.conf` | +| **5.2.10** | Ensure PasswordAuthentication is disabled | `PasswordAuthentication no` | ✓ | SSH hardening config | +| **5.2.11** | Ensure GSSAPI authentication is disabled | `GSSAPIAuthentication no` | ✓ | **CLAUDE.md requirement** | +| **5.2.16** | Ensure SSH MaxAuthTries is set to 3 or less | `MaxAuthTries 3` | ✓ | SSH hardening config | +| **5.3.1** | Ensure sudo is installed | sudo package present | ✓ | All systems | +| **5.3.2** | Ensure sudo commands use pty | `Defaults use_pty` | ✓ | sudoers config | +| **5.3.3** | Ensure sudo log file exists | `Defaults logfile` | ✓ | sudoers config | + +### CIS Distribution Support Benchmark + +| Distribution | Benchmark Version | Compliance Level | Testing | +|--------------|-------------------|------------------|---------| +| Debian 12 | CIS Debian Linux 12 v1.0.0 | Level 1 | Manual | +| Ubuntu 22.04 | CIS Ubuntu 22.04 LTS v1.0.0 | Level 1 | Manual | +| AlmaLinux 9 | CIS AlmaLinux OS 9 v1.0.0 | Level 1 | Manual | +| Rocky Linux 9 | CIS Rocky Linux 9 v1.0.0 | Level 1 | Manual | + +--- + +## NIST Cybersecurity Framework + +### Framework Core Functions + +#### 1. Identify (ID) + +| Category | Control | Implementation | Status | +|----------|---------|----------------|--------| +| **ID.AM-1** | Physical devices and systems | system_info role collects inventory | ✓ | +| **ID.AM-2** | Software platforms and applications | system_info detects installed software | ✓ | +| **ID.AM-3** | Organizational communication | Documentation in `docs/` | ✓ | +| **ID.AM-4** | External information systems | Network topology documented | ✓ | +| **ID.GV-1** | Organizational cybersecurity policy | CLAUDE.md guidelines | ✓ | + +#### 2. Protect (PR) + +| Category | Control | Implementation | Status | +|----------|---------|----------------|--------| +| **PR.AC-1** | Identities and credentials are managed | Ansible user with SSH keys | ✓ | +| **PR.AC-3** | Remote access is managed | SSH key-only, no password auth | ✓ | +| **PR.AC-4** | Access permissions managed | Least privilege, sudo logging | ✓ | +| **PR.DS-1** | Data at rest is protected | LVM encryption (planned) | Planned | +| **PR.DS-2** | Data in transit is protected | SSH encryption for all comms | ✓ | +| **PR.IP-1** | Baseline configuration | Ansible roles define baseline | ✓ | +| **PR.IP-3** | Configuration change control | Git version control | ✓ | +| **PR.IP-12** | Vulnerability management plan | Automatic security updates | ✓ | +| **PR.MA-1** | Maintenance is performed | Ansible playbooks for maintenance | ✓ | +| **PR.PT-1** | Audit logs are determined and documented | auditd configured | ✓ | +| **PR.PT-3** | Principle of least functionality | Minimal services enabled | ✓ | + +#### 3. Detect (DE) + +| Category | Control | Implementation | Status | +|----------|---------|----------------|--------| +| **DE.AE-3** | Event data are aggregated | auditd, journald | ✓ | +| **DE.CM-1** | Network monitored | Firewall logs (basic) | Partial | +| **DE.CM-7** | Unauthorized activity detected | Audit rules for privileged ops | ✓ | +| **DE.DP-4** | Event detection communicated | Planned SIEM integration | Planned | + +#### 4. Respond (RS) + +| Category | Control | Implementation | Status | +|----------|---------|----------------|--------| +| **RS.AN-1** | Notifications investigated | Manual process | Manual | +| **RS.CO-2** | Incidents reported | Incident response runbook | Planned | +| **RS.MI-2** | Incidents contained | Firewall rules for isolation | ✓ | + +#### 5. Recover (RC) + +| Category | Control | Implementation | Status | +|----------|---------|----------------|--------| +| **RC.RP-1** | Recovery plan executed | DR playbook available | ✓ | +| **RC.RP-2** | Recovery plan updated | Playbook versioned in git | ✓ | + +--- + +## NIST SP 800-53 Controls + +### Access Control (AC) + +| Control | Title | Implementation | Evidence | +|---------|-------|----------------|----------| +| **AC-2** | Account Management | ansible service account | Automated provisioning | +| **AC-3** | Access Enforcement | SELinux/AppArmor MAC | `getenforce`, `aa-status` | +| **AC-6** | Least Privilege | sudo with logging | sudoers configuration | +| **AC-7** | Unsuccessful Login Attempts | SSH MaxAuthTries = 3 | sshd_config | +| **AC-17** | Remote Access | SSH key-only authentication | SSH hardening | + +### Audit and Accountability (AU) + +| Control | Title | Implementation | Evidence | +|---------|-------|----------------|----------| +| **AU-2** | Auditable Events | auditd rules configured | `/etc/audit/rules.d/` | +| **AU-3** | Content of Audit Records | auditd log format | `/var/log/audit/audit.log` | +| **AU-6** | Audit Review | Manual review process | Quarterly reviews | +| **AU-8** | Time Stamps | chrony time sync | NTP configuration | +| **AU-9** | Protection of Audit Information | Restrictive permissions | `600` on audit logs | +| **AU-12** | Audit Generation | auditd enabled system-wide | `systemctl status auditd` | + +### Configuration Management (CM) + +| Control | Title | Implementation | Evidence | +|---------|-------|----------------|----------| +| **CM-2** | Baseline Configuration | Ansible roles define baseline | Git repository | +| **CM-3** | Configuration Change Control | Pull request workflow | Git history | +| **CM-6** | Configuration Settings | CIS Benchmark compliance | Automated hardening | +| **CM-7** | Least Functionality | Minimal packages installed | Package lists | + +### Identification and Authentication (IA) + +| Control | Title | Implementation | Evidence | +|---------|-------|----------------|----------| +| **IA-2** | Identification and Authentication | SSH key-based | sshd_config | +| **IA-2(1)** | Multi-Factor to Privileged Accounts | Planned (not implemented) | Planned | +| **IA-5** | Authenticator Management | SSH key rotation policy | 90-day policy | +| **IA-5(1)** | Password-Based Authentication | Passwords disabled for SSH | sshd_config | + +### System and Communications Protection (SC) + +| Control | Title | Implementation | Evidence | +|---------|-------|----------------|----------| +| **SC-7** | Boundary Protection | Firewall at each host | UFW/firewalld | +| **SC-8** | Transmission Confidentiality | SSH encryption | All Ansible comms via SSH | +| **SC-13** | Cryptographic Protection | SSH keys, TLS | SSH v2, strong ciphers | + +### System and Information Integrity (SI) + +| Control | Title | Implementation | Evidence | +|---------|-------|----------------|----------| +| **SI-2** | Flaw Remediation | Automatic security updates | unattended-upgrades/dnf-automatic | +| **SI-3** | Malicious Code Protection | ClamAV (planned) | Planned | +| **SI-4** | Information System Monitoring | auditd, logs | Log files | +| **SI-7** | Software Integrity Checks | AIDE file integrity | AIDE configuration | + +--- + +## PCI-DSS Compliance (If Applicable) + +### Requirement Mapping + +| Req | Title | Implementation | Status | +|-----|-------|----------------|--------| +| **2.2** | Configuration Standards | Ansible roles enforce standards | ✓ | +| **2.3** | Encrypt Non-Console Access | SSH only, encrypted | ✓ | +| **8.1** | Unique User IDs | ansible service account per system | ✓ | +| **8.2** | Strong Authentication | SSH keys (4096-bit RSA) | ✓ | +| **8.3** | Multi-Factor Auth | Planned | Planned | +| **10.1** | Audit Trails | auditd enabled | ✓ | +| **10.2** | Automated Audit Trails | auditd automatic logging | ✓ | + +--- + +## Compliance Evidence Collection + +### Automated Compliance Checks + +Use OpenSCAP for automated compliance scanning: + +```bash +# Install OpenSCAP +apt-get install libopenscap8 # Debian/Ubuntu +dnf install openscap-scanner # RHEL/AlmaLinux + +# Run CIS benchmark scan +oscap xccdf eval \ + --profile xccdf_org.ssgproject.content_profile_cis \ + --results results.xml \ + --report report.html \ + /usr/share/xml/scap/ssg/content/ssg-*.xml +``` + +### Manual Compliance Verification + +```bash +# SELinux status +getenforce + +# AppArmor status +aa-status + +# Firewall status +ufw status verbose # Debian/Ubuntu +firewall-cmd --list-all # RHEL + +# SSH configuration +sshd -T | grep -E "(PermitRootLogin|PasswordAuthentication|GSSAPIAuthentication)" + +# Audit daemon status +systemctl status auditd +auditctl -l + +# Automatic updates +systemctl status unattended-upgrades # Debian/Ubuntu +systemctl status dnf-automatic.timer # RHEL +``` + +--- + +## Compliance Gaps and Remediation Plan + +### Known Gaps + +| Gap | Framework | Target Date | Owner | +|-----|-----------|-------------|-------| +| Multi-Factor Authentication | NIST IA-2(1) | Q2 2025 | Security Team | +| Centralized Logging | NIST DE.AE-3 | Q1 2025 | Ops Team | +| SIEM Integration | NIST DE.DP-4 | Q2 2025 | Security Team | +| Full Disk Encryption | NIST PR.DS-1 | Q3 2025 | Ops Team | +| Automated Vulnerability Scanning | PCI 11.2 | Q2 2025 | Security Team | + +### Remediation Roadmap + +**Q1 2025**: +- Implement centralized logging (ELK or Graylog) +- Enhance audit rules for PCI compliance + +**Q2 2025**: +- Add multi-factor authentication for privileged access +- Deploy SIEM solution +- Implement automated vulnerability scanning + +**Q3 2025**: +- Full disk encryption for sensitive systems +- Implement intrusion detection (IDS/IPS) + +--- + +## Audit and Review Schedule + +| Activity | Frequency | Responsible | Last Completed | +|----------|-----------|-------------|----------------| +| CIS Benchmark Scan | Monthly | Ops Team | 2025-11-11 | +| Access Review | Quarterly | Security Team | 2025-11-11 | +| Configuration Audit | Quarterly | Ops Team | 2025-11-11 | +| Vulnerability Scan | Monthly | Security Team | 2025-11-11 | +| Penetration Test | Annually | External Auditor | N/A | +| Compliance Documentation Review | Quarterly | Security Team | 2025-11-11 | + +--- + +## Related Documentation + +- [Security Model](./architecture/security-model.md) +- [Architecture Overview](./architecture/overview.md) +- [CLAUDE.md Guidelines](../CLAUDE.md) +- [Runbook: Incident Response](./runbooks/incident-response.md) + +--- + +**Document Version**: 1.0.0 +**Last Updated**: 2025-11-11 +**Next Review**: 2026-02-11 +**Document Owner**: Security & Infrastructure Team diff --git a/docs/security/vault-management.md b/docs/security/vault-management.md new file mode 100644 index 0000000..052451f --- /dev/null +++ b/docs/security/vault-management.md @@ -0,0 +1,411 @@ +# Ansible Vault Management Guide + +This document describes how to manage encrypted secrets using Ansible Vault in this infrastructure. + +## Overview + +Ansible Vault is used to encrypt sensitive data such as passwords, API tokens, and private keys. All vault files are stored in `inventories//group_vars/all/vault.yml`. + +## Table of Contents + +- [Quick Start](#quick-start) +- [Vault File Locations](#vault-file-locations) +- [Creating Vault Files](#creating-vault-files) +- [Encrypting and Decrypting](#encrypting-and-decrypting) +- [Editing Vault Files](#editing-vault-files) +- [Using Vault Variables](#using-vault-variables) +- [Vault Password Management](#vault-password-management) +- [Best Practices](#best-practices) +- [Troubleshooting](#troubleshooting) + +## Quick Start + +```bash +# 1. Create vault file from example +cp inventories/production/group_vars/all/vault.yml.example \ + inventories/production/group_vars/all/vault.yml + +# 2. Edit and fill in secrets +vi inventories/production/group_vars/all/vault.yml + +# 3. Encrypt the vault file +ansible-vault encrypt inventories/production/group_vars/all/vault.yml + +# 4. Run playbook with vault password +ansible-playbook site.yml --ask-vault-pass +``` + +## Vault File Locations + +Vault files are organized by environment: + +``` +inventories/ +├── production/ +│ └── group_vars/ +│ └── all/ +│ ├── vault.yml.example # Template +│ └── vault.yml # Encrypted (gitignored) +├── staging/ +│ └── group_vars/ +│ └── all/ +│ ├── vault.yml.example +│ └── vault.yml +└── development/ + └── group_vars/ + └── all/ + ├── vault.yml.example + └── vault.yml +``` + +**Important**: `vault.yml` files should be added to `.gitignore` to prevent accidental commits. + +## Creating Vault Files + +### From Example Template + +```bash +# Copy example template +cp inventories/production/group_vars/all/vault.yml.example \ + inventories/production/group_vars/all/vault.yml + +# Edit and replace CHANGEME placeholders +vi inventories/production/group_vars/all/vault.yml + +# Encrypt the file +ansible-vault encrypt inventories/production/group_vars/all/vault.yml +``` + +### Create New Vault File + +```bash +# Create and encrypt in one step +ansible-vault create inventories/production/group_vars/all/vault.yml +``` + +This opens your editor to add vault contents, then automatically encrypts on save. + +## Encrypting and Decrypting + +### Encrypt a File + +```bash +ansible-vault encrypt inventories/production/group_vars/all/vault.yml +``` + +You'll be prompted to create a vault password. + +### Decrypt a File + +```bash +# Decrypt to view/edit (dangerous - creates plaintext file) +ansible-vault decrypt inventories/production/group_vars/all/vault.yml + +# View without decrypting +ansible-vault view inventories/production/group_vars/all/vault.yml +``` + +**Warning**: Decrypting a file leaves it in plaintext. Always re-encrypt after editing. + +### Encrypt Multiple Files + +```bash +ansible-vault encrypt inventories/*/group_vars/all/vault.yml +``` + +## Editing Vault Files + +### Edit Encrypted File + +```bash +# Edit encrypted file directly (recommended) +ansible-vault edit inventories/production/group_vars/all/vault.yml +``` + +This decrypts the file in memory, opens your editor, and re-encrypts on save. + +### Change Vault Password + +```bash +ansible-vault rekey inventories/production/group_vars/all/vault.yml +``` + +You'll be prompted for the old password, then the new password. + +## Using Vault Variables + +### In Playbooks + +Reference vault variables like normal variables: + +```yaml +--- +- name: Configure database + hosts: databases + tasks: + - name: Set MySQL root password + mysql_user: + name: root + password: "{{ vault_mysql_root_password }}" + host: localhost +``` + +### In Templates + +```jinja2 +# /etc/my.cnf +[client] +user = root +password = {{ vault_mysql_root_password }} +``` + +### In Role Defaults + +```yaml +# roles/mysql/defaults/main.yml +--- +mysql_root_password: "{{ vault_mysql_root_password }}" +``` + +## Vault Password Management + +### Option 1: Interactive Password Prompt (Most Secure) + +```bash +ansible-playbook site.yml --ask-vault-pass +``` + +### Option 2: Password File + +Create a password file: + +```bash +# Create password file (gitignored) +echo "YourVaultPassword123!" > .vault_pass +chmod 600 .vault_pass +``` + +Add to `.gitignore`: +``` +.vault_pass +``` + +Update `ansible.cfg`: +```ini +[defaults] +vault_password_file = .vault_pass +``` + +Run playbooks without prompt: +```bash +ansible-playbook site.yml +``` + +### Option 3: Environment Variable + +```bash +export ANSIBLE_VAULT_PASSWORD_FILE=.vault_pass +ansible-playbook site.yml +``` + +### Option 4: Script-Based Password (Advanced) + +Create a script that retrieves the password from a secure source: + +```bash +#!/bin/bash +# vault-password.sh +# Retrieve password from AWS Secrets Manager, HashiCorp Vault, etc. +aws secretsmanager get-secret-value \ + --secret-id ansible-vault-password \ + --query SecretString \ + --output text +``` + +Make it executable: +```bash +chmod +x vault-password.sh +``` + +Use in `ansible.cfg`: +```ini +[defaults] +vault_password_file = ./vault-password.sh +``` + +## Best Practices + +### Security + +1. **Never commit unencrypted vault files** to version control +2. **Use different vault passwords** for each environment +3. **Rotate vault passwords** every 90 days +4. **Restrict access** to vault password files (`chmod 600`) +5. **Use strong passwords** (minimum 20 characters, mixed case, numbers, symbols) +6. **Store production passwords** in a secure password manager (1Password, LastPass, etc.) + +### Organization + +1. **Prefix vault variables** with `vault_` for clarity: + ```yaml + vault_mysql_root_password: "secret123" + vault_api_token: "token456" + ``` + +2. **Use vault variables in role defaults**: + ```yaml + # roles/mysql/defaults/main.yml + mysql_root_password: "{{ vault_mysql_root_password }}" + ``` + +3. **Document all vault variables** in `vault.yml.example` + +4. **One vault file per environment** for easier management + +### Git Management + +Add to `.gitignore`: +``` +# Vault passwords +.vault_pass +vault-password.sh + +# Unencrypted vault files +**/vault.yml +!**/vault.yml.example +``` + +Verify vault files are encrypted before committing: +```bash +# Check if file is encrypted +head -1 inventories/production/group_vars/all/vault.yml +# Should output: $ANSIBLE_VAULT;1.1;AES256 +``` + +## Multiple Vault Passwords + +For environments with different vault passwords: + +### Using Vault IDs + +```bash +# Encrypt with vault ID +ansible-vault encrypt \ + --vault-id production@prompt \ + inventories/production/group_vars/all/vault.yml + +ansible-vault encrypt \ + --vault-id staging@prompt \ + inventories/staging/group_vars/all/vault.yml + +# Run playbook with multiple vault IDs +ansible-playbook site.yml \ + --vault-id production@.vault_pass_production \ + --vault-id staging@.vault_pass_staging +``` + +## Common Vault Variables + +### User Credentials +```yaml +vault_ansible_user_ssh_key: "ssh-rsa AAAA..." +vault_root_password: "password" +vault_ansible_become_password: "password" +``` + +### API Tokens +```yaml +vault_aws_access_key_id: "AKIA..." +vault_aws_secret_access_key: "secret" +vault_netbox_api_token: "token" +``` + +### Database Credentials +```yaml +vault_mysql_root_password: "password" +vault_postgresql_postgres_password: "password" +``` + +### Application Secrets +```yaml +vault_app_secret_key: "secret_key" +vault_app_api_key: "api_key" +``` + +## Troubleshooting + +### Wrong Vault Password + +**Error**: `Decryption failed (no vault secrets were found that could decrypt)` + +**Solution**: Verify you're using the correct vault password for that environment. + +### Vault File Not Found + +**Error**: `ERROR! Attempting to decrypt but no vault secrets found` + +**Solution**: Create the vault file or check the path is correct. + +### Permission Denied + +**Error**: `Permission denied: 'vault.yml'` + +**Solution**: Check file permissions: +```bash +ls -la inventories/production/group_vars/all/vault.yml +chmod 600 inventories/production/group_vars/all/vault.yml +``` + +### Forgot Vault Password + +**Solution**: Unfortunately, there's no way to recover a forgotten vault password. You'll need to: +1. Re-create the vault file from scratch +2. Re-enter all secrets +3. Re-encrypt with a new password + +**Prevention**: Store vault passwords in a secure password manager. + +### Check Vault File Integrity + +```bash +# Verify file can be decrypted +ansible-vault view inventories/production/group_vars/all/vault.yml + +# Check encryption format +file inventories/production/group_vars/all/vault.yml +# Should output: ASCII text (encrypted vault file) +``` + +## Emergency Procedures + +### Compromised Vault Password + +1. **Immediately change the vault password**: + ```bash + ansible-vault rekey inventories/production/group_vars/all/vault.yml + ``` + +2. **Rotate all secrets** stored in the vault + +3. **Audit access logs** to determine scope of compromise + +4. **Update vault password** in all secure storage locations + +### Lost Access to Production Vault + +1. Use backup vault password from secure password manager +2. If no backup exists, rotate all production credentials +3. Create new vault file with new credentials +4. Update all systems with new credentials + +## References + +- [Ansible Vault Documentation](https://docs.ansible.com/ansible/latest/user_guide/vault.html) +- [Ansible Best Practices - Vault](https://docs.ansible.com/ansible/latest/user_guide/playbooks_best_practices.html#variables-and-vaults) +- Internal: [CLAUDE.md - Secrets Management](../CLAUDE.md) + +--- + +**Document Version**: 1.0 +**Last Updated**: 2025-11-11 +**Maintainer**: Infrastructure Team diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md new file mode 100644 index 0000000..4b8159e --- /dev/null +++ b/docs/troubleshooting.md @@ -0,0 +1,602 @@ +# Troubleshooting Guide + +## Overview + +Comprehensive troubleshooting guide for common issues encountered in Ansible-managed infrastructure. + +**Last Updated**: 2025-11-11 +**Document Owner**: Operations Team + +--- + +## Table of Contents + +1. [Ansible Execution Issues](#ansible-execution-issues) +2. [SSH and Connectivity](#ssh-and-connectivity) +3. [VM Deployment Issues](#vm-deployment-issues) +4. [System Information Collection](#system-information-collection) +5. [Storage and LVM Issues](#storage-and-lvm-issues) +6. [Security and Firewall](#security-and-firewall) +7. [Performance Issues](#performance-issues) +8. [General Diagnostics](#general-diagnostics) + +--- + +## Ansible Execution Issues + +### Issue: "Failed to connect to the host via SSH" + +**Symptoms**: Cannot connect to target hosts + +**Causes**: +- SSH key not authorized +- Wrong SSH user +- Host unreachable +- SSH service not running + +**Solutions**: + +```bash +# 1. Test connectivity +ping + +# 2. Test SSH manually +ssh ansible@ + +# 3. Verify SSH service on target +ansible -m shell -a "systemctl status sshd" -u root --ask-pass + +# 4. Check SSH key is authorized +ansible -m authorized_key \ + -a "user=ansible key='{{ lookup('file', '~/.ssh/id_rsa.pub') }}' state=present" \ + -u root --ask-pass + +# 5. Verify ansible user exists +ansible -m shell -a "id ansible" -u root --ask-pass +``` + +### Issue: "Permission denied" or "sudo: a password is required" + +**Symptoms**: Tasks fail due to insufficient permissions + +**Causes**: +- ansible user lacks sudo permissions +- `become: yes` not specified +- Incorrect sudo configuration + +**Solutions**: + +```bash +# 1. Verify sudo permissions +ssh ansible@ "sudo -l" + +# 2. Check sudoers configuration +ssh ansible@ "sudo cat /etc/sudoers.d/ansible" + +# 3. Fix sudoers if needed (as root) +ssh root@ "cat > /etc/sudoers.d/ansible <<'EOF' +ansible ALL=(ALL) NOPASSWD: ALL +Defaults:ansible !requiretty +EOF" + +# 4. Ensure become is set in playbook +# Add to playbook: +# become: yes +``` + +### Issue: "Module not found" or "No module named..." + +**Symptoms**: Python module import errors + +**Causes**: +- Python dependencies missing on control node or target +- Wrong Python interpreter + +**Solutions**: + +```bash +# On control node +pip3 install -r requirements.txt + +# On target hosts +ansible all -m package -a "name=python3,python3-pip state=present" --become + +# Specify Python interpreter +ansible all -m setup -a "filter=ansible_python_version" \ + -e "ansible_python_interpreter=/usr/bin/python3" +``` + +--- + +## SSH and Connectivity + +### Issue: "UNREACHABLE!" for all hosts + +**Symptoms**: Cannot reach any hosts in inventory + +**Causes**: +- Network connectivity issues +- DNS resolution failures +- Firewall blocking SSH +- Incorrect inventory configuration + +**Solutions**: + +```bash +# 1. Verify inventory syntax +ansible-inventory --list -i inventories/production + +# 2. Test DNS resolution +ansible all -m shell -a "hostname -f" -i inventories/production + +# 3. Test network connectivity +ansible all -m ping -i inventories/production + +# 4. Check SSH port accessibility +nmap -p 22 + +# 5. Verify inventory file paths +ansible all --list-hosts -i inventories/production +``` + +### Issue: SSH connection hangs or times out + +**Symptoms**: SSH attempts timeout or hang indefinitely + +**Causes**: +- Network latency +- SSH idle timeout +- Firewall dropping connections +- MTU issues + +**Solutions**: + +```bash +# 1. Increase SSH timeout in ansible.cfg +[defaults] +timeout = 60 + +# 2. Enable SSH keepalive +[ssh_connection] +ssh_args = -o ServerAliveInterval=60 -o ServerAliveCountMax=3 + +# 3. Test with verbose SSH +ssh -vvv ansible@ + +# 4. Check MTU issues +ping -M do -s 1472 # Should not fragment +``` + +--- + +## VM Deployment Issues + +### Issue: VM fails to start after creation + +**Symptoms**: VM shows "shut off" immediately after deployment + +**Causes**: +- Insufficient resources on hypervisor +- Cloud-init ISO creation failed +- Invalid VM configuration + +**Solutions**: + +```bash +# 1. Check hypervisor resources +ansible hypervisor -m shell -a "free -h && df -h /var/lib/libvirt/images" + +# 2. Check VM definition +ansible hypervisor -m shell -a "virsh dumpxml " + +# 3. View libvirt logs +ansible hypervisor -m shell -a "tail -50 /var/log/libvirt/qemu/.log" + +# 4. Start VM manually and check errors +ansible hypervisor -m shell -a "virsh start " + +# 5. Check cloud-init ISO exists +ansible hypervisor -m shell -a "ls -lh /var/lib/libvirt/images/-cloud-init.iso" +``` + +### Issue: Cloud-init fails on first boot + +**Symptoms**: Cannot SSH to VM, cloud-init errors in logs + +**Causes**: +- Cloud-init configuration errors +- Network connectivity issues in VM +- Package installation failures + +**Solutions**: + +```bash +# 1. Access VM console +ansible hypervisor -m shell -a "virsh console " +# Press Enter, login as root (if console password set) + +# 2. Check cloud-init status +ssh ansible@ "cloud-init status --long" + +# 3. View cloud-init logs +ssh ansible@ "tail -100 /var/log/cloud-init-output.log" + +# 4. Re-run cloud-init modules +ssh ansible@ "sudo cloud-init clean && sudo cloud-init init" + +# 5. Verify network connectivity in VM +ssh ansible@ "ping -c 3 8.8.8.8 && nslookup google.com" +``` + +### Issue: Cannot get VM IP address + +**Symptoms**: `virsh domifaddr` returns no IP + +**Causes**: +- VM networking not configured +- DHCP not working +- VM not fully booted + +**Solutions**: + +```bash +# 1. Wait for VM to boot completely +sleep 60 + +# 2. Check all network sources +ansible hypervisor -m shell -a "virsh domifaddr --source lease" +ansible hypervisor -m shell -a "virsh domifaddr --source agent" + +# 3. Check DHCP leases +ansible hypervisor -m shell -a "virsh net-dhcp-leases default" + +# 4. Check VM network configuration +ansible hypervisor -m shell -a "virsh domif-getlink vnet0" + +# 5. Access via console to configure networking +ansible hypervisor -m shell -a "virsh console " +``` + +--- + +## System Information Collection + +### Issue: system_info role fails with "command not found" + +**Symptoms**: Tasks fail due to missing commands (lshw, dmidecode, etc.) + +**Causes**: +- Required packages not installed +- Package installation skipped + +**Solutions**: + +```bash +# 1. Run installation tasks +ansible-playbook site.yml -t system_info,install + +# 2. Manually install packages +ansible all -m package -a "name=lshw,dmidecode,pciutils,usbutils state=present" --become + +# 3. Verify packages installed +ansible all -m shell -a "which lshw dmidecode lspci" +``` + +### Issue: Statistics files not created + +**Symptoms**: No JSON files in `./stats/machines/` + +**Causes**: +- Directory permissions issues on control node +- Disk space full +- Export tasks not executed + +**Solutions**: + +```bash +# 1. Check directory exists and is writable +ls -la ./stats/machines/ +mkdir -p ./stats/machines +chmod 755 ./stats/machines + +# 2. Check disk space +df -h . + +# 3. Run export tasks explicitly +ansible-playbook site.yml -t system_info,export + +# 4. Check for errors in Ansible output +ansible-playbook site.yml -t system_info -vvv | tee ansible_debug.log +``` + +--- + +## Storage and LVM Issues + +### Issue: LVM configuration fails on deployed VM + +**Symptoms**: LVM post-deployment tasks fail + +**Causes**: +- Second disk not attached to VM +- LVM tools not installed +- Physical volume creation failed + +**Solutions**: + +```bash +# 1. Verify second disk exists +ssh ansible@ "lsblk" + +# 2. Check for /dev/vdb +ssh ansible@ "ls -l /dev/vdb" + +# 3. Verify LVM packages +ssh ansible@ "which pvcreate vgcreate lvcreate" + +# 4. Manually create PV if needed +ssh ansible@ "sudo pvcreate /dev/vdb" + +# 5. Re-run LVM configuration +ansible-playbook site.yml -t deploy_linux_vm,lvm,post-deploy \ + -e "deploy_linux_vm_name=" +``` + +### Issue: Disk full on hypervisor + +**Symptoms**: VM deployment fails, "No space left on device" + +**Causes**: +- Insufficient disk space in `/var/lib/libvirt/images` +- Too many cached cloud images +- Old VM disks not cleaned up + +**Solutions**: + +```bash +# 1. Check disk space +ansible hypervisor -m shell -a "df -h /var/lib/libvirt/images" + +# 2. List all VM disks +ansible hypervisor -m shell -a "ls -lh /var/lib/libvirt/images/*.qcow2" + +# 3. Remove old cloud images +ansible hypervisor -m shell -a "find /var/lib/libvirt/images -name '*-cloud*.qcow2' -mtime +90 -delete" + +# 4. Remove unused VM disks (CAREFUL!) +# Verify VM is deleted first +ansible hypervisor -m shell -a "virsh list --all" +ansible hypervisor -m shell -a "rm /var/lib/libvirt/images/old-vm-*.qcow2" + +# 5. Clean up libvirt storage pools +ansible hypervisor -m shell -a "virsh pool-refresh default && virsh vol-list default" +``` + +--- + +## Security and Firewall + +### Issue: Cannot SSH to VM after deployment + +**Symptoms**: SSH connection refused or times out + +**Causes**: +- Firewall blocking SSH +- SSH service not running +- SSH keys not deployed correctly + +**Solutions**: + +```bash +# 1. Check if VM is running +ansible hypervisor -m shell -a "virsh list --all" + +# 2. Access via hypervisor console +ansible hypervisor -m shell -a "virsh console " + +# 3. From console, check sshd status +systemctl status sshd + +# 4. Check firewall rules +sudo ufw status # Debian/Ubuntu +sudo firewall-cmd --list-all # RHEL/AlmaLinux + +# 5. Temporarily allow SSH (for troubleshooting) +sudo ufw allow 22/tcp # Debian/Ubuntu +sudo firewall-cmd --add-service=ssh --permanent && sudo firewall-cmd --reload # RHEL + +# 6. Verify SSH key authorized +cat ~/.ssh/authorized_keys +``` + +### Issue: SELinux denials preventing operations + +**Symptoms**: Operations fail with "Permission denied" even with sudo + +**Causes**: +- SELinux blocking operations +- Incorrect file contexts +- Missing SELinux policies + +**Solutions**: + +```bash +# 1. Check SELinux status +ssh ansible@ "getenforce" + +# 2. Check for denials +ssh ansible@ "sudo ausearch -m avc -ts recent" + +# 3. Generate policy from denials +ssh ansible@ "sudo ausearch -m avc -ts recent | audit2allow -M mypolicy" +ssh ansible@ "sudo semodule -i mypolicy.pp" + +# 4. Fix file contexts +ssh ansible@ "sudo restorecon -Rv /" + +# 5. Temporarily set to permissive for testing (NOT PRODUCTION) +ssh ansible@ "sudo setenforce 0" +# After testing, re-enable +ssh ansible@ "sudo setenforce 1" +``` + +--- + +## Performance Issues + +### Issue: Ansible playbook execution is very slow + +**Symptoms**: Playbooks take excessive time to complete + +**Causes**: +- Fact gathering on many hosts +- Serial execution instead of parallel +- Slow network connections +- Large inventory + +**Solutions**: + +```bash +# 1. Enable fact caching in ansible.cfg +[defaults] +fact_caching = jsonfile +fact_caching_connection = /tmp/ansible_facts +fact_caching_timeout = 3600 + +# 2. Increase parallelism +ansible-playbook site.yml -f 20 + +# 3. Skip fact gathering if not needed +ansible-playbook site.yml -e "gather_facts=false" + +# 4. Use strategy plugin +[defaults] +strategy = free # In ansible.cfg + +# 5. Enable pipelining +[ssh_connection] +pipelining = True + +# 6. Profile task execution +ansible-playbook site.yml --timing +``` + +### Issue: High CPU usage on hypervisor + +**Symptoms**: Hypervisor CPU at 100%, VMs slow + +**Causes**: +- CPU overcommitment +- Runaway processes in VMs +- Insufficient resources + +**Solutions**: + +```bash +# 1. Check hypervisor load +ansible hypervisor -m shell -a "top -bn1 | head -20" + +# 2. Check VM CPU allocation +ansible hypervisor -m shell -a "virsh vcpuinfo " + +# 3. List VMs by CPU usage +ansible hypervisor -m shell -a "virsh domstats --cpu-total" + +# 4. Inside VMs, check processes +ssh ansible@ "top -bn1 | head -20" + +# 5. Reduce VM vCPU allocation if needed +ansible hypervisor -m shell -a "virsh setvcpus --config" +ansible hypervisor -m shell -a "virsh shutdown && virsh start " +``` + +--- + +## General Diagnostics + +### Diagnostic Commands + +```bash +# Ansible inventory +ansible-inventory --list +ansible-inventory --graph + +# Connectivity test +ansible all -m ping + +# Gather facts from hosts +ansible all -m setup + +# Check disk space across all hosts +ansible all -m shell -a "df -h" + +# Check memory across all hosts +ansible all -m shell -a "free -h" + +# Check system load +ansible all -m shell -a "uptime" + +# List running services +ansible all -m shell -a "systemctl list-units --type=service --state=running" + +# Check for failed services +ansible all -m shell -a "systemctl --failed" + +# Review system logs +ansible all -m shell -a "journalctl -p err -n 50" +``` + +### Debug Mode + +```bash +# Verbose output (level 1) +ansible-playbook site.yml -v + +# More verbose (level 2 - shows module arguments) +ansible-playbook site.yml -vv + +# Very verbose (level 3 - shows connection attempts) +ansible-playbook site.yml -vvv + +# Maximum verbosity (level 4 - shows everything) +ansible-playbook site.yml -vvvv +``` + +### Log Locations + +**Control Node**: +- Ansible log: `/var/log/ansible.log` (if configured) +- Command history: `~/.bash_history` + +**Target Hosts**: +- System logs: `/var/log/syslog` (Debian) or `/var/log/messages` (RHEL) +- Auth logs: `/var/log/auth.log` (Debian) or `/var/log/secure` (RHEL) +- Audit logs: `/var/log/audit/audit.log` +- Cloud-init: `/var/log/cloud-init-output.log` +- Journal: `journalctl` + +--- + +## Getting Help + +### Internal Resources +- [CLAUDE.md Guidelines](../CLAUDE.md) +- [Architecture Overview](./architecture/overview.md) +- [Role Documentation](./roles/) +- [Cheatsheets](../cheatsheets/) + +### External Resources +- [Ansible Documentation](https://docs.ansible.com/) +- [KVM/libvirt Documentation](https://libvirt.org/docs.html) +- [Distribution-specific documentation](https://www.debian.org/doc/) + +### Support Channels +- Internal issue tracker: https://git.mymx.me +- Operations team: ops@example.com +- On-call escalation: +1-XXX-XXX-XXXX + +--- + +**Document Version**: 1.0.0 +**Last Updated**: 2025-11-11 +**Maintained By**: Operations Team diff --git a/docs/variables.md b/docs/variables.md new file mode 100644 index 0000000..1267a28 --- /dev/null +++ b/docs/variables.md @@ -0,0 +1,254 @@ +# Ansible Variables Documentation + +## Overview + +This document provides comprehensive documentation of all global, role-specific, and environment-specific variables used in the Ansible infrastructure automation. + +## Variable Precedence + +Ansible variable precedence (highest to lowest): + +1. Extra vars (`-e` in command line) +2. Task vars (only for the task) +3. Block vars (only for tasks in block) +4. Role and include vars +5. Set_facts / registered vars +6. Include params +7. Role params +8. Play vars_files +9. Play vars_prompt +10. Play vars +11. Host facts / cached set_facts +12. Playbook host_vars +13. Playbook group_vars +14. Inventory host_vars +15. Inventory group_vars +16. Inventory vars +17. Role defaults + +## Global Variables + +### inventories/*/group_vars/all.yml + +These variables apply to all hosts across all environments. + +| Variable | Default | Description | +|----------|---------|-------------| +| `ansible_user` | `ansible` | SSH user for automation | +| `ansible_become` | `true` | Use privilege escalation | +| `ansible_python_interpreter` | `/usr/bin/python3` | Python interpreter path | + +## Role-Specific Variables + +### deploy_linux_vm Role + +**Location**: `roles/deploy_linux_vm/defaults/main.yml` + +#### Required Variables + +| Variable | Required | Description | +|----------|----------|-------------| +| `deploy_linux_vm_os_distribution` | Yes | Distribution identifier (e.g., `ubuntu-22.04`, `almalinux-9`) | + +#### VM Configuration + +| Variable | Default | Description | +|----------|---------|-------------| +| `deploy_linux_vm_name` | `linux-guest` | VM name in libvirt | +| `deploy_linux_vm_hostname` | `linux-vm` | Guest hostname | +| `deploy_linux_vm_domain` | `localdomain` | Domain name (FQDN = hostname.domain) | +| `deploy_linux_vm_vcpus` | `2` | Number of virtual CPUs | +| `deploy_linux_vm_memory_mb` | `2048` | RAM allocation in MB | +| `deploy_linux_vm_disk_size_gb` | `30` | Primary disk size in GB | + +#### LVM Configuration + +| Variable | Default | Description | +|----------|---------|-------------| +| `deploy_linux_vm_use_lvm` | `true` | Enable LVM configuration | +| `deploy_linux_vm_lvm_vg_name` | `vg_system` | Volume group name | +| `deploy_linux_vm_lvm_pv_device` | `/dev/vdb` | Physical volume device | + +#### Security Configuration + +| Variable | Default | Description | +|----------|---------|-------------| +| `deploy_linux_vm_enable_firewall` | `true` | Enable UFW/firewalld | +| `deploy_linux_vm_enable_selinux` | `true` | Enable SELinux (RHEL family) | +| `deploy_linux_vm_enable_apparmor` | `true` | Enable AppArmor (Debian family) | +| `deploy_linux_vm_enable_auditd` | `true` | Enable audit daemon | +| `deploy_linux_vm_enable_automatic_updates` | `true` | Enable automatic security updates | +| `deploy_linux_vm_automatic_reboot` | `false` | Auto-reboot after updates | + +#### SSH Hardening + +| Variable | Default | Description | +|----------|---------|-------------| +| `deploy_linux_vm_ssh_permit_root_login` | `no` | Allow root SSH login | +| `deploy_linux_vm_ssh_password_authentication` | `no` | Allow password authentication | +| `deploy_linux_vm_ssh_gssapi_authentication` | `no` | **GSSAPI disabled per requirements** | +| `deploy_linux_vm_ssh_max_auth_tries` | `3` | Maximum authentication attempts | + +### system_info Role + +**Location**: `roles/system_info/defaults/main.yml` + +| Variable | Default | Description | +|----------|---------|-------------| +| `system_info_stats_base_dir` | `./stats/machines` | Base directory for statistics storage | +| `system_info_create_stats_dir` | `true` | Create stats directory if missing | +| `system_info_gather_cpu` | `true` | Gather CPU information | +| `system_info_gather_gpu` | `true` | Gather GPU information | +| `system_info_gather_memory` | `true` | Gather memory information | +| `system_info_gather_disk` | `true` | Gather disk information | +| `system_info_gather_network` | `true` | Gather network information | +| `system_info_detect_hypervisor` | `true` | Detect hypervisor capabilities | +| `system_info_json_indent` | `2` | JSON output indentation | + +## Environment-Specific Variables + +### Production (`inventories/production/group_vars/all.yml`) + +```yaml +# Example production variables +environment_name: production +backup_enabled: true +monitoring_enabled: true +automatic_updates_schedule: "0 2 * * 0" # Weekly Sunday 2 AM +``` + +### Staging (`inventories/staging/group_vars/all.yml`) + +```yaml +# Example staging variables +environment_name: staging +backup_enabled: true +monitoring_enabled: true +automatic_updates_schedule: "0 3 * * *" # Daily 3 AM +``` + +### Development (`inventories/development/group_vars/all.yml`) + +```yaml +# Example development variables +environment_name: development +backup_enabled: false +monitoring_enabled: false +automatic_updates_schedule: "0 4 * * *" # Daily 4 AM +``` + +## Variable Naming Conventions + +### Prefix Convention + +All role variables should be prefixed with the role name: + +```yaml +# Good +deploy_linux_vm_vcpus: 4 +system_info_gather_cpu: true + +# Bad (global namespace pollution) +vcpus: 4 +gather_cpu: true +``` + +### Type Indicators + +Use clear variable names that indicate type: + +```yaml +# Boolean +enable_firewall: true +is_production: false + +# String +hostname: "webserver01" +domain: "example.com" + +# Integer +cpu_count: 4 +memory_mb: 8192 + +# List +allowed_ips: + - "192.168.1.0/24" + - "10.0.0.0/8" + +# Dictionary +lvm_config: + vg_name: "vg_system" + volumes: + - name: "lv_opt" + size: "3G" +``` + +## Sensitive Variables + +### Ansible Vault + +Sensitive variables should be encrypted with Ansible Vault: + +```yaml +# inventories/production/group_vars/all/vault.yml (encrypted) +vault_database_password: "SecurePassword123!" +vault_api_token: "eyJhbGc..." +vault_ssh_private_key: | + -----BEGIN OPENSSH PRIVATE KEY----- + ... + -----END OPENSSH PRIVATE KEY----- +``` + +**Usage in playbooks**: +```yaml +database_password: "{{ vault_database_password }}" +``` + +**Encryption**: +```bash +ansible-vault encrypt inventories/production/group_vars/all/vault.yml +``` + +**Editing**: +```bash +ansible-vault edit inventories/production/group_vars/all/vault.yml +``` + +## Variable Validation + +### Using assert Module + +Validate variables before use: + +```yaml +- name: Validate required variables + assert: + that: + - deploy_linux_vm_os_distribution is defined + - deploy_linux_vm_os_distribution | length > 0 + - deploy_linux_vm_vcpus > 0 + - deploy_linux_vm_memory_mb >= 1024 + fail_msg: "Required variable validation failed" +``` + +## Best Practices + +1. **Use Defaults**: Define sensible defaults in `roles/*/defaults/main.yml` +2. **Document Variables**: Include description and type in README.md +3. **Prefix Role Variables**: Avoid namespace collisions +4. **Validate Input**: Use `assert` to catch misconfigurations early +5. **Encrypt Secrets**: Always use Ansible Vault for sensitive data +6. **Use Clear Names**: Make variable purpose obvious +7. **Avoid Hardcoding**: Use variables instead of hardcoded values + +## Related Documentation + +- [Role Index](./roles/role-index.md) +- [CLAUDE.md Guidelines](../CLAUDE.md) +- [Security Model](./architecture/security-model.md) + +--- + +**Document Version**: 1.0.0 +**Last Updated**: 2025-11-11 +**Maintained By**: Ansible Infrastructure Team