diff --git a/SYSTEM_ANALYSIS_AND_REMEDIATION.md b/SYSTEM_ANALYSIS_AND_REMEDIATION.md new file mode 100644 index 0000000..7bbc3cc --- /dev/null +++ b/SYSTEM_ANALYSIS_AND_REMEDIATION.md @@ -0,0 +1,831 @@ +# System Analysis and Remediation Plan + +**Date:** 2025-11-11 +**Analyzer:** Ansible Automation +**Scope:** All KVM guest VMs in development environment + +--- + +## Executive Summary + +System information gathering playbook executed against 3 VMs in the development environment: +- ✅ **pihole** (192.168.122.12): SUCCESS - 127 tasks completed +- ✅ **mymx/cow** (192.168.122.119): SUCCESS - 128 tasks completed (after remediation) +- ❌ **derp** (192.168.122.99): FAILED - SSH connectivity issues + +### Overall Health Status +- **Connectivity:** 2/3 hosts operational (67%) +- **CLAUDE.md Compliance:** Partial compliance identified +- **Security Posture:** Multiple findings requiring attention +- **Critical Issues:** 3 +- **High Priority Issues:** 5 +- **Medium Priority Issues:** 4 +- **Low Priority Issues:** 2 + +--- + +## Host-by-Host Analysis + +### pihole (pihole.grokbox) - 192.168.122.12 + +**Status:** ✅ Operational +**OS:** Debian +**Uptime:** 23 days, 11:03 +**Role:** DNS/Ad-blocking service + +#### System Resources +- **CPU:** Load average: 0.27, 0.11, 0.06 (healthy) +- **Memory:** 1.9GB total, 401MB used, 1.5GB available (healthy) +- **Swap:** **0B** ❌ CRITICAL +- **Disk:** /dev/vda1 - 7.7GB total, 1.9GB used (25% utilization) + +#### Critical Findings + +**1. No Swap Configured** ❌ **CRITICAL** +- **Finding:** System has 0B swap space +- **Risk:** High risk of OOM killer activation under memory pressure +- **CLAUDE.md Requirement:** Minimum 1GB swap (lv_swap) +- **Impact:** Service interruptions, potential data loss +- **Remediation:** + ```bash + # Option 1: Add swap file (quick fix) + dd if=/dev/zero of=/swapfile bs=1M count=2048 + chmod 600 /swapfile + mkswap /swapfile + swapon /swapfile + echo '/swapfile none swap sw 0 0' >> /etc/fstab + + # Option 2: LVM swap (CLAUDE.md compliant) + # Requires LVM migration (see below) + ``` + +**2. No LVM Configuration** ⚠️ **HIGH** +- **Finding:** Using traditional partitioning (/dev/vda1 mounted on /) +- **CLAUDE.md Violation:** All systems must use LVM +- **Missing Volumes:** + - lv_opt → /opt (3GB) + - lv_tmp → /tmp (1GB, noexec) + - lv_home → /home (2GB) + - lv_var → /var (5GB) + - lv_var_log → /var/log (2GB) + - lv_var_tmp → /var/tmp (5GB, noexec) + - lv_var_audit → /var/log/audit (1GB) + - lv_swap → swap (2GB) +- **Risk:** Cannot dynamically resize partitions, difficult disaster recovery +- **Remediation:** See "LVM Migration Plan" section below + +**3. Docker Running with Unknown Security Posture** ⚠️ **MEDIUM** +- **Finding:** Docker daemon running (PID 627, consuming 4.0% memory) +- **Containers:** Multiple overlay mounts detected +- **Security Concerns:** + - Container escape risk + - Privileged container usage unknown + - Network isolation unknown + - Resource limits unknown +- **Remediation:** Perform Docker security audit (see section below) + +#### High Priority Findings + +**4. Unattended Upgrades Running** ℹ️ **INFO** +- **Finding:** `/usr/share/unattended-upgrades/unattended-upgrade-shutdown` active +- **Status:** This is expected behavior per CLAUDE.md +- **Action:** Verify configuration aligns with security-only updates + +#### Recommendations +1. **Immediate:** Configure swap space (Option 1: swap file) +2. **Short-term:** Conduct Docker security audit +3. **Long-term:** Plan LVM migration or document exception rationale + +--- + +### mymx / cow.mymx.me - 192.168.122.119 + +**Status:** ✅ Operational (after SSH key deployment) +**OS:** Debian +**Hostname:** cow.mymx.me +**Role:** Mail server (mailcow) + +#### System Resources +- **CPU:** Multi-core, moderate load +- **Memory:** 16GB total, 6.1GB used, 9.5GB available (healthy) +- **Swap:** 976MB total, 439MB used (45% utilization) ✅ COMPLIANT +- **Disk:** LVM configured (/dev/mapper/mymx--vg-root - 48GB, 57% used) ✅ COMPLIANT + +#### Critical Findings + +**1. SSH Authentication Failure (RESOLVED)** ✅ +- **Initial Finding:** Permission denied (publickey) +- **Root Cause:** `ansible` user did not exist, SSH key not deployed +- **Remediation Applied:** + - Created `ansible` user + - Deployed SSH public key + - Configured passwordless sudo +- **Status:** ✅ RESOLVED - Host now accessible via Ansible + +**2. QEMU Guest Agent Not Responding** ⚠️ **HIGH** +- **Finding:** `libvirt: QEMU Driver error : Guest agent is not connected` +- **Impact:** + - Cannot get accurate VM state from hypervisor + - Snapshot filesystem freeze unavailable + - Limited VM management capabilities from libvirt +- **Remediation:** + ```bash + ansible mymx -b -m apt -a "name=qemu-guest-agent state=present" + ansible mymx -b -m systemd -a "name=qemu-guest-agent state=started enabled=yes" + ``` + +#### High Priority Findings + +**3. Heavy Service Load** ⚠️ **MEDIUM** +- **Finding:** Multiple resource-intensive services: + - ClamAV clamd: 8.7% memory (1.4GB) + - YaCy search: 7.9% memory (1.3GB) + high CPU + - OpenWebUI: 4.8% memory (800MB) + - MariaDB: 2.0% memory (328MB) + - Redis: Running +- **Concerns:** + - Memory pressure (6.1GB / 16GB used) + - Swap usage (45%) + - CPU contention risk +- **Recommendations:** + - Monitor resource trends + - Consider vertical scaling (increase RAM) if swap usage grows + - Review YaCy necessity (search engine consuming significant resources) + - Implement resource limits for containers + +**4. Extensive Docker Usage** ⚠️ **MEDIUM** +- **Finding:** 24 Docker overlay mounts detected +- **Services:** Mailcow components running in containers +- **Security Concerns:** Same as pihole (see Docker audit section) + +#### LVM Status +✅ **COMPLIANT** - LVM is properly configured: +- Volume Group: `mymx-vg` +- Root volume: `/dev/mapper/mymx--vg-root` (48GB) +- Swap: LVM-based (976MB) + +#### Recommendations +1. **Immediate:** Install qemu-guest-agent +2. **Short-term:** Monitor resource usage trends +3. **Medium-term:** Conduct Docker security audit +4. **Long-term:** Plan capacity expansion if memory usage continues growing + +--- + +### derp - 192.168.122.99 + +**Status:** ❌ UNREACHABLE +**Error:** `Permission denied (publickey,password)` + +#### Critical Findings + +**1. SSH Authentication Failure** ❌ **CRITICAL** +- **Finding:** Cannot connect via SSH with both key and password authentication +- **Attempted Remediation:** Failed to connect via jump host +- **Error Detail:** `Connection closed by UNKNOWN port 65535` +- **Possible Causes:** + 1. VM is not running + 2. SSH service not running + 3. Network connectivity issue + 4. Firewall blocking connection + 5. SSH configuration issue + 6. System compromised or in rescue mode + +#### Immediate Actions Required +1. **Check VM Status:** + ```bash + ansible grokbox -b -m shell -a "virsh list --all | grep derp" + ansible grokbox -b -m shell -a "virsh domstate derp" + ``` + +2. **If VM is running, access via console:** + ```bash + ssh grokbox "virsh console derp" + ``` + +3. **Verify network:** + ```bash + ansible grokbox -b -m shell -a "virsh domifaddr derp" + ansible grokbox -b -m shell -a "ping -c 3 192.168.122.99" + ``` + +4. **Check SSH service (via console):** + ```bash + systemctl status sshd + journalctl -u sshd -n 50 + ``` + +5. **Check firewall (via console):** + ```bash + ufw status # Debian/Ubuntu + iptables -L # All systems + ``` + +--- + +## Infrastructure-Wide Issues + +### Dynamic Inventory Warnings + +**Finding:** Invalid characters in group names +``` +[WARNING]: Invalid characters were found in group names but not replaced +``` + +**Root Cause:** Libvirt dynamic inventory creates UUID-based groups with hyphens: +- `7cd5a220-bea4-49a1-a44e-a247dbdfd085` +- `6d714c93-16fb-41c8-8ef8-9001f9066b3a` +- `9ede717f-879b-48aa-add0-2dfd33e10765` + +**Impact:** Potential compatibility issues with Ansible group operations + +**Remediation:** +```yaml +# inventories/development/libvirt_kvm.yml +# Add group name sanitization +keyed_groups: + - key: info.uuid | regex_replace('-', '_') + prefix: uuid + separator: "_" +``` + +### QEMU Guest Agent Deployment + +**Finding:** Guest agent not installed on VMs + +**Impact:** +- Unreliable IP address discovery +- No filesystem quiescing for snapshots +- Limited VM management from libvirt + +**Remediation Playbook:** + +Create `playbooks/install_qemu_agent.yml`: +```yaml +--- +- name: Install QEMU Guest Agent on all VMs + hosts: kvm_guests + become: yes + tasks: + - name: Install qemu-guest-agent (Debian/Ubuntu) + apt: + name: qemu-guest-agent + state: present + update_cache: yes + when: ansible_os_family == "Debian" + + - name: Install qemu-guest-agent (RHEL/Rocky/Alma) + yum: + name: qemu-guest-agent + state: present + when: ansible_os_family == "RedHat" + + - name: Enable and start qemu-guest-agent + systemd: + name: qemu-guest-agent + state: started + enabled: yes + + - name: Verify agent is running + systemd: + name: qemu-guest-agent + register: agent_status + + - name: Display agent status + debug: + msg: "QEMU Guest Agent status: {{ agent_status.status.ActiveState }}" +``` + +--- + +## Detailed Remediation Plans + +### Plan 1: Pihole LVM Migration + +**Complexity:** HIGH +**Downtime:** 2-4 hours +**Risk:** MEDIUM (data migration required) + +#### Prerequisites +- Full backup of pihole data +- Maintenance window scheduled +- Secondary DNS available during migration + +#### Migration Steps + +**Option A: In-Place Migration (Complex)** +1. Backup all data +2. Add second disk to VM +3. Create LVM on new disk +4. Copy data to new LVM volumes +5. Update fstab +6. Update bootloader +7. Reboot and verify +8. Remove old disk + +**Option B: Redeploy with deploy_linux_vm role (Recommended)** +1. Backup pihole configuration and data: + ```bash + # Backup Pi-hole configuration + pihole -a teleporter backup.tar.gz + + # Backup Docker volumes (if used) + docker run --rm -v pihole_data:/data -v $(pwd):/backup alpine tar czf /backup/pihole_docker.tar.gz /data + ``` + +2. Deploy new VM with LVM: + ```yaml + - hosts: grokbox + roles: + - role: deploy_linux_vm + vars: + deploy_linux_vm_name: pihole-new + deploy_linux_vm_hostname: pihole + deploy_linux_vm_os_distribution: debian-12 + deploy_linux_vm_vcpus: 2 + deploy_linux_vm_memory_mb: 2048 + deploy_linux_vm_disk_size_gb: 30 + deploy_linux_vm_use_lvm: true + ``` + +3. Restore data to new VM +4. Test functionality +5. Update DNS records +6. Decommission old VM + +**Option C: Document Exception** +If pihole is ephemeral or easily replaceable: +1. Document why LVM is not required +2. Add to exceptions list in CLAUDE.md +3. Ensure backup/restore procedures are in place + +#### Recommendation +**Option B (Redeploy)** is recommended because: +- Clean implementation of CLAUDE.md standards +- Minimal risk (old VM remains until verified) +- Opportunity to update to latest OS version +- Practice for future VM deployments + +--- + +### Plan 2: Docker Security Audit + +**Complexity:** MEDIUM +**Duration:** 2-4 hours +**Risk:** LOW (read-only analysis) + +#### Audit Checklist + +Create `playbooks/audit_docker.yml`: + +```yaml +--- +- name: Docker Security Audit + hosts: kvm_guests + become: yes + gather_facts: yes + tasks: + - name: Check if Docker is installed + command: which docker + register: docker_installed + failed_when: false + changed_when: false + + - block: + - name: Get Docker version + command: docker version --format '{{ "{{" }}.Server.Version{{ "}}" }}' + register: docker_version + changed_when: false + + - name: List running containers + command: docker ps --format '{{ "{{" }}.Names{{ "}}" }}\t{{ "{{" }}.Image{{ "}}" }}\t{{ "{{" }}.Status{{ "}}" }}' + register: docker_containers + changed_when: false + + - name: Check for privileged containers + shell: docker inspect $(docker ps -q) --format '{{ "{{" }}.Name{{ "}}" }}: Privileged={{ "{{" }}.HostConfig.Privileged{{ "}}" }}' + register: privileged_containers + changed_when: false + failed_when: false + + - name: Check container resource limits + shell: docker inspect $(docker ps -q) --format '{{ "{{" }}.Name{{ "}}" }}: Memory={{ "{{" }}.HostConfig.Memory{{ "}}" }} CPUs={{ "{{" }}.HostConfig.NanoCpus{{ "}}" }}' + register: resource_limits + changed_when: false + failed_when: false + + - name: Check Docker daemon configuration + command: docker info --format '{{ "{{" }}.SecurityOptions{{ "}}" }}' + register: security_options + changed_when: false + + - name: Check for Docker socket exposure + stat: + path: /var/run/docker.sock + register: docker_socket + + - name: Check Docker socket permissions + shell: ls -la /var/run/docker.sock + register: socket_perms + changed_when: false + when: docker_socket.stat.exists + + - name: List Docker networks + command: docker network ls + register: docker_networks + changed_when: false + + - name: Check for host network mode containers + shell: docker inspect $(docker ps -q) --format '{{ "{{" }}.Name{{ "}}" }}: NetworkMode={{ "{{" }}.HostConfig.NetworkMode{{ "}}" }}' + register: network_modes + changed_when: false + failed_when: false + + - name: Display audit results + debug: + msg: + - "=== Docker Security Audit ===" + - "Docker Version: {{ docker_version.stdout }}" + - "Running Containers:" + - "{{ docker_containers.stdout_lines }}" + - "" + - "Privileged Containers:" + - "{{ privileged_containers.stdout_lines | default(['None']) }}" + - "" + - "Resource Limits:" + - "{{ resource_limits.stdout_lines | default(['None configured']) }}" + - "" + - "Security Options:" + - "{{ security_options.stdout }}" + - "" + - "Docker Socket: {{ socket_perms.stdout | default('Not found') }}" + - "" + - "Network Modes:" + - "{{ network_modes.stdout_lines | default(['None']) }}" + + when: docker_installed.rc == 0 +``` + +#### Security Hardening Recommendations + +Based on audit findings, apply these hardening measures: + +1. **Restrict Docker Socket Access** + ```bash + chmod 660 /var/run/docker.sock + chown root:docker /var/run/docker.sock + ``` + +2. **Enable User Namespaces** + ```json + # /etc/docker/daemon.json + { + "userns-remap": "default" + } + ``` + +3. **Configure Resource Limits (Mailcow example)** + ```yaml + # docker-compose.yml + services: + postfix: + mem_limit: 512m + cpus: 0.5 + ``` + +4. **Disable Privileged Containers** (review necessity) +5. **Enable AppArmor/SELinux profiles** +6. **Configure logging**: + ```json + { + "log-driver": "json-file", + "log-opts": { + "max-size": "10m", + "max-file": "3" + } + } + ``` + +--- + +### Plan 3: Swap Configuration for Pihole + +**Complexity:** LOW +**Duration:** 10 minutes +**Risk:** LOW +**Downtime:** None (can be done live) + +#### Quick Fix: Swap File + +Create `playbooks/configure_swap.yml`: + +```yaml +--- +- name: Configure Swap on Systems Without It + hosts: kvm_guests + become: yes + vars: + swap_file_path: /swapfile + swap_size_mb: 2048 # 2GB + tasks: + - name: Check current swap + command: swapon --show + register: current_swap + changed_when: false + failed_when: false + + - name: Check if swap file exists + stat: + path: "{{ swap_file_path }}" + register: swap_file + + - block: + - name: Create swap file + command: dd if=/dev/zero of={{ swap_file_path }} bs=1M count={{ swap_size_mb }} + args: + creates: "{{ swap_file_path }}" + + - name: Set swap file permissions + file: + path: "{{ swap_file_path }}" + mode: '0600' + owner: root + group: root + + - name: Format swap file + command: mkswap {{ swap_file_path }} + when: not swap_file.stat.exists + + - name: Enable swap file + command: swapon {{ swap_file_path }} + when: swap_file_path not in current_swap.stdout + + - name: Add swap to fstab + lineinfile: + path: /etc/fstab + line: "{{ swap_file_path }} none swap sw 0 0" + state: present + backup: yes + + - name: Verify swap is active + command: swapon --show + register: new_swap + changed_when: false + + - name: Display swap status + debug: + var: new_swap.stdout_lines + + when: current_swap.stdout | length == 0 or swap_size_mb > 0 +``` + +Execute: +```bash +ansible-playbook playbooks/configure_swap.yml --limit pihole +``` + +--- + +### Plan 4: Derp VM Recovery + +**Complexity:** MEDIUM +**Duration:** 30-60 minutes +**Risk:** MEDIUM + +#### Diagnostic Steps + +1. **Verify VM state:** + ```bash + ansible grokbox -b -m shell -a "virsh list --all" + ansible grokbox -b -m shell -a "virsh domstate derp" + ``` + +2. **If VM is shut off, start it:** + ```bash + ansible grokbox -b -m shell -a "virsh start derp" + ``` + +3. **Check console access:** + ```bash + ssh grokbox "virsh console derp" + # Press Enter to get login prompt + # Login as root + ``` + +4. **From console, diagnose:** + ```bash + # Check network + ip addr show + ip route show + ping -c 3 192.168.122.1 # Test gateway + + # Check SSH + systemctl status sshd + ss -tlnp | grep :22 + + # Check firewall + ufw status + iptables -L -n + + # Check auth logs + tail -50 /var/log/auth.log # Debian + ``` + +5. **Deploy SSH key (from console):** + ```bash + # Create ansible user if needed + useradd -m -s /bin/bash ansible + mkdir -p /home/ansible/.ssh + chmod 700 /home/ansible/.ssh + + # Add public key (paste manually via console) + cat > /home/ansible/.ssh/authorized_keys << 'EOF' + ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILBrnivsqjhAxWYeuuvnYc3neeRRuHsr2SjeKv+Drtpu user@debian + EOF + + chmod 600 /home/ansible/.ssh/authorized_keys + chown -R ansible:ansible /home/ansible/.ssh + + # Configure sudo + echo "ansible ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ansible + chmod 440 /etc/sudoers.d/ansible + ``` + +6. **Test connectivity:** + ```bash + ansible derp -m ping + ``` + +--- + +## Priority Matrix + +### Critical (Fix Immediately) + +| Issue | Host | Impact | ETA | +|-------|------|--------|-----| +| No swap configured | pihole | OOM risk | 10min | +| derp unreachable | derp | Cannot manage | 30-60min | + +### High Priority (Fix This Week) + +| Issue | Host | Impact | ETA | +|-------|------|--------|-----| +| No LVM | pihole | Non-compliant, inflexible | 2-4hrs | +| QEMU agent missing | mymx, derp | Limited VM management | 15min | +| Resource pressure | mymx | Performance degradation risk | Ongoing monitoring | + +### Medium Priority (Fix This Month) + +| Issue | Host | Impact | ETA | +|-------|------|--------|-----| +| Docker security unknown | pihole, mymx | Potential vulnerabilities | 2-4hrs | +| Dynamic inventory warnings | All | Compatibility issues | 1hr | +| Heavy services load | mymx | Capacity planning | Ongoing | + +### Low Priority (Plan for Future) + +| Issue | Host | Impact | ETA | +|-------|------|--------|-----| +| YaCy resource usage | mymx | Optimization opportunity | TBD | + +--- + +## Execution Timeline + +### Week 1 (Nov 11-15, 2025) + +**Day 1 (Today):** +- ✅ Deploy SSH keys to mymx (COMPLETED) +- ⏳ Recover derp VM access +- ⏳ Configure swap on pihole +- ⏳ Install qemu-guest-agent on all VMs + +**Day 2:** +- Run Docker security audit on pihole and mymx +- Review findings and create hardening plan +- Fix dynamic inventory warnings + +**Day 3:** +- Implement Docker hardening recommendations +- Document current system state + +### Week 2 (Nov 18-22, 2025) + +**Planning:** +- Plan pihole LVM migration (or document exception) +- Schedule maintenance window +- Create backup procedures + +**Execution:** +- Pihole migration (if approved) +- Validation and testing + +### Week 3 (Nov 25-29, 2025) + +- Monitor mymx resource usage +- Capacity planning analysis +- Update documentation + +--- + +## Monitoring and Validation + +### Success Criteria + +1. **Connectivity:** All 3 VMs accessible via Ansible +2. **Swap:** All VMs have minimum 1GB swap configured +3. **LVM:** All VMs using LVM or documented exception +4. **QEMU Agent:** All VMs have guest agent running +5. **Docker:** Security audit completed, critical findings addressed +6. **Documentation:** All exceptions and configurations documented + +### Validation Commands + +```bash +# Test connectivity +ansible kvm_guests -m ping + +# Check swap +ansible kvm_guests -b -m shell -a "swapon --show" + +# Check LVM +ansible kvm_guests -b -m shell -a "pvs && vgs && lvs" + +# Check QEMU agent +ansible kvm_guests -b -m systemd -a "name=qemu-guest-agent" + +# Run full system info gather +ansible-playbook playbooks/gather_system_info.yml +``` + +--- + +## Documentation Updates Required + +1. **Update CLAUDE.md:** + - Document any approved exceptions (e.g., pihole LVM) + - Add Docker security requirements + +2. **Update inventory:** + - Document derp issues and resolution + - Note mymx resource constraints + +3. **Create runbook:** + - VM recovery procedures + - Swap configuration standard + - Docker hardening checklist + +--- + +## Lessons Learned + +1. **SSH Key Management:** Need automated key deployment for new VMs + - Recommendation: Include in deploy_linux_vm role cloud-init + +2. **QEMU Guest Agent:** Should be standard in cloud-init + - Recommendation: Add to deploy_linux_vm role templates + +3. **LVM Enforcement:** Need validation in system_info role + - Recommendation: Add CLAUDE.md compliance check + +4. **Monitoring Needed:** Resource usage trends not tracked + - Recommendation: Implement monitoring role (Prometheus + node_exporter) + +--- + +## Appendix A: Commands Reference + +### Quick Diagnostics +```bash +# Check all VMs status +ansible kvm_guests -m ping + +# Get system resources +ansible kvm_guests -b -m shell -a "free -h && df -h" + +# Check running services +ansible kvm_guests -b -m shell -a "systemctl list-units --type=service --state=running" + +# Network info +ansible kvm_guests -b -m shell -a "ip -br addr" +``` + +### Emergency Access +```bash +# Console access if SSH fails +ssh grokbox "virsh console " + +# Force reboot +ssh grokbox "virsh destroy && virsh start " + +# Get VM details +ssh grokbox "virsh dominfo " +``` + +--- + +**Document Version:** 1.0 +**Last Updated:** 2025-11-11T02:30:00Z +**Next Review:** 2025-11-18 +**Owner:** Ansible Infrastructure Team