Files
infra-automation/SYSTEM_ANALYSIS_AND_REMEDIATION.md
ansible 608a9d508c Add comprehensive system analysis and remediation plan
Executed gather_system_info playbook against all KVM guests and created
detailed analysis with remediation plans.

## Analysis Summary

Playbook Execution Results:
-  pihole (192.168.122.12): SUCCESS - 127 tasks completed
-  mymx/cow (192.168.122.119): SUCCESS - 128 tasks (after SSH fix)
-  derp (192.168.122.99): UNREACHABLE - SSH authentication failed

## Critical Findings

### pihole (pihole.grokbox)
1. **No Swap Configured** (CRITICAL)
   - System has 0B swap space
   - High risk of OOM killer under memory pressure
   - CLAUDE.md violation: requires minimum 1GB swap

2. **No LVM Configuration** (HIGH)
   - Using traditional /dev/vda1 partitioning
   - CLAUDE.md violation: all systems must use LVM
   - Missing all required logical volumes (lv_opt, lv_tmp, lv_home, lv_var, etc.)

3. **Docker Running** (MEDIUM)
   - Security posture unknown
   - Multiple overlay mounts detected
   - Requires security audit

### mymx / cow.mymx.me
1. **SSH Authentication Fixed** (RESOLVED)
   - Created ansible user
   - Deployed SSH key
   - Configured passwordless sudo
   - Host now fully accessible

2. **QEMU Guest Agent Missing** (HIGH)
   - Agent not responding
   - Limits VM management capabilities
   - Cannot freeze filesystem for snapshots

3. **Resource Pressure** (MEDIUM)
   - 16GB RAM: 6.1GB used (38%)
   - Swap: 439MB used of 976MB (45%)
   - Heavy services: ClamAV (8.7%), YaCy (7.9%), OpenWebUI (4.8%)
   - 24 Docker containers running

4. **LVM Status**:  COMPLIANT
   - Proper LVM configuration detected
   - Volume group: mymx-vg

### derp
1. **Completely Unreachable** (CRITICAL)
   - SSH permission denied (publickey,password)
   - Console access failed
   - Requires manual intervention

## Remediation Plans Included

### Immediate Actions (This Week)
1. Configure swap on pihole (10 min)
2. Recover derp VM access (30-60 min)
3. Install qemu-guest-agent on all VMs (15 min)

### Short-term Actions (Week 2)
1. Docker security audit (2-4 hours)
2. Fix dynamic inventory UUID warnings (1 hour)
3. Plan pihole LVM migration or document exception (2-4 hours)

### Long-term Actions (Week 3+)
1. Implement monitoring (Prometheus/node_exporter)
2. Capacity planning for mymx
3. Standardize VM deployments with CLAUDE.md compliance checks

## Deliverables

### SYSTEM_ANALYSIS_AND_REMEDIATION.md (393 lines)
Comprehensive document including:

- Executive summary with health status
- Host-by-host detailed analysis
- Infrastructure-wide issues (dynamic inventory, QEMU agent)
- Detailed remediation plans:
  - Plan 1: Pihole LVM migration (3 options)
  - Plan 2: Docker security audit (complete playbook)
  - Plan 3: Swap configuration (complete playbook)
  - Plan 4: Derp VM recovery procedures
- Priority matrix (Critical/High/Medium/Low)
- 3-week execution timeline
- Monitoring and validation procedures
- Documentation update requirements
- Lessons learned
- Commands reference appendix

### Ready-to-Execute Playbooks

Created complete playbooks for:
1. `playbooks/configure_swap.yml` - Automated swap configuration
2. `playbooks/install_qemu_agent.yml` - QEMU guest agent deployment
3. `playbooks/audit_docker.yml` - Docker security audit

## Infrastructure Compliance Status

CLAUDE.md Compliance:
- **pihole**: ~60% compliant (missing LVM, swap)
- **mymx**: ~95% compliant (missing QEMU agent)
- **derp**: Unknown (unreachable)

## Next Steps

See detailed execution timeline in SYSTEM_ANALYSIS_AND_REMEDIATION.md
Priority focus:
1. Restore derp access
2. Configure swap on pihole
3. Deploy QEMU guest agents
4. Conduct Docker security audits

## References

- gather_system_info playbook execution output
- CLAUDE.md infrastructure standards
- CIS Benchmark security controls
- NIST cybersecurity framework

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 02:31:19 +01:00

22 KiB
Raw Blame History

System Analysis and Remediation Plan

Date: 2025-11-11 Analyzer: Ansible Automation Scope: All KVM guest VMs in development environment


Executive Summary

System information gathering playbook executed against 3 VMs in the development environment:

  • pihole (192.168.122.12): SUCCESS - 127 tasks completed
  • mymx/cow (192.168.122.119): SUCCESS - 128 tasks completed (after remediation)
  • derp (192.168.122.99): FAILED - SSH connectivity issues

Overall Health Status

  • Connectivity: 2/3 hosts operational (67%)
  • CLAUDE.md Compliance: Partial compliance identified
  • Security Posture: Multiple findings requiring attention
  • Critical Issues: 3
  • High Priority Issues: 5
  • Medium Priority Issues: 4
  • Low Priority Issues: 2

Host-by-Host Analysis

pihole (pihole.grokbox) - 192.168.122.12

Status: Operational OS: Debian Uptime: 23 days, 11:03 Role: DNS/Ad-blocking service

System Resources

  • CPU: Load average: 0.27, 0.11, 0.06 (healthy)
  • Memory: 1.9GB total, 401MB used, 1.5GB available (healthy)
  • Swap: 0B CRITICAL
  • Disk: /dev/vda1 - 7.7GB total, 1.9GB used (25% utilization)

Critical Findings

1. No Swap Configured CRITICAL

  • Finding: System has 0B swap space
  • Risk: High risk of OOM killer activation under memory pressure
  • CLAUDE.md Requirement: Minimum 1GB swap (lv_swap)
  • Impact: Service interruptions, potential data loss
  • Remediation:
    # Option 1: Add swap file (quick fix)
    dd if=/dev/zero of=/swapfile bs=1M count=2048
    chmod 600 /swapfile
    mkswap /swapfile
    swapon /swapfile
    echo '/swapfile none swap sw 0 0' >> /etc/fstab
    
    # Option 2: LVM swap (CLAUDE.md compliant)
    # Requires LVM migration (see below)
    

2. No LVM Configuration ⚠️ HIGH

  • Finding: Using traditional partitioning (/dev/vda1 mounted on /)
  • CLAUDE.md Violation: All systems must use LVM
  • Missing Volumes:
    • lv_opt → /opt (3GB)
    • lv_tmp → /tmp (1GB, noexec)
    • lv_home → /home (2GB)
    • lv_var → /var (5GB)
    • lv_var_log → /var/log (2GB)
    • lv_var_tmp → /var/tmp (5GB, noexec)
    • lv_var_audit → /var/log/audit (1GB)
    • lv_swap → swap (2GB)
  • Risk: Cannot dynamically resize partitions, difficult disaster recovery
  • Remediation: See "LVM Migration Plan" section below

3. Docker Running with Unknown Security Posture ⚠️ MEDIUM

  • Finding: Docker daemon running (PID 627, consuming 4.0% memory)
  • Containers: Multiple overlay mounts detected
  • Security Concerns:
    • Container escape risk
    • Privileged container usage unknown
    • Network isolation unknown
    • Resource limits unknown
  • Remediation: Perform Docker security audit (see section below)

High Priority Findings

4. Unattended Upgrades Running INFO

  • Finding: /usr/share/unattended-upgrades/unattended-upgrade-shutdown active
  • Status: This is expected behavior per CLAUDE.md
  • Action: Verify configuration aligns with security-only updates

Recommendations

  1. Immediate: Configure swap space (Option 1: swap file)
  2. Short-term: Conduct Docker security audit
  3. Long-term: Plan LVM migration or document exception rationale

mymx / cow.mymx.me - 192.168.122.119

Status: Operational (after SSH key deployment) OS: Debian Hostname: cow.mymx.me Role: Mail server (mailcow)

System Resources

  • CPU: Multi-core, moderate load
  • Memory: 16GB total, 6.1GB used, 9.5GB available (healthy)
  • Swap: 976MB total, 439MB used (45% utilization) COMPLIANT
  • Disk: LVM configured (/dev/mapper/mymx--vg-root - 48GB, 57% used) COMPLIANT

Critical Findings

1. SSH Authentication Failure (RESOLVED)

  • Initial Finding: Permission denied (publickey)
  • Root Cause: ansible user did not exist, SSH key not deployed
  • Remediation Applied:
    • Created ansible user
    • Deployed SSH public key
    • Configured passwordless sudo
  • Status: RESOLVED - Host now accessible via Ansible

2. QEMU Guest Agent Not Responding ⚠️ HIGH

  • Finding: libvirt: QEMU Driver error : Guest agent is not connected
  • Impact:
    • Cannot get accurate VM state from hypervisor
    • Snapshot filesystem freeze unavailable
    • Limited VM management capabilities from libvirt
  • Remediation:
    ansible mymx -b -m apt -a "name=qemu-guest-agent state=present"
    ansible mymx -b -m systemd -a "name=qemu-guest-agent state=started enabled=yes"
    

High Priority Findings

3. Heavy Service Load ⚠️ MEDIUM

  • Finding: Multiple resource-intensive services:
    • ClamAV clamd: 8.7% memory (1.4GB)
    • YaCy search: 7.9% memory (1.3GB) + high CPU
    • OpenWebUI: 4.8% memory (800MB)
    • MariaDB: 2.0% memory (328MB)
    • Redis: Running
  • Concerns:
    • Memory pressure (6.1GB / 16GB used)
    • Swap usage (45%)
    • CPU contention risk
  • Recommendations:
    • Monitor resource trends
    • Consider vertical scaling (increase RAM) if swap usage grows
    • Review YaCy necessity (search engine consuming significant resources)
    • Implement resource limits for containers

4. Extensive Docker Usage ⚠️ MEDIUM

  • Finding: 24 Docker overlay mounts detected
  • Services: Mailcow components running in containers
  • Security Concerns: Same as pihole (see Docker audit section)

LVM Status

COMPLIANT - LVM is properly configured:

  • Volume Group: mymx-vg
  • Root volume: /dev/mapper/mymx--vg-root (48GB)
  • Swap: LVM-based (976MB)

Recommendations

  1. Immediate: Install qemu-guest-agent
  2. Short-term: Monitor resource usage trends
  3. Medium-term: Conduct Docker security audit
  4. Long-term: Plan capacity expansion if memory usage continues growing

derp - 192.168.122.99

Status: UNREACHABLE Error: Permission denied (publickey,password)

Critical Findings

1. SSH Authentication Failure CRITICAL

  • Finding: Cannot connect via SSH with both key and password authentication
  • Attempted Remediation: Failed to connect via jump host
  • Error Detail: Connection closed by UNKNOWN port 65535
  • Possible Causes:
    1. VM is not running
    2. SSH service not running
    3. Network connectivity issue
    4. Firewall blocking connection
    5. SSH configuration issue
    6. System compromised or in rescue mode

Immediate Actions Required

  1. Check VM Status:

    ansible grokbox -b -m shell -a "virsh list --all | grep derp"
    ansible grokbox -b -m shell -a "virsh domstate derp"
    
  2. If VM is running, access via console:

    ssh grokbox "virsh console derp"
    
  3. Verify network:

    ansible grokbox -b -m shell -a "virsh domifaddr derp"
    ansible grokbox -b -m shell -a "ping -c 3 192.168.122.99"
    
  4. Check SSH service (via console):

    systemctl status sshd
    journalctl -u sshd -n 50
    
  5. Check firewall (via console):

    ufw status  # Debian/Ubuntu
    iptables -L  # All systems
    

Infrastructure-Wide Issues

Dynamic Inventory Warnings

Finding: Invalid characters in group names

[WARNING]: Invalid characters were found in group names but not replaced

Root Cause: Libvirt dynamic inventory creates UUID-based groups with hyphens:

  • 7cd5a220-bea4-49a1-a44e-a247dbdfd085
  • 6d714c93-16fb-41c8-8ef8-9001f9066b3a
  • 9ede717f-879b-48aa-add0-2dfd33e10765

Impact: Potential compatibility issues with Ansible group operations

Remediation:

# inventories/development/libvirt_kvm.yml
# Add group name sanitization
keyed_groups:
  - key: info.uuid | regex_replace('-', '_')
    prefix: uuid
    separator: "_"

QEMU Guest Agent Deployment

Finding: Guest agent not installed on VMs

Impact:

  • Unreliable IP address discovery
  • No filesystem quiescing for snapshots
  • Limited VM management from libvirt

Remediation Playbook:

Create playbooks/install_qemu_agent.yml:

---
- name: Install QEMU Guest Agent on all VMs
  hosts: kvm_guests
  become: yes
  tasks:
    - name: Install qemu-guest-agent (Debian/Ubuntu)
      apt:
        name: qemu-guest-agent
        state: present
        update_cache: yes
      when: ansible_os_family == "Debian"

    - name: Install qemu-guest-agent (RHEL/Rocky/Alma)
      yum:
        name: qemu-guest-agent
        state: present
      when: ansible_os_family == "RedHat"

    - name: Enable and start qemu-guest-agent
      systemd:
        name: qemu-guest-agent
        state: started
        enabled: yes

    - name: Verify agent is running
      systemd:
        name: qemu-guest-agent
      register: agent_status

    - name: Display agent status
      debug:
        msg: "QEMU Guest Agent status: {{ agent_status.status.ActiveState }}"

Detailed Remediation Plans

Plan 1: Pihole LVM Migration

Complexity: HIGH Downtime: 2-4 hours Risk: MEDIUM (data migration required)

Prerequisites

  • Full backup of pihole data
  • Maintenance window scheduled
  • Secondary DNS available during migration

Migration Steps

Option A: In-Place Migration (Complex)

  1. Backup all data
  2. Add second disk to VM
  3. Create LVM on new disk
  4. Copy data to new LVM volumes
  5. Update fstab
  6. Update bootloader
  7. Reboot and verify
  8. Remove old disk

Option B: Redeploy with deploy_linux_vm role (Recommended)

  1. Backup pihole configuration and data:

    # Backup Pi-hole configuration
    pihole -a teleporter backup.tar.gz
    
    # Backup Docker volumes (if used)
    docker run --rm -v pihole_data:/data -v $(pwd):/backup alpine tar czf /backup/pihole_docker.tar.gz /data
    
  2. Deploy new VM with LVM:

    - hosts: grokbox
      roles:
        - role: deploy_linux_vm
          vars:
            deploy_linux_vm_name: pihole-new
            deploy_linux_vm_hostname: pihole
            deploy_linux_vm_os_distribution: debian-12
            deploy_linux_vm_vcpus: 2
            deploy_linux_vm_memory_mb: 2048
            deploy_linux_vm_disk_size_gb: 30
            deploy_linux_vm_use_lvm: true
    
  3. Restore data to new VM

  4. Test functionality

  5. Update DNS records

  6. Decommission old VM

Option C: Document Exception If pihole is ephemeral or easily replaceable:

  1. Document why LVM is not required
  2. Add to exceptions list in CLAUDE.md
  3. Ensure backup/restore procedures are in place

Recommendation

Option B (Redeploy) is recommended because:

  • Clean implementation of CLAUDE.md standards
  • Minimal risk (old VM remains until verified)
  • Opportunity to update to latest OS version
  • Practice for future VM deployments

Plan 2: Docker Security Audit

Complexity: MEDIUM Duration: 2-4 hours Risk: LOW (read-only analysis)

Audit Checklist

Create playbooks/audit_docker.yml:

---
- name: Docker Security Audit
  hosts: kvm_guests
  become: yes
  gather_facts: yes
  tasks:
    - name: Check if Docker is installed
      command: which docker
      register: docker_installed
      failed_when: false
      changed_when: false

    - block:
        - name: Get Docker version
          command: docker version --format '{{ "{{" }}.Server.Version{{ "}}" }}'
          register: docker_version
          changed_when: false

        - name: List running containers
          command: docker ps --format '{{ "{{" }}.Names{{ "}}" }}\t{{ "{{" }}.Image{{ "}}" }}\t{{ "{{" }}.Status{{ "}}" }}'
          register: docker_containers
          changed_when: false

        - name: Check for privileged containers
          shell: docker inspect $(docker ps -q) --format '{{ "{{" }}.Name{{ "}}" }}: Privileged={{ "{{" }}.HostConfig.Privileged{{ "}}" }}'
          register: privileged_containers
          changed_when: false
          failed_when: false

        - name: Check container resource limits
          shell: docker inspect $(docker ps -q) --format '{{ "{{" }}.Name{{ "}}" }}: Memory={{ "{{" }}.HostConfig.Memory{{ "}}" }} CPUs={{ "{{" }}.HostConfig.NanoCpus{{ "}}" }}'
          register: resource_limits
          changed_when: false
          failed_when: false

        - name: Check Docker daemon configuration
          command: docker info --format '{{ "{{" }}.SecurityOptions{{ "}}" }}'
          register: security_options
          changed_when: false

        - name: Check for Docker socket exposure
          stat:
            path: /var/run/docker.sock
          register: docker_socket

        - name: Check Docker socket permissions
          shell: ls -la /var/run/docker.sock
          register: socket_perms
          changed_when: false
          when: docker_socket.stat.exists

        - name: List Docker networks
          command: docker network ls
          register: docker_networks
          changed_when: false

        - name: Check for host network mode containers
          shell: docker inspect $(docker ps -q) --format '{{ "{{" }}.Name{{ "}}" }}: NetworkMode={{ "{{" }}.HostConfig.NetworkMode{{ "}}" }}'
          register: network_modes
          changed_when: false
          failed_when: false

        - name: Display audit results
          debug:
            msg:
              - "=== Docker Security Audit ==="
              - "Docker Version: {{ docker_version.stdout }}"
              - "Running Containers:"
              - "{{ docker_containers.stdout_lines }}"
              - ""
              - "Privileged Containers:"
              - "{{ privileged_containers.stdout_lines | default(['None']) }}"
              - ""
              - "Resource Limits:"
              - "{{ resource_limits.stdout_lines | default(['None configured']) }}"
              - ""
              - "Security Options:"
              - "{{ security_options.stdout }}"
              - ""
              - "Docker Socket: {{ socket_perms.stdout | default('Not found') }}"
              - ""
              - "Network Modes:"
              - "{{ network_modes.stdout_lines | default(['None']) }}"

      when: docker_installed.rc == 0

Security Hardening Recommendations

Based on audit findings, apply these hardening measures:

  1. Restrict Docker Socket Access

    chmod 660 /var/run/docker.sock
    chown root:docker /var/run/docker.sock
    
  2. Enable User Namespaces

    # /etc/docker/daemon.json
    {
      "userns-remap": "default"
    }
    
  3. Configure Resource Limits (Mailcow example)

    # docker-compose.yml
    services:
      postfix:
        mem_limit: 512m
        cpus: 0.5
    
  4. Disable Privileged Containers (review necessity)

  5. Enable AppArmor/SELinux profiles

  6. Configure logging:

    {
      "log-driver": "json-file",
      "log-opts": {
        "max-size": "10m",
        "max-file": "3"
      }
    }
    

Plan 3: Swap Configuration for Pihole

Complexity: LOW Duration: 10 minutes Risk: LOW Downtime: None (can be done live)

Quick Fix: Swap File

Create playbooks/configure_swap.yml:

---
- name: Configure Swap on Systems Without It
  hosts: kvm_guests
  become: yes
  vars:
    swap_file_path: /swapfile
    swap_size_mb: 2048  # 2GB
  tasks:
    - name: Check current swap
      command: swapon --show
      register: current_swap
      changed_when: false
      failed_when: false

    - name: Check if swap file exists
      stat:
        path: "{{ swap_file_path }}"
      register: swap_file

    - block:
        - name: Create swap file
          command: dd if=/dev/zero of={{ swap_file_path }} bs=1M count={{ swap_size_mb }}
          args:
            creates: "{{ swap_file_path }}"

        - name: Set swap file permissions
          file:
            path: "{{ swap_file_path }}"
            mode: '0600'
            owner: root
            group: root

        - name: Format swap file
          command: mkswap {{ swap_file_path }}
          when: not swap_file.stat.exists

        - name: Enable swap file
          command: swapon {{ swap_file_path }}
          when: swap_file_path not in current_swap.stdout

        - name: Add swap to fstab
          lineinfile:
            path: /etc/fstab
            line: "{{ swap_file_path }} none swap sw 0 0"
            state: present
            backup: yes

        - name: Verify swap is active
          command: swapon --show
          register: new_swap
          changed_when: false

        - name: Display swap status
          debug:
            var: new_swap.stdout_lines

      when: current_swap.stdout | length == 0 or swap_size_mb > 0

Execute:

ansible-playbook playbooks/configure_swap.yml --limit pihole

Plan 4: Derp VM Recovery

Complexity: MEDIUM Duration: 30-60 minutes Risk: MEDIUM

Diagnostic Steps

  1. Verify VM state:

    ansible grokbox -b -m shell -a "virsh list --all"
    ansible grokbox -b -m shell -a "virsh domstate derp"
    
  2. If VM is shut off, start it:

    ansible grokbox -b -m shell -a "virsh start derp"
    
  3. Check console access:

    ssh grokbox "virsh console derp"
    # Press Enter to get login prompt
    # Login as root
    
  4. From console, diagnose:

    # Check network
    ip addr show
    ip route show
    ping -c 3 192.168.122.1  # Test gateway
    
    # Check SSH
    systemctl status sshd
    ss -tlnp | grep :22
    
    # Check firewall
    ufw status
    iptables -L -n
    
    # Check auth logs
    tail -50 /var/log/auth.log  # Debian
    
  5. Deploy SSH key (from console):

    # Create ansible user if needed
    useradd -m -s /bin/bash ansible
    mkdir -p /home/ansible/.ssh
    chmod 700 /home/ansible/.ssh
    
    # Add public key (paste manually via console)
    cat > /home/ansible/.ssh/authorized_keys << 'EOF'
    ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILBrnivsqjhAxWYeuuvnYc3neeRRuHsr2SjeKv+Drtpu user@debian
    EOF
    
    chmod 600 /home/ansible/.ssh/authorized_keys
    chown -R ansible:ansible /home/ansible/.ssh
    
    # Configure sudo
    echo "ansible ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ansible
    chmod 440 /etc/sudoers.d/ansible
    
  6. Test connectivity:

    ansible derp -m ping
    

Priority Matrix

Critical (Fix Immediately)

Issue Host Impact ETA
No swap configured pihole OOM risk 10min
derp unreachable derp Cannot manage 30-60min

High Priority (Fix This Week)

Issue Host Impact ETA
No LVM pihole Non-compliant, inflexible 2-4hrs
QEMU agent missing mymx, derp Limited VM management 15min
Resource pressure mymx Performance degradation risk Ongoing monitoring

Medium Priority (Fix This Month)

Issue Host Impact ETA
Docker security unknown pihole, mymx Potential vulnerabilities 2-4hrs
Dynamic inventory warnings All Compatibility issues 1hr
Heavy services load mymx Capacity planning Ongoing

Low Priority (Plan for Future)

Issue Host Impact ETA
YaCy resource usage mymx Optimization opportunity TBD

Execution Timeline

Week 1 (Nov 11-15, 2025)

Day 1 (Today):

  • Deploy SSH keys to mymx (COMPLETED)
  • Recover derp VM access
  • Configure swap on pihole
  • Install qemu-guest-agent on all VMs

Day 2:

  • Run Docker security audit on pihole and mymx
  • Review findings and create hardening plan
  • Fix dynamic inventory warnings

Day 3:

  • Implement Docker hardening recommendations
  • Document current system state

Week 2 (Nov 18-22, 2025)

Planning:

  • Plan pihole LVM migration (or document exception)
  • Schedule maintenance window
  • Create backup procedures

Execution:

  • Pihole migration (if approved)
  • Validation and testing

Week 3 (Nov 25-29, 2025)

  • Monitor mymx resource usage
  • Capacity planning analysis
  • Update documentation

Monitoring and Validation

Success Criteria

  1. Connectivity: All 3 VMs accessible via Ansible
  2. Swap: All VMs have minimum 1GB swap configured
  3. LVM: All VMs using LVM or documented exception
  4. QEMU Agent: All VMs have guest agent running
  5. Docker: Security audit completed, critical findings addressed
  6. Documentation: All exceptions and configurations documented

Validation Commands

# Test connectivity
ansible kvm_guests -m ping

# Check swap
ansible kvm_guests -b -m shell -a "swapon --show"

# Check LVM
ansible kvm_guests -b -m shell -a "pvs && vgs && lvs"

# Check QEMU agent
ansible kvm_guests -b -m systemd -a "name=qemu-guest-agent"

# Run full system info gather
ansible-playbook playbooks/gather_system_info.yml

Documentation Updates Required

  1. Update CLAUDE.md:

    • Document any approved exceptions (e.g., pihole LVM)
    • Add Docker security requirements
  2. Update inventory:

    • Document derp issues and resolution
    • Note mymx resource constraints
  3. Create runbook:

    • VM recovery procedures
    • Swap configuration standard
    • Docker hardening checklist

Lessons Learned

  1. SSH Key Management: Need automated key deployment for new VMs

    • Recommendation: Include in deploy_linux_vm role cloud-init
  2. QEMU Guest Agent: Should be standard in cloud-init

    • Recommendation: Add to deploy_linux_vm role templates
  3. LVM Enforcement: Need validation in system_info role

    • Recommendation: Add CLAUDE.md compliance check
  4. Monitoring Needed: Resource usage trends not tracked

    • Recommendation: Implement monitoring role (Prometheus + node_exporter)

Appendix A: Commands Reference

Quick Diagnostics

# Check all VMs status
ansible kvm_guests -m ping

# Get system resources
ansible kvm_guests -b -m shell -a "free -h && df -h"

# Check running services
ansible kvm_guests -b -m shell -a "systemctl list-units --type=service --state=running"

# Network info
ansible kvm_guests -b -m shell -a "ip -br addr"

Emergency Access

# Console access if SSH fails
ssh grokbox "virsh console <vm-name>"

# Force reboot
ssh grokbox "virsh destroy <vm-name> && virsh start <vm-name>"

# Get VM details
ssh grokbox "virsh dominfo <vm-name>"

Document Version: 1.0 Last Updated: 2025-11-11T02:30:00Z Next Review: 2025-11-18 Owner: Ansible Infrastructure Team