# Troubleshooting Guide ## Overview Comprehensive troubleshooting guide for common issues encountered in Ansible-managed infrastructure. **Last Updated**: 2025-11-11 **Document Owner**: Operations Team --- ## Table of Contents 1. [Ansible Execution Issues](#ansible-execution-issues) 2. [SSH and Connectivity](#ssh-and-connectivity) 3. [VM Deployment Issues](#vm-deployment-issues) 4. [System Information Collection](#system-information-collection) 5. [Storage and LVM Issues](#storage-and-lvm-issues) 6. [Security and Firewall](#security-and-firewall) 7. [Performance Issues](#performance-issues) 8. [General Diagnostics](#general-diagnostics) --- ## Ansible Execution Issues ### Issue: "Failed to connect to the host via SSH" **Symptoms**: Cannot connect to target hosts **Causes**: - SSH key not authorized - Wrong SSH user - Host unreachable - SSH service not running **Solutions**: ```bash # 1. Test connectivity ping # 2. Test SSH manually ssh ansible@ # 3. Verify SSH service on target ansible -m shell -a "systemctl status sshd" -u root --ask-pass # 4. Check SSH key is authorized ansible -m authorized_key \ -a "user=ansible key='{{ lookup('file', '~/.ssh/id_rsa.pub') }}' state=present" \ -u root --ask-pass # 5. Verify ansible user exists ansible -m shell -a "id ansible" -u root --ask-pass ``` ### Issue: "Permission denied" or "sudo: a password is required" **Symptoms**: Tasks fail due to insufficient permissions **Causes**: - ansible user lacks sudo permissions - `become: yes` not specified - Incorrect sudo configuration **Solutions**: ```bash # 1. Verify sudo permissions ssh ansible@ "sudo -l" # 2. Check sudoers configuration ssh ansible@ "sudo cat /etc/sudoers.d/ansible" # 3. Fix sudoers if needed (as root) ssh root@ "cat > /etc/sudoers.d/ansible <<'EOF' ansible ALL=(ALL) NOPASSWD: ALL Defaults:ansible !requiretty EOF" # 4. Ensure become is set in playbook # Add to playbook: # become: yes ``` ### Issue: "Module not found" or "No module named..." **Symptoms**: Python module import errors **Causes**: - Python dependencies missing on control node or target - Wrong Python interpreter **Solutions**: ```bash # On control node pip3 install -r requirements.txt # On target hosts ansible all -m package -a "name=python3,python3-pip state=present" --become # Specify Python interpreter ansible all -m setup -a "filter=ansible_python_version" \ -e "ansible_python_interpreter=/usr/bin/python3" ``` --- ## SSH and Connectivity ### Issue: "UNREACHABLE!" for all hosts **Symptoms**: Cannot reach any hosts in inventory **Causes**: - Network connectivity issues - DNS resolution failures - Firewall blocking SSH - Incorrect inventory configuration **Solutions**: ```bash # 1. Verify inventory syntax ansible-inventory --list -i inventories/production # 2. Test DNS resolution ansible all -m shell -a "hostname -f" -i inventories/production # 3. Test network connectivity ansible all -m ping -i inventories/production # 4. Check SSH port accessibility nmap -p 22 # 5. Verify inventory file paths ansible all --list-hosts -i inventories/production ``` ### Issue: SSH connection hangs or times out **Symptoms**: SSH attempts timeout or hang indefinitely **Causes**: - Network latency - SSH idle timeout - Firewall dropping connections - MTU issues **Solutions**: ```bash # 1. Increase SSH timeout in ansible.cfg [defaults] timeout = 60 # 2. Enable SSH keepalive [ssh_connection] ssh_args = -o ServerAliveInterval=60 -o ServerAliveCountMax=3 # 3. Test with verbose SSH ssh -vvv ansible@ # 4. Check MTU issues ping -M do -s 1472 # Should not fragment ``` --- ## VM Deployment Issues ### Issue: VM fails to start after creation **Symptoms**: VM shows "shut off" immediately after deployment **Causes**: - Insufficient resources on hypervisor - Cloud-init ISO creation failed - Invalid VM configuration **Solutions**: ```bash # 1. Check hypervisor resources ansible hypervisor -m shell -a "free -h && df -h /var/lib/libvirt/images" # 2. Check VM definition ansible hypervisor -m shell -a "virsh dumpxml " # 3. View libvirt logs ansible hypervisor -m shell -a "tail -50 /var/log/libvirt/qemu/.log" # 4. Start VM manually and check errors ansible hypervisor -m shell -a "virsh start " # 5. Check cloud-init ISO exists ansible hypervisor -m shell -a "ls -lh /var/lib/libvirt/images/-cloud-init.iso" ``` ### Issue: Cloud-init fails on first boot **Symptoms**: Cannot SSH to VM, cloud-init errors in logs **Causes**: - Cloud-init configuration errors - Network connectivity issues in VM - Package installation failures **Solutions**: ```bash # 1. Access VM console ansible hypervisor -m shell -a "virsh console " # Press Enter, login as root (if console password set) # 2. Check cloud-init status ssh ansible@ "cloud-init status --long" # 3. View cloud-init logs ssh ansible@ "tail -100 /var/log/cloud-init-output.log" # 4. Re-run cloud-init modules ssh ansible@ "sudo cloud-init clean && sudo cloud-init init" # 5. Verify network connectivity in VM ssh ansible@ "ping -c 3 8.8.8.8 && nslookup google.com" ``` ### Issue: Cannot get VM IP address **Symptoms**: `virsh domifaddr` returns no IP **Causes**: - VM networking not configured - DHCP not working - VM not fully booted **Solutions**: ```bash # 1. Wait for VM to boot completely sleep 60 # 2. Check all network sources ansible hypervisor -m shell -a "virsh domifaddr --source lease" ansible hypervisor -m shell -a "virsh domifaddr --source agent" # 3. Check DHCP leases ansible hypervisor -m shell -a "virsh net-dhcp-leases default" # 4. Check VM network configuration ansible hypervisor -m shell -a "virsh domif-getlink vnet0" # 5. Access via console to configure networking ansible hypervisor -m shell -a "virsh console " ``` --- ## System Information Collection ### Issue: system_info role fails with "command not found" **Symptoms**: Tasks fail due to missing commands (lshw, dmidecode, etc.) **Causes**: - Required packages not installed - Package installation skipped **Solutions**: ```bash # 1. Run installation tasks ansible-playbook site.yml -t system_info,install # 2. Manually install packages ansible all -m package -a "name=lshw,dmidecode,pciutils,usbutils state=present" --become # 3. Verify packages installed ansible all -m shell -a "which lshw dmidecode lspci" ``` ### Issue: Statistics files not created **Symptoms**: No JSON files in `./stats/machines/` **Causes**: - Directory permissions issues on control node - Disk space full - Export tasks not executed **Solutions**: ```bash # 1. Check directory exists and is writable ls -la ./stats/machines/ mkdir -p ./stats/machines chmod 755 ./stats/machines # 2. Check disk space df -h . # 3. Run export tasks explicitly ansible-playbook site.yml -t system_info,export # 4. Check for errors in Ansible output ansible-playbook site.yml -t system_info -vvv | tee ansible_debug.log ``` --- ## Storage and LVM Issues ### Issue: LVM configuration fails on deployed VM **Symptoms**: LVM post-deployment tasks fail **Causes**: - Second disk not attached to VM - LVM tools not installed - Physical volume creation failed **Solutions**: ```bash # 1. Verify second disk exists ssh ansible@ "lsblk" # 2. Check for /dev/vdb ssh ansible@ "ls -l /dev/vdb" # 3. Verify LVM packages ssh ansible@ "which pvcreate vgcreate lvcreate" # 4. Manually create PV if needed ssh ansible@ "sudo pvcreate /dev/vdb" # 5. Re-run LVM configuration ansible-playbook site.yml -t deploy_linux_vm,lvm,post-deploy \ -e "deploy_linux_vm_name=" ``` ### Issue: Disk full on hypervisor **Symptoms**: VM deployment fails, "No space left on device" **Causes**: - Insufficient disk space in `/var/lib/libvirt/images` - Too many cached cloud images - Old VM disks not cleaned up **Solutions**: ```bash # 1. Check disk space ansible hypervisor -m shell -a "df -h /var/lib/libvirt/images" # 2. List all VM disks ansible hypervisor -m shell -a "ls -lh /var/lib/libvirt/images/*.qcow2" # 3. Remove old cloud images ansible hypervisor -m shell -a "find /var/lib/libvirt/images -name '*-cloud*.qcow2' -mtime +90 -delete" # 4. Remove unused VM disks (CAREFUL!) # Verify VM is deleted first ansible hypervisor -m shell -a "virsh list --all" ansible hypervisor -m shell -a "rm /var/lib/libvirt/images/old-vm-*.qcow2" # 5. Clean up libvirt storage pools ansible hypervisor -m shell -a "virsh pool-refresh default && virsh vol-list default" ``` --- ## Security and Firewall ### Issue: Cannot SSH to VM after deployment **Symptoms**: SSH connection refused or times out **Causes**: - Firewall blocking SSH - SSH service not running - SSH keys not deployed correctly **Solutions**: ```bash # 1. Check if VM is running ansible hypervisor -m shell -a "virsh list --all" # 2. Access via hypervisor console ansible hypervisor -m shell -a "virsh console " # 3. From console, check sshd status systemctl status sshd # 4. Check firewall rules sudo ufw status # Debian/Ubuntu sudo firewall-cmd --list-all # RHEL/AlmaLinux # 5. Temporarily allow SSH (for troubleshooting) sudo ufw allow 22/tcp # Debian/Ubuntu sudo firewall-cmd --add-service=ssh --permanent && sudo firewall-cmd --reload # RHEL # 6. Verify SSH key authorized cat ~/.ssh/authorized_keys ``` ### Issue: SELinux denials preventing operations **Symptoms**: Operations fail with "Permission denied" even with sudo **Causes**: - SELinux blocking operations - Incorrect file contexts - Missing SELinux policies **Solutions**: ```bash # 1. Check SELinux status ssh ansible@ "getenforce" # 2. Check for denials ssh ansible@ "sudo ausearch -m avc -ts recent" # 3. Generate policy from denials ssh ansible@ "sudo ausearch -m avc -ts recent | audit2allow -M mypolicy" ssh ansible@ "sudo semodule -i mypolicy.pp" # 4. Fix file contexts ssh ansible@ "sudo restorecon -Rv /" # 5. Temporarily set to permissive for testing (NOT PRODUCTION) ssh ansible@ "sudo setenforce 0" # After testing, re-enable ssh ansible@ "sudo setenforce 1" ``` --- ## Performance Issues ### Issue: Ansible playbook execution is very slow **Symptoms**: Playbooks take excessive time to complete **Causes**: - Fact gathering on many hosts - Serial execution instead of parallel - Slow network connections - Large inventory **Solutions**: ```bash # 1. Enable fact caching in ansible.cfg [defaults] fact_caching = jsonfile fact_caching_connection = /tmp/ansible_facts fact_caching_timeout = 3600 # 2. Increase parallelism ansible-playbook site.yml -f 20 # 3. Skip fact gathering if not needed ansible-playbook site.yml -e "gather_facts=false" # 4. Use strategy plugin [defaults] strategy = free # In ansible.cfg # 5. Enable pipelining [ssh_connection] pipelining = True # 6. Profile task execution ansible-playbook site.yml --timing ``` ### Issue: High CPU usage on hypervisor **Symptoms**: Hypervisor CPU at 100%, VMs slow **Causes**: - CPU overcommitment - Runaway processes in VMs - Insufficient resources **Solutions**: ```bash # 1. Check hypervisor load ansible hypervisor -m shell -a "top -bn1 | head -20" # 2. Check VM CPU allocation ansible hypervisor -m shell -a "virsh vcpuinfo " # 3. List VMs by CPU usage ansible hypervisor -m shell -a "virsh domstats --cpu-total" # 4. Inside VMs, check processes ssh ansible@ "top -bn1 | head -20" # 5. Reduce VM vCPU allocation if needed ansible hypervisor -m shell -a "virsh setvcpus --config" ansible hypervisor -m shell -a "virsh shutdown && virsh start " ``` --- ## General Diagnostics ### Diagnostic Commands ```bash # Ansible inventory ansible-inventory --list ansible-inventory --graph # Connectivity test ansible all -m ping # Gather facts from hosts ansible all -m setup # Check disk space across all hosts ansible all -m shell -a "df -h" # Check memory across all hosts ansible all -m shell -a "free -h" # Check system load ansible all -m shell -a "uptime" # List running services ansible all -m shell -a "systemctl list-units --type=service --state=running" # Check for failed services ansible all -m shell -a "systemctl --failed" # Review system logs ansible all -m shell -a "journalctl -p err -n 50" ``` ### Debug Mode ```bash # Verbose output (level 1) ansible-playbook site.yml -v # More verbose (level 2 - shows module arguments) ansible-playbook site.yml -vv # Very verbose (level 3 - shows connection attempts) ansible-playbook site.yml -vvv # Maximum verbosity (level 4 - shows everything) ansible-playbook site.yml -vvvv ``` ### Log Locations **Control Node**: - Ansible log: `/var/log/ansible.log` (if configured) - Command history: `~/.bash_history` **Target Hosts**: - System logs: `/var/log/syslog` (Debian) or `/var/log/messages` (RHEL) - Auth logs: `/var/log/auth.log` (Debian) or `/var/log/secure` (RHEL) - Audit logs: `/var/log/audit/audit.log` - Cloud-init: `/var/log/cloud-init-output.log` - Journal: `journalctl` --- ## Getting Help ### Internal Resources - [CLAUDE.md Guidelines](../CLAUDE.md) - [Architecture Overview](./architecture/overview.md) - [Role Documentation](./roles/) - [Cheatsheets](../cheatsheets/) ### External Resources - [Ansible Documentation](https://docs.ansible.com/) - [KVM/libvirt Documentation](https://libvirt.org/docs.html) - [Distribution-specific documentation](https://www.debian.org/doc/) ### Support Channels - Internal issue tracker: https://git.mymx.me - Operations team: ops@example.com - On-call escalation: +1-XXX-XXX-XXXX --- **Document Version**: 1.0.0 **Last Updated**: 2025-11-11 **Maintained By**: Operations Team