# Ansible Infrastructure - Improvement Plan **Date:** 2025-11-11 **Version:** 1.0 **Status:** Active --- ## Executive Summary Based on comprehensive analysis of the Ansible infrastructure automation project, this document outlines a prioritized improvement plan across 5 key areas: **Infrastructure Operations**, **Development Quality**, **Security & Compliance**, **Documentation & Standards**, and **Scalability & Performance**. ### Current State Overview **Strengths:** - ✅ Strong foundation with security-first CLAUDE.md guidelines (95% compliance) - ✅ Dynamic inventory operational (community.libvirt) - ✅ 2 production-ready roles with comprehensive documentation - ✅ Automated remediation playbooks (swap, qemu-agent) - ✅ Excellent MTTR (<3 minutes for critical issues) - ✅ Comprehensive documentation structure (100% coverage) **Critical Gaps:** - ❌ 1/3 VMs unreachable (derp - 33% infrastructure failure) - ❌ No CI/CD pipeline (high risk of regression) - ❌ Molecule tests non-functional (testing coverage gap) - ❌ Git push permission issues (operational blocker) - ❌ Docker security audit pending (compliance risk) - ❌ Limited role library (2 roles vs. target of 50+) **Metrics:** - **Operational VMs:** 2/3 (67%) - **CLAUDE.md Compliance:** 75-90% per host - **Role Count:** 2 (target: 50+) - **CI/CD Pipeline:** 0% (not implemented) - **Test Coverage:** 0% (Molecule structure exists, not functional) - **Documentation Coverage:** 100% --- ## Priority Classification **P0 - CRITICAL (24-48 hours):** Infrastructure blocking issues **P1 - HIGH (1 week):** Security, compliance, operational efficiency **P2 - MEDIUM (2-4 weeks):** Quality improvements, standardization **P3 - LOW (1-3 months):** Nice-to-have, future enhancements --- ## Improvement Areas ### 1. Infrastructure Operations (P0/P1) #### 1.1 VM Recovery and Connectivity [P0] **Issue:** derp VM unreachable (192.168.122.99) - **Impact:** 33% infrastructure failure rate - **Root Cause:** SSH authentication failure - Permission denied (publickey,password) - **Blocking:** System analysis, compliance verification **Tasks:** - [ ] Access derp VM via libvirt console (virsh console derp) - [ ] Verify ansible user exists and has correct configuration - [ ] Deploy SSH public key to /home/ansible/.ssh/authorized_keys - [ ] Verify sudo configuration (passwordless sudo for ansible user) - [ ] Test SSH connectivity from control node - [ ] Execute system_info playbook against derp - [ ] Document recovery procedure in runbooks **Timeline:** This week (Week 47) **Estimated Effort:** 2-4 hours (manual console access required) #### 1.2 QEMU Guest Agent Deployment [P1] **Issue:** mymx missing QEMU agent functionality - **Impact:** Cannot perform graceful shutdowns, resource monitoring limited - **Compliance:** CLAUDE.md recommends QEMU agent for KVM guests **Tasks:** - [ ] Verify virtio-serial channel exists in VM XML (virsh edit mymx) - [ ] Add virtio-serial channel if missing - [ ] Execute playbooks/install_qemu_agent.yml on mymx - [ ] Verify agent communication (virsh domifaddr mymx) - [ ] Test guest agent commands **Timeline:** This week (Week 47) **Estimated Effort:** 30 minutes (playbook already exists) #### 1.3 LVM Migration for pihole [P1] **Issue:** pihole using traditional partitioning (non-compliant with CLAUDE.md) - **Impact:** Cannot dynamically resize volumes, difficult disaster recovery - **Risk:** Data loss if migration performed incorrectly **Tasks:** - [ ] Evaluate migration options: - Option A: Rebuild VM using deploy_linux_vm role (clean slate) - Option B: In-place migration (high risk) - Option C: Document exception with rationale - [ ] Create comprehensive backup of pihole - [ ] Test restore procedure - [ ] Execute migration plan (if approved) - [ ] Verify LVM configuration post-migration - [ ] Update compliance metrics **Timeline:** Week 48-49 **Estimated Effort:** 4-8 hours (depends on option chosen) **Recommendation:** Option A (rebuild) - cleanest approach #### 1.4 Git Push Permission Issue [P0] **Issue:** Gitea server pre-receive hook blocking pushes - **Impact:** Cannot commit improvements to remote repository - **Blocking:** Version control, collaboration, backup **Tasks:** - [ ] Investigate Gitea pre-receive hook configuration - [ ] Check repository permissions for ansible@mymx.me user - [ ] Verify git hooks on server side - [ ] Test push with verbose output - [ ] Document git workflow procedures **Timeline:** This week (Week 47) **Estimated Effort:** 1-2 hours --- ### 2. Security & Compliance (P1) #### 2.1 Docker Security Audit [P1] **Issue:** Docker running on pihole with unknown security posture - **Impact:** Container escape risk, privilege escalation, resource exhaustion - **Compliance:** CLAUDE.md requires security audits for containerized services **Tasks:** - [ ] Create playbooks/audit_docker.yml playbook - [ ] Audit docker daemon configuration (/etc/docker/daemon.json) - [ ] Check for privileged containers (docker inspect) - [ ] Verify user namespace remapping - [ ] Check AppArmor/SELinux profiles - [ ] Audit network isolation (bridge vs. host mode) - [ ] Check resource limits (CPU, memory) - [ ] Scan container images for vulnerabilities - [ ] Review exposed ports and services - [ ] Generate compliance report - [ ] Implement recommended hardening **Timeline:** Week 47-48 **Estimated Effort:** 4-6 hours **Deliverables:** - playbooks/audit_docker.yml - docs/security/docker-hardening.md - Docker security baseline role (future) #### 2.2 Swap Configuration [P1] **Status:** Partially complete (playbook exists) - pihole: ✅ Configured (2GB) - mymx: ✅ Configured (2GB) - derp: ❌ Pending (VM unreachable) **Tasks:** - [ ] Execute configure_swap.yml on derp (after connectivity restored) - [ ] Verify swap persistence across reboots - [ ] Monitor swap usage trends **Timeline:** Week 47 (after derp recovery) **Estimated Effort:** 15 minutes #### 2.3 Automated Compliance Scanning [P2] **Issue:** Manual compliance verification is time-consuming - **Impact:** Delayed detection of configuration drift **Tasks:** - [ ] Research OpenSCAP integration options - [ ] Create security_audit playbook with CIS benchmarks - [ ] Implement automated weekly compliance scans - [ ] Configure compliance reporting - [ ] Set up alerting for critical findings **Timeline:** Week 48-50 **Estimated Effort:** 8-12 hours --- ### 3. Development Quality & Testing (P1/P2) #### 3.1 Molecule Testing Implementation [P1] **Issue:** Molecule structure exists but tests are non-functional - **Impact:** No automated testing, high regression risk - **Quality Risk:** Cannot verify roles work correctly **Current State:** - Molecule installed - roles/deploy_linux_vm/molecule/default/ directory exists - No molecule.yml configuration **Tasks:** - [ ] Create molecule.yml for deploy_linux_vm role - [ ] Set up Docker/Podman test containers - [ ] Write converge.yml test playbook - [ ] Write verify.yml validation tests - [ ] Create test scenarios for: - Debian 12 deployment - RHEL 9 deployment - LVM configuration validation - Cloud-init template rendering - [ ] Document testing procedures - [ ] Create cheatsheets/testing.md - [ ] Repeat for system_info role **Timeline:** Week 48-50 **Estimated Effort:** 12-16 hours **Priority:** HIGH (required before scaling role development) **Example molecule.yml:** ```yaml --- dependency: name: galaxy driver: name: docker platforms: - name: debian-12-test image: debian:12 pre_build_image: true privileged: true command: /lib/systemd/systemd - name: rockylinux-9-test image: rockylinux:9 pre_build_image: true privileged: true command: /usr/sbin/init provisioner: name: ansible config_options: defaults: callbacks_enabled: profile_tasks, timer inventory: group_vars: all: ansible_user: root verifier: name: ansible ``` #### 3.2 CI/CD Pipeline Setup [P1] **Issue:** No automated testing on commits/PRs - **Impact:** Manual quality control, slow feedback loop - **Risk:** Breaking changes reach main branch **Tasks:** - [ ] Evaluate CI/CD options: - Gitea Actions (preferred - native integration) - Jenkins (more features, higher complexity) - GitLab CI (if migrating from Gitea) - [ ] Create .gitea/workflows/ci.yml - [ ] Implement pipeline stages: - Syntax validation (ansible-playbook --syntax-check) - Linting (ansible-lint) - YAML validation (yamllint) - Molecule tests - Security scanning (ansible-audit) - [ ] Configure branch protection rules - [ ] Set up status checks for pull requests - [ ] Configure notifications (email/webhook) **Timeline:** Week 49-50 **Estimated Effort:** 8-12 hours **Example Gitea Actions workflow:** ```yaml name: Ansible CI on: push: branches: [ master, develop ] pull_request: branches: [ master ] jobs: lint: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Run ansible-lint run: | pip install ansible-lint ansible-lint test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Run Molecule tests run: | pip install molecule molecule-docker cd roles/deploy_linux_vm molecule test ``` #### 3.3 Pre-commit Hooks [P2] **Issue:** No local quality checks before commits - **Impact:** Quality issues reach repository **Tasks:** - [ ] Install pre-commit framework - [ ] Create .pre-commit-config.yaml - [ ] Configure hooks: - ansible-lint - yamllint - trailing whitespace removal - end-of-file fixer - mixed line endings check - [ ] Document pre-commit setup in README.md - [ ] Create setup script for developers **Timeline:** Week 48 **Estimated Effort:** 2-4 hours #### 3.4 Ansible Configuration Optimization [P2] **Current Config:** ``` gathering = smart callbacks_enabled = profile_tasks, timer # Missing: forks, pipelining, fact_caching ``` **Tasks:** - [ ] Enable SSH pipelining for performance - [ ] Implement fact caching (Redis or JSON file) - [ ] Increase forks for parallel execution - [ ] Configure strategy plugins - [ ] Enable ControlMaster for SSH connection reuse - [ ] Document configuration choices **Timeline:** Week 48 **Estimated Effort:** 2-3 hours **Recommended additions:** ```ini [defaults] gathering = smart callbacks_enabled = profile_tasks, timer forks = 20 host_key_checking = False retry_files_enabled = False fact_caching = jsonfile fact_caching_connection = /tmp/ansible_facts fact_caching_timeout = 3600 [ssh_connection] pipelining = True ssh_args = -o ControlMaster=auto -o ControlPersist=3600s ``` #### 3.5 Ansible Galaxy Configuration Fix [P2] **Issue:** `ansible-galaxy collection list` fails with galaxy_server config error **Tasks:** - [ ] Fix ansible.cfg galaxy_server configuration - [ ] Verify collection installations - [ ] Document collection management procedures **Timeline:** Week 47 **Estimated Effort:** 30 minutes --- ### 4. Role Development & Expansion (P2/P3) #### 4.1 Common Base System Role [P2] **Need:** Standardized base configuration for all systems - **Impact:** Consistency, reduced duplication, faster deployments **Tasks:** - [ ] Create roles/common role structure - [ ] Implement essential package installation - [ ] User and group management - [ ] SSH hardening - [ ] Time synchronization (chrony) - [ ] System logging (rsyslog) - [ ] Implement molecule tests - [ ] Create comprehensive documentation - [ ] Create cheatsheet **Timeline:** Week 50-51 **Estimated Effort:** 16-20 hours **Features:** - Essential packages (vim, htop, tmux, jq, curl, wget, etc.) - SSH hardening (disable root login, key-only auth) - Chrony/NTP configuration - Rsyslog centralized logging - User account management - Sudo configuration - Timezone configuration - Locale configuration #### 4.2 Security Hardening Role [P2] **Need:** CIS Benchmark compliance automation - **Impact:** Consistent security posture, audit compliance **Tasks:** - [ ] Create roles/security_hardening role - [ ] Implement CIS Benchmark controls for: - Debian 12 - RHEL 9/Rocky/AlmaLinux - [ ] SELinux/AppArmor enforcement - [ ] Firewall configuration (firewalld/ufw) - [ ] Fail2ban setup - [ ] AIDE file integrity monitoring - [ ] Auditd configuration - [ ] Kernel hardening (sysctl) - [ ] Password policies (PAM) - [ ] Account lockout policies - [ ] Implement molecule tests - [ ] Create documentation **Timeline:** Weeks 51-52 (December) **Estimated Effort:** 24-32 hours #### 4.3 Monitoring Role [P2] **Need:** Prometheus node_exporter for metrics collection - **Impact:** Visibility into system health, capacity planning **Tasks:** - [ ] Create roles/prometheus_node_exporter role - [ ] Install and configure node_exporter - [ ] Configure systemd service - [ ] Configure firewall rules - [ ] Implement security hardening - [ ] Create molecule tests - [ ] Create documentation **Timeline:** Week 51 **Estimated Effort:** 8-12 hours #### 4.4 Future Roles (P3) Lower priority roles for future development: **Web Servers (Q1 2026):** - roles/nginx - roles/apache - roles/haproxy **Databases (Q1 2026):** - roles/postgresql - roles/mysql - roles/redis **Application Services (Q1-Q2 2026):** - roles/docker (security-hardened) - roles/docker_compose - roles/backup (Restic/Borg) - roles/vpn (WireGuard) --- ### 5. Documentation & Standards (P2/P3) #### 5.1 Update CHANGELOG.md [P2] **Issue:** Week 46 improvements not documented in CHANGELOG.md - **Impact:** Lost historical context, version tracking incomplete **Tasks:** - [ ] Document Week 46 achievements: - Role compliance improvements (70% → 95%) - System analysis and remediation framework - Remediation playbooks (swap, qemu-agent) - Dynamic inventory migration - SSH access restoration - Documentation expansion (2,100+ lines) - [ ] Tag version 0.2.0 - [ ] Update version numbers in relevant files **Timeline:** Week 47 **Estimated Effort:** 1 hour #### 5.2 Create Testing Cheatsheet [P2] **Need:** Quick reference for testing workflows **Tasks:** - [ ] Create cheatsheets/testing.md - [ ] Document Molecule usage - [ ] Document ansible-lint usage - [ ] Document CI/CD pipeline - [ ] Include troubleshooting tips **Timeline:** Week 49 **Estimated Effort:** 2-3 hours #### 5.3 Dynamic Inventory Group Name Sanitization [P2] **Issue:** UUID-based group names generate warnings ``` [WARNING]: Invalid characters were found in group names but not replaced ``` **Tasks:** - [ ] Research inventory plugin configuration options - [ ] Implement group name sanitization - [ ] Test with libvirt dynamic inventory - [ ] Document solution **Timeline:** Week 48 **Estimated Effort:** 2-3 hours #### 5.4 Runbook Documentation [P3] **Need:** Operational procedures for common tasks **Tasks:** - [ ] Create docs/runbooks/vm-recovery.md - [ ] Create docs/runbooks/emergency-procedures.md - [ ] Create docs/runbooks/capacity-planning.md - [ ] Create docs/runbooks/security-incident-response.md **Timeline:** Weeks 50-52 **Estimated Effort:** 8-12 hours --- ### 6. Inventory & Repository Organization (P2) #### 6.1 Separate Inventories Repository [P2] **Need:** Public inventories repository (per CLAUDE.md) - **Impact:** Better separation of concerns, public/private boundary **Current State:** - inventories/ in main repository - secrets/ in git submodule (correct) **Tasks:** - [ ] Create new public repository: inventories - [ ] Move inventories/ directory to new repo - [ ] Configure as git submodule - [ ] Update .gitmodules - [ ] Update documentation - [ ] Test inventory loading from submodule - [ ] Update README.md with submodule instructions **Timeline:** Week 48 **Estimated Effort:** 3-4 hours **Note:** Evaluate necessity - current setup with inventories/ in main repo may be acceptable for single-team usage. --- ### 7. Performance & Scalability (P3) #### 7.1 Fact Caching Implementation [P3] **Need:** Reduce gather_facts execution time - **Current:** ~1.7 seconds per host - **Target:** <0.5 seconds (cached) **Tasks:** - [ ] Evaluate caching backends (Redis vs. JSON file) - [ ] Implement fact caching in ansible.cfg - [ ] Test cache performance - [ ] Configure cache timeout - [ ] Monitor cache hit rates **Timeline:** Week 51 **Estimated Effort:** 2-4 hours #### 7.2 Parallel Execution Optimization [P3] **Tasks:** - [ ] Benchmark current execution times - [ ] Increase forks parameter - [ ] Test strategy: free for independent tasks - [ ] Implement async tasks for long-running operations - [ ] Document performance optimizations **Timeline:** Week 52 **Estimated Effort:** 3-4 hours --- ## Implementation Timeline ### Week 47 (Current Week) - Critical Operations **Focus:** Restore infrastructure, unblock operations - [ ] **P0:** Recover derp VM connectivity (4 hours) - [ ] **P0:** Resolve git push permission issue (2 hours) - [ ] **P1:** Install QEMU agent on mymx (30 min) - [ ] **P1:** Begin Docker security audit (2 hours) - [ ] **P2:** Update CHANGELOG.md with Week 46 achievements (1 hour) - [ ] **P2:** Fix ansible-galaxy configuration (30 min) **Total Estimated Effort:** 10 hours ### Week 48 - Testing & Quality **Focus:** Establish testing infrastructure - [ ] **P1:** Molecule testing implementation - Part 1 (8 hours) - [ ] **P1:** Complete Docker security audit (4 hours) - [ ] **P1:** Plan LVM migration for pihole (2 hours) - [ ] **P2:** Pre-commit hooks setup (3 hours) - [ ] **P2:** Ansible configuration optimization (2 hours) - [ ] **P2:** Dynamic inventory group sanitization (2 hours) **Total Estimated Effort:** 21 hours ### Week 49 - CI/CD & Automation **Focus:** Automated quality gates - [ ] **P1:** CI/CD pipeline setup (10 hours) - [ ] **P1:** Molecule testing implementation - Part 2 (8 hours) - [ ] **P2:** Testing cheatsheet (3 hours) - [ ] **P2:** Separate inventories repository (if needed) (4 hours) **Total Estimated Effort:** 25 hours ### Week 50-51 - Role Development **Focus:** Expand role library - [ ] **P1:** Complete Molecule testing (4 hours) - [ ] **P2:** Common base system role (20 hours) - [ ] **P2:** Prometheus node_exporter role (10 hours) - [ ] **P2:** Automated compliance scanning (8 hours) **Total Estimated Effort:** 42 hours ### Week 52 - Security & Hardening **Focus:** Security baseline - [ ] **P2:** Security hardening role (24 hours) - [ ] **P3:** Runbook documentation (8 hours) - [ ] **P3:** Performance optimization (6 hours) **Total Estimated Effort:** 38 hours --- ## Success Metrics ### Infrastructure Health - **Target:** 100% VM connectivity (3/3 operational) - **Current:** 67% (2/3 operational) - **Timeline:** Week 47 ### Testing Coverage - **Target:** 80% role coverage with functional Molecule tests - **Current:** 0% (structure exists, not functional) - **Timeline:** Week 50 ### CI/CD Maturity - **Target:** Automated testing on all commits - **Current:** 0% (no pipeline) - **Timeline:** Week 49 ### Role Library Growth - **Target:** 5 production-ready roles by end of December - **Current:** 2 roles - **Timeline:** Week 52 ### Compliance Score - **Target:** 95% CLAUDE.md compliance across all hosts - **Current:** 75-90% per host - **Timeline:** Week 51 ### Time to Deploy New Role - **Target:** <8 hours with full testing - **Current:** Unknown (no testing framework) - **Timeline:** Week 50 --- ## Risk Assessment ### High Risks | Risk | Impact | Probability | Mitigation | |------|--------|-------------|------------| | LVM migration data loss | CRITICAL | MEDIUM | Comprehensive backups, testing, consider rebuild | | Molecule test complexity | HIGH | MEDIUM | Start simple, iterate, use Docker not libvirt | | CI/CD pipeline setup delays | HIGH | MEDIUM | Use Gitea Actions (simpler), prioritize basic tests | | derp VM unrecoverable | HIGH | LOW | Document rebuild procedure using deploy_linux_vm | | Time constraints | MEDIUM | HIGH | Prioritize P0/P1 tasks, defer P3 tasks | ### Medium Risks | Risk | Impact | Probability | Mitigation | |------|--------|-------------|------------| | Docker security findings | MEDIUM | HIGH | Plan remediation time, may need container rebuild | | Breaking changes during testing | MEDIUM | MEDIUM | Use check mode, test in dev environment first | | Inventory repository complexity | MEDIUM | LOW | Evaluate if truly necessary, may skip | --- ## Resource Requirements ### Personnel - **Senior Ansible Developer:** 1 FTE - **Time Allocation:** - Week 47: 10 hours (critical ops) - Week 48-49: 23 hours/week (testing & CI/CD) - Week 50-52: 20 hours/week (role development) ### Infrastructure - **Existing:** KVM/libvirt hypervisor, 3 VMs - **New Requirements:** - Docker/Podman for Molecule testing (can use existing Docker on pihole) - CI/CD runner (can use existing infrastructure) - Fact cache storage (~100MB, can use local disk) ### Tools & Services - **Existing:** Ansible, Git, Gitea, Docker - **New:** Molecule, pre-commit framework, yamllint - **Installation:** `pip install molecule molecule-docker pre-commit yamllint` --- ## Dependencies ### Critical Path 1. **Week 47:** derp recovery → full infrastructure operational 2. **Week 48:** Molecule setup → enables role testing 3. **Week 49:** CI/CD pipeline → enables automated quality 4. **Week 50+:** Role development → depends on testing framework ### External Dependencies - Gitea server availability (for CI/CD and git operations) - KVM hypervisor access (for VM management) - Internet connectivity (for package installations) --- ## Monitoring & Review ### Weekly Reviews - **Monday:** Review previous week progress, adjust priorities - **Friday:** Status update, document blockers ### Metrics Tracking - VM connectivity status - Test coverage percentage - CI/CD pipeline success rate - CLAUDE.md compliance score - Role count and quality ### Quarterly Goals - **Q1 2026 End:** - 10+ production-ready roles - 90%+ test coverage - Full CI/CD maturity - 95%+ CLAUDE.md compliance - Automated security scanning --- ## Appendix: Quick Reference ### Immediate Actions (This Week) **Monday-Tuesday:** 1. Recover derp VM (console access) 2. Fix git push permissions 3. Update CHANGELOG.md **Wednesday-Thursday:** 4. Install QEMU agent on mymx 5. Start Docker security audit 6. Fix ansible-galaxy configuration **Friday:** 7. Review progress 8. Update TODO.md 9. Plan Week 48 tasks ### Command Reference ```bash # VM Recovery virsh console derp virsh edit mymx # Add virtio-serial # Testing ansible-playbook playbooks/install_qemu_agent.yml ansible-playbook playbooks/audit_docker.yml molecule test # CI/CD ansible-lint ansible-playbook --syntax-check site.yml yamllint . # Monitoring ansible-playbook playbooks/gather_system_info.yml cat stats/machines/*/summary.txt ``` --- ## Related Documents - [TODO.md](TODO.md) - Weekly task tracking - [ROADMAP.md](ROADMAP.md) - Strategic long-term plan - [CHANGELOG.md](CHANGELOG.md) - Version history - [SYSTEM_ANALYSIS_AND_REMEDIATION.md](SYSTEM_ANALYSIS_AND_REMEDIATION.md) - Current system state - [CLAUDE.md](CLAUDE.md) - Development standards and guidelines --- **Next Review:** 2025-11-18 (Monday, Week 48) **Plan Owner:** Ansible Infrastructure Team **Document Status:** Active