Strategic and tactical planning documents for 12-week improvement initiative across 7 key improvement areas. IMPROVEMENT_PLAN.md (831 lines): - Strategic 12-week improvement roadmap - 7 improvement areas with priorities - Infrastructure operations (P0/P1) - Development quality & testing (P1/P2) - Security & compliance (P1) - Role development & expansion (P2/P3) - Documentation & standards (P2/P3) - Performance & scalability (P3) - Detailed task breakdowns with time estimates - Success metrics and KPIs - Risk assessment and mitigation strategies - Resource requirements (136 hours over 6 weeks) TASKS_WEEK_47.md (832 lines): - Detailed executable task plan for Week 47 - Day-by-day breakdown (Monday-Friday) - Copy-paste ready bash commands - Acceptance criteria for each task - Rollback procedures - Metrics tracking table - Blocker identification ASSESSMENT_SUMMARY.md (455 lines): - Comprehensive project assessment - Current state analysis (72/100 health score) - Strengths and critical gaps identified - Priority classification (P0-P3) - Infrastructure status (67% connectivity) - Role inventory (2 production-ready) - Development quality gaps highlighted - Next steps and immediate actions Key Insights: - Infrastructure: 67% operational (2/3 VMs reachable) - Role compliance: 95% (excellent) - Testing: 0% coverage (critical gap) - CI/CD: Not implemented (critical gap) - Documentation: 100% (excellent) Planning Approach: - Prioritized by impact and urgency - Executable tasks with clear deliverables - Time-boxed milestones - Risk-aware with mitigation strategies - Realistic resource estimates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
23 KiB
Ansible Infrastructure - Improvement Plan
Date: 2025-11-11 Version: 1.0 Status: Active
Executive Summary
Based on comprehensive analysis of the Ansible infrastructure automation project, this document outlines a prioritized improvement plan across 5 key areas: Infrastructure Operations, Development Quality, Security & Compliance, Documentation & Standards, and Scalability & Performance.
Current State Overview
Strengths:
- ✅ Strong foundation with security-first CLAUDE.md guidelines (95% compliance)
- ✅ Dynamic inventory operational (community.libvirt)
- ✅ 2 production-ready roles with comprehensive documentation
- ✅ Automated remediation playbooks (swap, qemu-agent)
- ✅ Excellent MTTR (<3 minutes for critical issues)
- ✅ Comprehensive documentation structure (100% coverage)
Critical Gaps:
- ❌ 1/3 VMs unreachable (derp - 33% infrastructure failure)
- ❌ No CI/CD pipeline (high risk of regression)
- ❌ Molecule tests non-functional (testing coverage gap)
- ❌ Git push permission issues (operational blocker)
- ❌ Docker security audit pending (compliance risk)
- ❌ Limited role library (2 roles vs. target of 50+)
Metrics:
- Operational VMs: 2/3 (67%)
- CLAUDE.md Compliance: 75-90% per host
- Role Count: 2 (target: 50+)
- CI/CD Pipeline: 0% (not implemented)
- Test Coverage: 0% (Molecule structure exists, not functional)
- Documentation Coverage: 100%
Priority Classification
P0 - CRITICAL (24-48 hours): Infrastructure blocking issues P1 - HIGH (1 week): Security, compliance, operational efficiency P2 - MEDIUM (2-4 weeks): Quality improvements, standardization P3 - LOW (1-3 months): Nice-to-have, future enhancements
Improvement Areas
1. Infrastructure Operations (P0/P1)
1.1 VM Recovery and Connectivity [P0]
Issue: derp VM unreachable (192.168.122.99)
- Impact: 33% infrastructure failure rate
- Root Cause: SSH authentication failure - Permission denied (publickey,password)
- Blocking: System analysis, compliance verification
Tasks:
- Access derp VM via libvirt console (virsh console derp)
- Verify ansible user exists and has correct configuration
- Deploy SSH public key to /home/ansible/.ssh/authorized_keys
- Verify sudo configuration (passwordless sudo for ansible user)
- Test SSH connectivity from control node
- Execute system_info playbook against derp
- Document recovery procedure in runbooks
Timeline: This week (Week 47) Estimated Effort: 2-4 hours (manual console access required)
1.2 QEMU Guest Agent Deployment [P1]
Issue: mymx missing QEMU agent functionality
- Impact: Cannot perform graceful shutdowns, resource monitoring limited
- Compliance: CLAUDE.md recommends QEMU agent for KVM guests
Tasks:
- Verify virtio-serial channel exists in VM XML (virsh edit mymx)
- Add virtio-serial channel if missing
- Execute playbooks/install_qemu_agent.yml on mymx
- Verify agent communication (virsh domifaddr mymx)
- Test guest agent commands
Timeline: This week (Week 47) Estimated Effort: 30 minutes (playbook already exists)
1.3 LVM Migration for pihole [P1]
Issue: pihole using traditional partitioning (non-compliant with CLAUDE.md)
- Impact: Cannot dynamically resize volumes, difficult disaster recovery
- Risk: Data loss if migration performed incorrectly
Tasks:
- Evaluate migration options:
- Option A: Rebuild VM using deploy_linux_vm role (clean slate)
- Option B: In-place migration (high risk)
- Option C: Document exception with rationale
- Create comprehensive backup of pihole
- Test restore procedure
- Execute migration plan (if approved)
- Verify LVM configuration post-migration
- Update compliance metrics
Timeline: Week 48-49 Estimated Effort: 4-8 hours (depends on option chosen) Recommendation: Option A (rebuild) - cleanest approach
1.4 Git Push Permission Issue [P0]
Issue: Gitea server pre-receive hook blocking pushes
- Impact: Cannot commit improvements to remote repository
- Blocking: Version control, collaboration, backup
Tasks:
- Investigate Gitea pre-receive hook configuration
- Check repository permissions for ansible@mymx.me user
- Verify git hooks on server side
- Test push with verbose output
- Document git workflow procedures
Timeline: This week (Week 47) Estimated Effort: 1-2 hours
2. Security & Compliance (P1)
2.1 Docker Security Audit [P1]
Issue: Docker running on pihole with unknown security posture
- Impact: Container escape risk, privilege escalation, resource exhaustion
- Compliance: CLAUDE.md requires security audits for containerized services
Tasks:
- Create playbooks/audit_docker.yml playbook
- Audit docker daemon configuration (/etc/docker/daemon.json)
- Check for privileged containers (docker inspect)
- Verify user namespace remapping
- Check AppArmor/SELinux profiles
- Audit network isolation (bridge vs. host mode)
- Check resource limits (CPU, memory)
- Scan container images for vulnerabilities
- Review exposed ports and services
- Generate compliance report
- Implement recommended hardening
Timeline: Week 47-48 Estimated Effort: 4-6 hours Deliverables:
- playbooks/audit_docker.yml
- docs/security/docker-hardening.md
- Docker security baseline role (future)
2.2 Swap Configuration [P1]
Status: Partially complete (playbook exists)
- pihole: ✅ Configured (2GB)
- mymx: ✅ Configured (2GB)
- derp: ❌ Pending (VM unreachable)
Tasks:
- Execute configure_swap.yml on derp (after connectivity restored)
- Verify swap persistence across reboots
- Monitor swap usage trends
Timeline: Week 47 (after derp recovery) Estimated Effort: 15 minutes
2.3 Automated Compliance Scanning [P2]
Issue: Manual compliance verification is time-consuming
- Impact: Delayed detection of configuration drift
Tasks:
- Research OpenSCAP integration options
- Create security_audit playbook with CIS benchmarks
- Implement automated weekly compliance scans
- Configure compliance reporting
- Set up alerting for critical findings
Timeline: Week 48-50 Estimated Effort: 8-12 hours
3. Development Quality & Testing (P1/P2)
3.1 Molecule Testing Implementation [P1]
Issue: Molecule structure exists but tests are non-functional
- Impact: No automated testing, high regression risk
- Quality Risk: Cannot verify roles work correctly
Current State:
- Molecule installed
- roles/deploy_linux_vm/molecule/default/ directory exists
- No molecule.yml configuration
Tasks:
- Create molecule.yml for deploy_linux_vm role
- Set up Docker/Podman test containers
- Write converge.yml test playbook
- Write verify.yml validation tests
- Create test scenarios for:
- Debian 12 deployment
- RHEL 9 deployment
- LVM configuration validation
- Cloud-init template rendering
- Document testing procedures
- Create cheatsheets/testing.md
- Repeat for system_info role
Timeline: Week 48-50 Estimated Effort: 12-16 hours Priority: HIGH (required before scaling role development)
Example molecule.yml:
---
dependency:
name: galaxy
driver:
name: docker
platforms:
- name: debian-12-test
image: debian:12
pre_build_image: true
privileged: true
command: /lib/systemd/systemd
- name: rockylinux-9-test
image: rockylinux:9
pre_build_image: true
privileged: true
command: /usr/sbin/init
provisioner:
name: ansible
config_options:
defaults:
callbacks_enabled: profile_tasks, timer
inventory:
group_vars:
all:
ansible_user: root
verifier:
name: ansible
3.2 CI/CD Pipeline Setup [P1]
Issue: No automated testing on commits/PRs
- Impact: Manual quality control, slow feedback loop
- Risk: Breaking changes reach main branch
Tasks:
- Evaluate CI/CD options:
- Gitea Actions (preferred - native integration)
- Jenkins (more features, higher complexity)
- GitLab CI (if migrating from Gitea)
- Create .gitea/workflows/ci.yml
- Implement pipeline stages:
- Syntax validation (ansible-playbook --syntax-check)
- Linting (ansible-lint)
- YAML validation (yamllint)
- Molecule tests
- Security scanning (ansible-audit)
- Configure branch protection rules
- Set up status checks for pull requests
- Configure notifications (email/webhook)
Timeline: Week 49-50 Estimated Effort: 8-12 hours
Example Gitea Actions workflow:
name: Ansible CI
on:
push:
branches: [ master, develop ]
pull_request:
branches: [ master ]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run ansible-lint
run: |
pip install ansible-lint
ansible-lint
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run Molecule tests
run: |
pip install molecule molecule-docker
cd roles/deploy_linux_vm
molecule test
3.3 Pre-commit Hooks [P2]
Issue: No local quality checks before commits
- Impact: Quality issues reach repository
Tasks:
- Install pre-commit framework
- Create .pre-commit-config.yaml
- Configure hooks:
- ansible-lint
- yamllint
- trailing whitespace removal
- end-of-file fixer
- mixed line endings check
- Document pre-commit setup in README.md
- Create setup script for developers
Timeline: Week 48 Estimated Effort: 2-4 hours
3.4 Ansible Configuration Optimization [P2]
Current Config:
gathering = smart
callbacks_enabled = profile_tasks, timer
# Missing: forks, pipelining, fact_caching
Tasks:
- Enable SSH pipelining for performance
- Implement fact caching (Redis or JSON file)
- Increase forks for parallel execution
- Configure strategy plugins
- Enable ControlMaster for SSH connection reuse
- Document configuration choices
Timeline: Week 48 Estimated Effort: 2-3 hours
Recommended additions:
[defaults]
gathering = smart
callbacks_enabled = profile_tasks, timer
forks = 20
host_key_checking = False
retry_files_enabled = False
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 3600
[ssh_connection]
pipelining = True
ssh_args = -o ControlMaster=auto -o ControlPersist=3600s
3.5 Ansible Galaxy Configuration Fix [P2]
Issue: ansible-galaxy collection list fails with galaxy_server config error
Tasks:
- Fix ansible.cfg galaxy_server configuration
- Verify collection installations
- Document collection management procedures
Timeline: Week 47 Estimated Effort: 30 minutes
4. Role Development & Expansion (P2/P3)
4.1 Common Base System Role [P2]
Need: Standardized base configuration for all systems
- Impact: Consistency, reduced duplication, faster deployments
Tasks:
- Create roles/common role structure
- Implement essential package installation
- User and group management
- SSH hardening
- Time synchronization (chrony)
- System logging (rsyslog)
- Implement molecule tests
- Create comprehensive documentation
- Create cheatsheet
Timeline: Week 50-51 Estimated Effort: 16-20 hours
Features:
- Essential packages (vim, htop, tmux, jq, curl, wget, etc.)
- SSH hardening (disable root login, key-only auth)
- Chrony/NTP configuration
- Rsyslog centralized logging
- User account management
- Sudo configuration
- Timezone configuration
- Locale configuration
4.2 Security Hardening Role [P2]
Need: CIS Benchmark compliance automation
- Impact: Consistent security posture, audit compliance
Tasks:
- Create roles/security_hardening role
- Implement CIS Benchmark controls for:
- Debian 12
- RHEL 9/Rocky/AlmaLinux
- SELinux/AppArmor enforcement
- Firewall configuration (firewalld/ufw)
- Fail2ban setup
- AIDE file integrity monitoring
- Auditd configuration
- Kernel hardening (sysctl)
- Password policies (PAM)
- Account lockout policies
- Implement molecule tests
- Create documentation
Timeline: Weeks 51-52 (December) Estimated Effort: 24-32 hours
4.3 Monitoring Role [P2]
Need: Prometheus node_exporter for metrics collection
- Impact: Visibility into system health, capacity planning
Tasks:
- Create roles/prometheus_node_exporter role
- Install and configure node_exporter
- Configure systemd service
- Configure firewall rules
- Implement security hardening
- Create molecule tests
- Create documentation
Timeline: Week 51 Estimated Effort: 8-12 hours
4.4 Future Roles (P3)
Lower priority roles for future development:
Web Servers (Q1 2026):
- roles/nginx
- roles/apache
- roles/haproxy
Databases (Q1 2026):
- roles/postgresql
- roles/mysql
- roles/redis
Application Services (Q1-Q2 2026):
- roles/docker (security-hardened)
- roles/docker_compose
- roles/backup (Restic/Borg)
- roles/vpn (WireGuard)
5. Documentation & Standards (P2/P3)
5.1 Update CHANGELOG.md [P2]
Issue: Week 46 improvements not documented in CHANGELOG.md
- Impact: Lost historical context, version tracking incomplete
Tasks:
- Document Week 46 achievements:
- Role compliance improvements (70% → 95%)
- System analysis and remediation framework
- Remediation playbooks (swap, qemu-agent)
- Dynamic inventory migration
- SSH access restoration
- Documentation expansion (2,100+ lines)
- Tag version 0.2.0
- Update version numbers in relevant files
Timeline: Week 47 Estimated Effort: 1 hour
5.2 Create Testing Cheatsheet [P2]
Need: Quick reference for testing workflows
Tasks:
- Create cheatsheets/testing.md
- Document Molecule usage
- Document ansible-lint usage
- Document CI/CD pipeline
- Include troubleshooting tips
Timeline: Week 49 Estimated Effort: 2-3 hours
5.3 Dynamic Inventory Group Name Sanitization [P2]
Issue: UUID-based group names generate warnings
[WARNING]: Invalid characters were found in group names but not replaced
Tasks:
- Research inventory plugin configuration options
- Implement group name sanitization
- Test with libvirt dynamic inventory
- Document solution
Timeline: Week 48 Estimated Effort: 2-3 hours
5.4 Runbook Documentation [P3]
Need: Operational procedures for common tasks
Tasks:
- Create docs/runbooks/vm-recovery.md
- Create docs/runbooks/emergency-procedures.md
- Create docs/runbooks/capacity-planning.md
- Create docs/runbooks/security-incident-response.md
Timeline: Weeks 50-52 Estimated Effort: 8-12 hours
6. Inventory & Repository Organization (P2)
6.1 Separate Inventories Repository [P2]
Need: Public inventories repository (per CLAUDE.md)
- Impact: Better separation of concerns, public/private boundary
Current State:
- inventories/ in main repository
- secrets/ in git submodule (correct)
Tasks:
- Create new public repository: inventories
- Move inventories/ directory to new repo
- Configure as git submodule
- Update .gitmodules
- Update documentation
- Test inventory loading from submodule
- Update README.md with submodule instructions
Timeline: Week 48 Estimated Effort: 3-4 hours
Note: Evaluate necessity - current setup with inventories/ in main repo may be acceptable for single-team usage.
7. Performance & Scalability (P3)
7.1 Fact Caching Implementation [P3]
Need: Reduce gather_facts execution time
- Current: ~1.7 seconds per host
- Target: <0.5 seconds (cached)
Tasks:
- Evaluate caching backends (Redis vs. JSON file)
- Implement fact caching in ansible.cfg
- Test cache performance
- Configure cache timeout
- Monitor cache hit rates
Timeline: Week 51 Estimated Effort: 2-4 hours
7.2 Parallel Execution Optimization [P3]
Tasks:
- Benchmark current execution times
- Increase forks parameter
- Test strategy: free for independent tasks
- Implement async tasks for long-running operations
- Document performance optimizations
Timeline: Week 52 Estimated Effort: 3-4 hours
Implementation Timeline
Week 47 (Current Week) - Critical Operations
Focus: Restore infrastructure, unblock operations
- P0: Recover derp VM connectivity (4 hours)
- P0: Resolve git push permission issue (2 hours)
- P1: Install QEMU agent on mymx (30 min)
- P1: Begin Docker security audit (2 hours)
- P2: Update CHANGELOG.md with Week 46 achievements (1 hour)
- P2: Fix ansible-galaxy configuration (30 min)
Total Estimated Effort: 10 hours
Week 48 - Testing & Quality
Focus: Establish testing infrastructure
- P1: Molecule testing implementation - Part 1 (8 hours)
- P1: Complete Docker security audit (4 hours)
- P1: Plan LVM migration for pihole (2 hours)
- P2: Pre-commit hooks setup (3 hours)
- P2: Ansible configuration optimization (2 hours)
- P2: Dynamic inventory group sanitization (2 hours)
Total Estimated Effort: 21 hours
Week 49 - CI/CD & Automation
Focus: Automated quality gates
- P1: CI/CD pipeline setup (10 hours)
- P1: Molecule testing implementation - Part 2 (8 hours)
- P2: Testing cheatsheet (3 hours)
- P2: Separate inventories repository (if needed) (4 hours)
Total Estimated Effort: 25 hours
Week 50-51 - Role Development
Focus: Expand role library
- P1: Complete Molecule testing (4 hours)
- P2: Common base system role (20 hours)
- P2: Prometheus node_exporter role (10 hours)
- P2: Automated compliance scanning (8 hours)
Total Estimated Effort: 42 hours
Week 52 - Security & Hardening
Focus: Security baseline
- P2: Security hardening role (24 hours)
- P3: Runbook documentation (8 hours)
- P3: Performance optimization (6 hours)
Total Estimated Effort: 38 hours
Success Metrics
Infrastructure Health
- Target: 100% VM connectivity (3/3 operational)
- Current: 67% (2/3 operational)
- Timeline: Week 47
Testing Coverage
- Target: 80% role coverage with functional Molecule tests
- Current: 0% (structure exists, not functional)
- Timeline: Week 50
CI/CD Maturity
- Target: Automated testing on all commits
- Current: 0% (no pipeline)
- Timeline: Week 49
Role Library Growth
- Target: 5 production-ready roles by end of December
- Current: 2 roles
- Timeline: Week 52
Compliance Score
- Target: 95% CLAUDE.md compliance across all hosts
- Current: 75-90% per host
- Timeline: Week 51
Time to Deploy New Role
- Target: <8 hours with full testing
- Current: Unknown (no testing framework)
- Timeline: Week 50
Risk Assessment
High Risks
| Risk | Impact | Probability | Mitigation |
|---|---|---|---|
| LVM migration data loss | CRITICAL | MEDIUM | Comprehensive backups, testing, consider rebuild |
| Molecule test complexity | HIGH | MEDIUM | Start simple, iterate, use Docker not libvirt |
| CI/CD pipeline setup delays | HIGH | MEDIUM | Use Gitea Actions (simpler), prioritize basic tests |
| derp VM unrecoverable | HIGH | LOW | Document rebuild procedure using deploy_linux_vm |
| Time constraints | MEDIUM | HIGH | Prioritize P0/P1 tasks, defer P3 tasks |
Medium Risks
| Risk | Impact | Probability | Mitigation |
|---|---|---|---|
| Docker security findings | MEDIUM | HIGH | Plan remediation time, may need container rebuild |
| Breaking changes during testing | MEDIUM | MEDIUM | Use check mode, test in dev environment first |
| Inventory repository complexity | MEDIUM | LOW | Evaluate if truly necessary, may skip |
Resource Requirements
Personnel
- Senior Ansible Developer: 1 FTE
- Time Allocation:
- Week 47: 10 hours (critical ops)
- Week 48-49: 23 hours/week (testing & CI/CD)
- Week 50-52: 20 hours/week (role development)
Infrastructure
- Existing: KVM/libvirt hypervisor, 3 VMs
- New Requirements:
- Docker/Podman for Molecule testing (can use existing Docker on pihole)
- CI/CD runner (can use existing infrastructure)
- Fact cache storage (~100MB, can use local disk)
Tools & Services
- Existing: Ansible, Git, Gitea, Docker
- New: Molecule, pre-commit framework, yamllint
- Installation:
pip install molecule molecule-docker pre-commit yamllint
Dependencies
Critical Path
- Week 47: derp recovery → full infrastructure operational
- Week 48: Molecule setup → enables role testing
- Week 49: CI/CD pipeline → enables automated quality
- Week 50+: Role development → depends on testing framework
External Dependencies
- Gitea server availability (for CI/CD and git operations)
- KVM hypervisor access (for VM management)
- Internet connectivity (for package installations)
Monitoring & Review
Weekly Reviews
- Monday: Review previous week progress, adjust priorities
- Friday: Status update, document blockers
Metrics Tracking
- VM connectivity status
- Test coverage percentage
- CI/CD pipeline success rate
- CLAUDE.md compliance score
- Role count and quality
Quarterly Goals
- Q1 2026 End:
- 10+ production-ready roles
- 90%+ test coverage
- Full CI/CD maturity
- 95%+ CLAUDE.md compliance
- Automated security scanning
Appendix: Quick Reference
Immediate Actions (This Week)
Monday-Tuesday:
- Recover derp VM (console access)
- Fix git push permissions
- Update CHANGELOG.md
Wednesday-Thursday: 4. Install QEMU agent on mymx 5. Start Docker security audit 6. Fix ansible-galaxy configuration
Friday: 7. Review progress 8. Update TODO.md 9. Plan Week 48 tasks
Command Reference
# VM Recovery
virsh console derp
virsh edit mymx # Add virtio-serial
# Testing
ansible-playbook playbooks/install_qemu_agent.yml
ansible-playbook playbooks/audit_docker.yml
molecule test
# CI/CD
ansible-lint
ansible-playbook --syntax-check site.yml
yamllint .
# Monitoring
ansible-playbook playbooks/gather_system_info.yml
cat stats/machines/*/summary.txt
Related Documents
- TODO.md - Weekly task tracking
- ROADMAP.md - Strategic long-term plan
- CHANGELOG.md - Version history
- SYSTEM_ANALYSIS_AND_REMEDIATION.md - Current system state
- CLAUDE.md - Development standards and guidelines
Next Review: 2025-11-18 (Monday, Week 48) Plan Owner: Ansible Infrastructure Team Document Status: Active