Files
infra-automation/IMPROVEMENT_PLAN.md
ansible f6d0ac0a9d Add comprehensive project improvement planning documents
Strategic and tactical planning documents for 12-week improvement
initiative across 7 key improvement areas.

IMPROVEMENT_PLAN.md (831 lines):
- Strategic 12-week improvement roadmap
- 7 improvement areas with priorities
- Infrastructure operations (P0/P1)
- Development quality & testing (P1/P2)
- Security & compliance (P1)
- Role development & expansion (P2/P3)
- Documentation & standards (P2/P3)
- Performance & scalability (P3)
- Detailed task breakdowns with time estimates
- Success metrics and KPIs
- Risk assessment and mitigation strategies
- Resource requirements (136 hours over 6 weeks)

TASKS_WEEK_47.md (832 lines):
- Detailed executable task plan for Week 47
- Day-by-day breakdown (Monday-Friday)
- Copy-paste ready bash commands
- Acceptance criteria for each task
- Rollback procedures
- Metrics tracking table
- Blocker identification

ASSESSMENT_SUMMARY.md (455 lines):
- Comprehensive project assessment
- Current state analysis (72/100 health score)
- Strengths and critical gaps identified
- Priority classification (P0-P3)
- Infrastructure status (67% connectivity)
- Role inventory (2 production-ready)
- Development quality gaps highlighted
- Next steps and immediate actions

Key Insights:
- Infrastructure: 67% operational (2/3 VMs reachable)
- Role compliance: 95% (excellent)
- Testing: 0% coverage (critical gap)
- CI/CD: Not implemented (critical gap)
- Documentation: 100% (excellent)

Planning Approach:
- Prioritized by impact and urgency
- Executable tasks with clear deliverables
- Time-boxed milestones
- Risk-aware with mitigation strategies
- Realistic resource estimates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 07:47:37 +01:00

23 KiB

Ansible Infrastructure - Improvement Plan

Date: 2025-11-11 Version: 1.0 Status: Active


Executive Summary

Based on comprehensive analysis of the Ansible infrastructure automation project, this document outlines a prioritized improvement plan across 5 key areas: Infrastructure Operations, Development Quality, Security & Compliance, Documentation & Standards, and Scalability & Performance.

Current State Overview

Strengths:

  • Strong foundation with security-first CLAUDE.md guidelines (95% compliance)
  • Dynamic inventory operational (community.libvirt)
  • 2 production-ready roles with comprehensive documentation
  • Automated remediation playbooks (swap, qemu-agent)
  • Excellent MTTR (<3 minutes for critical issues)
  • Comprehensive documentation structure (100% coverage)

Critical Gaps:

  • 1/3 VMs unreachable (derp - 33% infrastructure failure)
  • No CI/CD pipeline (high risk of regression)
  • Molecule tests non-functional (testing coverage gap)
  • Git push permission issues (operational blocker)
  • Docker security audit pending (compliance risk)
  • Limited role library (2 roles vs. target of 50+)

Metrics:

  • Operational VMs: 2/3 (67%)
  • CLAUDE.md Compliance: 75-90% per host
  • Role Count: 2 (target: 50+)
  • CI/CD Pipeline: 0% (not implemented)
  • Test Coverage: 0% (Molecule structure exists, not functional)
  • Documentation Coverage: 100%

Priority Classification

P0 - CRITICAL (24-48 hours): Infrastructure blocking issues P1 - HIGH (1 week): Security, compliance, operational efficiency P2 - MEDIUM (2-4 weeks): Quality improvements, standardization P3 - LOW (1-3 months): Nice-to-have, future enhancements


Improvement Areas

1. Infrastructure Operations (P0/P1)

1.1 VM Recovery and Connectivity [P0]

Issue: derp VM unreachable (192.168.122.99)

  • Impact: 33% infrastructure failure rate
  • Root Cause: SSH authentication failure - Permission denied (publickey,password)
  • Blocking: System analysis, compliance verification

Tasks:

  • Access derp VM via libvirt console (virsh console derp)
  • Verify ansible user exists and has correct configuration
  • Deploy SSH public key to /home/ansible/.ssh/authorized_keys
  • Verify sudo configuration (passwordless sudo for ansible user)
  • Test SSH connectivity from control node
  • Execute system_info playbook against derp
  • Document recovery procedure in runbooks

Timeline: This week (Week 47) Estimated Effort: 2-4 hours (manual console access required)

1.2 QEMU Guest Agent Deployment [P1]

Issue: mymx missing QEMU agent functionality

  • Impact: Cannot perform graceful shutdowns, resource monitoring limited
  • Compliance: CLAUDE.md recommends QEMU agent for KVM guests

Tasks:

  • Verify virtio-serial channel exists in VM XML (virsh edit mymx)
  • Add virtio-serial channel if missing
  • Execute playbooks/install_qemu_agent.yml on mymx
  • Verify agent communication (virsh domifaddr mymx)
  • Test guest agent commands

Timeline: This week (Week 47) Estimated Effort: 30 minutes (playbook already exists)

1.3 LVM Migration for pihole [P1]

Issue: pihole using traditional partitioning (non-compliant with CLAUDE.md)

  • Impact: Cannot dynamically resize volumes, difficult disaster recovery
  • Risk: Data loss if migration performed incorrectly

Tasks:

  • Evaluate migration options:
    • Option A: Rebuild VM using deploy_linux_vm role (clean slate)
    • Option B: In-place migration (high risk)
    • Option C: Document exception with rationale
  • Create comprehensive backup of pihole
  • Test restore procedure
  • Execute migration plan (if approved)
  • Verify LVM configuration post-migration
  • Update compliance metrics

Timeline: Week 48-49 Estimated Effort: 4-8 hours (depends on option chosen) Recommendation: Option A (rebuild) - cleanest approach

1.4 Git Push Permission Issue [P0]

Issue: Gitea server pre-receive hook blocking pushes

  • Impact: Cannot commit improvements to remote repository
  • Blocking: Version control, collaboration, backup

Tasks:

  • Investigate Gitea pre-receive hook configuration
  • Check repository permissions for ansible@mymx.me user
  • Verify git hooks on server side
  • Test push with verbose output
  • Document git workflow procedures

Timeline: This week (Week 47) Estimated Effort: 1-2 hours


2. Security & Compliance (P1)

2.1 Docker Security Audit [P1]

Issue: Docker running on pihole with unknown security posture

  • Impact: Container escape risk, privilege escalation, resource exhaustion
  • Compliance: CLAUDE.md requires security audits for containerized services

Tasks:

  • Create playbooks/audit_docker.yml playbook
  • Audit docker daemon configuration (/etc/docker/daemon.json)
  • Check for privileged containers (docker inspect)
  • Verify user namespace remapping
  • Check AppArmor/SELinux profiles
  • Audit network isolation (bridge vs. host mode)
  • Check resource limits (CPU, memory)
  • Scan container images for vulnerabilities
  • Review exposed ports and services
  • Generate compliance report
  • Implement recommended hardening

Timeline: Week 47-48 Estimated Effort: 4-6 hours Deliverables:

  • playbooks/audit_docker.yml
  • docs/security/docker-hardening.md
  • Docker security baseline role (future)

2.2 Swap Configuration [P1]

Status: Partially complete (playbook exists)

  • pihole: Configured (2GB)
  • mymx: Configured (2GB)
  • derp: Pending (VM unreachable)

Tasks:

  • Execute configure_swap.yml on derp (after connectivity restored)
  • Verify swap persistence across reboots
  • Monitor swap usage trends

Timeline: Week 47 (after derp recovery) Estimated Effort: 15 minutes

2.3 Automated Compliance Scanning [P2]

Issue: Manual compliance verification is time-consuming

  • Impact: Delayed detection of configuration drift

Tasks:

  • Research OpenSCAP integration options
  • Create security_audit playbook with CIS benchmarks
  • Implement automated weekly compliance scans
  • Configure compliance reporting
  • Set up alerting for critical findings

Timeline: Week 48-50 Estimated Effort: 8-12 hours


3. Development Quality & Testing (P1/P2)

3.1 Molecule Testing Implementation [P1]

Issue: Molecule structure exists but tests are non-functional

  • Impact: No automated testing, high regression risk
  • Quality Risk: Cannot verify roles work correctly

Current State:

  • Molecule installed
  • roles/deploy_linux_vm/molecule/default/ directory exists
  • No molecule.yml configuration

Tasks:

  • Create molecule.yml for deploy_linux_vm role
  • Set up Docker/Podman test containers
  • Write converge.yml test playbook
  • Write verify.yml validation tests
  • Create test scenarios for:
    • Debian 12 deployment
    • RHEL 9 deployment
    • LVM configuration validation
    • Cloud-init template rendering
  • Document testing procedures
  • Create cheatsheets/testing.md
  • Repeat for system_info role

Timeline: Week 48-50 Estimated Effort: 12-16 hours Priority: HIGH (required before scaling role development)

Example molecule.yml:

---
dependency:
  name: galaxy
driver:
  name: docker
platforms:
  - name: debian-12-test
    image: debian:12
    pre_build_image: true
    privileged: true
    command: /lib/systemd/systemd
  - name: rockylinux-9-test
    image: rockylinux:9
    pre_build_image: true
    privileged: true
    command: /usr/sbin/init
provisioner:
  name: ansible
  config_options:
    defaults:
      callbacks_enabled: profile_tasks, timer
  inventory:
    group_vars:
      all:
        ansible_user: root
verifier:
  name: ansible

3.2 CI/CD Pipeline Setup [P1]

Issue: No automated testing on commits/PRs

  • Impact: Manual quality control, slow feedback loop
  • Risk: Breaking changes reach main branch

Tasks:

  • Evaluate CI/CD options:
    • Gitea Actions (preferred - native integration)
    • Jenkins (more features, higher complexity)
    • GitLab CI (if migrating from Gitea)
  • Create .gitea/workflows/ci.yml
  • Implement pipeline stages:
    • Syntax validation (ansible-playbook --syntax-check)
    • Linting (ansible-lint)
    • YAML validation (yamllint)
    • Molecule tests
    • Security scanning (ansible-audit)
  • Configure branch protection rules
  • Set up status checks for pull requests
  • Configure notifications (email/webhook)

Timeline: Week 49-50 Estimated Effort: 8-12 hours

Example Gitea Actions workflow:

name: Ansible CI

on:
  push:
    branches: [ master, develop ]
  pull_request:
    branches: [ master ]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run ansible-lint
        run: |
          pip install ansible-lint
          ansible-lint

  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run Molecule tests
        run: |
          pip install molecule molecule-docker
          cd roles/deploy_linux_vm
          molecule test

3.3 Pre-commit Hooks [P2]

Issue: No local quality checks before commits

  • Impact: Quality issues reach repository

Tasks:

  • Install pre-commit framework
  • Create .pre-commit-config.yaml
  • Configure hooks:
    • ansible-lint
    • yamllint
    • trailing whitespace removal
    • end-of-file fixer
    • mixed line endings check
  • Document pre-commit setup in README.md
  • Create setup script for developers

Timeline: Week 48 Estimated Effort: 2-4 hours

3.4 Ansible Configuration Optimization [P2]

Current Config:

gathering = smart
callbacks_enabled = profile_tasks, timer
# Missing: forks, pipelining, fact_caching

Tasks:

  • Enable SSH pipelining for performance
  • Implement fact caching (Redis or JSON file)
  • Increase forks for parallel execution
  • Configure strategy plugins
  • Enable ControlMaster for SSH connection reuse
  • Document configuration choices

Timeline: Week 48 Estimated Effort: 2-3 hours

Recommended additions:

[defaults]
gathering = smart
callbacks_enabled = profile_tasks, timer
forks = 20
host_key_checking = False
retry_files_enabled = False
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 3600

[ssh_connection]
pipelining = True
ssh_args = -o ControlMaster=auto -o ControlPersist=3600s

3.5 Ansible Galaxy Configuration Fix [P2]

Issue: ansible-galaxy collection list fails with galaxy_server config error

Tasks:

  • Fix ansible.cfg galaxy_server configuration
  • Verify collection installations
  • Document collection management procedures

Timeline: Week 47 Estimated Effort: 30 minutes


4. Role Development & Expansion (P2/P3)

4.1 Common Base System Role [P2]

Need: Standardized base configuration for all systems

  • Impact: Consistency, reduced duplication, faster deployments

Tasks:

  • Create roles/common role structure
  • Implement essential package installation
  • User and group management
  • SSH hardening
  • Time synchronization (chrony)
  • System logging (rsyslog)
  • Implement molecule tests
  • Create comprehensive documentation
  • Create cheatsheet

Timeline: Week 50-51 Estimated Effort: 16-20 hours

Features:

  • Essential packages (vim, htop, tmux, jq, curl, wget, etc.)
  • SSH hardening (disable root login, key-only auth)
  • Chrony/NTP configuration
  • Rsyslog centralized logging
  • User account management
  • Sudo configuration
  • Timezone configuration
  • Locale configuration

4.2 Security Hardening Role [P2]

Need: CIS Benchmark compliance automation

  • Impact: Consistent security posture, audit compliance

Tasks:

  • Create roles/security_hardening role
  • Implement CIS Benchmark controls for:
    • Debian 12
    • RHEL 9/Rocky/AlmaLinux
  • SELinux/AppArmor enforcement
  • Firewall configuration (firewalld/ufw)
  • Fail2ban setup
  • AIDE file integrity monitoring
  • Auditd configuration
  • Kernel hardening (sysctl)
  • Password policies (PAM)
  • Account lockout policies
  • Implement molecule tests
  • Create documentation

Timeline: Weeks 51-52 (December) Estimated Effort: 24-32 hours

4.3 Monitoring Role [P2]

Need: Prometheus node_exporter for metrics collection

  • Impact: Visibility into system health, capacity planning

Tasks:

  • Create roles/prometheus_node_exporter role
  • Install and configure node_exporter
  • Configure systemd service
  • Configure firewall rules
  • Implement security hardening
  • Create molecule tests
  • Create documentation

Timeline: Week 51 Estimated Effort: 8-12 hours

4.4 Future Roles (P3)

Lower priority roles for future development:

Web Servers (Q1 2026):

  • roles/nginx
  • roles/apache
  • roles/haproxy

Databases (Q1 2026):

  • roles/postgresql
  • roles/mysql
  • roles/redis

Application Services (Q1-Q2 2026):

  • roles/docker (security-hardened)
  • roles/docker_compose
  • roles/backup (Restic/Borg)
  • roles/vpn (WireGuard)

5. Documentation & Standards (P2/P3)

5.1 Update CHANGELOG.md [P2]

Issue: Week 46 improvements not documented in CHANGELOG.md

  • Impact: Lost historical context, version tracking incomplete

Tasks:

  • Document Week 46 achievements:
    • Role compliance improvements (70% → 95%)
    • System analysis and remediation framework
    • Remediation playbooks (swap, qemu-agent)
    • Dynamic inventory migration
    • SSH access restoration
    • Documentation expansion (2,100+ lines)
  • Tag version 0.2.0
  • Update version numbers in relevant files

Timeline: Week 47 Estimated Effort: 1 hour

5.2 Create Testing Cheatsheet [P2]

Need: Quick reference for testing workflows

Tasks:

  • Create cheatsheets/testing.md
  • Document Molecule usage
  • Document ansible-lint usage
  • Document CI/CD pipeline
  • Include troubleshooting tips

Timeline: Week 49 Estimated Effort: 2-3 hours

5.3 Dynamic Inventory Group Name Sanitization [P2]

Issue: UUID-based group names generate warnings

[WARNING]: Invalid characters were found in group names but not replaced

Tasks:

  • Research inventory plugin configuration options
  • Implement group name sanitization
  • Test with libvirt dynamic inventory
  • Document solution

Timeline: Week 48 Estimated Effort: 2-3 hours

5.4 Runbook Documentation [P3]

Need: Operational procedures for common tasks

Tasks:

  • Create docs/runbooks/vm-recovery.md
  • Create docs/runbooks/emergency-procedures.md
  • Create docs/runbooks/capacity-planning.md
  • Create docs/runbooks/security-incident-response.md

Timeline: Weeks 50-52 Estimated Effort: 8-12 hours


6. Inventory & Repository Organization (P2)

6.1 Separate Inventories Repository [P2]

Need: Public inventories repository (per CLAUDE.md)

  • Impact: Better separation of concerns, public/private boundary

Current State:

  • inventories/ in main repository
  • secrets/ in git submodule (correct)

Tasks:

  • Create new public repository: inventories
  • Move inventories/ directory to new repo
  • Configure as git submodule
  • Update .gitmodules
  • Update documentation
  • Test inventory loading from submodule
  • Update README.md with submodule instructions

Timeline: Week 48 Estimated Effort: 3-4 hours

Note: Evaluate necessity - current setup with inventories/ in main repo may be acceptable for single-team usage.


7. Performance & Scalability (P3)

7.1 Fact Caching Implementation [P3]

Need: Reduce gather_facts execution time

  • Current: ~1.7 seconds per host
  • Target: <0.5 seconds (cached)

Tasks:

  • Evaluate caching backends (Redis vs. JSON file)
  • Implement fact caching in ansible.cfg
  • Test cache performance
  • Configure cache timeout
  • Monitor cache hit rates

Timeline: Week 51 Estimated Effort: 2-4 hours

7.2 Parallel Execution Optimization [P3]

Tasks:

  • Benchmark current execution times
  • Increase forks parameter
  • Test strategy: free for independent tasks
  • Implement async tasks for long-running operations
  • Document performance optimizations

Timeline: Week 52 Estimated Effort: 3-4 hours


Implementation Timeline

Week 47 (Current Week) - Critical Operations

Focus: Restore infrastructure, unblock operations

  • P0: Recover derp VM connectivity (4 hours)
  • P0: Resolve git push permission issue (2 hours)
  • P1: Install QEMU agent on mymx (30 min)
  • P1: Begin Docker security audit (2 hours)
  • P2: Update CHANGELOG.md with Week 46 achievements (1 hour)
  • P2: Fix ansible-galaxy configuration (30 min)

Total Estimated Effort: 10 hours

Week 48 - Testing & Quality

Focus: Establish testing infrastructure

  • P1: Molecule testing implementation - Part 1 (8 hours)
  • P1: Complete Docker security audit (4 hours)
  • P1: Plan LVM migration for pihole (2 hours)
  • P2: Pre-commit hooks setup (3 hours)
  • P2: Ansible configuration optimization (2 hours)
  • P2: Dynamic inventory group sanitization (2 hours)

Total Estimated Effort: 21 hours

Week 49 - CI/CD & Automation

Focus: Automated quality gates

  • P1: CI/CD pipeline setup (10 hours)
  • P1: Molecule testing implementation - Part 2 (8 hours)
  • P2: Testing cheatsheet (3 hours)
  • P2: Separate inventories repository (if needed) (4 hours)

Total Estimated Effort: 25 hours

Week 50-51 - Role Development

Focus: Expand role library

  • P1: Complete Molecule testing (4 hours)
  • P2: Common base system role (20 hours)
  • P2: Prometheus node_exporter role (10 hours)
  • P2: Automated compliance scanning (8 hours)

Total Estimated Effort: 42 hours

Week 52 - Security & Hardening

Focus: Security baseline

  • P2: Security hardening role (24 hours)
  • P3: Runbook documentation (8 hours)
  • P3: Performance optimization (6 hours)

Total Estimated Effort: 38 hours


Success Metrics

Infrastructure Health

  • Target: 100% VM connectivity (3/3 operational)
  • Current: 67% (2/3 operational)
  • Timeline: Week 47

Testing Coverage

  • Target: 80% role coverage with functional Molecule tests
  • Current: 0% (structure exists, not functional)
  • Timeline: Week 50

CI/CD Maturity

  • Target: Automated testing on all commits
  • Current: 0% (no pipeline)
  • Timeline: Week 49

Role Library Growth

  • Target: 5 production-ready roles by end of December
  • Current: 2 roles
  • Timeline: Week 52

Compliance Score

  • Target: 95% CLAUDE.md compliance across all hosts
  • Current: 75-90% per host
  • Timeline: Week 51

Time to Deploy New Role

  • Target: <8 hours with full testing
  • Current: Unknown (no testing framework)
  • Timeline: Week 50

Risk Assessment

High Risks

Risk Impact Probability Mitigation
LVM migration data loss CRITICAL MEDIUM Comprehensive backups, testing, consider rebuild
Molecule test complexity HIGH MEDIUM Start simple, iterate, use Docker not libvirt
CI/CD pipeline setup delays HIGH MEDIUM Use Gitea Actions (simpler), prioritize basic tests
derp VM unrecoverable HIGH LOW Document rebuild procedure using deploy_linux_vm
Time constraints MEDIUM HIGH Prioritize P0/P1 tasks, defer P3 tasks

Medium Risks

Risk Impact Probability Mitigation
Docker security findings MEDIUM HIGH Plan remediation time, may need container rebuild
Breaking changes during testing MEDIUM MEDIUM Use check mode, test in dev environment first
Inventory repository complexity MEDIUM LOW Evaluate if truly necessary, may skip

Resource Requirements

Personnel

  • Senior Ansible Developer: 1 FTE
  • Time Allocation:
    • Week 47: 10 hours (critical ops)
    • Week 48-49: 23 hours/week (testing & CI/CD)
    • Week 50-52: 20 hours/week (role development)

Infrastructure

  • Existing: KVM/libvirt hypervisor, 3 VMs
  • New Requirements:
    • Docker/Podman for Molecule testing (can use existing Docker on pihole)
    • CI/CD runner (can use existing infrastructure)
    • Fact cache storage (~100MB, can use local disk)

Tools & Services

  • Existing: Ansible, Git, Gitea, Docker
  • New: Molecule, pre-commit framework, yamllint
  • Installation: pip install molecule molecule-docker pre-commit yamllint

Dependencies

Critical Path

  1. Week 47: derp recovery → full infrastructure operational
  2. Week 48: Molecule setup → enables role testing
  3. Week 49: CI/CD pipeline → enables automated quality
  4. Week 50+: Role development → depends on testing framework

External Dependencies

  • Gitea server availability (for CI/CD and git operations)
  • KVM hypervisor access (for VM management)
  • Internet connectivity (for package installations)

Monitoring & Review

Weekly Reviews

  • Monday: Review previous week progress, adjust priorities
  • Friday: Status update, document blockers

Metrics Tracking

  • VM connectivity status
  • Test coverage percentage
  • CI/CD pipeline success rate
  • CLAUDE.md compliance score
  • Role count and quality

Quarterly Goals

  • Q1 2026 End:
    • 10+ production-ready roles
    • 90%+ test coverage
    • Full CI/CD maturity
    • 95%+ CLAUDE.md compliance
    • Automated security scanning

Appendix: Quick Reference

Immediate Actions (This Week)

Monday-Tuesday:

  1. Recover derp VM (console access)
  2. Fix git push permissions
  3. Update CHANGELOG.md

Wednesday-Thursday: 4. Install QEMU agent on mymx 5. Start Docker security audit 6. Fix ansible-galaxy configuration

Friday: 7. Review progress 8. Update TODO.md 9. Plan Week 48 tasks

Command Reference

# VM Recovery
virsh console derp
virsh edit mymx  # Add virtio-serial

# Testing
ansible-playbook playbooks/install_qemu_agent.yml
ansible-playbook playbooks/audit_docker.yml
molecule test

# CI/CD
ansible-lint
ansible-playbook --syntax-check site.yml
yamllint .

# Monitoring
ansible-playbook playbooks/gather_system_info.yml
cat stats/machines/*/summary.txt


Next Review: 2025-11-18 (Monday, Week 48) Plan Owner: Ansible Infrastructure Team Document Status: Active