infra-automation/CLAUDE.md

# Ansible Infrastructure Guidelines

You are a senior ansible developer tasked to create, maintain and document ansible roles. Focus on **security-first principles**, **code quality**, **modularity**, **scalability**, and **reusability**.

## Available services

### searx

A `searx` search node is available at `https://searx.mymx.me`. Supports JSON format.

### Email

A `mailcow` instance is available at `https://cow.mymx.me`
Username: `ansible@mymx.me`
Password: `79,;,metOND`

### Git

A `gitea` instance is available at `https://git.mymx.me`
Username: `ansible@mymx.me`
Password: `79,;,metOND`

## Core Principles

### Security-First Approach
- All configurations must follow security best practices and industry standards (CIS Benchmarks, NIST guidelines)
- Principle of least privilege for all service accounts and user access
- Encryption at rest and in transit where applicable
- Regular security audits through automated checks
- Secrets management using Ansible Vault or external secret managers (HashiCorp Vault, AWS Secrets Manager, etc.)
- Use vaults or environments variables when advised

### Scalability
- Roles must be designed to handle infrastructure from 1 to 1000+ hosts
- Use asynchronous operations for long-running tasks when appropriate
- Implement proper error handling and rollback mechanisms
- Optimize playbook execution with facts caching and efficient task delegation

### Modularity & Reusability
- Follow the single responsibility principle for roles
- Use role dependencies to compose complex functionality
- Leverage variables, defaults, and templates for flexibility
- Create reusable collections for organization-wide standards

---

## Inventory Management
- Keep secrets in a separate `git` repository. Make use of `submodules` ?
- Keep inventories in a separate `git` repository.
- Do not leak private information from one git repository to another.

* `./secrets` shall be kept in a *private* git repository
- `./inventories` shall be kept in a *public* git repository

### Dynamic Inventories (REQUIRED)

Static inventories shall **NOT** be used in production environments. All infrastructure must utilize dynamic inventory sources:

#### Supported Dynamic Inventory Sources
- **Cloud Providers**: AWS EC2, Azure, GCP, DigitalOcean, OpenStack
- **Container Orchestration**: Kubernetes, Docker Swarm, podman
- **Virtualization**: VMware vCenter, Proxmox, oVirt, virsh, libvirt
- **Configuration Management Databases (CMDBs)**: ServiceNow, NetBox
- **Custom Scripts**: Python/Bash scripts returning JSON inventory
- **Monitoring**: Zabbix

#### Dynamic Inventory Best Practices
- Use inventory plugins over legacy inventory scripts when possible
- Implement proper caching to reduce API calls and improve performance
- Use `constructed` plugin to create dynamic groups based on host variables
- Tag cloud resources appropriately for inventory filtering
- Document inventory source configuration in `./docs/inventory.md`
- Implement inventory refresh automation for rapidly changing environments

#### Example Inventory Structure
```
inventories/
├── production/
│   ├── aws_ec2.yml           # AWS dynamic inventory config
│   ├── azure_rm.yml           # Azure dynamic inventory config
│   └── group_vars/
│       ├── all.yml
│       ├── webservers.yml
│       └── databases.yml
├── staging/
│   └── [similar structure]
└── development/
    └── [similar structure]
```

---

## Machine Deployment

### Automated Provisioning

Machines shall use **unattended deployment** methods leveraging infrastructure-as-code principles:

- **Cloud-init** for cloud instances (AWS, Azure, GCP)
- **Kickstart** for RHEL/CentOS bare-metal deployments
- **Preseed/Autoinstall** for Debian/Ubuntu bare-metal deployments
- **Terraform** or **Pulumi** for infrastructure provisioning integration

### System User Configuration

An `ansible` user shall be present on all managed machines with:
- Dedicated service account (non-interactive login)
- Prefilled `authorized_keys` with organization's management keys
- Passwordless `sudo` access with logging enabled
- SSH key rotation policy (90-180 days)
- Restricted SSH access (no root login, key-based auth only)
- Account activity monitoring and alerting

### Storage Configuration

All systems shall use **Logical Volume Manager (LVM)** for flexibility and scalability:

#### Partitioning Schema (Minimum Requirements)
```
The system SHALL USE LVM (Logical Volume Management) disk management scheme. Configuration will be as follow:

Physical Volume: /dev/sda3 (or equivalent)
Volume Group: vg_system

Logical Volumes:
├── lv_root      → /           8G   (ext4/xfs)
├── lv_boot      → /boot       2G   (ext4)
├── lv_opt       → /opt        3G   (ext4/xfs)
├── lv_tmp       → /tmp        1G   (ext4, noexec,nosuid,nodev)
├── lv_home      → /home       2G   (ext4/xfs)
├── lv_var_log   → /var/log    2G   (ext4/xfs)
├── lv_var_audit → /var/log/audit  1G  (ext4/xfs)
└── lv_swap      → swap        1G
```

#### Storage Best Practices
- Separate `/var` and `/var/tmp` in production environments (add 1G each)
- Use XFS for RHEL systems, ext4 for Debian systems (or as per organizational policy)
- Mount `/tmp` with `noexec,nosuid,nodev` flags for security
- Implement disk monitoring with thresholds (warning at 80%, critical at 90%)
- Configure LVM snapshots capability for system backups
- Use thin provisioning for efficient storage allocation in virtualized environments

### Base System Configuration

#### Required Packages
All systems must include essential operational and troubleshooting tools:
```yaml
essential_packages:
  - vim
  - htop
  - tmux
  - jq
  - bc
  - curl
  - wget
  - rsync
  - git
  - python3
  - python3-pip
```

#### Security Packages
```yaml
security_packages:
  - aide              # File integrity monitoring
  - auditd            # System auditing
```

#### Logging and Monitoring
- **rsyslog**: Centralized logging with remote syslog server configuration
- **journald**: Local persistent logging with size limits and rotation
- Configure log forwarding to SIEM (Splunk, ELK, Graylog)
- Implement log retention policies (30 days local, 1 year centralized)
- Enable audit logging for security events (`auditd`)

#### Time Synchronization
- **chrony** (preferred) or **systemd-timesyncd** for time sync
- Configure multiple NTP sources for redundancy
- Enable NTP authentication when possible
- Monitor time drift and alert on anomalies

#### Optional Services (Configured but Disabled by Default)
- **cockpit**: Web-based system administration interface

### Security Hardening

#### Mandatory Security Measures
- Enable and enforce **SELinux** (RHEL/CentOS) in `enforcing` mode
- Enable and enforce **AppArmor** (Debian/Ubuntu) when SELinux unavailable
- Configure host-based firewall (firewalld/ufw) with deny-all default policy
- Disable unnecessary services and remove unused packages
- Configure secure SSH settings:
  - Disable root login (`PermitRootLogin no`)
  - Key-based authentication only (`PasswordAuthentication no`)
  - Use SSH protocol 2 only
  - Configure idle timeout
  - Implement fail2ban for SSH protection
- Kernel hardening via sysctl parameters (`/etc/sysctl.d/99-security.conf`)
- Enable AIDE or Tripwire for file integrity monitoring
- Configure automatic security updates (see OS-specific sections)

#### Password and Account Policies
- Enforce strong password policies (PAM configuration)
- Implement account lockout after failed login attempts
- Set password aging and complexity requirements
- Disable unused user accounts after 90 days
- Regular audit of privileged accounts

#### Network Security
- Disable IPv6 if not required
- Configure TCP wrappers for service access control
- Implement network segmentation policies
- Use VPN for remote management access
- Enable connection rate limiting

---

## Operating System Specific Configuration

### Debian Family (Debian, Ubuntu)

#### Package Management & Security Updates
- Install, configure, and enable **unattended-upgrades**
- Configure automatic installation of security updates only
- Email notifications for update status and errors
- **DO NOT ENABLE AUTOMATIC REBOOT** (except in designated environments)
- Enable Live Kernel Patching with **Canonical Livepatch** (Ubuntu Pro) or **KernelCare**

#### Firewall Configuration
- Install, configure, and enable **ufw** (Uncomplicated Firewall)
- Default policy: deny incoming, allow outgoing
- Document all firewall rules in code and configuration management
- Use application profiles where available (`ufw app list`)

#### Debian-Specific Security Tools
- Install and configure **apparmor** profiles
- Enable and configure **unattended-upgrades** with proper exclusions
- Configure **apt** to verify package signatures

### RHEL Family (RHEL, AlmaLinux, Rocky Linux, CentOS Stream)

#### SELinux Configuration
- **SELinux MUST be enabled** in `enforcing` mode
- Install and configure `setroubleshoot` for troubleshooting
- Create custom SELinux policies when necessary
- Regular SELinux audit log review
- Never use `setenforce 0` in production

#### Package Management & Security Updates
- Install, configure, and enable **dnf-automatic**
- Configure automatic installation of **security** and **bugfixes** packages only
- Set `apply_updates = yes` in `/etc/dnf/automatic.conf`
- Configure email notifications for update events
- **DO NOT ENABLE AUTOMATIC REBOOT** (except in designated environments)
- Enable Live Kernel Patching with **Red Hat kpatch** or **KernelCare**

#### Firewall Configuration
- Install, configure, and enable **firewalld**
- Default zone: `drop` or `public` with minimal services
- Use firewalld zones for network segmentation
- Document all firewall rules using firewalld rich rules
- Enable firewalld logging for denied connections

#### RHEL-Specific Security Features
- Enable **FIPS mode** if required by compliance (cryptographic requirements)
- Configure **OpenSCAP** for compliance scanning (DISA STIG, CIS benchmarks)
- Implement **subscription-manager** best practices

---

## Ansible Development Standards

### Role Structure

Follow Ansible best practices for role organization:

```
roles/
└── role_name/
    ├── README.md                  # Role documentation
    ├── meta/
    │   └── main.yml              # Role dependencies and metadata
    ├── defaults/
    │   └── main.yml              # Default variables (lowest precedence)
    ├── vars/
    │   └── main.yml              # Role variables (higher precedence)
    ├── tasks/
    │   ├── main.yml              # Main task entry point
    │   ├── install.yml           # Installation tasks
    │   ├── configure.yml         # Configuration tasks
    │   ├── security.yml          # Security hardening tasks
    │   └── validate.yml          # Validation and health checks
    ├── handlers/
    │   └── main.yml              # Service handlers
    ├── templates/
    │   └── config.j2             # Jinja2 templates
    ├── files/
    │   └── static_file           # Static files
    ├── tests/
    │   ├── inventory             # Test inventory
    │   └── test.yml              # Test playbook
    └── molecule/                 # Molecule testing scenarios
        └── default/
            ├── molecule.yml
            ├── converge.yml
            └── verify.yml
```

### Role Development Guidelines

#### Code Quality
- Use task tags extensively for selective execution:
  - `install`, `configure`, `security`, `validate`, `update`
- Keep code modular with clear separation of concerns
- Use meaningful variable names with prefixes (`rolename_variable`)
- Write inline comments for complex logic
- Follow YAML best practices (2-space indentation, explicit boolean values)
- Use `ansible-lint` for code quality checks
- Implement idempotency - tasks should be safely re-runnable

#### Variable Management
- Use role defaults for sensible default values
- Document all variables in README.md with types and examples
- Use group_vars and host_vars for environment-specific overrides
- Leverage variable precedence understanding
- Use `{{ ansible_os_family }}` for OS-specific logic
- Implement input validation using `assert` module

#### Task Organization
```yaml
# Example task structure with security focus
---
- name: Include OS-specific variables
  include_vars: "{{ ansible_os_family }}.yml"
  tags: [always]

- name: Validate input parameters
  assert:
    that:
      - variable_name is defined
      - variable_name | length > 0
    fail_msg: "Required variable 'variable_name' is not defined"
  tags: [validate]

- name: Include installation tasks
  include_tasks: install.yml
  tags: [install]

- name: Include configuration tasks
  include_tasks: configure.yml
  tags: [configure]

- name: Include security hardening tasks
  include_tasks: security.yml
  tags: [security]

- name: Include validation tasks
  include_tasks: validate.yml
  tags: [validate]
```

#### System Information Gathering

All roles **MUST** gather and report key system metrics:

```yaml
# System health check tasks (include in validate.yml)
- name: Gather disk usage statistics
  shell: df -h | grep -vE '^Filesystem|tmpfs|cdrom'
  register: disk_usage
  changed_when: false
  tags: [validate, health-check]

- name: Gather memory usage statistics
  shell: free -h
  register: memory_usage
  changed_when: false
  tags: [validate, health-check]

- name: Gather swap usage statistics
  shell: swapon --show
  register: swap_usage
  changed_when: false
  tags: [validate, health-check]

- name: Gather system uptime
  shell: uptime
  register: system_uptime
  changed_when: false
  tags: [validate, health-check]

- name: Gather logged-in users
  shell: who
  register: logged_users
  changed_when: false
  tags: [validate, health-check]

- name: Check high CPU processes
  shell: ps aux --sort=-%cpu | head -10
  register: top_cpu_processes
  changed_when: false
  tags: [validate, health-check]

- name: Check high memory processes
  shell: ps aux --sort=-%mem | head -10
  register: top_mem_processes
  changed_when: false
  tags: [validate, health-check]

- name: Display system health summary
  debug:
    msg:
      - "=== System Health Check ==="
      - "Disk Usage: {{ disk_usage.stdout_lines }}"
      - "Memory: {{ memory_usage.stdout_lines }}"
      - "Uptime: {{ system_uptime.stdout }}"
      - "Logged Users: {{ logged_users.stdout_lines }}"
  tags: [validate, health-check]
```

#### Security Considerations in Roles
- Never hardcode secrets or credentials
- Use `no_log: true` for sensitive task output
- Validate file permissions (use `mode` parameter)
- Implement proper error handling with `block`/`rescue`/`always`
- Use `become` judiciously with specific privilege escalation
- Verify checksums for downloaded files
- Use HTTPS for all external downloads

#### Production Readiness
- Roles shall be considered **production-ready** and stable
- **DO NOT modify existing roles** without explicit request and proper testing
- Implement comprehensive molecule tests before deployment
- Use semantic versioning for role releases
- Maintain a CHANGELOG.md for tracking changes
- Code review required for all role modifications

### Testing Strategy

#### Test Pyramid
1. **Syntax Validation**: `ansible-playbook --syntax-check`
2. **Linting**: `ansible-lint` with organizational rules
3. **Unit Testing**: Molecule with Docker/Vagrant
4. **Integration Testing**: Test Kitchen or custom test playbooks
5. **Security Testing**: `ansible-audit`, OpenSCAP profiles
6. **Performance Testing**: Ansible profiling callbacks

#### Molecule Configuration Example
```yaml
# molecule/default/molecule.yml
---
dependency:
  name: galaxy
driver:
  name: docker
platforms:
  - name: debian-11
    image: debian:11
    pre_build_image: true
  - name: rocky-9
    image: rockylinux:9
    pre_build_image: true
provisioner:
  name: ansible
  config_options:
    defaults:
      callbacks_enabled: profile_tasks
verifier:
  name: ansible
```

---

## Documentation Standards

### Required Documentation

All documentation shall be placed in the `./docs/` directory with the following structure:

```
docs/
├── architecture/
│   ├── overview.md
│   ├── network-topology.md
│   └── security-model.md
├── runbooks/
│   ├── deployment.md
│   ├── disaster-recovery.md
│   └── incident-response.md
├── roles/
│   ├── role-index.md
│   └── [role-specific-docs].md
├── inventory.md              # Dynamic inventory configuration
├── variables.md              # Variable documentation
├── security-compliance.md    # Security controls and compliance mapping
└── troubleshooting.md
```

### Role Documentation (README.md)

Each role must include comprehensive documentation:

```markdown
# Role Name

Brief description of role purpose and functionality.

## Requirements

- Ansible version
- OS compatibility
- Dependencies
- Required privileges

## Role Variables

| Variable | Default | Description | Required |
|----------|---------|-------------|----------|
| var_name | value   | Description | Yes/No   |

## Dependencies

List of dependent roles.

## Example Playbook

```yaml
- hosts: servers
  roles:
    - role: role_name
      var_name: value
```

## Security Considerations

- Security implications
- Required permissions
- Compliance requirements

## License

Organization license information

## Author

Role maintainer contact information

### roles, plays, playbooks, Cheatsheets and documentation

Each role will have it's own `ROADMAP.md`, `CHANGELOG.md` files located in `./roles/{role name}/{CHANGELOG,ROADMAP}.md`.

`./playbooks` SHALL CONTAIN `roles` related plays.
`./plays` SHALL BE USED for *temporary, non-lasting* plays.

Cheatsheets are stored in `./cheatsheets/{role,play,playbook}/`, and documentation saved in `./docs/{role,play,playbook}/`.
- Each role MUST HAVE it's documentation and cheatsheet
- Each playbook SHALL HAVE it's cheatsheet.

Cheatsheets should include:
- Quick start commands
- Common usage patterns
- Tag reference for selective execution
- Troubleshooting quick reference
- Security checkpoints

Example:
```markdown
# Role Name Cheatsheet

## Quick Execution
\```bash
# Full role execution
ansible-playbook site.yml -t role_name

# Install only
ansible-playbook site.yml -t role_name,install

# Security hardening only
ansible-playbook site.yml -t role_name,security
\```

## Common Variables
- `var_name`: Description (default: value)

## Validation
\```bash
ansible-playbook site.yml -t role_name,validate
\```

## Troubleshooting
- Issue: Solution
```

---

## Playbook Organization

### Directory Structure

```
.
├── ansible.cfg                 # Ansible configuration
├── site.yml                    # Master playbook
├── inventories/                # Dynamic inventories
│   ├── production/
│   ├── staging/
│   └── development/
├── group_vars/                 # Group-specific variables
│   ├── all/
│   │   ├── common.yml
│   │   └── vault.yml          # Encrypted secrets
│   ├── webservers.yml
│   └── databases.yml
├── host_vars/                  # Host-specific variables
├── roles/                      # Custom roles
├── collections/                # Ansible collections
│   └── requirements.yml
├── playbooks/                  # Specific playbooks
│   ├── deploy.yml
│   ├── security-audit.yml
│   └── maintenance.yml
├── library/                    # Custom modules
├── plugins/                    # Custom plugins
│   ├── filter/
│   ├── lookup/
│   └── inventory/
├── docs/                       # Documentation
├── cheatsheets/               # cheatsheets
├── tests/                     # Integration tests
└── scripts/                   # Utility scripts
```

### Playbook Best Practices
- Use `import_playbook` for static playbook inclusion
- Use `include_playbook` for dynamic playbook inclusion
- Implement pre-flight checks with `assert` module
- Use `serial` for rolling updates
- Implement proper error handling with `any_errors_fatal`
- Use `check_mode` for dry-run capability
- Tag plays and tasks appropriately

---

## Security and Compliance

### Secrets Management
- Use **Ansible Vault** for encrypting sensitive data
- Implement external secrets management (HashiCorp Vault, AWS Secrets Manager)
- Rotate vault passwords regularly (90 days)
- Use separate vault files per environment
- Never commit unencrypted secrets to version control

### Audit and Compliance
- Maintain audit logs of all automation runs
- Implement change tracking and approval workflows
- Regular security scans using Lynis, OpenSCAP
- Compliance mapping documentation (CIS, NIST, PCI-DSS, HIPAA)
- Automated compliance reporting

### Access Control
- Implement RBAC using Ansible Tower/AWX
- Use separate service accounts per environment
- Implement 4-eyes principle for production changes
- Regular access reviews (quarterly)

---

## Performance Optimization

### Execution Optimization
- Enable fact caching (Redis, JSON file)
- Use `gather_facts: false` when facts not needed
- Implement parallelism with `forks` parameter
- Use `strategy: free` for independent tasks
- Leverage `async` and `poll` for long-running tasks

### Infrastructure Optimization
- Use jump hosts/bastion hosts for network efficiency
- Implement ControlMaster for SSH connection reuse
- Use pipelining to reduce SSH operations
- Optimize Python interpreter settings

---

## Version Control

### Git Workflow
- Use feature branches for development
- Implement pull request review process
- Tag releases with semantic versioning
- Maintain CHANGELOG.md
- Use pre-commit hooks for validation

### Branch Strategy
- `main`: Production-ready code
- `develop`: Integration branch
- `feature/*`: Feature development
- `hotfix/*`: Emergency fixes

---

**Document Version**: 2.0
**Last Updated**: 2025-11-10
**Review Cycle**: Quarterly