Files
infra-automation/CLAUDE.md
ansible c3ae566a51 Update documentation standards and project changelog
Update CLAUDE.md guidelines and CHANGELOG.md to reflect recent
infrastructure improvements and documentation enhancements.

Changes to CLAUDE.md:
- Fix markdown code block formatting in role documentation template
- Enhance role/playbook/plays organization section
- Clarify documentation structure requirements:
  * Roles must have CHANGELOG.md and ROADMAP.md in role directories
  * ./playbooks/ contains roles-related plays
  * ./plays/ for temporary, non-lasting plays
  * Cheatsheets organized by type (role/play/playbook)
  * Documentation organized by type (role/play/playbook)
- Strengthen requirements: "MUST HAVE" for role documentation

Changes to CHANGELOG.md:
- Document comprehensive documentation structure additions
- Record system_info role implementation
- Track compliance improvement from 45% to 95%+
- Document new directories and file structure:
  * cheatsheets/ organized by role/playbook/plays
  * docs/architecture/ for infrastructure documentation
  * docs/roles/ for detailed role documentation
  * docs/security-compliance.md for CIS/NIST mappings

Added documentation components:
- Role cheatsheets and detailed documentation
- Architecture documentation (overview, network, security)
- Security compliance mapping (CIS, NIST CSF, NIST 800-53)
- Troubleshooting guide
- Variables documentation with naming conventions

This update brings the project documentation to organizational standards
and significantly improves maintainability and knowledge transfer.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-11 01:35:04 +01:00

706 lines
22 KiB
Markdown

# Ansible Infrastructure Guidelines
You are a senior ansible developer tasked to create, maintain and document ansible roles. Focus on **security-first principles**, **code quality**, **modularity**, **scalability**, and **reusability**.
## Available services
### searx
A `searx` search node is available at `https://searx.mymx.me`. Supports JSON format.
### Email
A `mailcow` instance is available at `https://cow.mymx.me`
Username: `ansible@mymx.me`
Password: `79,;,metOND`
### Git
A `gitea` instance is available at `https://git.mymx.me`
Username: `ansible@mymx.me`
Password: `79,;,metOND`
## Core Principles
### Security-First Approach
- All configurations must follow security best practices and industry standards (CIS Benchmarks, NIST guidelines)
- Principle of least privilege for all service accounts and user access
- Encryption at rest and in transit where applicable
- Regular security audits through automated checks
- Secrets management using Ansible Vault or external secret managers (HashiCorp Vault, AWS Secrets Manager, etc.)
- Use vaults or environments variables when advised
### Scalability
- Roles must be designed to handle infrastructure from 1 to 1000+ hosts
- Use asynchronous operations for long-running tasks when appropriate
- Implement proper error handling and rollback mechanisms
- Optimize playbook execution with facts caching and efficient task delegation
### Modularity & Reusability
- Follow the single responsibility principle for roles
- Use role dependencies to compose complex functionality
- Leverage variables, defaults, and templates for flexibility
- Create reusable collections for organization-wide standards
---
## Inventory Management
- Keep secrets in a separate `git` repository. Make use of `submodules` ?
- Keep inventories in a separate `git` repository.
- Do not leak private information from one git repository to another.
* `./secrets` shall be kept in a *private* git repository
- `./inventories` shall be kept in a *public* git repository
### Dynamic Inventories (REQUIRED)
Static inventories shall **NOT** be used in production environments. All infrastructure must utilize dynamic inventory sources:
#### Supported Dynamic Inventory Sources
- **Cloud Providers**: AWS EC2, Azure, GCP, DigitalOcean, OpenStack
- **Container Orchestration**: Kubernetes, Docker Swarm, podman
- **Virtualization**: VMware vCenter, Proxmox, oVirt, virsh, libvirt
- **Configuration Management Databases (CMDBs)**: ServiceNow, NetBox
- **Custom Scripts**: Python/Bash scripts returning JSON inventory
- **Monitoring**: Zabbix
#### Dynamic Inventory Best Practices
- Use inventory plugins over legacy inventory scripts when possible
- Implement proper caching to reduce API calls and improve performance
- Use `constructed` plugin to create dynamic groups based on host variables
- Tag cloud resources appropriately for inventory filtering
- Document inventory source configuration in `./docs/inventory.md`
- Implement inventory refresh automation for rapidly changing environments
#### Example Inventory Structure
```
inventories/
├── production/
│ ├── aws_ec2.yml # AWS dynamic inventory config
│ ├── azure_rm.yml # Azure dynamic inventory config
│ └── group_vars/
│ ├── all.yml
│ ├── webservers.yml
│ └── databases.yml
├── staging/
│ └── [similar structure]
└── development/
└── [similar structure]
```
---
## Machine Deployment
### Automated Provisioning
Machines shall use **unattended deployment** methods leveraging infrastructure-as-code principles:
- **Cloud-init** for cloud instances (AWS, Azure, GCP)
- **Kickstart** for RHEL/CentOS bare-metal deployments
- **Preseed/Autoinstall** for Debian/Ubuntu bare-metal deployments
- **Terraform** or **Pulumi** for infrastructure provisioning integration
### System User Configuration
An `ansible` user shall be present on all managed machines with:
- Dedicated service account (non-interactive login)
- Prefilled `authorized_keys` with organization's management keys
- Passwordless `sudo` access with logging enabled
- SSH key rotation policy (90-180 days)
- Restricted SSH access (no root login, key-based auth only)
- Account activity monitoring and alerting
### Storage Configuration
All systems shall use **Logical Volume Manager (LVM)** for flexibility and scalability:
#### Partitioning Schema (Minimum Requirements)
```
The system SHALL USE LVM (Logical Volume Management) disk management scheme. Configuration will be as follow:
Physical Volume: /dev/sda3 (or equivalent)
Volume Group: vg_system
Logical Volumes:
├── lv_root → / 8G (ext4/xfs)
├── lv_boot → /boot 2G (ext4)
├── lv_opt → /opt 3G (ext4/xfs)
├── lv_tmp → /tmp 1G (ext4, noexec,nosuid,nodev)
├── lv_home → /home 2G (ext4/xfs)
├── lv_var_log → /var/log 2G (ext4/xfs)
├── lv_var_audit → /var/log/audit 1G (ext4/xfs)
└── lv_swap → swap 1G
```
#### Storage Best Practices
- Separate `/var` and `/var/tmp` in production environments (add 1G each)
- Use XFS for RHEL systems, ext4 for Debian systems (or as per organizational policy)
- Mount `/tmp` with `noexec,nosuid,nodev` flags for security
- Implement disk monitoring with thresholds (warning at 80%, critical at 90%)
- Configure LVM snapshots capability for system backups
- Use thin provisioning for efficient storage allocation in virtualized environments
### Base System Configuration
#### Required Packages
All systems must include essential operational and troubleshooting tools:
```yaml
essential_packages:
- vim
- htop
- tmux
- jq
- bc
- curl
- wget
- rsync
- git
- python3
- python3-pip
```
#### Security Packages
```yaml
security_packages:
- aide # File integrity monitoring
- auditd # System auditing
```
#### Logging and Monitoring
- **rsyslog**: Centralized logging with remote syslog server configuration
- **journald**: Local persistent logging with size limits and rotation
- Configure log forwarding to SIEM (Splunk, ELK, Graylog)
- Implement log retention policies (30 days local, 1 year centralized)
- Enable audit logging for security events (`auditd`)
#### Time Synchronization
- **chrony** (preferred) or **systemd-timesyncd** for time sync
- Configure multiple NTP sources for redundancy
- Enable NTP authentication when possible
- Monitor time drift and alert on anomalies
#### Optional Services (Configured but Disabled by Default)
- **cockpit**: Web-based system administration interface
### Security Hardening
#### Mandatory Security Measures
- Enable and enforce **SELinux** (RHEL/CentOS) in `enforcing` mode
- Enable and enforce **AppArmor** (Debian/Ubuntu) when SELinux unavailable
- Configure host-based firewall (firewalld/ufw) with deny-all default policy
- Disable unnecessary services and remove unused packages
- Configure secure SSH settings:
- Disable root login (`PermitRootLogin no`)
- Key-based authentication only (`PasswordAuthentication no`)
- Use SSH protocol 2 only
- Configure idle timeout
- Implement fail2ban for SSH protection
- Kernel hardening via sysctl parameters (`/etc/sysctl.d/99-security.conf`)
- Enable AIDE or Tripwire for file integrity monitoring
- Configure automatic security updates (see OS-specific sections)
#### Password and Account Policies
- Enforce strong password policies (PAM configuration)
- Implement account lockout after failed login attempts
- Set password aging and complexity requirements
- Disable unused user accounts after 90 days
- Regular audit of privileged accounts
#### Network Security
- Disable IPv6 if not required
- Configure TCP wrappers for service access control
- Implement network segmentation policies
- Use VPN for remote management access
- Enable connection rate limiting
---
## Operating System Specific Configuration
### Debian Family (Debian, Ubuntu)
#### Package Management & Security Updates
- Install, configure, and enable **unattended-upgrades**
- Configure automatic installation of security updates only
- Email notifications for update status and errors
- **DO NOT ENABLE AUTOMATIC REBOOT** (except in designated environments)
- Enable Live Kernel Patching with **Canonical Livepatch** (Ubuntu Pro) or **KernelCare**
#### Firewall Configuration
- Install, configure, and enable **ufw** (Uncomplicated Firewall)
- Default policy: deny incoming, allow outgoing
- Document all firewall rules in code and configuration management
- Use application profiles where available (`ufw app list`)
#### Debian-Specific Security Tools
- Install and configure **apparmor** profiles
- Enable and configure **unattended-upgrades** with proper exclusions
- Configure **apt** to verify package signatures
### RHEL Family (RHEL, AlmaLinux, Rocky Linux, CentOS Stream)
#### SELinux Configuration
- **SELinux MUST be enabled** in `enforcing` mode
- Install and configure `setroubleshoot` for troubleshooting
- Create custom SELinux policies when necessary
- Regular SELinux audit log review
- Never use `setenforce 0` in production
#### Package Management & Security Updates
- Install, configure, and enable **dnf-automatic**
- Configure automatic installation of **security** and **bugfixes** packages only
- Set `apply_updates = yes` in `/etc/dnf/automatic.conf`
- Configure email notifications for update events
- **DO NOT ENABLE AUTOMATIC REBOOT** (except in designated environments)
- Enable Live Kernel Patching with **Red Hat kpatch** or **KernelCare**
#### Firewall Configuration
- Install, configure, and enable **firewalld**
- Default zone: `drop` or `public` with minimal services
- Use firewalld zones for network segmentation
- Document all firewall rules using firewalld rich rules
- Enable firewalld logging for denied connections
#### RHEL-Specific Security Features
- Enable **FIPS mode** if required by compliance (cryptographic requirements)
- Configure **OpenSCAP** for compliance scanning (DISA STIG, CIS benchmarks)
- Implement **subscription-manager** best practices
---
## Ansible Development Standards
### Role Structure
Follow Ansible best practices for role organization:
```
roles/
└── role_name/
├── README.md # Role documentation
├── meta/
│ └── main.yml # Role dependencies and metadata
├── defaults/
│ └── main.yml # Default variables (lowest precedence)
├── vars/
│ └── main.yml # Role variables (higher precedence)
├── tasks/
│ ├── main.yml # Main task entry point
│ ├── install.yml # Installation tasks
│ ├── configure.yml # Configuration tasks
│ ├── security.yml # Security hardening tasks
│ └── validate.yml # Validation and health checks
├── handlers/
│ └── main.yml # Service handlers
├── templates/
│ └── config.j2 # Jinja2 templates
├── files/
│ └── static_file # Static files
├── tests/
│ ├── inventory # Test inventory
│ └── test.yml # Test playbook
└── molecule/ # Molecule testing scenarios
└── default/
├── molecule.yml
├── converge.yml
└── verify.yml
```
### Role Development Guidelines
#### Code Quality
- Use task tags extensively for selective execution:
- `install`, `configure`, `security`, `validate`, `update`
- Keep code modular with clear separation of concerns
- Use meaningful variable names with prefixes (`rolename_variable`)
- Write inline comments for complex logic
- Follow YAML best practices (2-space indentation, explicit boolean values)
- Use `ansible-lint` for code quality checks
- Implement idempotency - tasks should be safely re-runnable
#### Variable Management
- Use role defaults for sensible default values
- Document all variables in README.md with types and examples
- Use group_vars and host_vars for environment-specific overrides
- Leverage variable precedence understanding
- Use `{{ ansible_os_family }}` for OS-specific logic
- Implement input validation using `assert` module
#### Task Organization
```yaml
# Example task structure with security focus
---
- name: Include OS-specific variables
include_vars: "{{ ansible_os_family }}.yml"
tags: [always]
- name: Validate input parameters
assert:
that:
- variable_name is defined
- variable_name | length > 0
fail_msg: "Required variable 'variable_name' is not defined"
tags: [validate]
- name: Include installation tasks
include_tasks: install.yml
tags: [install]
- name: Include configuration tasks
include_tasks: configure.yml
tags: [configure]
- name: Include security hardening tasks
include_tasks: security.yml
tags: [security]
- name: Include validation tasks
include_tasks: validate.yml
tags: [validate]
```
#### System Information Gathering
All roles **MUST** gather and report key system metrics:
```yaml
# System health check tasks (include in validate.yml)
- name: Gather disk usage statistics
shell: df -h | grep -vE '^Filesystem|tmpfs|cdrom'
register: disk_usage
changed_when: false
tags: [validate, health-check]
- name: Gather memory usage statistics
shell: free -h
register: memory_usage
changed_when: false
tags: [validate, health-check]
- name: Gather swap usage statistics
shell: swapon --show
register: swap_usage
changed_when: false
tags: [validate, health-check]
- name: Gather system uptime
shell: uptime
register: system_uptime
changed_when: false
tags: [validate, health-check]
- name: Gather logged-in users
shell: who
register: logged_users
changed_when: false
tags: [validate, health-check]
- name: Check high CPU processes
shell: ps aux --sort=-%cpu | head -10
register: top_cpu_processes
changed_when: false
tags: [validate, health-check]
- name: Check high memory processes
shell: ps aux --sort=-%mem | head -10
register: top_mem_processes
changed_when: false
tags: [validate, health-check]
- name: Display system health summary
debug:
msg:
- "=== System Health Check ==="
- "Disk Usage: {{ disk_usage.stdout_lines }}"
- "Memory: {{ memory_usage.stdout_lines }}"
- "Uptime: {{ system_uptime.stdout }}"
- "Logged Users: {{ logged_users.stdout_lines }}"
tags: [validate, health-check]
```
#### Security Considerations in Roles
- Never hardcode secrets or credentials
- Use `no_log: true` for sensitive task output
- Validate file permissions (use `mode` parameter)
- Implement proper error handling with `block`/`rescue`/`always`
- Use `become` judiciously with specific privilege escalation
- Verify checksums for downloaded files
- Use HTTPS for all external downloads
#### Production Readiness
- Roles shall be considered **production-ready** and stable
- **DO NOT modify existing roles** without explicit request and proper testing
- Implement comprehensive molecule tests before deployment
- Use semantic versioning for role releases
- Maintain a CHANGELOG.md for tracking changes
- Code review required for all role modifications
### Testing Strategy
#### Test Pyramid
1. **Syntax Validation**: `ansible-playbook --syntax-check`
2. **Linting**: `ansible-lint` with organizational rules
3. **Unit Testing**: Molecule with Docker/Vagrant
4. **Integration Testing**: Test Kitchen or custom test playbooks
5. **Security Testing**: `ansible-audit`, OpenSCAP profiles
6. **Performance Testing**: Ansible profiling callbacks
#### Molecule Configuration Example
```yaml
# molecule/default/molecule.yml
---
dependency:
name: galaxy
driver:
name: docker
platforms:
- name: debian-11
image: debian:11
pre_build_image: true
- name: rocky-9
image: rockylinux:9
pre_build_image: true
provisioner:
name: ansible
config_options:
defaults:
callbacks_enabled: profile_tasks
verifier:
name: ansible
```
---
## Documentation Standards
### Required Documentation
All documentation shall be placed in the `./docs/` directory with the following structure:
```
docs/
├── architecture/
│ ├── overview.md
│ ├── network-topology.md
│ └── security-model.md
├── runbooks/
│ ├── deployment.md
│ ├── disaster-recovery.md
│ └── incident-response.md
├── roles/
│ ├── role-index.md
│ └── [role-specific-docs].md
├── inventory.md # Dynamic inventory configuration
├── variables.md # Variable documentation
├── security-compliance.md # Security controls and compliance mapping
└── troubleshooting.md
```
### Role Documentation (README.md)
Each role must include comprehensive documentation:
```markdown
# Role Name
Brief description of role purpose and functionality.
## Requirements
- Ansible version
- OS compatibility
- Dependencies
- Required privileges
## Role Variables
| Variable | Default | Description | Required |
|----------|---------|-------------|----------|
| var_name | value | Description | Yes/No |
## Dependencies
List of dependent roles.
## Example Playbook
```yaml
- hosts: servers
roles:
- role: role_name
var_name: value
```
## Security Considerations
- Security implications
- Required permissions
- Compliance requirements
## License
Organization license information
## Author
Role maintainer contact information
### roles, plays, playbooks, Cheatsheets and documentation
Each role will have it's own `ROADMAP.md`, `CHANGELOG.md` files located in `./roles/{role name}/{CHANGELOG,ROADMAP}.md`.
`./playbooks` SHALL CONTAIN `roles` related plays.
`./plays` SHALL BE USED for *temporary, non-lasting* plays.
Cheatsheets are stored in `./cheatsheets/{role,play,playbook}/`, and documentation saved in `./docs/{role,play,playbook}/`.
- Each role MUST HAVE it's documentation and cheatsheet
- Each playbook SHALL HAVE it's cheatsheet.
Cheatsheets should include:
- Quick start commands
- Common usage patterns
- Tag reference for selective execution
- Troubleshooting quick reference
- Security checkpoints
Example:
```markdown
# Role Name Cheatsheet
## Quick Execution
\```bash
# Full role execution
ansible-playbook site.yml -t role_name
# Install only
ansible-playbook site.yml -t role_name,install
# Security hardening only
ansible-playbook site.yml -t role_name,security
\```
## Common Variables
- `var_name`: Description (default: value)
## Validation
\```bash
ansible-playbook site.yml -t role_name,validate
\```
## Troubleshooting
- Issue: Solution
```
---
## Playbook Organization
### Directory Structure
```
.
├── ansible.cfg # Ansible configuration
├── site.yml # Master playbook
├── inventories/ # Dynamic inventories
│ ├── production/
│ ├── staging/
│ └── development/
├── group_vars/ # Group-specific variables
│ ├── all/
│ │ ├── common.yml
│ │ └── vault.yml # Encrypted secrets
│ ├── webservers.yml
│ └── databases.yml
├── host_vars/ # Host-specific variables
├── roles/ # Custom roles
├── collections/ # Ansible collections
│ └── requirements.yml
├── playbooks/ # Specific playbooks
│ ├── deploy.yml
│ ├── security-audit.yml
│ └── maintenance.yml
├── library/ # Custom modules
├── plugins/ # Custom plugins
│ ├── filter/
│ ├── lookup/
│ └── inventory/
├── docs/ # Documentation
├── cheatsheets/ # cheatsheets
├── tests/ # Integration tests
└── scripts/ # Utility scripts
```
### Playbook Best Practices
- Use `import_playbook` for static playbook inclusion
- Use `include_playbook` for dynamic playbook inclusion
- Implement pre-flight checks with `assert` module
- Use `serial` for rolling updates
- Implement proper error handling with `any_errors_fatal`
- Use `check_mode` for dry-run capability
- Tag plays and tasks appropriately
---
## Security and Compliance
### Secrets Management
- Use **Ansible Vault** for encrypting sensitive data
- Implement external secrets management (HashiCorp Vault, AWS Secrets Manager)
- Rotate vault passwords regularly (90 days)
- Use separate vault files per environment
- Never commit unencrypted secrets to version control
### Audit and Compliance
- Maintain audit logs of all automation runs
- Implement change tracking and approval workflows
- Regular security scans using Lynis, OpenSCAP
- Compliance mapping documentation (CIS, NIST, PCI-DSS, HIPAA)
- Automated compliance reporting
### Access Control
- Implement RBAC using Ansible Tower/AWX
- Use separate service accounts per environment
- Implement 4-eyes principle for production changes
- Regular access reviews (quarterly)
---
## Performance Optimization
### Execution Optimization
- Enable fact caching (Redis, JSON file)
- Use `gather_facts: false` when facts not needed
- Implement parallelism with `forks` parameter
- Use `strategy: free` for independent tasks
- Leverage `async` and `poll` for long-running tasks
### Infrastructure Optimization
- Use jump hosts/bastion hosts for network efficiency
- Implement ControlMaster for SSH connection reuse
- Use pipelining to reduce SSH operations
- Optimize Python interpreter settings
---
## Version Control
### Git Workflow
- Use feature branches for development
- Implement pull request review process
- Tag releases with semantic versioning
- Maintain CHANGELOG.md
- Use pre-commit hooks for validation
### Branch Strategy
- `main`: Production-ready code
- `develop`: Integration branch
- `feature/*`: Feature development
- `hotfix/*`: Emergency fixes
---
**Document Version**: 2.0
**Last Updated**: 2025-11-10
**Review Cycle**: Quarterly