# Infrastructure Architecture Overview ## Executive Summary This document provides a comprehensive overview of the Ansible-based infrastructure automation architecture. The system is designed with security-first principles, leveraging Infrastructure as Code (IaC) best practices for automated provisioning, configuration management, and operational excellence. **Architecture Version**: 1.0.0 **Last Updated**: 2025-11-11 **Document Owner**: Ansible Infrastructure Team --- ## Architecture Principles ### Security-First Design All infrastructure components implement defense-in-depth security: - **Least Privilege**: Service accounts with minimal required permissions - **Encryption**: Data encrypted at rest and in transit - **Hardening**: CIS Benchmark-compliant system configuration - **Auditing**: Comprehensive logging and audit trails - **Automation**: Security patches applied automatically ### Infrastructure as Code (IaC) All infrastructure is defined, versioned, and managed as code: - **Version Control**: Git-based change tracking - **Declarative Configuration**: Ansible playbooks and roles - **Idempotency**: Safe re-execution without side effects - **Documentation**: Self-documenting through code ### Scalability & Modularity Architecture scales from small to enterprise deployments: - **Modular Roles**: Single-purpose, reusable components - **Dynamic Inventories**: Auto-discovery of infrastructure - **Parallel Execution**: Concurrent operations for speed - **Horizontal Scaling**: Add capacity by adding hosts --- ## High-Level Architecture ``` ┌──────────────────────────────────────────────────────────────────┐ │ Management Layer │ │ ┌─────────────────┐ ┌──────────────────┐ │ │ │ Ansible Control │────────▶│ Git Repository │ │ │ │ Node │ │ (Gitea) │ │ │ │ │ └──────────────────┘ │ │ │ - Playbooks │ ┌──────────────────┐ │ │ │ - Inventories │────────▶│ Secret Manager │ │ │ │ - Roles │ │ (Ansible Vault) │ │ │ └────────┬────────┘ └──────────────────┘ │ └───────────┼──────────────────────────────────────────────────────┘ │ │ SSH (port 22) │ Encrypted, Key-based Auth │ ┌───────────┼──────────────────────────────────────────────────────┐ │ │ Compute Layer │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐│ │ │ Hypervisor Hosts ││ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ │ │ KVM/Libvirt │ │ KVM/Libvirt │ │ KVM/Libvirt │ ││ │ │ │ Hypervisor │ │ Hypervisor │ │ Hypervisor │ ││ │ │ │ (grokbox) │ │ (hv02) │ │ (hv03) │ ││ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ ││ │ └─────────┼──────────────────┼──────────────────┼──────────────┘│ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────────────────────┐│ │ │ Guest Virtual Machines ││ │ │ ││ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │ │ │ Web │ │ App │ │ Database │ │ Cache │ ││ │ │ │ Servers │ │ Servers │ │ Servers │ │ Servers │ ││ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ ││ │ │ ││ │ │ - SELinux/AppArmor Enforcing ││ │ │ - Firewall (UFW/firewalld) ││ │ │ - Automatic Security Updates ││ │ │ - LVM Storage Management ││ │ └─────────────────────────────────────────────────────────────┘│ └────────────────────────────────────────────────────────────────────┘ │ │ Logs, Metrics, Events ▼ ┌──────────────────────────────────────────────────────────────────┐ │ Observability Layer │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ Logging │ │ Monitoring │ │ Audit │ │ │ │ (Future) │ │ (Future) │ │ Logs │ │ │ └────────────┘ └────────────┘ └────────────┘ │ └──────────────────────────────────────────────────────────────────┘ ``` --- ## Component Architecture ### Management Layer #### Ansible Control Node **Purpose**: Central orchestration and automation hub **Components**: - Ansible Core (2.12+) - Python 3.x - Custom roles and playbooks - Dynamic inventory plugins - Ansible Vault for secrets **Responsibilities**: - Execute playbooks and roles - Manage inventory (dynamic and static) - Secure secrets management - Version control integration - Audit log collection **Security Controls**: - SSH key-based authentication only - No password-based access - Encrypted secrets (Ansible Vault) - Git-backed change tracking - Limited user access with RBAC #### Git Repository (Gitea) **Purpose**: Version control for Infrastructure as Code **Hosted**: https://git.mymx.me **Authentication**: SSH keys, user accounts **Content**: - Ansible playbooks - Role definitions - Inventory configurations (public) - Documentation - Scripts and utilities **Workflow**: - Feature branch development - Pull request reviews - Main branch protection - Semantic versioning tags **Note**: Secrets stored in separate private repository #### Secret Management **Primary**: Ansible Vault (file-based encryption) **Future**: HashiCorp Vault, AWS Secrets Manager integration **Secrets Managed**: - SSH private keys - Service account credentials - API tokens - Encryption certificates - Database passwords **Location**: `./secrets` directory (private git submodule) ### Compute Layer #### Hypervisor Hosts **Platform**: KVM/libvirt on Linux (Debian 12, Ubuntu 22.04, AlmaLinux 9) **Key Capabilities**: - Hardware virtualization (Intel VT-x / AMD-V) - Nested virtualization support - Storage pools (LVM-backed) - Virtual networking (bridges, NAT) - Live migration (planned) **Resource Allocation**: - CPU overcommit ratio: 2:1 (2 vCPUs per physical core) - Memory overcommit: Disabled for production - Storage: Thin provisioning with LVM **Management**: - virsh CLI - libvirt API - Ansible automation - No GUI (security requirement) #### Guest Virtual Machines **Provisioning**: Automated via `deploy_linux_vm` role **Supported Distributions**: - Debian 11, 12 - Ubuntu 20.04, 22.04, 24.04 LTS - RHEL 8, 9 - AlmaLinux 8, 9 - Rocky Linux 8, 9 - openSUSE Leap 15.5, 15.6 **Standard Configuration**: - Cloud-init provisioning - LVM storage (CLAUDE.md compliant) - SSH hardening (key-only, no root login) - SELinux enforcing (RHEL) / AppArmor (Debian) - Firewall enabled (UFW/firewalld) - Automatic security updates - Audit daemon (auditd) - Time synchronization (chrony) **Resource Tiers**: | Tier | vCPUs | RAM | Disk | Use Case | |------|-------|-----|------|----------| | Small | 2 | 2 GB | 30 GB | Development, testing | | Medium | 4 | 8 GB | 50 GB | Web servers, app servers | | Large | 8 | 16 GB | 100 GB | Databases, data processing | | XLarge | 16+ | 32+ GB | 200+ GB | High-performance applications | ### Observability Layer (Planned) #### Logging **Future Integration**: ELK Stack, Graylog, or Loki **Log Sources**: - System logs (rsyslog/journald) - Application logs - Audit logs (auditd) - Security events - Ansible execution logs **Retention**: 30 days local, 1 year centralized #### Monitoring **Future Integration**: Prometheus + Grafana **Metrics Collected**: - CPU, memory, disk, network utilization - Service availability - Application performance - Infrastructure health **Alerting**: PagerDuty, Slack, Email #### Audit & Compliance **Current**: - auditd on all systems - Ansible execution logs - Git change tracking **Future**: - Centralized audit log aggregation - SIEM integration - Compliance dashboards (CIS, NIST) --- ## Deployment Patterns ### Greenfield Deployment **Scenario**: New infrastructure from scratch ``` 1. Setup Ansible Control Node └─▶ Install Ansible └─▶ Clone git repository └─▶ Configure inventories └─▶ Setup secrets management 2. Provision Hypervisors └─▶ Install KVM/libvirt └─▶ Configure storage pools └─▶ Setup networking └─▶ Apply security hardening 3. Deploy Guest VMs └─▶ Use deploy_linux_vm role └─▶ Apply LVM configuration └─▶ Verify security posture 4. Configure Applications └─▶ Apply application roles └─▶ Configure services └─▶ Implement monitoring 5. Validate & Document └─▶ Run system_info role └─▶ Generate inventory └─▶ Update documentation ``` ### Incremental Expansion **Scenario**: Add capacity to existing infrastructure ``` 1. Add Hypervisor (if needed) └─▶ Physical installation └─▶ Ansible provisioning └─▶ Add to inventory 2. Deploy Additional VMs └─▶ Execute deploy_linux_vm role └─▶ Configure per requirements └─▶ Integrate with load balancer 3. Update Inventory └─▶ Refresh dynamic inventory └─▶ Update group assignments └─▶ Verify connectivity 4. Apply Configuration └─▶ Run relevant playbooks └─▶ Validate functionality └─▶ Monitor performance ``` ### Disaster Recovery **Scenario**: Rebuild after failure ``` 1. Assess Damage └─▶ Identify affected systems └─▶ Check backup status └─▶ Plan recovery order 2. Restore Hypervisor (if needed) └─▶ Reinstall from bare metal └─▶ Apply Ansible configuration └─▶ Restore storage pools 3. Restore VMs └─▶ Restore from backups, OR └─▶ Redeploy with deploy_linux_vm └─▶ Restore application data 4. Verify & Resume └─▶ Run validation checks └─▶ Test application functionality └─▶ Resume normal operations ``` --- ## Data Flow ### Provisioning Flow ``` Ansible Control │ │ 1. Read inventory │ (dynamic or static) ▼ Inventory │ │ 2. Execute playbook │ with role(s) ▼ Hypervisor │ │ 3. Create VM │ - Download cloud image │ - Create disks │ - Generate cloud-init ISO │ - Define & start VM ▼ Guest VM │ │ 4. Cloud-init first boot │ - User creation │ - SSH key deployment │ - Package installation │ - Security hardening ▼ Guest VM (Running) │ │ 5. Post-deployment │ - LVM configuration │ - Additional hardening │ - Service configuration ▼ Guest VM (Ready) ``` ### Configuration Management Flow ``` Git Repository │ │ 1. Developer commits changes │ (playbook, role, config) ▼ Pull Request │ │ 2. Code review │ Approval required ▼ Main Branch │ │ 3. Ansible control pulls changes │ (manual or automated) ▼ Ansible Control │ │ 4. Execute playbook │ Target specific environment ▼ Target Hosts │ │ 5. Apply configuration │ Idempotent execution ▼ Updated State │ │ 6. Validation │ Verify desired state ▼ Audit Log ``` ### Information Gathering Flow ``` Ansible Control │ │ 1. Execute gather_system_info.yml ▼ Target Hosts │ │ 2. Collect data │ - CPU, GPU, Memory │ - Disk, Network │ - Hypervisor info ▼ system_info role │ │ 3. Aggregate and format │ JSON structure ▼ Ansible Control │ │ 4. Save to local filesystem │ ./stats/machines// ▼ JSON Files │ │ 5. Query and analyze │ - jq queries │ - Report generation │ - CMDB sync ▼ Reports/Dashboards ``` --- ## Environment Segregation ### Environment Structure ``` inventories/ ├── production/ │ ├── hosts.yml (or dynamic plugin config) │ └── group_vars/ │ ├── all.yml │ └── webservers.yml ├── staging/ │ ├── hosts.yml │ └── group_vars/ │ └── all.yml └── development/ ├── hosts.yml └── group_vars/ └── all.yml ``` ### Environment Isolation | Environment | Purpose | Change Control | Automation | Data | |-------------|---------|----------------|------------|------| | **Production** | Live systems | Strict approval | Scheduled | Real | | **Staging** | Pre-production testing | Approval required | On-demand | Sanitized | | **Development** | Feature development | Minimal | On-demand | Synthetic | ### Promotion Pipeline ``` Development │ │ 1. Develop & test features │ No approval required ▼ Staging │ │ 2. Integration testing │ Approval: Tech Lead ▼ Production │ │ 3. Gradual rollout │ Approval: Operations Manager ▼ Live ``` --- ## Scaling Strategy ### Horizontal Scaling **Add compute capacity**: - Add hypervisor hosts - Deploy additional VMs - Update load balancer configuration - Rebalance workloads **Automation**: - Dynamic inventory auto-discovers new hosts - Ansible playbooks target groups, not individuals - Configuration applied uniformly ### Vertical Scaling **Increase VM resources**: - Shutdown VM - Modify vCPU/memory allocation (virsh) - Resize disk volumes (LVM) - Restart VM - Verify application performance ### Storage Scaling **Expand LVM volumes**: ```bash # Add new disk to hypervisor # Attach to VM as /dev/vdc # Extend volume group pvcreate /dev/vdc vgextend vg_system /dev/vdc # Extend logical volume lvextend -L +50G /dev/vg_system/lv_var resize2fs /dev/vg_system/lv_var # ext4 # or xfs_growfs /var # xfs ``` --- ## High Availability & Disaster Recovery ### Current State **Single Points of Failure**: - Ansible control node (manual failover) - Individual hypervisors (VM migration required) - No automated failover **Mitigation**: - Regular backups (VM snapshots) - Documentation for rebuild - Idempotent playbooks for re-deployment ### Future Enhancements (Planned) **High Availability**: - Multiple Ansible control nodes (Ansible Tower/AWX) - Hypervisor clustering (Proxmox cluster) - Load-balanced application tiers - Database replication (PostgreSQL streaming) **Disaster Recovery**: - Automated backup solution - Off-site backup replication - DR site with regular testing - Documented RTO/RPO objectives --- ## Performance Considerations ### Ansible Execution Optimization - **Fact Caching**: Reduces gather time - **Parallelism**: Increase forks for concurrent execution - **Pipelining**: Reduces SSH overhead - **Strategy Plugins**: Use `free` strategy when tasks are independent ### VM Performance Tuning - **CPU Pinning**: For latency-sensitive applications - **NUMA Awareness**: Optimize memory access - **virtio Drivers**: Use paravirtualized devices - **Disk I/O**: Use virtio-scsi with native AIO ### Network Performance - **SR-IOV**: For high-throughput networking - **Bridge Offloading**: Reduce CPU overhead - **MTU Optimization**: Jumbo frames where supported --- ## Cost Optimization ### Resource Efficiency - **Right-Sizing**: Match VM resources to actual needs - **Consolidation**: Maximize hypervisor utilization - **Thin Provisioning**: Allocate storage on-demand - **Decommissioning**: Remove unused infrastructure ### Automation Benefits - **Reduced Manual Labor**: Faster deployments - **Fewer Errors**: Consistent configurations - **Faster Recovery**: Automated DR procedures - **Better Utilization**: Data-driven capacity planning --- ## Related Documentation - [Network Topology](./network-topology.md) - [Security Model](./security-model.md) - [Role Index](../roles/role-index.md) - [CLAUDE.md Guidelines](../../CLAUDE.md) --- **Document Version**: 1.0.0 **Last Updated**: 2025-11-11 **Review Schedule**: Quarterly **Document Owner**: Ansible Infrastructure Team