infra-automation/docs/roles/system_info.md

# System Information Gathering Role Documentation

## Overview

The `system_info` role provides comprehensive hardware and software inventory capabilities for infrastructure automation. It collects detailed metrics about CPU, GPU, memory, storage, network, and virtualization/hypervisor configurations.

## Purpose

- **Infrastructure Inventory**: Maintain up-to-date hardware and software inventory
- **Capacity Planning**: Track resource utilization and plan for scaling
- **Compliance Documentation**: Support audit requirements with detailed system information
- **Troubleshooting**: Provide baseline configuration data for issue resolution
- **Monitoring Integration**: Feed data into monitoring and CMDB systems

## Architecture

### Data Collection Flow

```
┌─────────────────┐
│  Ansible Facts  │
│   (gathered)    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐      ┌──────────────────┐
│  Hardware Info  │──────▶│   CPU Details    │
│   Collection    │      │   GPU Detection  │
│                 │      │   Memory Info    │
└────────┬────────┘      │   Disk Layout    │
         │               └──────────────────┘
         ▼
┌─────────────────┐      ┌──────────────────┐
│  Hypervisor     │──────▶│  KVM/Libvirt     │
│   Detection     │      │  Proxmox VE      │
│                 │      │  LXD/Docker      │
└────────┬────────┘      │  VMware/Hyper-V  │
         │               └──────────────────┘
         ▼
┌─────────────────┐      ┌──────────────────┐
│  Aggregation    │──────▶│  JSON Export     │
│  & Export       │      │  Summary Report  │
│                 │      │  Timestamped     │
└─────────────────┘      └──────────────────┘
         │
         ▼
┌─────────────────────────────────────┐
│  ./stats/machines/<fqdn>/           │
│  ├── system_info.json               │
│  ├── system_info_<timestamp>.json   │
│  └── summary.txt                    │
└─────────────────────────────────────┘
```

### Task Organization

The role is organized into modular task files:

- `main.yml`: Orchestration and task inclusion
- `install.yml`: Package installation (OS-specific)
- `gather_system.yml`: OS and system information
- `gather_cpu.yml`: CPU details and capabilities
- `gather_gpu.yml`: GPU detection and details
- `gather_memory.yml`: Memory and swap information
- `gather_disk.yml`: Disk, LVM, and RAID information
- `gather_network.yml`: Network interfaces and configuration
- `detect_hypervisor.yml`: Virtualization platform detection
- `export_stats.yml`: JSON aggregation and export
- `validate.yml`: Health checks and validation

## Integration Points

### With Other Roles

The `system_info` role can be used in conjunction with:

- **Monitoring roles**: Feed collected data into Prometheus, Grafana, or other monitoring systems
- **CMDB integration**: Export to ServiceNow, NetBox, or other CMDBs
- **Capacity planning tools**: Provide data for capacity analysis
- **Compliance scanning**: Support CIS, NIST, or custom compliance checks

### With External Systems

#### Example: Export to NetBox

```yaml
- name: Sync to NetBox CMDB
  hosts: all
  tasks:
    - name: Include system_info role
      include_role:
        name: system_info

    - name: Push to NetBox
      uri:
        url: "https://netbox.example.com/api/dcim/devices/"
        method: POST
        body_format: json
        headers:
          Authorization: "Token {{ netbox_api_token }}"
        body:
          name: "{{ ansible_fqdn }}"
          device_type: "{{ system_info_hardware.product }}"
          custom_fields:
            cpu_model: "{{ system_info_cpu.model }}"
            memory_mb: "{{ system_info_memory.total_mb }}"
      delegate_to: localhost
```

#### Example: Prometheus Exporter

```yaml
- name: Export metrics for Prometheus
  copy:
    content: |
      # HELP system_info_cpu_count Number of CPU cores
      # TYPE system_info_cpu_count gauge
      system_info_cpu_count{host="{{ ansible_fqdn }}"} {{ system_info_cpu.count.vcpus }}

      # HELP system_info_memory_total_mb Total memory in MB
      # TYPE system_info_memory_total_mb gauge
      system_info_memory_total_mb{host="{{ ansible_fqdn }}"} {{ system_info_memory.total_mb }}
    dest: "/var/lib/node_exporter/textfile_collector/system_info.prom"
  delegate_to: "{{ ansible_fqdn }}"
```

## Data Dictionary

### JSON Schema

The exported JSON follows this structure:

```json
{
  "collection_info": {
    "timestamp": "ISO8601 datetime",
    "timestamp_epoch": "Unix epoch",
    "collected_by": "ansible",
    "role_version": "semver",
    "ansible_version": "version string"
  },
  "host_info": {
    "hostname": "short hostname",
    "fqdn": "fully qualified domain name",
    "uptime": "human readable uptime",
    "boot_time": "boot timestamp"
  },
  "system": {
    "distribution": "OS name",
    "distribution_version": "version",
    "distribution_release": "codename",
    "distribution_major_version": "major version",
    "os_family": "Debian|RedHat"
  },
  "kernel": {
    "version": "kernel version",
    "architecture": "x86_64|aarch64|etc"
  },
  "hardware": {
    "manufacturer": "hardware vendor",
    "product": "product name",
    "serial": "serial number",
    "uuid": "system UUID"
  },
  "security": {
    "selinux": "Enforcing|Permissive|Disabled|N/A",
    "apparmor": "Enabled|Disabled|N/A"
  },
  "cpu": { /* detailed CPU information */ },
  "gpu": { /* GPU detection and details */ },
  "memory": { /* memory statistics */ },
  "swap": { /* swap configuration */ },
  "disk": { /* disk and storage information */ },
  "network": { /* network configuration */ },
  "hypervisor": { /* virtualization details */ }
}
```

## Use Cases

### 1. Infrastructure Audit

Generate a complete inventory of all infrastructure:

```bash
# Gather information from all hosts
ansible-playbook playbooks/gather_system_info.yml

# Generate CSV report
jq -r '["FQDN","OS","CPU","Memory","Disk","Hypervisor"],
       ([.host_info.fqdn, .system.distribution, .cpu.model,
         (.memory.total_mb|tostring), (.disk.physical_disks|length|tostring),
         (.hypervisor.is_hypervisor|tostring)]) | @csv' \
  stats/machines/*/system_info.json > infrastructure_inventory.csv
```

### 2. License Compliance

Track CPU cores for license management:

```bash
# Count total CPU cores across infrastructure
jq -s 'map(.cpu.count.total_cores | tonumber) | add' \
  stats/machines/*/system_info.json
```

### 3. Capacity Planning

Identify hosts nearing resource limits:

```bash
# Find hosts with >80% memory usage
jq -r 'select(.memory.usage_percent > 80) |
       "\(.host_info.fqdn): \(.memory.usage_percent)%"' \
  stats/machines/*/system_info.json

# Find hosts with low disk space
jq -r 'select(.disk.usage_human[] |
       contains("9[0-9]%") or contains("100%")) |
       .host_info.fqdn' \
  stats/machines/*/system_info.json
```

### 4. Hypervisor Inventory

List all hypervisors and their VM counts:

```bash
# KVM/Libvirt hypervisors
jq -r 'select(.hypervisor.kvm_libvirt.installed == true) |
       "\(.host_info.fqdn): \(.hypervisor.kvm_libvirt.running_vms) running, \(.hypervisor.kvm_libvirt.total_vms) total"' \
  stats/machines/*/system_info.json

# Proxmox hosts
jq -r 'select(.hypervisor.proxmox.installed == true) |
       "\(.host_info.fqdn): \(.hypervisor.proxmox.version)"' \
  stats/machines/*/system_info.json
```

### 5. Security Compliance

Verify SELinux/AppArmor status:

```bash
# Check SELinux enforcement
jq -r 'select(.security.selinux != "Enforcing" and .security.selinux != "N/A") |
       "\(.host_info.fqdn): SELinux is \(.security.selinux)"' \
  stats/machines/*/system_info.json

# List CPU vulnerabilities
jq -r '"\(.host_info.fqdn):", .cpu.vulnerabilities[]' \
  stats/machines/*/system_info.json
```

## Performance Considerations

### Execution Time

Typical execution times per host:
- **Minimal gathering** (CPU, memory only): 15-20 seconds
- **Standard gathering** (all defaults): 30-45 seconds
- **Comprehensive** (with raw outputs): 45-60 seconds

Factors affecting performance:
- Number of network interfaces
- Number of disk devices
- Hypervisor API response time
- SMART disk scanning (slowest component)

### Optimization Strategies

1. **Parallel execution**: Use `-f` flag to increase parallelism
   ```bash
   ansible-playbook site.yml -t system_info -f 20
   ```

2. **Skip slow components**: Disable unnecessary gathering
   ```yaml
   system_info_gather_network: false  # Skip if not needed
   ```

3. **Cache facts**: Enable fact caching in ansible.cfg
   ```ini
   [defaults]
   fact_caching = jsonfile
   fact_caching_connection = /tmp/ansible_facts
   fact_caching_timeout = 3600
   ```

## Security Best Practices

### Data Protection

- **Sensitive information**: Statistics include serial numbers, UUIDs, and network topology
- **Access control**: Restrict read access to statistics directory
- **Encryption**: Consider encrypting the statistics directory for sensitive environments
- **Retention**: Implement rotation policy for timestamped backups

### Execution Security

- **Privilege escalation**: Role requires sudo/root for hardware information
- **Audit logging**: All executions are logged via Ansible
- **Read-only**: Role performs no modifications to managed systems
- **No secrets**: Role does not collect or expose credentials

## Troubleshooting Guide

### Common Problems

#### Problem: "Package installation failed"

**Symptoms**: Role fails during install phase
**Cause**: No internet access or repository issues
**Solution**:
```bash
# Pre-install packages manually
ansible all -m package -a "name=lshw,dmidecode,pciutils state=present" --become

# Or skip installation
ansible-playbook site.yml -t system_info --skip-tags install
```

#### Problem: "Statistics directory not created"

**Symptoms**: No output files generated
**Cause**: Permission issues on control node
**Solution**:
```bash
# Check permissions
mkdir -p ./stats/machines
chmod 755 ./stats/machines

# Or specify writable directory
ansible-playbook site.yml -e "system_info_stats_base_dir=/tmp/stats"
```

#### Problem: "Invalid JSON output"

**Symptoms**: jq reports parsing errors
**Cause**: Incomplete execution or disk full
**Solution**:
```bash
# Validate JSON files
for f in ./stats/machines/*/system_info.json; do
  jq empty "$f" 2>&1 || echo "Invalid: $f"
done

# Re-run for failed hosts
ansible-playbook site.yml -l failed_host -t system_info
```

## Maintenance

### Regular Updates

- **Quarterly review**: Update role for new hypervisor versions
- **OS compatibility**: Test with new OS releases
- **Package updates**: Verify new package versions don't break collection
- **Documentation**: Keep examples and use cases current

### Monitoring

Track role health metrics:
- Execution success rate
- Average execution time
- Output file sizes
- JSON validation failures

### Backup Strategy

```bash
# Daily backup of statistics
0 3 * * * tar -czf /backup/ansible-stats-$(date +\%Y\%m\%d).tar.gz \
  /opt/ansible/stats/machines/

# Cleanup old backups (keep 30 days)
0 4 * * * find /backup/ansible-stats-*.tar.gz -mtime +30 -delete
```

## Advanced Usage

### Custom Filters

Create custom Ansible filters for data processing:

```python
# filter_plugins/system_info_filters.py
def format_memory(value_mb):
    """Convert MB to human readable format"""
    if value_mb < 1024:
        return f"{value_mb} MB"
    elif value_mb < 1048576:
        return f"{value_mb/1024:.1f} GB"
    else:
        return f"{value_mb/1048576:.1f} TB"

class FilterModule(object):
    def filters(self):
        return {
            'format_memory': format_memory
        }
```

### Dynamic Inventory Integration

Use collected data for dynamic grouping:

```python
# inventory_plugins/system_info_inventory.py
# Create dynamic groups based on collected information
import json
import glob

groups = {
    'hypervisors': [],
    'virtual_machines': [],
    'high_memory': [],
    'gpu_enabled': []
}

for stats_file in glob.glob('stats/machines/*/system_info.json'):
    with open(stats_file) as f:
        data = json.load(f)
        fqdn = data['host_info']['fqdn']

        if data['hypervisor']['is_hypervisor']:
            groups['hypervisors'].append(fqdn)
        if data['hypervisor']['is_virtual']:
            groups['virtual_machines'].append(fqdn)
        if data['memory']['total_mb'] > 64000:
            groups['high_memory'].append(fqdn)
        if data['gpu']['detected']:
            groups['gpu_enabled'].append(fqdn)
```

## Related Documentation

- [Main README](../../roles/system_info/README.md)
- [Cheatsheet](../../cheatsheets/system_info.md)
- [Ansible Best Practices](https://docs.ansible.com/ansible/latest/user_guide/playbooks_best_practices.html)

## Changelog

See role README.md for version history and changes.

---

**Document Version**: 1.0.0
**Last Updated**: 2025-01-11
**Maintained By**: Ansible Infrastructure Team