diff --git a/docs/docker-userns-testing-guide.md b/docs/docker-userns-testing-guide.md new file mode 100644 index 0000000..fe7b306 --- /dev/null +++ b/docs/docker-userns-testing-guide.md @@ -0,0 +1,762 @@ +# Docker User Namespace Remapping - Testing and Implementation Guide + +**Document Version:** 1.0 +**Last Updated:** 2025-11-11 +**Risk Level:** HIGH +**Testing Required:** YES (Mandatory in dev/test first) + +--- + +## Table of Contents + +1. [Overview](#overview) +2. [Security Benefits](#security-benefits) +3. [Prerequisites](#prerequisites) +4. [Testing Phase (Week 48-49)](#testing-phase-week-48-49) +5. [Production Implementation (Week 50)](#production-implementation-week-50) +6. [Mailcow-Specific Considerations](#mailcow-specific-considerations) +7. [Troubleshooting](#troubleshooting) + +--- + +## Overview + +User namespace remapping is a Docker security feature that maps container UID/GIDs to different values on the host, preventing container root from being host root. + +### Current Status + +| Host | User Namespaces | Risk Level | Implementation Priority | +|------|-----------------|------------|------------------------| +| pihole | Not configured | MEDIUM | Week 49 (after testing) | +| mymx | Not configured | HIGH | Week 50 (mailcow complexity) | + +### Impact Assessment + +**Benefits:** +- ✅ Container root ≠ host root (major security improvement) +- ✅ Reduces container escape impact +- ✅ CIS Docker Benchmark compliance (2.13) + +**Risks:** +- ⚠️ **ALL containers must be recreated** +- ⚠️ Volume permissions must be remapped +- ⚠️ Breaking change for existing deployments +- ⚠️ Mailcow may have specific requirements + +**Recommendation:** Test thoroughly in dev, then pihole, then mymx (last) + +--- + +## Security Benefits + +### Without User Namespace Remapping (Current State) + +``` +Container: Host: +UID 0 (root) → UID 0 (root) ❌ DANGEROUS +UID 1000 → UID 1000 +``` + +**Problem:** Container root can potentially escape and has host root privileges. + +### With User Namespace Remapping (Target State) + +``` +Container: Host: +UID 0 (root) → UID 165536 ✅ SAFE +UID 1000 → UID 166536 +``` + +**Benefit:** Container root is unprivileged user on host. + +--- + +## Prerequisites + +### Before Starting Testing + +1. **VM Snapshots Created** + ```bash + ansible-playbook playbooks/backup_vm_snapshot.yml \ + -e "target_vms=['pihole', 'mymx']" + ``` + +2. **Rollback Procedures Reviewed** + - Read: `docs/runbooks/docker-configuration-rollback.md` + - Understand VM snapshot restore process + - Have emergency contact information ready + +3. **Maintenance Window Scheduled** + - Duration: 2-3 hours for testing + - Low-traffic period recommended + - Second person available for verification + +4. **Documentation Ready** + - This guide printed or accessible offline + - Docker and mailcow documentation available + - Notepad for documenting issues + +--- + +## Testing Phase (Week 48-49) + +### Phase 1: Test Environment Setup (Week 48) + +**Objective:** Validate user namespace remapping with simple container + +#### Option A: Use derp VM (Recommended) + +```bash +# 1. Start derp VM (if stopped) +ssh grokbox "sudo virsh start derp" + +# 2. Create ansible user and configure SSH +# (Use deploy_linux_vm role or manual setup) + +# 3. Install Docker +ansible derp -m apt -a "name=docker.io state=present" -b + +# 4. Create snapshot before testing +ansible-playbook playbooks/backup_vm_snapshot.yml \ + -e "target_vms=['derp']" +``` + +#### Option B: Create temporary test container on existing host + +```bash +# On pihole (low risk - only 1 container) +# Create test container first + +docker run -d --name userns-test \ + -v test-volume:/data \ + alpine:latest sleep infinity +``` + +### Phase 2: Enable User Namespace Remapping (Week 48) + +#### Step 1: Configure Docker Daemon + +```bash +# On test host (derp or pihole) +sudo tee /etc/docker/daemon.json < /data/test.txt' + +# Check file ownership on host +# Volume location changed! It's now in: +sudo ls -la /var/lib/docker/165536.165536/volumes/test-volume/_data/ + +# UID should be 165536 (remapped root) + +# Test read/write in container +docker exec userns-test cat /data/test.txt +docker exec userns-test sh -c 'echo "test2" >> /data/test.txt' +``` + +### Phase 3: Test with Real Application (Week 48-49) + +#### Test Scenario 1: Simple Web Server (pihole preparation) + +```bash +# Deploy nginx with volume +docker run -d --name test-nginx \ + -p 8080:80 \ + -v nginx-data:/usr/share/nginx/html \ + nginx:alpine + +# Test access +curl http://localhost:8080 + +# Create content +docker exec test-nginx sh -c 'echo "

User Namespace Test

" > /usr/share/nginx/html/test.html' + +# Verify access +curl http://localhost:8080/test.html + +# Check logs +docker logs test-nginx +``` + +#### Test Scenario 2: Database Container (mailcow preparation) + +```bash +# Deploy MariaDB with volume +docker run -d --name test-db \ + -e MYSQL_ROOT_PASSWORD=testpass123 \ + -v mysql-data:/var/lib/mysql \ + mariadb:10.11 + +# Wait for startup +sleep 30 + +# Test database +docker exec test-db mysql -ptest pass123 -e "SHOW DATABASES;" + +# Create test database +docker exec test-db mysql -ptest pass123 -e "CREATE DATABASE testdb;" + +# Stop and restart to test persistence +docker stop test-db +docker start test-db +sleep 20 + +# Verify data persisted +docker exec test-db mysql -ptest pass123 -e "SHOW DATABASES;" | grep testdb +``` + +#### Test Scenario 3: Application with File Uploads + +```bash +# Create upload directory +mkdir -p /tmp/test-uploads + +# Run container with bind mount +docker run -d --name test-upload \ + -v /tmp/test-uploads:/uploads \ + alpine:latest sleep infinity + +# Test file creation +docker exec test-upload sh -c 'echo "test" > /uploads/test.txt' + +# Check host permissions +ls -la /tmp/test-uploads/ +# File should be owned by UID 165536 + +# Test file access from container +docker exec test-upload cat /uploads/test.txt +``` + +### Phase 4: Identify Issues (Week 48-49) + +#### Common Issues to Check + +1. **Permission Denied Errors** + ```bash + # Check container logs + docker logs 2>&1 | grep -i "permission" + ``` + +2. **Volume Mount Failures** + ```bash + # List volumes + docker volume ls + + # Inspect volume + docker volume inspect + + # Check actual location on disk + sudo ls -la /var/lib/docker/*/volumes/ + ``` + +3. **Bind Mount Issues** + ```bash + # For bind mounts, may need to adjust host permissions + # Example: Allow remapped UID to write + sudo chown 165536:165536 /path/to/host/dir + ``` + +4. **Privileged Container Conflicts** + ```bash + # Test if privileged containers still work + docker run --rm --privileged alpine:latest id + # Note: Privileged containers bypass userns remapping + ``` + +#### Document All Findings + +Create test log: +```markdown +## User Namespace Remapping Test Log + +Date: +Host: +Docker Version: + +### Test 1: Simple Container +- Result: PASS/FAIL +- Issues: +- Notes: + +### Test 2: Web Server +- Result: PASS/FAIL +- Issues: +- Notes: + +### Test 3: Database +- Result: PASS/FAIL +- Issues: +- Notes: + +### Conclusion +Ready for production: YES/NO +Blockers: +``` + +--- + +## Production Implementation (Week 50) + +### Implementation Order + +1. **pihole** (Week 49 end / Week 50 start) - Lowest risk +2. **mymx** (Week 50 end) - Highest risk, requires mailcow-specific testing + +### pihole Implementation + +**Prerequisites:** +- ✅ Testing completed successfully on derp/test environment +- ✅ VM snapshot created +- ✅ Maintenance window scheduled +- ✅ Rollback procedure reviewed + +**Steps:** + +```bash +# 1. Create snapshot +ansible-playbook playbooks/backup_vm_snapshot.yml \ + -e "target_vms=['pihole']" \ + -e "snapshot_description='Pre user namespace implementation'" + +# 2. Backup current configuration +ansible pihole -m shell -a "sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.backup.$(date +%s)" -b + +# 3. Stop pihole container +ansible pihole -m shell -a "docker stop pihole" -b + +# 4. Configure user namespace remapping +ansible pihole -m copy -b -a " + dest=/etc/docker/daemon.json + content='{\"userns-remap\": \"default\"}' + owner=root + group=root + mode='0644' +" + +# 5. Restart Docker +ansible pihole -m systemd -a "name=docker state=restarted" -b + +# 6. Verify Docker started +ansible pihole -m shell -a "docker info | grep -i userns" -b + +# 7. Recreate pihole container (adjust based on actual deployment) +# If using docker run command, re-run it +# If using docker-compose, run: docker-compose up -d + +# 8. Verify pihole is working +ansible pihole -m shell -a "docker ps" -b +ansible pihole -m shell -a "docker logs pihole --tail 50" -b + +# 9. Test DNS functionality +dig @192.168.122.12 google.com + +# 10. Monitor for 1 hour +watch -n 60 'ansible pihole -m shell -a "docker ps" -b' +``` + +**Rollback if Issues:** +```bash +# Follow docs/runbooks/docker-configuration-rollback.md +# Procedure 3: User Namespace Remapping Rollback +``` + +--- + +## Mailcow-Specific Considerations + +### Why Mailcow is Complex + +1. **Multiple interconnected containers** (24 containers) +2. **Persistent data in multiple volumes** (mail, databases, configs) +3. **File permissions critical** for mail delivery +4. **Active production service** - downtime impact high + +### Mailcow Testing Approach (Week 49-50) + +#### Phase 1: Research (Week 49) + +```bash +# 1. Check mailcow documentation +# Search: "user namespace" or "userns-remap" +# URL: https://docs.mailcow.email/ + +# 2. Check mailcow GitHub issues +# Search for: userns, user namespace, permission issues + +# 3. Check mailcow community forum +# URL: https://community.mailcow.email/ +# Search for similar implementations +``` + +#### Phase 2: Mailcow Test Environment (Week 49) + +**Option A: Deploy test mailcow on derp** + +```bash +# Requires: +# - 4GB+ RAM (derp may be too small) +# - 20GB+ disk space +# - Domain for testing + +# Install mailcow on derp +git clone https://github.com/mailcow/mailcow-dockerized +cd mailcow-dockerized +./generate_config.sh +docker-compose up -d +``` + +**Option B: Clone mymx mailcow config to test environment** + +```bash +# Create test VM clone +# Copy mailcow configuration +# Test with user namespaces +``` + +#### Phase 3: Mailcow Volume Analysis (Week 49) + +```bash +# On mymx, identify all volumes +docker volume ls | grep mailcow + +# Check critical volumes +docker volume inspect mailcowdockerized_vmail-vol-1 +docker volume inspect mailcowdockerized_mysql-vol-1 + +# Document current permissions +for vol in $(docker volume ls -q | grep mailcow); do + echo "=== $vol ===" + sudo ls -la /var/lib/docker/volumes/$vol/_data/ | head -20 +done > /tmp/mailcow-permissions-before.txt +``` + +#### Phase 4: Mailcow Implementation (Week 50 - IF testing successful) + +**ONLY proceed if:** +- ✅ Testing in dev environment successful +- ✅ pihole implementation successful +- ✅ Mailcow community confirms no known issues +- ✅ Extended maintenance window available (2-4 hours) +- ✅ Full backups completed +- ✅ Rollback tested and confirmed working + +**Implementation Steps:** + +```bash +# 1. Create snapshot +ansible-playbook playbooks/backup_vm_snapshot.yml \ + -e "target_vms=['mymx']" \ + -e "snapshot_description='Pre mailcow user namespace'" + +# 2. Backup ALL mailcow data +ansible mymx -m shell -a "cd /opt/mailcow-dockerized && ./helper-scripts/backup_and_restore.sh backup all" -b + +# 3. Stop mailcow +ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose down" -b + +# 4. Backup current state +ansible mymx -m shell -a " + sudo tar -czf /root/mailcow-pre-userns-$(date +%s).tar.gz \ + /etc/docker \ + /opt/mailcow-dockerized \ + /var/lib/docker/volumes/mailcow* +" -b + +# 5. Configure user namespace +ansible mymx -m shell -a " + sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.backup.$(date +%s) + echo '{\"userns-remap\": \"default\"}' | sudo tee /etc/docker/daemon.json +" -b + +# 6. Restart Docker +ansible mymx -m systemd -a "name=docker state=restarted" -b + +# 7. Verify Docker started with user namespaces +ansible mymx -m shell -a "docker info | grep -i userns" -b + +# 8. Start mailcow (will recreate all containers) +ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose up -d" -b + +# 9. Monitor startup +watch -n 10 'ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose ps" -b' + +# 10. Check logs for permission errors +ansible mymx -m shell -a "cd /opt/mailcow-dockerized && docker-compose logs --tail 100" -b | grep -i "permission\|denied\|failed" + +# 11. Test mail functionality +# - Send test email +# - Receive test email +# - Check webmail access +# - Verify SOGo groupware +# - Test IMAP/SMTP connections + +# 12. Monitor for 4-8 hours before declaring success +``` + +**Known Potential Issues with Mailcow:** + +1. **Vmail Volume Permissions** + ```bash + # If mail delivery fails with permission errors + # May need to adjust permissions (LAST RESORT) + sudo chown -R 165536:165536 /var/lib/docker/165536.165536/volumes/mailcowdockerized_vmail-vol-1/_data/ + ``` + +2. **MySQL Volume Issues** + ```bash + # If database won't start + # Check MySQL logs + docker logs mailcowdockerized-mysql-mailcow-1 + + # May need database permission fixes + # This is why testing is CRITICAL + ``` + +3. **Dovecot Permission Issues** + ```bash + # Dovecot is sensitive to mail file permissions + # May require config adjustments in mailcow.conf + ``` + +### Mailcow Rollback Decision Point + +**Roll back immediately if:** +- Docker daemon won't start +- MySQL container won't start +- Cannot send/receive mail after 15 minutes +- Permission errors in critical containers +- Data appears missing/inaccessible + +**Use VM snapshot restore if:** +- Multiple containers failing +- Data corruption suspected +- Cannot resolve within 30 minutes + +--- + +## Troubleshooting + +### Issue 1: Docker Daemon Won't Start + +**Symptoms:** +```bash +systemctl status docker +# Failed to start Docker Application Container Engine +``` + +**Solutions:** +```bash +# Check logs +journalctl -u docker -n 100 --no-pager + +# Common causes: +# 1. Invalid daemon.json syntax +cat /etc/docker/daemon.json | jq '.' + +# 2. Subuid/subgid not configured +cat /etc/subuid +cat /etc/subgid +# Should have dockremap:165536:65536 + +# 3. Restore backup +sudo cp /etc/docker/daemon.json.backup. /etc/docker/daemon.json +sudo systemctl start docker +``` + +### Issue 2: Container Won't Start - Permission Denied + +**Symptoms:** +```bash +docker logs +# Permission denied errors +``` + +**Solutions:** +```bash +# 1. Check volume location +docker volume inspect + +# 2. Check permissions on host +sudo ls -la /var/lib/docker/165536.165536/volumes//_data/ + +# 3. If permissions wrong, may need to adjust +# (Avoid this if possible - indicates larger problem) +sudo chown -R 165536:165536 /var/lib/docker/165536.165536/volumes//_data/ +``` + +### Issue 3: Bind Mounts Not Working + +**Symptoms:** +```bash +docker logs +# Cannot access /bind/mount/path +``` + +**Solutions:** +```bash +# Bind mounts need host directory permissions adjusted +sudo chown 165536:165536 /path/to/bind/mount + +# Or use volumes instead of bind mounts +# Volumes are handled automatically by Docker +``` + +### Issue 4: Privileged Container Needed + +**Note:** Privileged containers (like mailcow netfilter) bypass user namespace remapping. + +```bash +# Verify privileged container still works +docker inspect | grep -i privileged +# Should show: "Privileged": true + +# Privileged containers run as actual root (userns bypassed) +# This is expected for netfilter, acceptable risk (documented) +``` + +--- + +## Success Criteria + +### Testing Phase Success (Before Production) + +- [ ] Simple container runs successfully +- [ ] Web server container accessible +- [ ] Database container stores/retrieves data +- [ ] Volume permissions correct (165536 UID) +- [ ] Bind mounts work (if needed) +- [ ] No permission errors in logs +- [ ] Can recreate containers after Docker restart +- [ ] Rollback procedure tested and successful + +### Production Implementation Success + +#### pihole +- [ ] VM snapshot created +- [ ] Docker daemon running with user namespaces +- [ ] pihole container running +- [ ] DNS queries working +- [ ] No permission errors in logs +- [ ] Monitoring shows normal operation for 24+ hours + +#### mymx/mailcow +- [ ] VM snapshot created +- [ ] Docker daemon running with user namespaces +- [ ] All 24 containers running +- [ ] Can send email +- [ ] Can receive email +- [ ] Webmail accessible +- [ ] SOGo groupware working +- [ ] No permission errors in logs +- [ ] Monitoring shows normal operation for 48+ hours +- [ ] Full service verification completed + +--- + +## Decision Tree + +``` +START: Ready to enable user namespaces? +│ +├─ Testing completed in dev? +│ ├─ NO → STOP: Complete testing first +│ └─ YES → Continue +│ +├─ VM snapshots created? +│ ├─ NO → STOP: Create snapshots first +│ └─ YES → Continue +│ +├─ Rollback procedure reviewed? +│ ├─ NO → STOP: Review rollback docs +│ └─ YES → Continue +│ +├─ Which host? +│ ├─ pihole → Proceed (lower risk) +│ └─ mymx → Additional checks needed +│ │ +│ ├─ Mailcow community research done? +│ │ ├─ NO → STOP: Research first +│ │ └─ YES → Continue +│ │ +│ ├─ pihole implementation successful? +│ │ ├─ NO → STOP: Fix pihole first +│ │ └─ YES → Continue +│ │ +│ ├─ Extended maintenance window? +│ │ ├─ NO → STOP: Schedule proper window +│ │ └─ YES → Proceed with caution +│ │ +│ └─ Proceed with mymx (high risk) +``` + +--- + +## References + +- Docker User Namespace Documentation: https://docs.docker.com/engine/security/userns-remap/ +- CIS Docker Benchmark 2.13: Enable user namespace support +- Mailcow Documentation: https://docs.mailcow.email/ +- NIST SP 800-190: Section 4.4 - Host OS and multi-tenancy + +--- + +**Document Version:** 1.0 +**Next Review:** After testing completion (Week 49) +**Owner:** Infrastructure Security Team diff --git a/docs/runbooks/docker-configuration-rollback.md b/docs/runbooks/docker-configuration-rollback.md new file mode 100644 index 0000000..24d8b0f --- /dev/null +++ b/docs/runbooks/docker-configuration-rollback.md @@ -0,0 +1,549 @@ +# Docker Configuration Rollback Procedures + +**Document Version:** 1.0 +**Last Updated:** 2025-11-11 +**Owner:** Infrastructure Team +**Risk Level:** HIGH - User Namespace Remapping / LOW - Resource Limits + +--- + +## Table of Contents + +1. [Overview](#overview) +2. [Pre-Change Requirements](#pre-change-requirements) +3. [Rollback Procedures](#rollback-procedures) +4. [Specific Scenarios](#specific-scenarios) +5. [Emergency Contacts](#emergency-contacts) + +--- + +## Overview + +This runbook provides step-by-step rollback procedures for Docker configuration changes, with special focus on high-risk modifications like user namespace remapping. + +### Risk Classification + +| Change Type | Risk Level | Rollback Complexity | Downtime | +|-------------|-----------|---------------------|----------| +| Resource limits | LOW | Simple | < 1 min | +| Image version pinning | LOW | Simple | < 1 min | +| User namespace remapping | HIGH | Complex | 5-15 min | +| Network configuration | MEDIUM | Moderate | 2-5 min | +| Storage driver change | CRITICAL | Complex | 15-30 min | + +--- + +## Pre-Change Requirements + +### Before ANY Docker Configuration Change + +**MANDATORY STEPS - DO NOT SKIP:** + +1. **Create VM Snapshot** + ```bash + # From Ansible control node + ansible-playbook playbooks/backup_vm_snapshot.yml \ + -e "target_vms=['pihole']" \ + -e "snapshot_description='Pre Docker config change'" + ``` + +2. **Backup Docker Configuration** + ```bash + # On target host + sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.backup.$(date +%s) + sudo tar -czf /root/docker-backup-$(date +%s).tar.gz \ + /etc/docker \ + /var/lib/docker/volumes + ``` + +3. **Document Current State** + ```bash + # Capture current container list + docker ps -a --format "table {{.Names}}\t{{.Image}}\t{{.Status}}" > /tmp/containers-before.txt + + # Capture current configuration + docker info > /tmp/docker-info-before.txt + + # Capture volume list + docker volume ls > /tmp/volumes-before.txt + ``` + +4. **Verify Connectivity** + ```bash + # Test from Ansible control node + ansible pihole -m ping + ansible pihole -m shell -a "docker ps" + ``` + +5. **Schedule Maintenance Window** + - Notify stakeholders + - Plan for 30-60 minute window + - Have second person available for verification + +--- + +## Rollback Procedures + +### Procedure 1: Quick Rollback (Resource Limits / Image Versions) + +**Time Estimate:** 1-2 minutes +**Risk:** LOW +**Downtime:** < 1 minute per container + +#### Steps + +1. **Stop affected container** + ```bash + docker stop + ``` + +2. **Restore previous configuration** + ```bash + # For docker run commands + # Simply re-run with old parameters + + # For docker-compose + git checkout HEAD~1 docker-compose.yml + docker-compose up -d + ``` + +3. **Verify service** + ```bash + docker ps | grep + docker logs --tail 50 + + # Test application functionality + curl -I http:// + ``` + +#### Success Criteria +- Container running +- Logs show normal operation +- Service accessible +- No errors in `docker logs` + +--- + +### Procedure 2: Daemon Configuration Rollback (Non-Breaking Changes) + +**Time Estimate:** 3-5 minutes +**Risk:** MEDIUM +**Downtime:** 2-3 minutes + +#### Steps + +1. **Restore daemon.json** + ```bash + sudo cp /etc/docker/daemon.json.backup. /etc/docker/daemon.json + ``` + +2. **Restart Docker daemon** + ```bash + sudo systemctl restart docker + ``` + +3. **Verify Docker is running** + ```bash + sudo systemctl status docker + docker info + ``` + +4. **Check all containers** + ```bash + docker ps -a + + # Restart any stopped containers + docker start $(docker ps -aq) + ``` + +5. **Verify services** + ```bash + # Test each service + docker logs --tail 20 + ``` + +#### Success Criteria +- Docker daemon running +- All containers started +- Services accessible +- No errors in `journalctl -u docker` + +--- + +### Procedure 3: User Namespace Remapping Rollback (HIGH RISK) + +**Time Estimate:** 10-15 minutes +**Risk:** HIGH +**Downtime:** 10-15 minutes +**Data Loss Risk:** LOW (if volumes backed up) + +⚠️ **WARNING:** This is the most complex rollback. Follow carefully. + +#### Pre-Rollback Verification + +```bash +# Verify snapshot exists +ssh grokbox "sudo virsh snapshot-list " + +# Verify backup archive exists +ls -lh /root/docker-backup-*.tar.gz +``` + +#### Steps + +1. **Stop all containers gracefully** + ```bash + # Mailcow example + cd /opt/mailcow-dockerized + docker-compose down + + # Or generic + docker stop $(docker ps -q) + ``` + +2. **Stop Docker daemon** + ```bash + sudo systemctl stop docker + ``` + +3. **Restore daemon.json (remove userns-remap)** + ```bash + sudo cp /etc/docker/daemon.json.backup. /etc/docker/daemon.json + + # Verify userns-remap is removed + grep -i userns /etc/docker/daemon.json + ``` + +4. **CRITICAL: Handle user namespace volume mappings** + ```bash + # User namespaced volumes are in a different location + # /var/lib/docker/./volumes/ + + # List namespaced volumes + sudo ls -la /var/lib/docker/*/volumes/ + + # Copy volumes back to main location (if needed) + sudo rsync -av /var/lib/docker/*/volumes/* /var/lib/docker/volumes/ + ``` + +5. **Start Docker daemon** + ```bash + sudo systemctl start docker + sudo systemctl status docker + ``` + +6. **Verify Docker info** + ```bash + docker info | grep -i "userns" + # Should NOT show user namespace remapping + ``` + +7. **Recreate containers** + ```bash + # Mailcow example + cd /opt/mailcow-dockerized + docker-compose up -d + + # Wait for all containers to start + watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"' + ``` + +8. **Verify all services** + ```bash + # Check container logs + docker-compose logs --tail 50 + + # Test services + curl -I https://cow.mymx.me + + # Verify email functionality (mailcow) + docker-compose exec postfix-mailcow postqueue -p + ``` + +#### If Rollback Fails: VM Snapshot Restore + +```bash +# From Ansible control node or directly on hypervisor + +# 1. Shutdown VM +ssh grokbox "sudo virsh shutdown " + +# 2. Wait for shutdown (max 60 seconds) +sleep 30 + +# 3. Force stop if needed +ssh grokbox "sudo virsh destroy " + +# 4. Revert to snapshot +ssh grokbox "sudo virsh snapshot-revert backup_" + +# 5. Start VM +ssh grokbox "sudo virsh start " + +# 6. Verify SSH access (may take 1-2 minutes) +ansible -m ping + +# 7. Verify services +ansible -m shell -a "docker ps" +``` + +#### Success Criteria +- Docker daemon running WITHOUT user namespace remapping +- All containers running +- All services accessible +- Volume data intact +- No permission errors in logs + +--- + +## Specific Scenarios + +### Scenario A: Mailcow Container Won't Start After Namespace Change + +**Symptoms:** +- Containers exit immediately +- Permission denied errors in logs +- Volume mount failures + +**Solution:** +```bash +# 1. Check volume permissions +docker run --rm -v mailcowdockerized_vmail-vol-1:/volume alpine ls -la /volume + +# 2. Fix permissions if needed (DANGEROUS - only if you know UID mapping) +# This example assumes standard userns mapping (165536 offset) +sudo chown -R 165536:165536 /var/lib/docker/volumes/mailcowdockerized_vmail-vol-1 + +# 3. If permissions are unfixable, revert to snapshot +# See "VM Snapshot Restore" above +``` + +### Scenario B: Docker Daemon Won't Start After Config Change + +**Symptoms:** +- `systemctl start docker` fails +- Errors in `journalctl -u docker` + +**Solution:** +```bash +# 1. Check exact error +sudo journalctl -u docker -n 50 --no-pager + +# 2. Validate daemon.json syntax +sudo cat /etc/docker/daemon.json | jq '.' + +# 3. If syntax error, restore backup +sudo cp /etc/docker/daemon.json.backup. /etc/docker/daemon.json + +# 4. If configuration conflict, check docs +sudo dockerd --validate --config-file /etc/docker/daemon.json + +# 5. Start daemon +sudo systemctl start docker +``` + +### Scenario C: Data Loss After Namespace Change + +**Symptoms:** +- Volumes appear empty +- Database containers can't find data +- Application state lost + +**Solution:** +```bash +# 1. STOP - Do not proceed with data recovery attempts +# 2. DO NOT restart containers +# 3. Immediately revert to snapshot + +ssh grokbox "sudo virsh snapshot-revert backup_" + +# 4. After VM restore, verify data +docker exec + +# Example for MySQL +docker exec mailcowdockerized-mysql-mailcow-1 mysql -u root -p -e "SHOW DATABASES;" +``` + +--- + +## Testing Rollback Procedures + +### Monthly Rollback Drill + +**Schedule:** First Monday of each month +**Duration:** 30 minutes +**Environment:** Development/Test VMs only + +#### Drill Steps + +1. **Create test VM or use derp** + ```bash + # Deploy test container + docker run -d --name test-nginx nginx:latest + ``` + +2. **Create snapshot** + ```bash + ansible-playbook playbooks/backup_vm_snapshot.yml \ + -e "target_vms=['test-vm']" + ``` + +3. **Make intentional breaking change** + ```bash + # Break Docker config + echo '{"invalid": json}' | sudo tee /etc/docker/daemon.json + sudo systemctl restart docker # This will fail + ``` + +4. **Practice rollback** + ```bash + # Follow Procedure 2 above + sudo cp /etc/docker/daemon.json.backup. /etc/docker/daemon.json + sudo systemctl start docker + ``` + +5. **Practice snapshot restore** + ```bash + # Follow VM Snapshot Restore procedure + ssh grokbox "sudo virsh snapshot-revert test-vm backup_" + ``` + +6. **Document issues found** + - Update this runbook + - Note any steps that were unclear + - Time each procedure + +--- + +## Emergency Contacts + +### Escalation Path + +| Level | Contact | Response Time | Responsibility | +|-------|---------|---------------|----------------| +| L1 | Infrastructure Team | Immediate | Execute runbook | +| L2 | Senior Sysadmin | 15 minutes | Complex issues | +| L3 | Vendor Support | 1-4 hours | Critical failures | + +### Service-Specific Contacts + +**Mailcow:** +- Documentation: https://docs.mailcow.email/ +- Community: https://community.mailcow.email/ +- Emergency: Check for known issues in GitHub + +**Docker:** +- Documentation: https://docs.docker.com/ +- Community Forums: https://forums.docker.com/ + +--- + +## Post-Rollback Actions + +### After Any Rollback + +1. **Update incident log** + ```markdown + Date: + VM: + Change Attempted: + Rollback Procedure Used: + Success: Yes/No + Time to Restore: + Issues Encountered: + ``` + +2. **Verify service monitoring** + - Check all alerts cleared + - Verify metrics returning to normal + - Test service endpoints + +3. **Document lessons learned** + - What went wrong? + - What could be improved? + - Update this runbook + +4. **Schedule post-mortem** (for critical incidents) + - Within 48 hours + - All stakeholders present + - Action items assigned + +5. **Update change management records** + - Mark change as rolled back + - Document reason for failure + - Plan for retry (if applicable) + +--- + +## Preventive Measures + +### Before Making High-Risk Changes + +1. **Test in development first** + - Use derp VM or test environment + - Replicate production as closely as possible + - Document exact steps that work + +2. **Review Docker/Mailcow changelogs** + - Check for known issues + - Review breaking changes + - Search community forums + +3. **Peer review change plan** + - Have colleague review procedure + - Walk through rollback steps + - Verify backup procedures + +4. **Schedule during low-traffic period** + - Weekend or late evening + - Notify users in advance + - Have monitoring ready + +--- + +## Appendix A: Quick Reference Commands + +### Snapshot Management +```bash +# Create snapshot +ansible-playbook playbooks/backup_vm_snapshot.yml -e "target_vms=['vm']" + +# List snapshots +ssh grokbox "sudo virsh snapshot-list " + +# Revert to snapshot +ssh grokbox "sudo virsh snapshot-revert " + +# Delete snapshot +ssh grokbox "sudo virsh snapshot-delete " +``` + +### Docker Backup/Restore +```bash +# Backup +sudo tar -czf docker-backup.tar.gz /etc/docker /var/lib/docker/volumes + +# Restore +sudo tar -xzf docker-backup.tar.gz -C / +``` + +### Service Verification +```bash +# Docker +systemctl status docker +docker info +docker ps + +# Mailcow +cd /opt/mailcow-dockerized +docker-compose ps +docker-compose logs --tail 50 +``` + +--- + +**Document End** + +**Review Schedule:** Monthly +**Next Review:** 2025-12-11 +**Approval:** Infrastructure Team Lead diff --git a/playbooks/backup_vm_snapshot.yml b/playbooks/backup_vm_snapshot.yml new file mode 100644 index 0000000..80a8950 --- /dev/null +++ b/playbooks/backup_vm_snapshot.yml @@ -0,0 +1,206 @@ +--- +# ============================================================================== +# VM Snapshot Backup Playbook +# ============================================================================== +# Create snapshots of VMs before risky operations +# Supports KVM/libvirt VMs via hypervisor connection +# ============================================================================== + +- name: Create VM Snapshots for Backup + hosts: localhost + gather_facts: true + vars: + hypervisor_uri: "qemu+ssh://grok@grok.home.serneels.xyz/system" + snapshot_description: "Pre-maintenance backup" + snapshot_prefix: "backup" + target_vms: [] # Empty list means all running VMs + + tasks: + - name: Display snapshot operation information + ansible.builtin.debug: + msg: + - "=== VM Snapshot Backup Operation ===" + - "Hypervisor: {{ hypervisor_uri }}" + - "Date: {{ ansible_date_time.iso8601 }}" + - "Target VMs: {{ target_vms | default('all running VMs') }}" + tags: [always] + + - name: Validate target_vms variable + ansible.builtin.assert: + that: + - target_vms is defined + - target_vms is iterable + fail_msg: "target_vms must be a list of VM names" + tags: [always] + + # ========================================================================== + # Get VM List + # ========================================================================== + + - name: Get list of all running VMs + ansible.builtin.shell: | + ssh grokbox "sudo virsh list --name" + register: all_vms_raw + changed_when: false + when: target_vms | length == 0 + tags: [discover] + + - name: Parse running VMs list + ansible.builtin.set_fact: + discovered_vms: "{{ all_vms_raw.stdout_lines | select() | list }}" + when: target_vms | length == 0 + tags: [discover] + + - name: Set final VM list + ansible.builtin.set_fact: + vms_to_backup: "{{ target_vms if target_vms | length > 0 else discovered_vms }}" + tags: [discover] + + - name: Display VMs to be backed up + ansible.builtin.debug: + msg: "VMs to backup: {{ vms_to_backup }}" + tags: [discover] + + # ========================================================================== + # Pre-flight Checks + # ========================================================================== + + - name: Check if VMs exist and are running + ansible.builtin.shell: | + ssh grokbox "sudo virsh domstate {{ item }}" + register: vm_states + failed_when: vm_states.rc != 0 + changed_when: false + loop: "{{ vms_to_backup }}" + tags: [validate] + + - name: Verify all VMs are running + ansible.builtin.assert: + that: + - item.stdout == 'running' + fail_msg: "VM {{ item.item }} is not running (state: {{ item.stdout }})" + success_msg: "VM {{ item.item }} is running" + loop: "{{ vm_states.results }}" + tags: [validate] + + - name: Check for existing snapshots + ansible.builtin.shell: | + ssh grokbox "sudo virsh snapshot-list {{ item }} --name" + register: existing_snapshots + changed_when: false + loop: "{{ vms_to_backup }}" + tags: [validate] + + - name: Display existing snapshots + ansible.builtin.debug: + msg: + - "VM: {{ item.item }}" + - "Existing snapshots: {{ item.stdout_lines | default(['none']) | join(', ') }}" + loop: "{{ existing_snapshots.results }}" + tags: [validate] + + # ========================================================================== + # Create Snapshots + # ========================================================================== + + - name: Generate snapshot name with timestamp + ansible.builtin.set_fact: + snapshot_timestamp: "{{ ansible_date_time.epoch }}" + tags: [snapshot] + + - name: Create VM snapshots + ansible.builtin.shell: | + ssh grokbox "sudo virsh snapshot-create-as {{ item }} \ + --name '{{ snapshot_prefix }}_{{ snapshot_timestamp }}' \ + --description '{{ snapshot_description }} - {{ ansible_date_time.iso8601 }}' \ + --atomic" + register: snapshot_create + loop: "{{ vms_to_backup }}" + tags: [snapshot] + + - name: Verify snapshot creation + ansible.builtin.shell: | + ssh grokbox "sudo virsh snapshot-info {{ item }} {{ snapshot_prefix }}_{{ snapshot_timestamp }}" + register: snapshot_info + changed_when: false + loop: "{{ vms_to_backup }}" + tags: [snapshot, verify] + + # ========================================================================== + # Generate Backup Report + # ========================================================================== + + - name: Create backup report directory + ansible.builtin.file: + path: "./stats/vm_backups" + state: directory + mode: '0755' + tags: [report] + + - name: Generate backup report + ansible.builtin.copy: + content: | + ================================================================================ + VM SNAPSHOT BACKUP REPORT + ================================================================================ + Date: {{ ansible_date_time.iso8601 }} + Hypervisor: {{ hypervisor_uri }} + Snapshot Name: {{ snapshot_prefix }}_{{ snapshot_timestamp }} + Description: {{ snapshot_description }} + + VMs Backed Up: + {% for vm in vms_to_backup %} + - {{ vm }} + {% endfor %} + + Snapshot Details: + {% for result in snapshot_info.results %} + + VM: {{ result.item }} + {{ result.stdout }} + {% endfor %} + + ROLLBACK INSTRUCTIONS + ================================================================================ + + To restore a VM to this snapshot: + + 1. Stop the VM (if running): + ssh grokbox "sudo virsh shutdown " + + 2. Revert to snapshot: + ssh grokbox "sudo virsh snapshot-revert {{ snapshot_prefix }}_{{ snapshot_timestamp }}" + + 3. Start the VM: + ssh grokbox "sudo virsh start " + + To delete this snapshot after verification: + ssh grokbox "sudo virsh snapshot-delete {{ snapshot_prefix }}_{{ snapshot_timestamp }}" + + ================================================================================ + END OF REPORT + ================================================================================ + dest: "./stats/vm_backups/backup_{{ snapshot_timestamp }}.txt" + mode: '0644' + tags: [report] + + # ========================================================================== + # Display Summary + # ========================================================================== + + - name: Display backup summary + ansible.builtin.debug: + msg: + - "=== VM Snapshot Backup Complete ===" + - "Snapshot Name: {{ snapshot_prefix }}_{{ snapshot_timestamp }}" + - "VMs Backed Up: {{ vms_to_backup | length }}" + - "Backup Report: ./stats/vm_backups/backup_{{ snapshot_timestamp }}.txt" + - "" + - "⚠️ IMPORTANT NOTES:" + - "1. Snapshots are point-in-time copies" + - "2. Test restoration procedure before relying on snapshots" + - "3. Snapshots consume disk space - clean up old snapshots" + - "4. For critical changes, consider full VM backups" + - "" + - "To restore: virsh snapshot-revert {{ snapshot_prefix }}_{{ snapshot_timestamp }}" + tags: [always]